EngineRoom

General Linear Modeling Tutorial

Tutorial

Click to Download Tutorial Transcript

Links to Example Datasets

Click to Access an EngineRoom Project with all 3 Data Files

Click to Download Example 1 (Random Block) Data File

Click to Download Example 2 (Interactions and Higher Order Terms) Data File

Click to Download Example 3 (Nested Terms) Data File

When to use this tool

The General Linear Model (GLM) is a versatile statistical tool that helps you understand the relationships between one or more predictor variables and a continuous response variable. Whether you're analyzing the effects of different treatments, comparing group means, or exploring interactions between variables, GLM provides a flexible framework for uncovering insights in your data.

GLM allows you to:

  • Build models using both categorical and continuous predictors
  • Account for known sources of random variation using random blocking
  • Include interaction terms and higher order terms as well as nested terms
  • Simplify your model with different reduction methods
  • Interpret coefficients, p-values, and model fit statistics

GLM is especially useful in experimental design, quality improvement, and advanced regression analysis.

How is this different from Multiple Regression?

General Linear Modeling is an umbrella that includes Multiple Regression.

Multiple Regression involves using multiple predictor variables to predict a response variable. The response must be continuous and the predictors can be continuous or categorical.

General Linear Modeling expands this concept to include linear combinations of other factors including interactions between factors, higher order terms, and nested terms.

General Linear Modeling can produce a more accurate model because it accounts for more detailed terms.

On the other hand, Multiple Regression is a simpler type of model that is great for cases where the predictor variables have a linear relationship with the response.

How to use this tool in EngineRoom

Example 1 (Random Block)

Click to Download Data File

Our first dataset is from a manufacturing process where we are analyzing the effects of Temperature and Pressure on Strength. In our dataset, we have 5 measurements per Raw Material Batch. It is known that strength can vary based on raw material batch, but there is no way to predict exactly what strength the next batch will have, they follow a random distribution.

Let's model strength using temperature and pressure as our independent variables with and without raw material batch as a random effect in the model.

Dataset for first GLM example using random blocks

1. Open the General Linear Modeling tool onto the workspace by going to Analyze > Regression Analysis > General Linear Modeling.

The first screen of the General Linear Modeling tool which shows directions to drag and drop continuous and categorical variables onto the study.

2. Drag Temperature and Pressure onto the Continuous Variables Dropzone, select "Continue", and then drag Strength onto the Response Variable Dropzone.

GLM example 1 with dropzones filled in

3. One new feature of GLM is the addition of multiple Model Reduction Methods. For this example we will leave on "Backwards Elimination" with an alpha-to-remove of 0.5 as the default.

Highlighting model reduction methods on GLM start screen

Model Reduction Methods

Note that for all methods, any model terms that are included in a significant higher order term in the model are kept in the model, regardless of their significance, to maintain hierarchy.

Backwards Elimination

  • After fitting the model with all terms specified, every model term that has a p-value that exceeds the "Backward Elimination Alpha-to-Remove" is removed from the final model one at a time in the output.

Forward Selection

  • Starting with no terms in the model, terms are added one at a time until no more terms with a p-value less than the "Forward Selection Alpha-to-Enter" can be added.

Stepwise

  • A combination of forward and backward elimination using both p-value cutoffs to add and remove terms.

No Reduction

  • The model specified is fit and no terms are added or removed based on significance.

4. Click "Calculate" to jump straight to the output.

GLM output for fixed model in example 1

In our output we can see that the model reduction removed Pressure from the model and the model with only Temperature has an R-Squared of 45%. Without including the Raw Material Batch in the model, the batch to batch variation is included in the pure error (noise). This leads to higher p-values for our fixed effect terms and a lower R-Squared.

More information on the model output below.


Fixed Effect Model Output

Conclusion Statement

  • Lists which factors are significant at the selected alpha level.

Model Summary

  • These statistics help determine whether the model generated is a good fit for your data.

Model Equation (Coded)

  • An equation generated using the input standardization method. The default standardization method is Centering, or subtracting the mean. In the Advanced example, we will show how this is set.
  • This equation is useful for comparing effect sizes by comparing term coefficients, but this model should not be used for directly predicting the output.

Model Equation (Uncoded)

  • An equation generated from your data. Use this model when using real values of your data to predict the output.

Analysis of Variance and Coefficients Tables

  • ANOVA table breaking down the significance of the terms in your model. The significant terms are highlighted in green.

Coefficients - Coded

  • Coded coefficients of each term, their statistical significance, and their Variance Inflation Factor (VIF)
  • VIFs measures how much the given term is correlated with other terms in the model. VIFs greater than 10 typically mean that there is high multicollinearity with other terms in the model and it may be risky to trust the coefficient sizes.

Model Parameters

  • Informational table about some of the selected options.

Let's fit the model again with the Raw Material Batch included as a random effect to account for its known random variation.

5. Click "edit options" in the output to go back to "Basic Setup".

6. Drag Raw Material Batch over to the Categorical Variables Dropzone and activate the "Random" slider in the categorical variables setup options.

Adding raw material batch and making it random in basic set up of GLM

Categorical Variable Setup

First, you have the option here to select which type of encoding you would use. Dummy Coding, also known as One-Hot Encoding, will calculate the coefficients of the categorical levels based on a reference level you select. In Effect Coding, the coefficients of the categorical levels will be set according to the mean. We will leave this at Dummy Coding.

Second, you have the option of setting whether a particular categorical variable is Random. Random variables are variables that are not selected specifically. In this case, our Raw Material Batch number is not something that we are specifically selecting as the Batches will come in randomly and will change in the future as well.


7. Click "Calculate" to jump straight to the output.

GLM output for example 1 with random block

At first glance we see that adding Raw Material Batch as a random block in our model has lead to Pressure being kept in the model and our R-Squared to go from 45% in the last model to 93% now. Accounting for the random batch to batch variation of Strength has lead to a much improved model.

More information on the model output below.


Mixed Model Output

Conclusion Statement

  • Lists which factors are significant at the selected alpha level.

Model Summary

  • These statistics help determine whether the model generated is a good fit for your data.

Marginal Model Equation (Coded)

  • The "Marginal" model does not include the random effect(s) in the model.
  • An equation generated using the input standardization method. The default standardization method is Centering, or subtracting the mean. In the Advanced example, we will show how this is set.
  • This equation is useful for comparing effect sizes by comparing term coefficients, but this model should not be used for directly predicting the output.

Marginal Model Equation (Uncoded)

  • An equation generated from your data. Use this model when using real values of your data to predict the output.

Coefficients - Coded

  • Coded coefficients of each term and their statistical significance
  • Variance Inflation Factors (VIFs) cannot be calculated in a mixed model

Variance Components

  • Breaks down the variation due to pure error (noise) and the random effect(s) in the model. The higher the %Total Variance of the random effect(s), the more significant their p-value.

Model Parameters

  • Informational table about some of the selected options.

Random Factors

  • Reports the Best Linear Unbiased Predictors (BLUPs) of each level of the given random effect.
  • BLUPs are the best guess of the effect of each level of the random effect.

Conditional Model Equation (Coded)

  • Unlike the "Marginal" model, the "Conditional" model does include the random effect(s) in the coded model.

Example 2 (Interactions and Higher Order Terms)

Click to Download Data File

This dataset contains Activity and Weight as input variables and Bone Density as a response variable.

GLM example 2 dataset

1. Open the General Linear Modeling tool onto the workspace by going to Analyze > Regression Analysis > General Linear Modeling.

The first screen of the General Linear Modeling tool which shows directions to drag and drop continuous and categorical variables onto the study.

2. Click on the Data Source and drag on Activity variable and the Weight variable onto the Continuous Variables Dropzone.

Activity and Weight variables are added to the G L M study

3. Click Continue.

4. Drag on the Bone Density variable onto the Response Variable Dropzone.

Bone Density variable added to response dropzone.

5. We are going to leave the Basic Setup options at their defaults.

6. Click on Advanced Setup.

Advanced Setup screen showing options for Nested Variables, Interactions and Higher Order Terms, and Input Standardization Method. Each as an edit button next to it.

7. Click on the "Edit" button next to "Interactions and Higher Order Terms". The Interactions and Higher Order Terms screen allows you to select which interactions or higher order terms you want to consider for the model. This adds additional complexity to your model, so be aware when adding many interactions or higher order terms.

Interaction and Higher Order Terms screen which shows a left column containing Activity and Weight, a center column containing two buttons for Add Higher Order Terms and Add Interaction Terms, and a third column for Included in Model which has Activity and Weight.

8. Click on both Activity and Weight on the left side.

9. Click on "Add Higher Order Terms >>" to add the quadratic terms Activity * Activity and Weight * Weight to the model.

GLM example 2 advanced settings

11. Click "Add Interaction Terms >>" with Activity and Weight still selected on the left. This will add Activity * Weight to the model.

GLM example 2 advanced settings continued

Note : Adjusting the number in the dropdown allows you to add 3rd order terms and three-way interactions.

12. Click "Calculate".

GLM example 2 output
GLM example 2 output continued

Our model shows a moderate fit to the data and removed the Weight quadratic term (Weight * Weight) in the model reduction phase. We expected a large amount of noise in this response variable, so seeing these terms as significant including is a success in learning how our factors can impact our response.

More information on the model output below.


Fixed Effect Model Output

Conclusion Statement

  • Lists which factors are significant at the selected alpha level.

Model Summary

  • These statistics help determine whether the model generated is a good fit for your data.

Model Equation (Coded)

  • An equation generated using the input standardization method. The default standardization method is Centering, or subtracting the mean. In the Advanced example, we will show how this is set.
  • This equation is useful for comparing effect sizes by comparing term coefficients, but this model should not be used for directly predicting the output.

Model Equation (Uncoded)

  • An equation generated from your data. Use this model when using real values of your data to predict the output.

Analysis of Variance and Coefficients Tables

  • ANOVA table breaking down the significance of the terms in your model. The significant terms are highlighted in green.

Coefficients - Coded

  • Coded coefficients of each term, their statistical significance, and their Variance Inflation Factor (VIF)
  • VIFs measures how much the given term is correlated with other terms in the model. VIFs greater than 10 typically mean that there is high multicollinearity with other terms in the model and it may be risky to trust the coefficient sizes.

Model Parameters

  • Informational table about some of the selected options.

Example 3 (Nested Terms)

Click to Download Data File

The baking company "Rise and Shine" wants to find the optimal Time and Temperature to bake their bread at in order to get the optimal loaf Weight of 900 grams. They have two bread baking Ovens each with a top and bottom Rack. The bakers know there exists bias between the ovens and between the racks but have never studied their true differences.

In order to optimize when to take out the oven from each individual rack, the bakers at "Rise and Shine" ran a designed experiment varying Time and Temperature for every Oven and Rack combination. The results from the completely randomized full factorial experiment are below.

GLM example 3 dataset

Nested Terms are used when the levels of one of the factors in your model is part or reliant upon the levels of another factor. In this example, which rack the bread is baked on depends on both the oven and the rack (e.g. the top rack of oven A is not the same as the top rack of oven B). Thus we say that "rack is nested in oven".

1. To start, open the General Linear Modeling tool onto the workspace by going to Analyze > Regression Analysis > General Linear Modeling.

The first screen of the General Linear Modeling tool which shows directions to drag and drop continuous and categorical variables onto the study.

2. Click on the Data Source then drag the variables Oven and Rack onto the Categorical Variables Dropzone. Drag the variables Temperature (F) and Time (min) onto the Continuous Variable Dropzone.

GLM example 3 nested model setup

3. Drag on the variable Weight (g) onto the Response Variable Dropzone.

GLM example 3 nested model setup continued

4. Leave the Basic Setup options as they are, and click on "Advanced Setup."

GLM example 3 nested model setup continued

5. On the Advanced Setup screen, click "Edit" for Nested Variables.

6. On the Nested Variables screen, select Rack is nested in Oven.

GLM example 3 nested model setup continued

7. Click "Calculate".

GLM example 3 nested model output
GLM example 3 nested model output after scrolling down
GLM example 3 nested model output after scrolling down further

Our model shows good fit to the data and includes all model terms. Notice that the term for Rack is "Rack(Oven)". This is the parentheses nomenclature to represent Rack being nested in Oven.

We can interpret the model similarly to how we have in the previous examples. The "Tukey Adjustment" table for differences in our categoric variables and the "Factorial Plots" are new focuses of this output.

More information on the model output below.


Fixed Effect Model Output (with Categorical Variables)

Conclusion Statement

  • Lists which factors are significant at the selected alpha level.

Model Summary

  • These statistics help determine whether the model generated is a good fit for your data.

Model Equation (Coded)

  • An equation generated using the input standardization method. The default standardization method is Centering, or subtracting the mean. In the Advanced example, we will show how this is set.
  • This equation is useful for comparing effect sizes by comparing term coefficients, but this model should not be used for directly predicting the output.

Model Equation (Uncoded)

  • An equation generated from your data. Use this model when using real values of your data to predict the output.

Analysis of Variance and Coefficients Tables

  • ANOVA table breaking down the significance of the terms in your model. The significant terms are highlighted in green.

Coefficients - Coded

  • Coded coefficients of each term, their statistical significance, and their Variance Inflation Factor (VIF)
  • VIFs measures how much the given term is correlated with other terms in the model. VIFs greater than 10 typically mean that there is high multicollinearity with other terms in the model and it may be risky to trust the coefficient sizes.

Model Parameters

  • Informational table about some of the selected options.

Simultaneous Tests for Difference of Means (Tukey Adjustment)

  • For each categorical variable in the model, this table tests the statistical significance of all paired differences between every combination of levels in the variable.
  • The confidence interval and p-value can both be used to assess significance.
  • The "Confidence Intervals" plots in the bottom right of the output visualize these differences and confidence intervals.

Note also that there are two tabs at the top. Switching to the tab called "Factorial Plots" will show the main effect plots for the categorical variables in the model. Here we can see the average difference between ovens A and B, as well as see the differences between the 4 unique racks.

GLM example 3 nested model factorial plots

Additional Options

Two other options not previously discussed:

Basic Setup Screen:

Model Reduction Method: Used to simplify the model by adding or removing factors based on their significance. Choose from Backward Elimination, Forward Selection, Stepwise, or No Reduction.

Dropdown showing the four options for Model Reduction Method

Advanced Setup Screen:

Input Standardization Method: Sets how you might be interested in comparing coefficients on the same scale. The options are Subtract Mean (Center), Divide by Standard Deviation, Subtract Mean and Divide by Standard Deviation, Code Min/Max to -1/+1, or None.

Dropdown showing the options for the input standarization method

Was this helpful?