EngineRoom

Regression Analysis Overview

Regression Analysis is a statistical tool used to investigate the relationships between variables. It allows you to ascertain the causal effect of one variable upon another - for instance, the effect of the square footage and location of houses on their sale prices. To investigate this issue, data is collected on the underlying variables of interest and regression analysis is employed to estimate the quantitative effect of the causal variables (called the 'independent' or 'predictor' variables) upon the variable that they influence (called the 'dependent' or 'response' variable). The analysis also allows you to assesses the “statistical significance” of the estimated relationships, that is, the degree of confidence that the true relationship is close to the estimated relationship.

Regression Analysis is used when it is hypothesized that one variable is dependent upon:

  • another single independent or input variable (Simple Regression), or
  • multiple independent or input variables (Multiple Regression).

The independent (predictor) variables may be continuous or discrete. If continuous , you can include them in the analysis without any change, as the algorithm will deal with them. But if the variable is discrete, with either nominal levels (say 'state1', 'state2', 'state3', etc.) or ordinal levels (say 'low', 'medium' and 'high') then you will have to create indicator ('dummy') variables to represent each of the levels in the variable, minus one. So for example, if your independent variable Location has three levels: 'East', 'Southeast' and 'Northwest', you will need two dummy variables (3 levels - 1 = 2) to fully represent this nominal predictor. You may call these 'East' (which takes value 1 if the location is in the east and 0 otherwise) and 'Southeast' (which takes value 1 if the location is in the southeast and 0 otherwise). You don't need a third dummy variable for 'Northwest' because setting both 'East' and 'Southeast' to 0 identifies the only remaining cases, which are the locations from the northwest.

The dependent (response) variable itself may be measured on a continuous or discrete scale - if the variable is continuous, you can use the typical simple or multiple regression technique, which can handle such data. If instead the dependent variable is binary (Pass/Fail), you should use logistic regression which accounts for the binomial distribution of the variable.

Was this helpful?