EngineRoom

Classification and Regression Tree Tutorial

Tutorial

Coming Soon

Regression Case: Click to Download Data File

Classification Case: Click to Download Data File

When to use this tool

There are two main uses for the Classification and Regression Tree:

  • Predict the output of a given set of features, or inputs
  • Understand which features have the most effect on the prediction of your outputs

Use this tool when you have a large amount of categorical and continuous data and the relationship between the features/inputs and response may not be linear.

How to use this tool in EngineRoom

In general this tool performs best with larger datasets.

1. Go to the Analyze (DMAIC) menu and click on Classification and Regression Tree to open it onto the workspace.

Classification and regression tree start up menu.

2. Drag the continuous variables on to the Continuous dropzone and drag the categorical variables on to the Categorical dropzone.

Classification and regression tree with categorical variables.

3. Click "Continue".

4. Drag on your response variable to the Response Variable dropzone.

  • In General: If your response is Text, it will be considered Categorical. If the response is numeric and has more than 6 unique values, it will be considered Continuous. If there are numeric values that have fewer than 6 unique values, then it will be considered Discrete.
Classification and regression tree with response.

5. (Note: This screen may be skipped) If your data is numeric with fewer than 11 unique variables, the Type of Analysis screen will allow you to confirm whether your numeric input should be considered Continuous or Discrete. Select the appropriate value and click "Continue".

6. Select options related to how the tree should be built.

  • Max. Tree Depth - indicates the maximum number of splits allowed before reaching a leaf node (terminating node)
  • Min. Samples to Split Node - indicates the minimum number of observations that need to be in a resulting node to create a split. The tree will stop splitting once it reaches the maximum tree depth indicated before, even if the minimum observations in the node has not been reached.
Classification and regression tree with min samples to split added.

7. Click "Continue".

8. Select options related to the Testing of the Tree. This will give some indication on how well the tree would perform with a test/'unseen' data set.

Indented list explaining cross-validation with K=10 folds.
Testing/Validation menu.

9. Click "Continue".

10. View the results of the tree. A Discrete output will result in a "Classification" tree and a Continuous output will result in a "Regression" tree.

The Output

The output includes:

  • Table Output: The tables on the left are different between the Classification and Regression Trees but the first few tables give an indication of the performance of the tree.
Sample classification output.
  • Feature Importance: The next few tables will indicate the relative importance of certain features as well as the Features that were removed from the model because they had little to no effect.
Feature importance tables.
  • Tree Visualization: the tree output will give indication of the results at each node. In the Regression case, the nodes are colored according to how close they are to the minimum or the maximum. In the Classification case, they are colored according to the dominant output class in that node.
Sample tree output.
  • Pareto of Feature Importance (%): This is a pareto chart that visualizes the relative impact of each feature on the tree. This is a chart representing the information in the table on the left.
Sample classification pareto chart output.

Example 1: Classification Tree

A consumer credit card company wants to better predict customers at risk of attrition (exiting the company). It has collected data from its consumer credit card portfolio, with the aim of helping analysts predict customer attrition. Develop a classification tree to help predict customer attrition with high accuracy.

  1. Open a Classification and Regression Tree from the Analyze menu.
  2. Drag on the input variables. For this example drag on the following:
  • EducationLevel - Categorical
  • MaritalStatus - Categorical
  • IncomeCategory - Categorical
  • CardCategory - Categorical
  • CustomerAge - Continuous
  • DependentCount - Continuous
  • MonthsOnBook - Continuous
  • TotalRelationshipCount - Continuous
  • MonthsInactive - Continuous
  • ContactsCount - Continuous
  • CreditLimit - Continuous
  • TotalRevolvingBalance - Continuous
  • AvgOpenToBuy - Continuous
  • TotalChangeQ4toQ1 - Continuous
  • TotalTransactionAmount - Continous
  • TotalTransactionCount - Continuous
  • TotalCountChangeQ4toQ1 - Continuous
  • AvgUtilizationRatio - Continuous
Sample classification tree start up menu.

3. Drag on "Attrition" onto the response variable.

Start up menu with "attrition" added on.

4. Click "Continue"

5. We will accept the defaults so click "Continue" and then "Calculate"

6. The result will be a classification tree organizing the observations based on attrition. Note that EngineRoom provides a recommendation that indicates that there is an imbalance in the data set, which may affect the results of the tree.

Sample classification output tree.

Example 2: Regression Tree

A newly introduced e-commerce platform is trying to determine the optimal retail price that it should set for it products - it has developed this dataset containing product listings, ratings, and sales performance data. Come up with a regression tree that can help inform the optimal pricing strategy for a product based on its ratings, merchant ratings count, mean discount and other metrics.

  1. Open a Classification and Regression Tree from the Analyze menu.
  2. Drag on the input variables. For this example, drag on the following variables to the Continuous Variables dropzone:
  • ListedProducts
  • TotalUnitsSold
  • MeanUnitsSoldPerProduct
  • Rating
  • MerchantRatingsCount
  • MeanProductPrices
  • AverageDiscount
  • MeanDiscount
  • MeanProductRatingsCount
Sample regression tree output.

3. Click "Continue".

4. For this example, drag on "RetailPrices" to the Response Variable dropzone.

Sample regression tree with "RetailPrices" added.

5. Click "Continue".

6. We will accept all of the defaults for options. Click "Continue" and then "Calculate".

7. The result will be a Regression tree for your results.

Sample regression tree output.

Note: Classification and regression tree data sets extracted from kaggle.com

Was this helpful?