Rosella       Machine Intelligence & Data Mining

Variable Relevancy Analysis / Principal Component Analysis for Predictive Modeling

One of the most important steps in predictive modeling is to identify relevant independent variables that are significantly correlated to the dependent/target variable. Variables with weak correlation should not be used in predictive models as they will only increase overfitting and decrease accuracy. Only independent variables with strong correlation to the dependent variable should be used in predictive models. CMSR Data Miner / Machine Learning / Rule Engine Studio provides confusion map and correlation analysis tools. In CMSR Studio, variable importance is measured with the followings;

  • Cramer coefficients
  • Chi-square statistic
  • GINI index
  • Correlation coefficients.

CMSR Confusion Map Chart is a very powerful tool for variable relevancy analysis. Confusion map charts can reveal variables that are good for predictive modeling. A confusion map shows how target class values are distributed over values of an independent variable. Analysing this distribution patterns, we can identify independent variables for predictive modeling. The following chart show confusion maps;

Confusion Map Variation Analysis

The following figure shows correlation analysis tool;

Factor Analysis

The figure shows correlation to the dependent variable "RISKFLAG". The right side window frame lists independent variables and category items in the order of strong correlation. "r-value" is the correlation coefficients. "r-sqaure" is the squared coefficients. Higher the absolute values are, the more relevant. Top listed items and variables are good candidates for inclusion as independent variables. The tool is very powerful and easy to use.

In addition to the correlation analysis tool, categorical bar charts and histograms can be used to identify modeling variables. This process should include data transformation as well. Categorical variables can be transformed into a smaller number of numerical flag variables. Numerical variables can be also transformed into numerical flag variables. These procedures are described fully in the paper "Modeling-Guide-for-Neural-Network.pdf". (Available to CMSR users.)