Correlation and Link Analysis

Link analysis takes many forms. The most common use of link analysis is web-page hyper-text links, as described in Web Navigation Pattern Analysis. Another common usage is in Healthcare Fraud Detection. It analyzes association between providers and patients potentially involved in healthcare insurance scams. Other usage of link analysis is described in the subsequent sections.

Correlation analysis and Categorical data

Correlation coefficient is a numerical measurement of linear association between two numerical variables. Correlation analysis is very important in selecting variables for clustering/segmentation and predictive modeling. Correlation coefficients range between -1 and 1. If two variables are perfectly negatively correlated, the coefficient is -1. If two are perfectly positively correlated, it is 1. Simple coefficient computation reveals linear correlations as shown below;

(Linear correlation)
An example of positive correlation An example of negative correlation

However, the following type correlation can not be exposed with simple computation. To compute non-linear correlation, StarProbe employs discretization, averaging and ranking techniques. First, it divides independent variables into intervals and computes averages of dependent variables for each interval. Averages then are used to represent independent variables. The distribution will be arranged in ascending order and become linear. In addition to averages, StarProbe uses ranking of averages. Note that this may make independent variables more uniform.

(Non-linear correlation)
An example of non-linear correlation

In addition, correlation coefficients can not be computed directly from categorical variables. Normally, linearization techniques are used. One commonly used method is to transform categorical variables into numerical variables. For example, "gender" can be transformed into two numerical variables: "gender=male" and "gender=female". This method works well if variables have only two categorical values. If there are more values than two, things get confusing. To detect such correlation, StarProbe employs averaging and raking procedures. For example, for the following example, averages are computed for each category. Then ranks of averages are computed. Averages and ranks are then used to represent categorical variables. The effect on the following example is that categories will be re-arranged in linear order: "cat1", "cat3", and "cat2". The result will be very high categorical correlation coefficients.

(Non-linear Categorical Correlation)
An example of categorical correlation
Case study: retail store sales and trend analysis

Correlation analysis is primarily used as a pre-analysis tool for predictive modeling. It is normally used to determine variables that may be used in predictive models. For the same rationale, it can be used in determining variables that may have bearing in sales of retail stores. That is to say, assume retails stores having variety of information: geographic & demographic composition of business areas, say, from census and GIS data. Correlation link analysis can be used to determine which variables have close correlation to sales revenues of retail stores. Then this information is used in developing segmentation for Sales Trend Analysis and Forecasting. A superb tool for sales management!


Correlation analysis is a feature of CMS - Cramer Modeling & Segmantation. Download is available for eavaluation from here.