Rosella       Machine Intelligence & Data Mining

The Cookbook for Predictive Modeling and Machine Learning

This page provides the comprehensive guide for predictive analytics: profiling, segmentation, and predictive modeling by machine learning. This covers data preparation to model deployment, and to real application!

The key factors to the success

The most important factor that can lead to successful implementation of predictive analytics is the availability of useful information, especially past historical data. This is not something that exists everywhere. Only a limitted number of industries have right data useful for predictive modeling, e.g., insurance, credit finance, direct marketing, etc. The second factor is intimate knowledge of business domains and data. The third factor, which may be most important one, is intuition and insight driven by expericence. It is noted that this cannot be obtained without the second factor. This can only be gained while developing real preditive models. Various experimentation will provide the real heuristic knowledge. Note that you cannot learn driving with actually driving a car! The same is true for predictive analytics.

If you need predictive modeling tools, please read Preditive Modeling Software. Software download is available from the page.

1. Concept Design

Goals and final outputs are clearly defined. For example, developing profiles of most valued customers, or developing segmentation that gives best efficient marketing campaigns, or developing predictive models that can identify customers who are at risk of churning, and so on. Once goal is defined, then methods and tools are selected, e.g., profiling, segmentation, predictive modeling, and so on. It is noted that goals and techniques selected will determine data preparation requirements at the next step.

2. Data Preparation

Once goals and outputs are defined, the next step is to collect and prepare data. In general, data fall into the following two major categories;

  • Numerical data: Also known as linear data. Numerical data represent measurable numerical values such as age, height, weight, quantities, amount, numbers of years, etc.
  • Categorical data: Also known as nominal data. A categorical variable has a finite number of values. For example, gender has "Male" and "Female". Other examples are vocation, job position, religion. They have a finite number of pre-determined values. Boolean variables may be handled as categorical variables: "YES" or "NO" categories.

In addition, the following data may be included after transformation;

  • Date and time: Date and time is a special data type that may not be used directly in analytics. This requires transformation into numerical or categorical values. For example, date of birth can be transformed into numerical "age". Calendar months can be translated into categorical "seasons": Spring, Summer, Fall, and Winter.

Identifiers do not have analytic significance. However, they might be required as record identifiers in case update on data is performed using identifiers. When segmentation or prediction models are applied to database records, a record identifier variable is needed.

  • Identifiers: Identifiers such as employee identifier, customer registration numbers, etc. Except one for record identification, other identifier variables are normally excluded in data mining. To update faster, identifier variables are required to be indexed or be the primary key.

Finally, the following data type has no value in data mining. They should not be included in prepared data. It is noted that, to be useful for data mining, variables should be useful for statistical interpretation;

  • Textual descriptive information: Textual descriptions, such as book abstract, address, product descriptions, etc., cannot be used in data mining directly. If they contain any useful information, they should be extracted and transformed into either numerical or categorical variables.
Customer Information

For customers, the following variables might be useful in profiling, segmentation and predictive modeling;

  • Demographic segmentation variables describe characteristics of customers and include age, gender, race, education, occupation, income, religion, marital status, family size, children, home ownership, socioeconomic status, and so on. Note that demographic segmentation normally refers to segmentation with these demographic variables.
  • Financial variables include income, asset values, property ownership, etc.
  • Geographic variables include various classification of geographic areas, for example, zip code, state, country, region, climate, population, and other geographical census data. Note that this information can come from national census data. For more, see geographic segmentation.
  • Psychographic segmentation variables describe life style, personality, values, attitudes, and so on. Note that psychographic segmentation normally refers to segmentation with these psychographic variables.
  • Behavioral segmentation variables include product usage rate and end, brand royalty, benefit sought, decision making units, ready-to-buy stage, and so on.
  • Past business history, Customers' past business track records can be extremely useful for segmentation. This may include total amounts purchased, purchasing frequency, (credit) default records, (insurance) claims, responsiveness for marketing campaigns, and so on.

2.1 Where to prepare your data?

Before collecting data, you need to determine data storage. If your data is for simple statistical tally and analysis, comma-separated free format files editable by text editors will be fine. However, most analytical requirement demand more than simple analysis. It often involves updating of data fields, e.g., customer scoring, segmentation of data, etc. It is recommended to prepare on relational database systems, as tables with a unique primary key field. Primary keys allow fast update operations. Or record identifier fields need to be indexed. Most commonly used database systems are MS Access , MS SQLServer, PostgreSQL, MySQL, etc. Note that these are used by many data analysts.

2.2 Enriching Internal data with External data

If available, external data can be useful in developing segmentation and predictive models. They can enrich internal data for the portion that your internal data lack. Examples of external data may include;

  • Geographic and census data: Census data often contain valuable information for marketing. There are also commercial psychographic and geographic information.
  • Vehicle models: Average repair costs, theft ratios, reliability test results, and so on.
  • Credit bureaux data: External credit bureaux data.

2.3 Data Extraction and Transformation

Your database may not have data ready for crucial work. They may have to be extracted from various databases and merged into a single database table. It is often true that most useful data fields may not exist in customer records as raw data. You may have to derive or transform them into useful information. This process iis called Feature Extraction.

2.4 Data Cleaning

Most databases contain garbage. Software bugs often introduce garbage information into databases. In addition, incomplete customer registrations may leave records that do not represent customers. In addition, categorical fields may contain values that do not exist as a value. Garbage information can distort statistics and results. More importantly, it can create extreme outliers that may have far larger or smaller values than normal ones. A few steps will ensure cleaner data;

  1. Identify bogus records that do not represent a customer (or entity) and remove them.
  2. For numerical fields, find out minimum and maximum values. Make sure that there are no outliers (or extreme values) in numerical fields. You may use histograms of scatter plots to identify them.
  3. For categorical fields, there might be values that do not exist in domain values or inconsistently coded. Gender, for example, consists of "Male" and "Female". Some records may contain "M" or "F", while others contain "Male" and "Female" correctly. Even worse, some might even contain "jhasd", or "cat", and so on. This often occurs due to software bugs in transactional processing systems. Standardize coding and correct wrong data. If correction is not possible, replacing with "NULL" values may be better alternatives.

2.5 Target Variable Encoding

To optimize segmentation and modeling, there should be at least a variable that may be used to measure performance of segmentation or predictive models. After all, your goals will be to optimize segmentation and predictive models to maximize or minimize outcomes on the variables. These variables may be called as optimization targets, and may include the followings;

  • Credit risk scoring: credit default/delinquency amounts/frequencies.
  • Insurance risk scoring: insurance claim amounts/frequencies.
  • Direct mail marketing: Purchase amount and/or quantity.
  • Churn prevention: Churn indicator ("YES" or "NO).

Target variables are typically encoded using one or more of the followings

  1. "YES"/"NO": Most target variables can be coded as a binary categorical field. For example for insurance, if a customer has a claim history, he/she may be coded as "YES", "TRUE", etc. For no claims, "NO", "FALSE", etc.
  2. "0" and "1": Binary categorical variables may be linearized using 0 (for false) and 1 (for true). This is often used to have normalized values for numerical data. For example, "0" for 0 and negative values. "1" for any positive values.
  3. Original amount/quantity: This uses original numerical values as they are. For example, purchase amount, claim amount, default amount, etc.

2.6 Preparing training and validation sets

Profiling and segmentation require the whole data as a single mining set. For predictive modeling, however, you may need to prepare a training set and a number of cross-validation sets. A model is trained using a training set and verified over several other validation datasets. The following methods may be used in preparing ' datasets;

  • Divide populations randomly. You may use random number generators.
  • You may use whole populations. But create multiple sets using data from different periods, e.g., monthly, quarterly, yearly information, etc. This normally shares most data fields, but target performance fields come from different periods.
  • Sometimes multiple datasets occur naturally. Use them!
How to find useful variables?

How do you find useful variables for predictive modeling? This is very important in predictive modeling. After all, the principles of predictive modeling is to produce segments that predict customer or population behaviors statistically. Naturally, correlation analysis is a good mechanism since it can measure the degree of association between variables and outcomes systematically. Note that it can measure both negative relationships as well as positive relationships. In addition, neural network weight analysis can be used. It shows importance of variables visually. For more, read Variable Relavancy Analysis / Principal Component Analysis. Software tools are available from CMSR Data Mining / Machine Learning / Rule Engine Software.

Profiling & Segmentation

The Pareto Principle (aka, the 80-20 rule) teaches us that 80% of outcomes come from 20% populations. It is often the case where the ratio is 90 - 10: 90% of outcomes come from 10% pullulation. The purpose of segmentation is to identify the 10 or 20% population and develop profiles for them.

  • Hotspot profiling analysis: This is to identify hottest segment profiles.
  • Neural clustering: similar populations or customers are grouped together to form segments.
  • Drill-down decision tree: Decision tree divides populations or customers into sub segments in such a way that can maximize probability of each categories of target fields.

Predictive Modeling

With proper software tools and right data on hand, developing predictive modeling is fairly straightforward. Predictive models are developed as follows;

  1. Correlation analysis: With the data collected, correlation analysis is performed to determine whether factor variables have strong association to predictor values. Factor variables with weak correlation may be excluded in developing models as they do not provide predictive ability.
  2. Model configuration and training: From software tools, predictive models are configured with modeling variables and trained using the collected data.
  3. Model validation: Predictive models are validated using data not used for model development. It is highly desirable to test with a number of different validation datasets and deploy them only if they show consistently predict well over different datasets. For training and validation purposes, collected data might be divided into two or more datasets.
  4. Model integration: Models are integrated using Rule Engine. Model integration can incorporate different models. In addition, Rule engines provide conditional model application and exception rule handling capabilities.
  5. Apply to real data: Fully developed models are applied to data in two ways. One is to apply models directly from CMSR Data Mining / Machine Learning / Rule Engine Software.
The impact of previous optimization

Most businesses have internal policies that filter out certain business transactions. When used directly, this will necessarily bias training datasets. It may not be possible to obtain training datasets that represent truthfully. In this case, it is desirable to apply internal policies first, then apply predictive models developed specifically to the segments.

Predictive modeling also introduces biases in future training datasets. Once businesses deploy predictive models, subsequent data will have far fewer presence of certain types of samples. This will lead to inaccurate predictive models. Bagging with previous models may be a solution. That is, apply previously deployed models along with new models and combine the results. This may reduce biases significantly.

The data skew issues

Predictive modeling can be extremely useful if it can predict rare events accurately. For example, the ratios of insurance claims, credit defaults, and direct mail-order marketing are very row, say, 10% or even 2%. Predictive models developed using such skewed events show very low accuracy predicting events happening, say, less than 30%, or even 10%.

Is it the limitation of predictive modeling in general? The simple answer is absolutely no! You may have developed simplistic models using wrong techniques. For example, decision tree (and other classification) techniques will show very poor performance. You will normally end up with totally useless models. To learn more about how you can develop better models, read How to develop highly accurate predictive models.

For more information, please read the following pages;

If you need predictive modeling tools, please read Preditive Modeling Software. Software download is available from the page.