The Cookbook for Predictive Modeling and Machine Learning
This page provides the comprehensive guide for predictive analytics:
profiling, segmentation, and predictive modeling by machine learning. This covers
data preparation to model deployment, and to real application!
The key factors to the success
The most important factor that can lead to successful implementation
of predictive analytics is the availability of useful information, especially
past historical data.
This is not something that exists everywhere. Only a limitted number of
industries have right data useful for predictive modeling, e.g.,
insurance, credit finance, direct marketing, etc.
The second factor is intimate knowledge of business domains and data.
The third factor, which may be most important one, is intuition
and insight driven by expericence. It is noted that this cannot
be obtained without the second factor. This can only be gained
while developing real preditive models. Various experimentation
will provide the real
heuristic knowledge. Note that you cannot learn driving with actually
driving a car! The same is true for predictive analytics.
If you need predictive modeling tools, please read Preditive Modeling Software.
Software download is available from the page.
1. Concept Design
Goals and final outputs are clearly defined.
For example, developing profiles of most valued customers,
or developing segmentation that gives best efficient marketing
campaigns, or developing predictive models that can identify
customers who are at risk of churning, and so on.
Once goal is defined, then methods and tools are selected, e.g.,
profiling, segmentation, predictive modeling, and so on.
It is noted that goals and techniques selected will determine
data preparation requirements at the next step.
2. Data Preparation
Once goals and outputs are defined, the next step is to collect and prepare data.
In general, data fall into the following two major categories;
- Numerical data: Also known as linear data. Numerical data represent
measurable numerical values such as age, height, weight, quantities, amount,
numbers of years, etc.
- Categorical data: Also known as nominal data. A categorical variable has
a finite number of values. For example, gender has "Male" and "Female".
Other examples are vocation, job position, religion. They have a finite
number of pre-determined values.
Boolean variables may be handled as categorical variables: "YES" or "NO" categories.
In addition, the following data may be included after transformation;
- Date and time: Date and time is a special data type that may not be used
directly in analytics. This requires transformation into numerical or categorical
values. For example, date of birth can be transformed into numerical "age". Calendar months
can be translated into categorical "seasons": Spring, Summer, Fall, and Winter.
Identifiers do not have analytic significance. However, they might be required
as record identifiers in case update on data is performed using identifiers.
When segmentation or prediction models are applied to database records,
a record identifier variable is needed.
- Identifiers: Identifiers such as employee identifier, customer registration
numbers, etc. Except one for record identification, other identifier variables are
normally excluded in data mining.
Finally, the following data type has no value in data mining.
They should not be included in prepared data.
It is noted that, to be useful for data mining, variables should
be useful for statistical interpretation;
- Textual descriptive information: Textual descriptions, such as
book abstract, address, product descriptions, etc., cannot be
used in data mining directly. If they contain any useful information,
they should be extracted and transformed into either numerical
or categorical variables.
For customers, the following variables might be useful in profiling,
segmentation and predictive modeling;
- Demographic segmentation variables describe characteristics of customers and
include age, gender, race, education, occupation, income, religion,
marital status, family size, children, home ownership, socioeconomic status, and so on.
Note that demographic segmentation normally refers to segmentation with these demographic
- Financial variables include income, asset values, property ownership, etc.
- Geographic variables include various classification of geographic
areas, for example, zip code, state, country, region, climate, population,
and other geographical census data. Note that this information can come from
national census data. For more, see geographic
- Psychographic segmentation variables describe life style, personality, values, attitudes,
and so on.
Note that psychographic segmentation normally refers to segmentation with these psychographic
- Behavioral segmentation variables include product usage rate and end, brand royalty,
benefit sought, decision making units, ready-to-buy stage, and so on.
- Past business history, Customers' past business track records
can be extremely useful for segmentation. This may include total amounts purchased,
purchasing frequency, (credit) default records, (insurance) claims, responsiveness
for marketing campaigns, and so on.
2.1 Where to prepare your data?
Before collecting data, you need to determine data storage.
If your data is for simple statistical tally and analysis, comma-separated
free format files editable by text editors will be fine. However, most
analytical requirement demand more than simple analysis. It often involves
updating of data fields, e.g., customer scoring, segmentation of data, etc.
It is recommended to prepare on relational database systems, as tables with a
unique primary key field. Primary keys allow fast update operations.
Most commonly used database systems are MS Access , MS SQLServer, PostgreSQL, MySQL, etc.
Note that these are used by many data analysts.
2.2 Enriching Internal data with External data
If available, external data can be useful in developing segmentation and predictive models.
They can enrich internal data for the portion that your internal data lack.
Examples of external data may include;
- Geographic and census data: Census data often contain valuable information
for marketing. There are also commercial psychographic and geographic information.
- Vehicle models: Average repair costs, theft ratios, reliability test results,
and so on.
2.3 Data Extraction and Transformation
Your database may not have data ready for crucial work.
They may have to be extracted from various databases and merged into
a single customer table. It is often true that most useful data
fields may not exist in customer records as raw data. You may have to
derive or transform them into useful information.
2.4 Data Cleaning
Most databases contain garbage.
Software bugs often introduce garbage information into databases.
In addition, uncompleted customer registrations may leave records
that do not represent customers. In addition, categorical fields
may contain values that do not exist as a value. Garbage
information can distort statistics and results. More importantly,
it can create extreme outliers that may have far larger or smaller
values than normal ones. A few steps will ensure cleaner
- Identify bogus records that do not represent a customer (or entity)
and remove them.
- For numerical fields, find out minimum and maximum values. Make
sure that there are no outliers (or extreme values)
in numerical fields. You may
use histograms of scatter plots to identify them.
- For categorical fields, there might be values that do not exist
in domain values or inconsistently coded. Gender, for example,
consists of "Male" and
"Female". Some records may contain "M" or "F", while others
contain "Male" and "Female" correctly. Even worse, some might
even contain "jhasd", or "cat", and so on. This often occurs
due to software bugs in transactional processing systems.
and correct wrong data. If correction is not possible,
replacing with "NULL" values may be better alternatives.
2.5 Target Variable Encoding
To optimize segmentation and modeling, there should be at least a
variable that may be used to measure performance of segmentation or
models. After all, your goals will be to optimize segmentation and
models to maximize or minimize outcomes on the variables.
These variables may be called as optimization targets,
and may include the followings;
- Credit scoring: credit default/delinquency amounts/frequencies.
- Insurance scoring: insurance claim amounts/frequencies.
- Direct mail marketing: Purchase amount and/or quantity.
- Churn prevention: Churn indicator ("YES" or "NO).
Target variables are typically encoded using one or more of the
- "YES"/"NO": Most target variables can be coded as a binary categorical field.
For example for insurance, if a customer has a claim history, he/she may
be coded as "YES", "TRUE", etc. For no claims, "NO", "FALSE", etc.
- "0" and "1": Binary categorical variables may be linearized using 0 (for false)
and 1 (for true). This is often used to have normalized values for numerical
data. For example, "0" for 0 and negative values. "1" for any positive
- Original amount/quantity: This uses original numerical values as they are.
For example, purchase amount, claim amount, default amount, etc.
2.6 Preparing training and validation sets
Profiling and segmentation require the whole data as a single mining set.
For predictive modeling, however, you may need to prepare a training set and a
number of cross-validation sets. A model is trained using a training set and verified
over several other validation datasets. The following methods may be used in preparing '
- Divide populations randomly. You may use random number generators.
- You may use whole populations. But create multiple sets using data from
different periods, e.g., monthly, quarterly, yearly information, etc.
This normally shares most data fields, but target performance
fields come from different periods.
- Sometimes multiple datasets occur naturally. Use them!
How to find useful variables?
How do you find useful variables for predictive modeling?
This is very important in predictive modeling. After all,
the principles of predictive modeling is
to produce segments that predict customer or population behaviors statistically.
Naturally, correlation analysis is a good mechanism since it can
measure the degree of association between variables and outcomes systematically.
Note that it can measure both negative relationships as well as
In addition, neural network weight analysis can be used. It shows
importance of variables visually.
read Variable Relavancy Analysis - Correlation Analysis.
Software tools are available from CMSR
Data Mining / Machine Learning Software.
Profiling & Segmentation
The Pareto Principle (aka, the 80-20 rule) teaches us that 80% of outcomes
come from 20% populations.
It is often the case where the ratio is 90 - 10: 90% of outcomes come from 10%
pullulation. The purpose of segmentation is to identify the 10 or 20% population
and develop profiles for them.
- Hotspot profiling analysis: This is to identify
hottest segment profiles.
- Neural clustering: similar populations
or customers are grouped together to form segments.
- Drill-down decision tree:
Decision tree divides populations or customers into sub segments in
such a way that can maximize probability of each categories of
With proper software tools and right data on hand,
developing predictive modeling is fairly straightforward.
Predictive models are developed as follows;
- Correlation analysis:
With the data collected, correlation analysis is
performed to determine whether factor variables have strong association to
predictor values. Factor variables with weak correlation may be excluded
in developing models as they do not provide predictive ability.
- Model configuration and training: From software tools,
predictive models are configured with modeling variables and trained
using the collected data.
- Model validation: Predictive models are validated using data
not used for model development. It is highly desirable to test with a
number of different
validation datasets and deploy them only if they show consistently
predict well over different datasets.
For training and validation purposes, collected data might be divided
into two or more datasets.
- Model integration: Models are integrated using
Rule Engines. Model integration can
incorporate different models. In addition, Rule engines provide
conditional model application and exception rule handling capabilities.
- Apply to real data: Fully developed models are applied
to data in two ways. One is to apply models directly from CMSR
Data Mining Software.
The impact of previous optimization
Most businesses have internal policies that filter out certain business
When used directly, this will necessarily bias training datasets. It may not be
possible to obtain training datasets that represent truthfully.
In this case, it is desirable to apply internal policies first, then
apply predictive models developed specifically to the segments.
Predictive modeling also introduces biases in future training datasets.
Once businesses deploy predictive models, subsequent data will have far
fewer presence of certain types of samples. This will lead to
inaccurate predictive models. Bagging with previous models may
be a solution. That is, apply previously deployed models along with
new models and combine the results. This may reduce biases significantly.
The data skew issues
Predictive modeling can be extremely useful if it can predict rare
events accurately. For example, the ratios of insurance claims, credit defaults,
and direct mail-order marketing are very row, say, 10% or even 2%.
Predictive models developed using such skewed events show very low
accuracy predicting events happening, say, less than 30%, or even 10%.
Is it the limitation of predictive modeling in general? The simple answer
is absolutely no! You may have developed simplistic models using wrong
techniques. For example, decision tree (and other classification)
techniques will show very poor performance. You will normally end up with
totally useless models. To learn more about how you can develop better
models, read How to develop highly accurate predictive models.
For more information, please read the following pages;
If you need predictive modeling tools, please read Preditive Modeling Software.
Software download is available from the page.