Rosella       Machine Intelligence & Data Mining

Deviation Analysis and Deviation Detection

Deviation analysis can reveal surprising facts hidden inside data. CMSR Data Miner provides tools that can be used to detect deviations, anomalies, and outliers. Detection is needed for various reasons;

  • Knowledge discovery: often such information is vital part of important business decisions and scientific discovery.
  • Auditing: examining such information can reveal problems and mal-practices.
  • Fraud detection: fraudulent claims often carry inconsistent information. Such information can reveal fraud cases.
  • Data cleaning: such information can be from mistakes in data entry which should be corrected.

Hotspot Analysis

Hotspot Analysis can drill-down data, detecting outliers. More specifically, this will detect patterns of outliers, defined in terms of profile conditions. Outliers can have extremely high or low averages, probabilities, etc. With CMSR Data Miner, you can perform as follows;

  • Search hotspot profiles.
  • Query database using the hotspot profiles and examine the result rows.

The following figure shows CMSR Hotspot Analysis drill down tree for credit risk data. Information is drilled down by refining data for search of higher credit risk segments.

Credit Hotspot Profiling of Risky Credit Segments

Predictive Modeling (by Machine Learning) and Anormaly Detection

Predictive Modeling, such as decision tree, regression, rule induction and neural network, can be used to detect deviations. Principal idea is to develop AI machine learning models that can predict general expectations. Then compare values against AI generated expected values. If differences in values are large, anormalies are detected. With CMSR Data Miner/Machine Learning Studio, this works as follows;

  • Build predictive models for targeted fields using other fields as induction fields.
  • Apply the models to data in database and save results onto database rows.
  • Compare and identify records with large differences in predicted values. For numerical fields, ratios of values can be used. You can perform this easily with simple SQL statements.

Another approach is to use models that can predict anormaly directly as follows;

  • Build predictive models for targeted fields using other fields as induction fields. Note that target fields contain abnormal status information.
  • Apply the models to data in database and save abnomality status onto database rows.
  • Sort database rows based on abnomaility status values and print. You can perform this easily with simple SQL statements.

Cross Tables and Hidden Patterns

CMSR Data Miner supports very powerful deviation detection methods for Cross Tables. It shows deviation in terms of over performining and under performing cells based on Chisquare statistics. The methods can reveal hidden patterns and hidden information hidden inside cross table numbers. Note that this tool is also available as Excel Addin Cross Tables.

Clustering

Clustering objects based on similarity and analyzing clusters can reveal more problem entities. With CMSR Data Miner, you can perform as follows;

  • Cluster objects based on similarity.
  • Examine clusters using cluster visualization tools.
  • Apply the clustering model to database table. And examine those identified records.

Deviation Detection in Time-Series Trend Data

Rule-based automation can be used to detect deviant trends automatically. RME-EP is a rule engine which supports various time-series regression and statistical functions. It's a perfect platform for automation of time-series trend deviation detection. For example, the followings can be implemented easily;

  • Which KPIs have wild fluctuations in series data? This can be easily detected using coefficient of determination and regression coefficient of variance.
  • Which KPIs exceed the expected values predicted by regression by more than the average of absolute errors?
  • Which KPIs drop the expected values by regression by more than standard deviation?
  • Which KPIs grow more than 2% linearly in average with linear coefficient of determination 0.25 or more?
  • Which KPIs decrease more than 2% linearly in average with linear coefficient of determination 0.25 or more?
  • Which KPIs grow/decrease linearly solidly with correlation coefficient over 0.8 or less than -0.8?
  • Which KPIs grow/decrease exponentially solidly with coefficient of determination 0.7 or more with growth rate over 1%?
  • Which KPIs start declining?
  • And so on.

This type of rules can be easily implemented with RME-EP rule engine. It requires KPI data to be stored as a relational database table. In addition, you need to learn SQL style coding. RME-EP is available from CMSR Data Miner/Machine Learning Studio. The following is an coding example. CMSR Download has the full code in the model "KPI Trends Detection Scan" of the "Demo" project. The model implements some above detections. RME-EP can update KPI database table and/or write into a CSV file. More example codes can be found here.

/*
 * If the difference between the current value and the predicted value is  
 * bigger than rate-times of the regression average absolute errors.
 */
// evaluation status for big jump, if, 'Yes'
DECLARE "State-BigJump"  AS STRING OUTPUT INITIAL VALUE 'No'; 

RULE BigJumpDetection:
IF ABS("Current" - TIMESERIES(1, (LINEAR, EXPONENTIAL), 1, "Series")) > 
	TIMESERIES(AVG_ABS_ERROR, (LINEAR, EXPONENTIAL), 1, "Series")
	 *  1.5 THEN
	SET "State-BigJump" AS 'Yes'
END;