Custom Search

Popular Posts

Wednesday, July 23, 2014


According to Berry and Linoff, Data Mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. This definition, justifiably, raises the question: how does data mining differ from OLAP? OLAP (Online Analytical Processing) is undoubtedly a semiautomatic means of analyzing data, but the main difference lies in quantities of data that can be handled.

There are other differences as well. Tables 1 and 2 summarize these differences.

Table-1 : OLAP Vs Data Mining – Past Vs Future 
OLAP: Report on the past
Data Mining: Predict the future
Who are our top 100 best customers for the last three years?
Which 100 customers offer the best profit potential?
Which customers defaulted on the mortgages last in two years?
Which customers are likely to be bad credit risks?
What were the sales by territory last quarter compared to the targets?
What are the anticipated sales by territory and region for next year?
Which salespersons sold more than their quota during last four quarters?
Which salespersons are expected to exceed their quotas next year?
Last year, which stores exceeded the total prior year sales?
For the next two years, which stores are likely to have best performance?
Last year, which were the top five promotions that performed well?
What is the expected return for next year’s promotions?
Which customers switched to other phone companies last year?
Which customers are likely to switch to the competition next year?

Table-2 : Differences between OLAP and Data Mining

Motivation for
Information request
What is happening in the
Predict the future based only why this is happening
Data granularity
Summary data
Detailed transaction-level data.
Number of business
Limited number of dimensions
Large number of dimensions.
Number of dimension
Small number of attributes
Many dimension attributes
Sizes of datasets for the dimensions
Not large for each dimension
Usually very large for each dimension
Analysis approach
User-driven, interactive analysis
Data-driven automatic
knowledge discovery
Analysis techniques
Multidimensional, drill-down, and slice-and-dice
Prepare data, launch mining tool and sit back
State of the
Mature and widely used
Still emerging; some parts of the technology more mature

Why Now?
Why is data mining being put to use in more and more businesses? Here are some basic reasons:
In today’s world, an organization generates more information in a week than most people can read in a lifetime. It is humanly impossible to study, decipher, and interpret all that data to find useful patterns.
A data warehouse pools all the data after proper transformation and cleansing into well-organized data structures. Nevertheless, the sheer volume of data makes it impossible for anyone to use analysis and query tools to discern useful patterns.
In recent times, many data mining tools suitable for a wide range of applications have appeared in the market. The tools and products are now mature enough for business use.
Data mining needs substantial computing power. Parallel hardware, databases, and other powerful components are available and are becoming very affordable.
Organizations are placing enormous emphasis on building sound customer relationships, and for good reasons. Companies want to know how they can sell more to existing customers. Organizations are interested in determining which of their customers will prove to be of long-term value to them. Companies need to discover any existing natural classifications among their customers so that the each such class may be properly targeted with products and services. Data mining enables companies to find answers and discover patterns in their customer data.
Finally, competitive considerations weigh heavily on organizations to get into data mining. Perhaps competitors are already using data mining.

Data Mining Techniques

Data mining covers a broad range of techniques. Each technique has been heavily researched in recent years, and several mature and efficient algorithms have evolved for each of them. The main techniques are: Cluster detection, Decision trees, Memory based reasoning, Link analysis, Rule induction, Association rule discovery, Outlier detection and analysis, Neural networks, Genetic algorithms, and Sequential pattern discovery. Discussion on the algorithms associated with the various techniques has not mentioned here for two main reasons:

firstly, because they are too mathematical / technical in nature, and secondly, because there are numerous, well written text books, to serve the needs of those who are specially interested in the subject. Table-3 below summarized the important features of some of these techniques. The model structure refers to how the technique is perceived, not how it is actually implemented. For example, a decision tree model may actually be implemented through SQL statements. In the framework, the basic process is the process performed by the particular data mining technique. For example, the decision trees perform the process of splitting at decision points. How a technique validate the model is important. In the case of neural networks, the technique does not contain a validation method to determine termination. The model calls for processing the input records through the different layers of nodes and terminate the discovery at the output node. 

Table 3 : Summary of Data Mining Techniques
Data Mining Technique
Underlying Structure
Basic Process
Validation Method
Cluster Detection
Distance calculation in n-vector space
Grouping of values in the same neighbourhood
Cross Validation to Verify Accuracy
Decision Trees
n-ary Tree
Splits at decision points
based on entropy
Cross Validation
Predictive Structure Based on Distance and Combination Functions
Association of unknown
instances with known
Cross Validation
Link Analysis
Discover links among
variables by their values
Not Applicable
Neural Networks
Forward Propagation
Weighted inputs of
predictors at each node
Not Applicable
Genetic Algorithms
Fitness Functions
Survival of the fittest on
mutation of derived values
Mostly Cross

Data Mining Applications
Data mining technology encompasses a rich collection of proven techniques that cover a wide range of applications in both the commercial and non-commercial realms. In some cases, multiple techniques are used, back to back, to greater advantage. For instance, a cluster detection technique to identify clusters of customers may be followed by a predictive algorithm applied to some of the identified clusters to discover the expected behaviour of the customers in those clusters.

Non-commercial use of data mining is strong and pervasive in the research area. In oil exploration and research, data mining techniques discover locations suitable for drilling based on potential mineral and oil deposits. Pattern discovery and matching techniques have military applications in assisting to identify targets. Medical research is a field ripe for data mining. The technology helps researchers with discoveries of correlations between diseases and patient characteristics. Crime investigation agencies use the technology to connect criminal profiles to crimes. In astronomy and cosmology, data mining helps predict cosmetic events.

The scientific community makes use of data mining to a moderate extent, but the technology has widespread applications in the commercial arena. Most of the tools target the commercial sector. Consider the following list of a few major applications of data mining in the business area.

Customer Segmentation: This is one of the most widespread applications. Businesses use data mining to understand their customers. Cluster detection algorithms discover clusters of customers sharing the same characteristics. Market Basket Analysis: This very useful application for the retail industry. Association rule algorithms uncover affinities between products that are bought together. Other businesses such as upscale auction houses use these algorithms to find customers to whom they can sell higher-value items.
Risk Management: Insurance companies and mortgage businesses use data mining to uncover risks associated with potential customers.
Fraud Detection: Credit card companies use data mining to discover abnormal spending patterns of customers. Such patterns can expose fraudulent use of the cards.
Delinquency Tracking: Loan companies use the technology to track customers who are likely to default on repayments.
Demand Prediction: Retail and other businesses use data mining to match demand and supply trends to forecast for specific products.

Table 4 : Application of Data Mining Techniques
Application Area
Examples of Mining Functions
Mining Processes
Mining Techniques
Fraud Detection
Credit Card Frauds
Internal Audits
Warehouse Pilferage
Determination of
Variation from Norms
Data Visualization
Memory-based Reasoning
Outlier Detection and Analysis
Risk Management
Credit Card Upgrades Mortgage Loans Customer Retention
Credit Rating
Detection and Analysis
of Association
Affinity Grouping
Decision Trees
Memory Based Reasoning
Neural Networks
Market Analysis
Market basket analysis
Target marketing Cross selling Customer Relationship
Predictive Modeling
Database Segmentation
Cluster Detection
Decision Trees
Association Rules
Genetic Algorithms


Blog Widget by LinkWithin