12CE10015 TP2 .pdf
Original filename: 12CE10015_TP2.pdf
Author: sai krishna
This PDF 1.5 document has been generated by Microsoft® Word 2013, and has been sent on pdf-archive.com on 08/04/2015 at 18:14, from IP address 203.110.x.x.
The current document download page has been viewed 563 times.
File size: 538 KB (7 pages).
Privacy: public file
Download original PDF file
DATA MINING TECHNIQUES, COMPARISON & THEIR USES
CH. SAI KRISHNA
Roll Number 12CE10015
Department of Civil Engineering
Indian Institute of Technology, Kharagpur – 721302, India
In this paper, focus is given on variety of approaches, techniques and different areas of interest which are
helpful and marked as important in the field of data mining technologies. The concept of data mining was
summarized and its significance and methodologies are illustrated. Now-a-days large organizations and
MNC’s distributed across the world generates and operates large volumes of data. Strategic decisions
are made taking in account of all the data. But in today’s environment, efficiency or speed is not the only
key for competitiveness. Hence to analyze, manage and make strategic decisions we require data mining
techniques. Comparison of these techniques, their situational usage and further research interests are
Data Mining is the process of extracting information from large sets of data through the use of techniques
and various algorithms drawn from fields of Statistics, Machine Learning and Data Base Management
Systems. Traditional data analysis methods involve interpretation of data, which is slow, expensive and
highly subjective. Many times it involves manual work.
Rigorous data collection and new storage technologies have made it possible for organizations to
accumulate large amounts of data at minimal expenses. Exploiting and analyzing this stored data, in
order to extract useful and actionable information for decision making, is the overall goal of the activity
termed as ‘data mining’. It is an interdisciplinary subfield in computer science involving computational
process of large data sets and patterns discovery. The goal of this advanced analysis process is to
extract information from a data set and transform it into an understandable structure for further use. The
methods used are at the juncture of artificial intelligence, machine learning, statistics, database systems
and business intelligence. Data Mining is about solving problems by analyzing data already present in
Data mining have broadly five major element such as:
Extraction, transformation, and transaction of incoming data onto the data warehouse system.
Storing and managing the data in a single or a multidimensional database system.
Providing data access to analysts and information technology professionals.
Analyze the data by using software of choice.
Presentation of data in useful, appealing and easily understandable format.
Data mining tasks can be classified into two categories namely – descriptive and predictive. Descriptive
mining characterizes the general properties of procured data already present in database. It only involves
statistical concepts and mainly summarizes the data points making it possible to study important aspects
of the data set. . Predictive mining performs inference on the existing data, finds patterns in the data and
uses these patterns of the current data in order to make predictions on future behavior and anticipate the
consequences of change. Predictive data mining is the most common type of data mining and the one
that has more direct business applications. When the data mining is being done without the knowledge for
what the customer is searching, then it is called an Exploratory Data Analysis.
Methods of Data Mining
In the field of artificial intelligence (AI), a genetic algorithm is a search heuristic that mimics the process of
natural selection, routinely used to generate useful and dependable solutions to optimization and search
problems. The general idea behind Genetic Algorithms is that we can build a better solution if we
somehow combine the good parts of other solutions, just like the natural phenomenon of DNA
combination of living beings. Genetic Algorithms belong to a larger class of evolutionary algorithms, which
generates solutions to optimization problems using techniques inspired by natural evolution.
The methodology of Genetic Algorithm starts with generation of initial population randomly, allowing the
entire range of possible solutions of the search space. Individual selections are made based on fitness
process, by using a fitness function where fitter solutions are typically selected. Next set of population of
solutions is selected by performing genetic operators (crossover & mutation).
Crossover is a process of taking more than one parent
solutions and producing a new solution (child solution)
from them. Selection of parents is generally based on
fitness proportionate or tournament selection. Different
crossover techniques also exist like one-point crossover,
two point crossover, cut and splice, uniform and half
uniform crossover etc.
Mutation is used to maintain genetic diversity between
consecutive generations of population. Mutation alters
one or more gene values from its initial state. In mutation
solution may change entirely from the previous solution.
Different types of mutation are Bit string mutation, flip bit,
uniform and non-uniform.
These Genetic operators, produce a new generation of
individuals by recombining features of their parents.
Eventually a generation of individuals will be interpreted
to the original problem domain and the fit individual
represents the solution.
Termination of the search process until a termination
condition is reached. Common terminating conditions are
a solution is found that satisfies minimum criteria or a
fixed number of generations are reached or if the
allocated budget is reached.
Figure 1. Flow chart of Genetic Algorithm
Figure 1, represents the steps of a typical Genetic Algorithm.
Uses of Genetic Algorithms
Genetic algorithms find their use covering over wide areas like Automotive Design, Robotics, Evolvable
hardware, Software Testing, Optimized telecommunication routing, Computer Gaming, Encryption and
code breaking, Finance investment strategies, marketing and merchandising etc.
Limitations of Genetic Algorithms
The better solution is only in comparison to other solutions. As a result, the termination criterion is
not clear in every problem.
They may have a tendency to converge towards local optima or arbitrary points rather than global
Decision trees are predictive models which maps observations about an item to conclusions about items
target value. It is one of the predictive modelling approaches used in statistics, data mining and machine
learning. A decision tree is a flow chart like structure with numerous nodes where each node denotes a
test on an attribute value, each branch represents an outcome of the test and tree leaves represent
classes or class distribution. A decision tree is most often used for classification. A decision tree partitions
the data set into cells belonging to one class. This partition is represented as a sequence of tests.
The classification of an instance of input is performed by starting at the root node and, depending on the
results obtained in the tests follows suitable branches until a leaf node is reached.
A decision tree is a predictive model that can be viewed as a tree where each branch of the tree is a
classification question and leaves represent the partition of the data set with their classification. From a
business perspective, they can be viewed as generating a subdivision with in the original data set. These
segmentations are then used for a predictive study. Easily understandable models can be generated
using these decision trees as the segmentations are labelled with a description of their characteristics.
Decision trees are Simple to understand and interpret, requires little data preparation, can handle both
numerical and categorical data, and performs well with large data sets under standard conditions
Limitations of Decision Trees
They can create over complex trees that do not generalize well from the training data (well known
as over fitting). Methods like pruning are necessary to avoid this problem.
Data with categorical variables with different number of levels, decision trees favor those
attributes with more levels, which can be avoided by the conditional approach.
Figure 2. Structure of a Decision Tree
A Neural Network (NN) or an Artificial Neural Network (ANN) is an information processing paradigm that
is inspired by the way biological nervous systems, such as the brain, process information. The key
element of this paradigm is the novel structure of the information processing system. They are used to
estimate or approximate functions that can depend on large number of inputs and are generally unknown.
They can compute values from inputs by changing weightages accordingly, for desired output. A simple
neural network consists of Input nodes, Hidden layers and Output node. These nodes are connected by
respective weightages. A Neural network can resemble the brain to some extent. They can replace a
mathematical model and trains itself to perform the task of prediction. This powerful predictive modelling
technique creates very complex models that are really difficult to understand by even experts.
Figure 3. Simple Architectural representation of Neural Network
Advantages of Neural Networks
They act more like a real nervous system.
Parallel organization permits solutions to problems where multiple constraints must be satisfied.
They are more robust because of the weights and performs well in noisy environment.
They improve performance by learning the data.
Possible to achieve low rate of error and high degree of accuracy after training.
When dealing with large amounts of data generally multilayer perceptron’s can be used which will have
several intermediate hidden layers. Multilayer perceptron is more efficient than a single layer perceptron.
Performance of a model depending on different input parameters can be cross checked.
Artificial neural network have become a powerful tool in tasks like pattern recognition, decision problem or
predication applications. It is one of the newest signals processing technology. ANN is an adaptive,
nonlinear system that learns to perform a function from data and that adaptive phase is normally training
phase where system parameter is change during operations.
In statistics, regression analysis is a statistical process for estimating the relationships among variables. It
includes many techniques for modelling and analyzing several variables, when the focus is on the
relationship between dependent variable and one or more independent variables
A regression model defines three types of regression models: linear, polynomial, logistic regression. The
model type attribute indicates the type of regression used.
Linear and stepwise-polynomial regression are designed for numeric dependent variables having a
continuous spectrum of values. These models should contain exactly one regression table. The attributes
normalized method and target category are not used in that case.
Logistic regression is designed for categorical dependent variables. These models should contain exactly
one regression table for each target category. The normalization method describes whether/ how the
prediction is converted into a probability.
Figure 4. Graphical representation of Linear & Logistic models
Regression models are easier to develop and their efficiency can be estimated by visualization of the
distribution of data points. More the number of data points more efficient is the regression model, but
problems like over fitting may occur due to large number of data points.
The catalogue of Rule extraction contains three main norms for evaluation of algorithms: the scope of
dependency and the format of the extract description. The first dimension concerns with the scope of use
of an algorithm either regression or dimension focuses on the extraction algorithm on the underlying,
independent algorithms. The third criterion focuses on the obtained rules that might be worthwhile
algorithms (predictive vs descriptive).
Usually a rule consists of two values. A left hand precursor and a right hand subsequent. A precursor can
have one or multiple conditions which must be true in order for the subsequent to be true for a given
accuracy whereas a subsequent is just a single condition. Thus while mining a rule from a database
precursor, subsequent, accuracy, and coverage are all targeted.
Accuracy – How often is the rule correct?
Coverage – How often does the rule apply?
Only because the data base is expressed as rule, it does not mean that it is true always. So like data
mining algorithms it is equally important to identify and make obvious the uncertainty in the rule. This is
called accuracy. The coverage of the rule means how much of the database it “covers” or applies to.
Important criteria for rule extraction:
Comprehensibility: The extent to which extracted representations are humanly comprehensible.
Fidelity: The extent to which extracted representations accurately model the networks from
which they were extracted.
Accuracy: The ability of extracted representations to make accurate predictions on previously
Scalability: The ability of the method to scale to networks with large input spaces and large
numbers of weighted connections.
Generality: The extent to which the method requires special training.
Pros and Cons
Having discussed briefly the different methodologies in data mining, each method has its own utilities and
backdrops. Genetic algorithms have a very slow convergence, the reason is its unguided mutation. We
can overcome this problem by using combination of search algorithms which can perform guided search
like differential evolution. Over fitting and large optimization time are some other disadvantages of genetic
algorithms. For a better performance, instead of a simple weighted approach, add an intermediary
intelligent system and optimize that system or by using neural networks and fuzzy logic.
Decision trees express the acquired knowledge in a readable form. The main advantage is interpretability,
easy to understand and process the information. Favor towards a highly branched division can cause a
drawback of these decision trees. Over fitting may also occur in decision trees. Successfully applicable in
many applications like medical diagnosis, credit risk prediction etc.
Neural networks are well known for their adaptive learning, real time operation, and ability to process
complicated and imprecise data. Its applications include character recognition, image processing,
classification problems, stock market prediction, security applications, feature extraction etc.
At present data mining is a new, fast growing and important area of research. The extensiveness with
which different methodologies are applied in so many areas is no less than astonishing. Artificial Neural
Networks itself is a very easy and suitable method for solving the problems of data mining, as it offers a
good robustness, self-organizing adaptive, able to perform parallel processing, and distributed storage
along with a high degree of fault tolerance. Many new applications and new fields daily find their own
solutions following these methodologies
This paper is rather a theoretical one covering various techniques and methodologies of data mining
explaining their own advantages and contexts of usage.
New methodologies in data mining are being developed either by combination of one or more existing
methods. Availability of high end technologies, computer power enables us to develop more efficient and
powerful data mining methods. In a fast pace new world where Tb’s of new data are being generated as a
daily basis there is a necessity for simple, faster and yet effective tools using which a strategic decision
can be made.
Application of Genetic Algorithm, article in www.doc.ic.ac.uk .
Decision tree learning, Wikipedia.org/wiki/Decision_tree_learning.
Jagjit Wilku, Data ware Housing and Data Mining: Decision Trees, Slideshare
James Malone, Kenneth McGarry, Stefan Wermter and Chris Bowerman,”Data Mining using Rule
Extraction from Kohonen Self-Organising Maps”.
Lior Rokach and Oded Maimon(2008),”Data mining with Decision Trees: Theory and Applications”.
M. Craven and J. Shavlik (1993), “Learning rules using ANN”, Proceeding of 10 th International Conference
on Machine Learning.
Mir Asif Iquebal,”Genetic Algorithms and their Appications: An Overview“.
Tulips Angel Thankachan and Dr. Kumudha Raimond (2013),“A Survey on Classification and Rule
Extraction Techniques for Datamining”.