Introduction to Data Mining and Machine Learning

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases.

Aim of the book:

    “The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability. If you wish to experiment with your own data, you will be able to do this easily with the Weka software


Data Mining and Machine Learning

Data mining is about solving problems by analyzing data already present in databases.

Data mining is defined as the process of discovering patterns in data.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  1. As a black box whose innards are effectively incomprehensible and,
  2. As a transparent box whose construction reveals the structure of the pattern. Both, we are assuming, make good predictions.

Patterns that affect decision making process explicitly are called structural patterns.

The rules do not really generalize from the data; they merely summarize it.

Problems with datasets:

  1. The set of examples given as input is far from complete.
  2. Values are not specified for all the features in all the examples.
  3. Rules appear to classify the examples correctly, whereas often, because of errors or noise in the data, misclassifications occur even on the data that is used to train the classifier.

One of proposed definition to learning:

Things learn when they change their behavior in a way that makes them perform better in the future

A set of rules that are intended to be interpreted in sequence is called a decision list (Classification Rules).

Numeric attribute problem is problem that appears in numeric values. Otherwise it’s called, mixed-attribute problem.

Rules that strongly associate different attribute values are called association rules.

People frequently use machine learning techniques to gain insight into the structure of their data rather than to make predictions for new cases.

One of challenges in machine learning is, the question of what is the most natural and easily understood format for the output from a machine learning scheme.

The process of determining regression equation weights is called regression.

The fact that the decision structure is comprehensible is a key feature in the successful adoption of the application.


Fielded Applications

Decisions involving judgment:

When you apply for a loan, you have to fill out a questionnaire that asks for relevant financial and personal information. This information is used by the loan company as the basis for its decision as to whether to lend you money. Such decisions are typically made in two stages. First, statistical methods are used to determine clear “accept” and “reject” cases. The remaining borderline cases are more difficult and call for human judgment. For example, one loan company uses a statistical decision procedure to calculate a numeric parameter based on the information supplied in the questionnaire. Applicants are accepted if this parameter exceeds a preset threshold and rejected if it falls below a second threshold. This accounts for 90% of cases, and the remaining 10% are referred to loan officers for a decision. On examining historical data on whether applicants did indeed repay their loans, however, it turned out that half of the borderline applicants who were granted loans actually defaulted. Although it would be tempting simply to deny credit to borderline customers, credit industry professionals pointed out that if only their repayment future could be reliably determined it is precisely these customers whose business should be wooed; they tend to be active customers of a credit institution because their finances remain in a chronically volatile condition. A suitable compromise must be reached between the viewpoint of a company accountant, who dislikes bad debt, and that of a sales executive, who dislikes turning business away.

    Enter machine learning. The input was 1000 training examples of borderline cases for which a loan had been made that specified whether the borrower had finally paid off or defaulted. For each training example, about 20 attributes were extracted from the questionnaire, such as age, years with current employer, years at current address, years with the bank, and other credit cards possessed. A machine learning procedure was used to produce a small set of classification rules that made correct predictions on two-thirds of the borderline cases in an independently chosen test set. Not only did these rules improve the success rate of the loan decisions, but the company also found them attractive because they could be used to explain to applicants the reasons behind the decision. Although the project was an exploratory one that took only a small development effort, the loan company was apparently so pleased with the result that the rules were put into use immediately.


Screening Images:

Since the early days of satellite technology, environmental scientists have been trying to detect oil slicks from satellite images to give early warning of ecological disasters and deter illegal dumping. Radar satellites provide an opportunity for monitoring coastal waters day and night, regardless of weather conditions. Oil slicks appear as dark regions in the image whose size and shape evolve depending on weather and sea conditions. However, other look-alike dark regions can be caused by local weather conditions such as high wind. Detecting oil slicks is an expensive manual process requiring highly trained personnel who assess each region in the image.

    A hazard detection system has been developed to screen images for subsequent manual processing. Intended to be marketed worldwide to a wide variety of users—government agencies and companies—with different objectives, applications, and geographic areas, it needs to be highly customizable to individual circumstances. Machine learning allows the system to be trained on examples of spills and nonspills supplied by the user and lets the user control the tradeoff between undetected spills and false alarms. Unlike other machine learning applications, which generate a classifier that is then deployed in the field, here it is the learning method itself that will be deployed.

The input is a set of raw pixel images from a radar satellite, and the output is a much smaller set of images with putative oil slicks marked by a colored border. First, standard image processing operations are applied to normalize the image. Then, suspicious dark regions are identified. Several dozen attributes are extracted from each region, characterizing its size, shape, area, intensity, sharpness and jaggedness of the boundaries, proximity to other regions, and information about the background in the vicinity of the region. Finally, standard learning techniques are applied to the resulting attribute vectors.

Several interesting problems were encountered. One is the scarcity of training data. Oil slicks are (fortunately) very rare, and manual classification is extremely costly. Another is the unbalanced nature of the problem: of the many dark regions in the training data, only a very small fraction are actual oil slicks. A third is that the examples group naturally into batches, with regions drawn from each image forming a single batch, and background characteristics vary from one batch to another. Finally, the performance task is to serve as a filter, and the user must be provided with a convenient means of varying the false alarm rate.


Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.


Other Applications:

Sophisticated manufacturing processes often involve tweaking control parameters. Separating crude oil from natural gas is an essential prerequisite to oil refinement, and controlling the separation process is a tricky job. British Petroleum used machine learning to create rules for setting the parameters. This now takes just 10 minutes, whereas previously human experts took more than a day. Westinghouse faced problems in their process for manufacturing nuclear fuel pellets and used machine learning to create rules to control the process. This was reported to save them more than $10 million per year (in 1984). The Tennessee printing company R.R. Donnelly applied the same idea to control rotogravure printing presses to reduce artifacts caused by inappropriate parameter settings, reducing the number of artifacts from more than 500 each year to less than 30.


Machine Learning and Statistics

If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.


Some of the most important decisions in machine learning system are:

  1. The concept description language.
  2. The order in which the space is searched.
  3. The way that overfitting to the particular training data is avoided.



  • Data Mining.
  • Machine Learning.
  • Pattern.
  • Structural Pattern.
  • Decision List.
  • Classification Rules.
  • Numeric Attribute Problem.
  • Mixed Attribute Problem.
  • Association Rules.
  • Regression.

Market Basket Analysis.