Chapter 2, Statistical Learning

What is Statistical Learning

Input variables to a statistical model are often called predictors, independent variablesfeatures, or just variables.

Output variable from a statistical model is often called response, or dependent variable.
In essence, statistical learning refers to a set of approaches for estimating f (error term)


Why Estimate f

Estimating f is used for prediction and inference.

The prediction accuracy depends on two quantities, reducible error and irreducible error.

The reducible error measure the prediction error as function of X and predicted Y, which by using appropriate statistical learning technique can be reduced.

The irreducible error measure the prediction error as function of ε and the predicted Y. Its irreducible as no matter how well we estimate Y, we cannot reduce the error introduced by ε. The irreducible error may contain unknown variables, variations and variance of the estimated output Y.

The techniques of statistical learning focus on estimating of f with aim of minimizing the reducible error.

The inference task is about understanding the relations between the output and the predictors. Some questions to ask are

  • Which predictors have strong influence on Y
  • What is the relationship between the response and each predictor
  • Is the relation between Y and predictors representable in linear equation or not

Linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches

Non-Linear models provide quite accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging

How F is estimated

The process here is to observe training data and train a statistical model to find Y ≈ f(X). There are two methods to do so parametric and nonparametric.

Parametric methods involve two step model based approach

  • Define and select a linear model f(X)=WX+b
  • Select a procedure for fitting the train data to the selected model. Example of such procedure is least squares method.

Choosing close model to the true function of f is challenging. Choosing a too far model would result in poor estimation.

Overfitting is a phenomena when the model fits much to the noise in the training data instead of generalizing to capture the underlying f.

Nonparametric methods do not make explicit assumptions about the functional form of f. Instead they seek an estimate close to the training data without being too rough or too wiggly. Advantage of such methods is ability to cover wide range of shapes of f, though this requires training data way more than parametric methods to generalize.

Usually restrictive, simple and linear models are more suitable than flexible models for inference tasks.

Regression problems are about predicting an quantitative response while the classification problems are about predicting a qualitative response.

Model Accuracy Assessment

In regression problems mean square error is commonly used.

Mean Square Error


Chapter 1, Introduction

Statistical learning refers to a vast set of tools for understanding data. Types of statistical learning:

  • Supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs
  • Unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data
Linear regression is used for predicting quantitative values, such as an individual’s salary
Statistical Learning History:
  • Around 1900, Legendre and Gauss published method of least squares papers (a form of linear regression)
  • In 1936, Fisher proposed linear discriminant analysis for predicting qualitative regression problems.
  • In 1940, various authors proposed logistic regression for solving the regression problem
  • In the early 1970s, Nelder and Wedderburn coined the term generalized linear models for an entire class of statistical learning methods that include both linear and logistic regression as special cases.
  • In 1970, many linear models were available however, fitting a nonlinear relationship was computationally infeasible at that time.
  • In 1980, the computing technology had improved so it can solve nonlinear problems.
  • In mid 1980, Breiman, Friedman, Olshen and Stone introduced classification and regression trees, and were among the first to demonstrate the power of a detailed practical implementation of a method, including cross-validation for model selection
  • In 1986, Hastie and Tibshirani coined the term generalized additive models for a class of nonlinear extensions to generalized linear models
  • In the past years, progress on statistical learning techniques and tools has been marked by the increasing availability of powerful and relatively user-friendly software

Udacity Deep Learning Course Summary


Logistic Classifier (Linear Classifier): Takes an input (image pixels) and applies a linear function to them to make predictions.

Linear Function: WX + b = Y, where W is weight matrix, X is input matrix, b is bias and Y is output matrix. The W and b terms are the trained values.

Softmax: is a way to transform scores into probabilities. Increasing the number of training data set would make the classifier more confident and discriminate individual classes clearly.

One-Hot Encoding is when the correct class has probability of 1.0 and other classes 0 probability. This technique is not efficient on large number of classes. The benefit is measuring cross-entropy is easy.

Multinomial Logistic Classification is setting where input is feeded to linear model to logistic to softmax to one-hot encoding.

Gradient Descent is the technique used for finding the value to minimize the loss function. A correct simulation should update Theta0 and Theta1 at same time not after each other.

Learning Rate is variable that controls how big a step the gradient descent takes downhill.

For a Local Minimum the slope is going to be zero and this will result in 0 for derivative term which would result in same value of theta.

Training Dataset is the data set that the classifier uses to train on.

Validation Dataset is the data set used for validation and also it’s possible to reconstruct this dataset until the classifier looks like generalized.

Test Dataset is the data set used for real testing. Data samples that the classifier never saw before.

Rule of 30 is an idea to measure if improvements in the classification model is significant or not.

Usually models use more 30K examples for validation to assure significance. Whenever there’s an improvements more than 0.1% this will be significant, which makes measuring the improvements easy.

The main problem with Gradient Descent is that it’s gets calculated for each example in the training data set, and the derivative term takes long time from processing standpoint. This problem will arise in scalability scenarios where the training data set is huge.

To overcome the scalability and performance issues of gradient descent, using a small random data (to calculate its average loss) set out of the training data set with small learning rate overcomes this problem. That technique is called Stochastic Gradient Descent (SAG).

Use running average of the loss instead of relying solely on the current instance helps reducing the number of steps to converge. This is called momentum technique

Using lower learning rate usually tends to get a more accurate classifier.

Stochastic Gradient Descent Parameters:

  • Initial Learning Rate: the initial value for learning rate
  • Learning Rate Decay: how much learning rate should increase
  • Momentum: the running average of the loss
  • Batch Size: the random sample used by SGD
  • Weight Initialization

ADAGRAD is a technique that eliminates couple of SGD parameters and require batch size and weights only.

Deep Neural Networks

Rectified Linear Units (RELU): non-linear function that’s y=x for y > 0 and 0 otherwise.

Backpropagation is used to calculate the gradient of the cost function to update the weights.

Deep Learning is about adding more hidden layers to the model. This helps because of:

  • Parameter efficiency: uses few parameters but goes deeper.
  • Model Abstraction: going deeper results into layers that learn more generalized and useful shapes, like geometric shape instead of line

Why deep learning has not been used before?

  • Deep learning, requires large dataset which has been available recently
  • Better regularization techniques

Regularization is a technique used to prevent from overfitting

Avoid overfitting:

  • Early termination for the training using validation set as soon as the performance stops improving
  • Using regularization techniques
    • Adding L2 for the weights vector
    • Using Dropout

Dropout (by Geoffrey Hinton) is a regularization technique used to prevent overfitting by stochastically turning off activations for hidden layers. This forces the network not to rely on a specific activation in its training and it learns redundant representation for everything across the input data.

Redundant representations results in a network that need some consensus technique to decide the result.

Convolutional Neural Network

If the structure of the input is known, then there’s a chance to reduce the NN complexity by relying on such fact. For example, if the input is a colored letter then the NN can be simplified by using grayscale instead of adding the colors complexity to the NN.

Features that do not change across time, space or everywhere are called statistical invariant. For example, translation invariance is used to simplify the NN by disregard learning the location of the object in an image.

Weight sharing is used to simplifying learning in a NN by utilizing statistical invariance.

CNN (by Yann LeCun) are NN that share their parameters across the space.

Convolution is the process of applying a mask to a NN in convolution pattern (like image filters) to a given image. This would result into a matrix that has different width, height and depth from the original image.

The layer’s parameters consist of a set of learnable filters (or kernels/patches), which have a small receptive field, but extend through the full depth of the input volume.

An RGB image has three feature maps each corresponds to one color intensity.

Stride is the number of pixels shifted each time the filter is moved

Valid padding is way to cut the image as the size of the filter

Same padding is way to pad zeros at the edge of the image to keep sizes same.

Striding reduces the feature map size but it’s very aggressive and sometimes result in loss of data.  Pooling is a technique used for combining information for an image sample. Two common pooling techniques are max pooling and average pooling.

1×1 convolution is a way to reduce the dimensionality of the NN in a simple way.

Recurrent Neural Network is a NN that’s used with temporal classification/prediction problems.

Long Short Term Memory (LSTM) is a version from RNN that’s being used widely.

Racing Games AI

This genre is mostly divided into two main groups, vehicular and specialty.

The variants of vehicular games appeared early, and the split stuck. They are differentiated by their camera perspective: the first-third person racing game and an overhead view.

The specialty racing games are mostly fad driven – they involve the hot racing style sport at the time.

One last subtype is the cart racing game, which simplifies the driving portion of the game and adds obstacles, strange trucks, and other action elements.

Common AI Elements

Track AI

Where the road meets the rubber, track AI is the system needed to keep a CPU-controlled car on racetrack (or city street) at high speed within the rules of the game. This is usually a state-based system with the different vehicle states detailing the main ways that a racer can be on track. As guidelines, most games use combination of physical and optimal lines of travel (which are either data path laid down in track editor, or calculated automatically by a technique known as “finding the path of minimum curvature”, as shown in figure below) that mimic the invisible lines of travel that humans use when they race on tracks and roads.




Some games use traffic systems that look very realistic with lane changes, cars getting over for police vehicles, proper use of traffic lights and intersections. This is mostly FSM behaviors, with a lot of message passing to ensure that accidents don’t happen and some randomness to ensure that things don’t look repetitive.


A game like GTA 2 has different types of pedestrians for example, one type can stop by and steal your car, other one can pass away while you are moving. Other systems use simple flocking behavior.

Enemy and Combat

That’s about adding AI bots with weapons or something to play against the player.

Nonplayer Characters

NPCs are folks that you deal with to give you information, clean your car, etc… These characters are most probably scripted.

Other competitive behavior

Adding more interesting playing elements to the game, like having bridge fall down while you are racing with the car.

Useful AI Techniques

Finite State Machines (FSMs)
Scripted Systems
Messaging System

The traffic and pedestrians mostly use this technique to communicate and coordinate their movement.

Genetic Algorithms

A genetic algorithm is used to tune car performance parameters until it reaches best outcome to be used. Most probably that is done offline.

Areas that need improvement

Other areas of interest than crime

Not everybody’s mother wants to see her kid running over a prostitute for her wallet.

More intelligent AI enemies
Persisting world

it’d be great if the whole world simulations are computed together.

Representations for Classical Planning

A restricted state-transition system is one that meets all of the restrictive assumptions A0 through A7 given in before. It is a deterministic, static, finite, and fully observable state-transition system with restricted goals and implicit time.

Such a system is denoted (S,A, F) instead of (S,A, E, Y) because there are no contingent events.

Classical planning (i.e. STRIPS planning) refers generically to planning for restricted state-transition systems.

Motivations for studying classical planning:

  1. As usual in science when one is facing a very complex problem, it is very useful to make restrictive assumptions in order to work out well-founded models and approaches. In planning, assumptions A0 through A7 led to this baseline class.

Main issues (i.e. problems) in classical planning are:

  1. How to represent the states and the actions in a way that does not explicitly enumerate S,A, and y. Without such a representation, it is not possible to develop domain-independent approaches to planning.
  2. How to perform the search for a solution efficiently: which search space, which algorithm, and what heuristics and control techniques to use for finding a solution.

Why we need problem representation?

A necessary input to any planning algorithm is a description of the problem to be solved. In practice, it usually would be impossible for this problem description to include an explicit enumeration of all the possible states and state transitions: Such a problem description would be exceedingly large, and generating it would usually require more work than solving the planning problem. Instead, a problem representation is needed that does not explicitly enumerate the states and state transitions but makes it easy to compute them on-the-fly.

There are three different ways of to represent classical planning problems:

  1. In a set-theoretic representation.
    Each state of the world is a set of propositions, and each action is a syntactic expression specifying which propositions belong to the state in order for the action to be applicable and which propositions the action will add or remove in order to make a new state of the world.
  2. In a classical representation.
    The states and actions are like the ones described for set-theoretic representations except that first-order literals and logical connectives are used instead of propositions, This is the most popular choice for restricted state-transition systems.
  3. In a state-variable representation.
    Each state is represented by a tuple of values of n state variables {Xl,…, Xn}, and each action is represented by a partial function that maps this tuple into some other tuple of values of the n state variables. This approach is especially useful for representing domains in which a state is a set of attributes that range over finite domains and whose values change over time.


Set-Theoretic Representation

We will usually call such problems set-theoretic planning problems, and we will refer to the representation scheme as set-theoretic planning.

Note that minimal solution can’t be redundant.

Properties of the set-theoretic representations:

  1. Readability.
    One advantage of the set-theoretic representation is that it provides a more concise and readable representation of the state-transition system than we would get by enumerating all of the states and transitions explicitly.
  2. Ease of Computation.
    Most of computations depend on basic set theory operations.
  3. Expressively.
    Some problems can’t be expressed in this form (i.e. finding prime number problem)


Classical Representation

The classical representation scheme generalizes the set-theoretic representation scheme using notation derived from first-order logic. States are represented as sets of logical atoms that are true or false within some interpretation. Actions are represented by planning operators that change the truth values of these atoms.

A state is a set of ground atoms of L. Since L has no function symbols, the set S of all possible states is guaranteed to be finite. As in the set-theoretic representation scheme, an atom p holds in s iff p belongs to s. If g is a set of literals (i.e., atoms and negated atoms), we will say that s satisfies g (denoted s |= g) when there is a substitution SEGMA such that every positive literal of SEGMA (g) is in s and no negated literal of SEGMA (g) is in s.


The predicate which can be considered as a function of the set of states; will be called a fluent or flexible relation.


Predicates that are constant from state to state are called rigid relation.


Closed-world assumption means that an atom that is not explicitly specified in a state does not hold in that state.


Rigid relations cannot appear in the effects of any operator o because they are invariant over all the states; they can be used only in precond(o). In other words, any predicate in effects(o) is a flexible relation.

An action a is relevant for g, i.e., a can produce a state that satisfies g, if:    

DWR = Dock-Worker Robots.

Extensions to classical representation:

  1. Simple Syntactical Extensions.
  2. Conditional Planning Operators.
  3. Quantified Expressions.
  4. Disjunctive Preconditions.
  5. Axiomatic Inference.
    1. This technique distinct two new classes of flexible relations: Primary Relations and Secondary Relations.
  6. Function Symbols.
  7. Attached procedures.
  8. Extended Goals.


State-Variable Representation

A state-variable representation relies on the following ingredients:

Search Procedures and Computational Complexity


Some of problem-solving properties:

  1. Soundness:
    A deterministic procedure is sound if, whenever it is invoked on some problem P and returns a value v not equal to failure, v is guaranteed to be a solution for P. A nondeterministic procedure is sound if every successful execution trace returns a value that is guaranteed to be a solution to P.
  2. Completeness:
    A deterministic procedure is complete if, whenever it is invoked on a solvable problem P, it is guaranteed to return a value v not equal to failure. A nondeterministic procedure is complete if, whenever it is invoked on a solvable problem P, at least one of its execution traces will return a value v not equal to failure whenever P is solvable.
  3. Admissibility:
    If there is some measure of optimality for solutions to a problem P, then a deterministic procedure is admissible if it is guaranteed to return an optimal solution whenever P is solvable. Note that if a procedure is admissible, then it is also sound and complete.

A well-known class of search problems is state-space search problems. The state space is a set of nodes called states, and the objective is to find a state s that satisfies some goal condition g.

The space complexity of breadth first search is bd where b is the number of children for each node and d is the depth of the tree. Whereas the space complexity of depth-first search is d.

Search Procedures:

Breadth-First Search.

Depth-First Search.

Best First Search:

In some cases, the objective is to find a goal state s that minimizes some objective function f(s). In this case, a nondeterministic search will not work properly: instead of returning the solution found in a single execution trace, we must look at all of the execution traces to see which one leads to the best solution.

One way to do this deterministically is to use a best-first search. This is like a breadth-first search in the sense that it maintains an active list of nodes that have been generated but not yet visited. However, instead of using this list as a queue the way a breadth-first search does a best-first search instead uses the list as a priority queue: the next node chosen from the active list will be the one whose f value is smallest. Best-first search is sound. If f is monotonically increasing, i.e., if f(s) <= f(s’) whenever s’ is a child of s, then a best-first search will never return a non-optimal solution and is admissible in finite search spaces. If there is a number Gamma > 0 such that if f(s) + Gamma <= f(s’) whenever s’ is a child of s, then a best-first search is admissible even in infinite search spaces.

The well-known A* search procedure is a special case of best-first state-space search, with some modifications to handle situations where there are multiple paths to the same state.

Depth-First Branch-and-Bound Search:

Another search procedure for minimizing an objective function is branch-and-bound. The most general form of branch-and bound is general enough to include nearly all top-down search procedures as special cases. The best-known version of branch-and-bound is a simple depth-first version similar to the procedure shown in Figure 1. In this procedure, s* is a global variable that holds the best solution seen so far, with s* equal to some dummy value and f (s*) = infinity when the procedure is initially invoked. If the state space is finite and acyclic and if f is monotonically increasing, then Depth-first-BB is admissible. If f is non-monotonic, then Depth-first-BB may not be admissible, and if the state space is infinite, then Depth-first-BB may fail to terminate.

Fig. 1: A branch-and-bound version of state space search.


Greedy Search:

A greedy search is a depth-first search procedure with no backtracking. It works as follows. If s is a solution, then return it; otherwise, repeat the search at the child o(s) whose f value is smallest. There are no guarantees of whether this procedure will find an optimal solution, but it can sometimes save a huge amount of time over what would be needed to find a guaranteed optimal solution.

Hill-Climbing Search:

This is similar to greedy search, except that in a hill-climbing problem, every node is a solution, and a hill-climbing procedure will only go from s to o(s) if f(o(s)) < f(s).


Most of the above search procedures can be modified using a variety of heuristic techniques. These can be divided roughly into two classes:

  1. Pruning techniques:
    These are ways to determine that some nodes will not lead to a solution (or to a desirable solution), so that the procedure can prune these nodes (i.e., remove them from the search space).
  2. Node-selection techniques:
    These are ways to guess which nodes will lead to desirable solutions (e.g., solutions that are optimal or near-optimal, or solutions that can be found quickly), so that the procedure can visit these nodes first.

There are a number of special cases in which these properties can be preserved.

Iterative deepening is a technique that can be used in conjunction with a depth-first search procedure to make it complete. It can be done in two ways: breadth-first or best-first.

Breadth-first iterative deepening does a depth-first search that backtracks whenever it reaches depth i and repeats this search for i = 1, 2 … until a solution is found. Like ordinary breadth-first search, breadth-first iterative deepening is complete but not admissible.

Best-first iterative deepening does a depth-first search that backtracks during its ith iteration whenever it reaches a node s such that f(s) >= fi, where f0 =f(s0), and for i > 0,

fi = min{ f(s) | the search backtracked at s during iteration i – 1 }

The well-known IDA* procedure uses best-first iterative deepening.


Problem-Reduction Search

Another kind of search space is a problem-reduction space, in which each state s represents a problem to be solved, and each operator o(s) produces not just a single child state s’ but an entire set of children {Si . . . . . Sk} (the number of children may vary from one state to another). The children are called subproblems of s, and a solution for one of them represents a portion of the solution to s. Thus, to solve s it is not sufficient just to find a solution below some descendant of s. Instead, the search space is an AND/OR graph, and a solution for s consists of a set of solutions {Vl . . . . . Vk} that are the leaf nodes of a solution graph rooted at s.


A function f(n) is said to be logarithmically bounded if f (n)= O(log n); polynomially bounded if there is a constant c such that f(n)= O(nC); and exponentially bounded if there is a constant c such that f(n) = O(cn).


Computational Complexity of Problems

Given a character string s, is s belongs to L (i.e. language).

Here are some complexity classes dealing with time:

  • P is the set of all languages L such that L has a deterministic recognition procedure whose worst-case running time is polynomially bounded.
  • NP is the set of all languages L such that L has a nondeterministic recognition procedure whose worst-case running time is polynomially bounded.
  • EXPTIME is the set of all languages L such that L has a deterministic recognition procedure whose worst-case running time is exponentially bounded.
  • NEXPTIME is the set of all languages L such that L has a nondeterministic recognition procedure whose worst-case running time is exponentially bounded.

Here are some complexity classes dealing with space:

  • NLOGSPACE is the set of all languages L such that L has a nondeterministic recognition procedure whose worst-case space requirement is logarithmically bounded.
  • PSPACE is the set of all languages L such that L has a recognition procedure whose worst-case space requirement is polynomially bounded. It makes no difference whether the procedure is deterministic or nondeterministic; in either case we will get the same set of languages.
  • EXPSPACE is the set of all languages L such that L has a recognition procedure whose worst-case space requirement is exponentially bounded. It makes no difference whether the procedure is deterministic or nondeterministic; in either case we will get the same set of languages.

If C is one of the complexity classes and L is a language, then L is C-hard if every language in C is reducible to L in polynomial time. L is C-complete if C is L-hard and L belongs to C. Intuitively, if L is C-complete, then L is one of the hardest languages in C: if we can recognize L, then we can also recognize any other language in C with at most a polynomial amount of additional overhead.


Planning Domains as Language-Recognition Problems

Given an alphabet L in which to write statements of planning problems, we can define the following languages:

  • PLAN-EXISTENCE is the set of all strings s belongs to L* such that s is the statement of a solvable planning problem.
  • PLAN-LENGTH is the set of all strings of the form (s,k) such that s is the statement of a solvable planning problem, k is a nonnegative integer, and s has a solution plan that contains no more than k actions

The definition of PLAN-LENGTH follows the standard procedure for converting optimization problems into language-recognition problems.

What really interests us, of course, is not the problem of determining whether there is a plan of length k or less but the problem of finding the shortest plan. If the length of the shortest plan is polynomially bounded, then it can be shown that the two problems are polynomially reducible to each other. However, if the length of the shortest plan is not polynomially bounded, then finding the shortest plan can be much harder than determining whether there is a plan of length k or less. For example, in the well-known Towers of Hanoi problem and certain generalizations of it, the length of the shortest plan can be found in low-order polynomial time, but actually producing a plan of that length requires exponential time and space because the plan has exponential length.

Preface, Introduction and Overview

Motivations of studying planning:

  1. Need for information processing tools that provide affordable and efficient planning resources (Practice)
  2. Planning is an important component of rational behavior. One of the important purposes of AI (Theory)


First Intuitions on Planning



Planning is the reasoning side of acting. It is an abstract, explicit deliberation process that chooses and organizes actions by anticipating their expected outcomes. This deliberation aims at achieving as best as possible some presented objectives. Automated planning is an area of Artificial Intelligence (AI) that studies this deliberation process computationally.


When we don’t need planning:

When the purpose of an action is immediate given our knowledge of that action, or when we perform well-trained behaviors for which we have presented plans, or when the course of an action can be freely adapted while acting, then we usually act and adapt our actions without explicitly planning them.


Reactive vs. Deliberative agents:

Reactive agents simply retrieve pre-set behaviors similar to reflexes without maintaining any internal state. On the other hand, deliberative agents behave more like they are thinking, by searching through a space of behaviors, maintaining internal state, and predicting the effects of actions. Although the line between reactive and deliberative agents can be somewhat blurry, an agent with no internal state is certainly reactive, and one which bases its actions on the predicted actions of other agents is deliberative.


Forms of Planning:

  1. Path and motion planning:
    Is concerned with the synthesis of a geometric path from a starting position in space to a goal and of a control trajectory along that path that specifies the state variables in the configuration space of a mobile system, such as a truck, a mechanical arm, a robot, or a virtual character.
  2. Perception planning:
    Is concerned with plans involving sensing actions for gathering information. Perception planning addresses questions such as which information is needed and when it is needed, where to look for it, which sensors are most adequate for this particular task, and how to use them.
  3. Navigation planning:
    Combines the two previous problems of motion and perception planning in order to reach a goal or to explore an area.
  4. Manipulation planning:
    Is concerned with handling objects, e.g., to build assemblies.
  5. Communication Planning:
    Arises in dialog and in cooperation problems between several agents, human or artificial. It addresses issues such as when and how to query needed information and which feedback should be provided.


Domain Specific Approaches

Disadvantages of domain specific approaches:

  1. They require external understanding of general planning forms.
  2. It is more costly to address each planning problem anew instead of relying on and adapting some general tools.
  3. Domain-specific approaches are not satisfactory for studying and designing an autonomous intelligent machine.

For all these reasons, automated planning is interested in domain-independent general approaches to planning.

Models used in domain-independent planning:

  1. Project planning:
    In which models of actions are reduced mainly to temporal and precedence constraints, e.g., the earliest and latest start times of an action or its latency with respect to another action
  2. Scheduling and resource allocation:
    In which the action models include the above types of constraints plus constraints on the resources to be used by each action.
  3. Plan Synthesis:
    In which the action models enrich the precedent models with the conditions needed for the applicability of an action and the effects of the action on the state of the world.


Conceptual Model for Planning

A conceptual model is a simple theoretical device for describing the main elements of a problem.

Since planning is concerned with choosing and organizing actions for changing the state of a system, a conceptual model for planning requires a general model for a dynamic system.

We’ll use state-transition system (also called discrete-event systems) to describe the conceptual model of the planning.


Fig. 1: A simple conceptual model for planning.


A plan is a structure that gives the appropriate actions that leads you to the objective. The objective can be specified in several different ways:

  • Specifying goal state or set of goal states.
  • More generally, the objective is to satisfy some condition over the sequence of states followed by the system.
  • An alternative specification is through a utility function attached to states, with penalties and rewards. The goal is to optimize some compound function of these utilities (e.g., sum or maximum) over the sequence of states followed by the system.
  • Another alternative is to specify the objective as tasks that the system should perform. These tasks can be defined recursively as sets of actions and other tasks.

A more realistic model interleaves planning and acting, with plan supervision, plan revision, and replanning mechanisms. There is a need for a closed loop between the planner and the controller (Figure 1). The latter returns to the planner the execution status of the plan to enable dynamic planning.

Fig2: A conceptual model for dynamic planning.


Fully observable systems are systems were the controller is able to know all information about the system. Otherwise it’s called partially observable.

The restricted conceptual model plan is unconditional, and the controller executing the plan is an open-loop controller, i.e., it does not get any feedback about the state of the system.

Restricted Planning Model:

  1. Finite system.
  2. Fully observable.
  3. Deterministic.
  4. Static.
  5. Restricted goals.
  6. Sequential plan.
  7. Implicit time.
  8. Offline planning.

Conformant planning deals with nondeterministic and partially observable systems.

Introduction to Data Mining and Machine Learning

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases.

Aim of the book:

    “The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability. If you wish to experiment with your own data, you will be able to do this easily with the Weka software


Data Mining and Machine Learning

Data mining is about solving problems by analyzing data already present in databases.

Data mining is defined as the process of discovering patterns in data.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  1. As a black box whose innards are effectively incomprehensible and,
  2. As a transparent box whose construction reveals the structure of the pattern. Both, we are assuming, make good predictions.

Patterns that affect decision making process explicitly are called structural patterns.

The rules do not really generalize from the data; they merely summarize it.

Problems with datasets:

  1. The set of examples given as input is far from complete.
  2. Values are not specified for all the features in all the examples.
  3. Rules appear to classify the examples correctly, whereas often, because of errors or noise in the data, misclassifications occur even on the data that is used to train the classifier.

One of proposed definition to learning:

Things learn when they change their behavior in a way that makes them perform better in the future

A set of rules that are intended to be interpreted in sequence is called a decision list (Classification Rules).

Numeric attribute problem is problem that appears in numeric values. Otherwise it’s called, mixed-attribute problem.

Rules that strongly associate different attribute values are called association rules.

People frequently use machine learning techniques to gain insight into the structure of their data rather than to make predictions for new cases.

One of challenges in machine learning is, the question of what is the most natural and easily understood format for the output from a machine learning scheme.

The process of determining regression equation weights is called regression.

The fact that the decision structure is comprehensible is a key feature in the successful adoption of the application.


Fielded Applications

Decisions involving judgment:

When you apply for a loan, you have to fill out a questionnaire that asks for relevant financial and personal information. This information is used by the loan company as the basis for its decision as to whether to lend you money. Such decisions are typically made in two stages. First, statistical methods are used to determine clear “accept” and “reject” cases. The remaining borderline cases are more difficult and call for human judgment. For example, one loan company uses a statistical decision procedure to calculate a numeric parameter based on the information supplied in the questionnaire. Applicants are accepted if this parameter exceeds a preset threshold and rejected if it falls below a second threshold. This accounts for 90% of cases, and the remaining 10% are referred to loan officers for a decision. On examining historical data on whether applicants did indeed repay their loans, however, it turned out that half of the borderline applicants who were granted loans actually defaulted. Although it would be tempting simply to deny credit to borderline customers, credit industry professionals pointed out that if only their repayment future could be reliably determined it is precisely these customers whose business should be wooed; they tend to be active customers of a credit institution because their finances remain in a chronically volatile condition. A suitable compromise must be reached between the viewpoint of a company accountant, who dislikes bad debt, and that of a sales executive, who dislikes turning business away.

    Enter machine learning. The input was 1000 training examples of borderline cases for which a loan had been made that specified whether the borrower had finally paid off or defaulted. For each training example, about 20 attributes were extracted from the questionnaire, such as age, years with current employer, years at current address, years with the bank, and other credit cards possessed. A machine learning procedure was used to produce a small set of classification rules that made correct predictions on two-thirds of the borderline cases in an independently chosen test set. Not only did these rules improve the success rate of the loan decisions, but the company also found them attractive because they could be used to explain to applicants the reasons behind the decision. Although the project was an exploratory one that took only a small development effort, the loan company was apparently so pleased with the result that the rules were put into use immediately.


Screening Images:

Since the early days of satellite technology, environmental scientists have been trying to detect oil slicks from satellite images to give early warning of ecological disasters and deter illegal dumping. Radar satellites provide an opportunity for monitoring coastal waters day and night, regardless of weather conditions. Oil slicks appear as dark regions in the image whose size and shape evolve depending on weather and sea conditions. However, other look-alike dark regions can be caused by local weather conditions such as high wind. Detecting oil slicks is an expensive manual process requiring highly trained personnel who assess each region in the image.

    A hazard detection system has been developed to screen images for subsequent manual processing. Intended to be marketed worldwide to a wide variety of users—government agencies and companies—with different objectives, applications, and geographic areas, it needs to be highly customizable to individual circumstances. Machine learning allows the system to be trained on examples of spills and nonspills supplied by the user and lets the user control the tradeoff between undetected spills and false alarms. Unlike other machine learning applications, which generate a classifier that is then deployed in the field, here it is the learning method itself that will be deployed.

The input is a set of raw pixel images from a radar satellite, and the output is a much smaller set of images with putative oil slicks marked by a colored border. First, standard image processing operations are applied to normalize the image. Then, suspicious dark regions are identified. Several dozen attributes are extracted from each region, characterizing its size, shape, area, intensity, sharpness and jaggedness of the boundaries, proximity to other regions, and information about the background in the vicinity of the region. Finally, standard learning techniques are applied to the resulting attribute vectors.

Several interesting problems were encountered. One is the scarcity of training data. Oil slicks are (fortunately) very rare, and manual classification is extremely costly. Another is the unbalanced nature of the problem: of the many dark regions in the training data, only a very small fraction are actual oil slicks. A third is that the examples group naturally into batches, with regions drawn from each image forming a single batch, and background characteristics vary from one batch to another. Finally, the performance task is to serve as a filter, and the user must be provided with a convenient means of varying the false alarm rate.


Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.


Other Applications:

Sophisticated manufacturing processes often involve tweaking control parameters. Separating crude oil from natural gas is an essential prerequisite to oil refinement, and controlling the separation process is a tricky job. British Petroleum used machine learning to create rules for setting the parameters. This now takes just 10 minutes, whereas previously human experts took more than a day. Westinghouse faced problems in their process for manufacturing nuclear fuel pellets and used machine learning to create rules to control the process. This was reported to save them more than $10 million per year (in 1984). The Tennessee printing company R.R. Donnelly applied the same idea to control rotogravure printing presses to reduce artifacts caused by inappropriate parameter settings, reducing the number of artifacts from more than 500 each year to less than 30.


Machine Learning and Statistics

If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.


Some of the most important decisions in machine learning system are:

  1. The concept description language.
  2. The order in which the space is searched.
  3. The way that overfitting to the particular training data is avoided.



  • Data Mining.
  • Machine Learning.
  • Pattern.
  • Structural Pattern.
  • Decision List.
  • Classification Rules.
  • Numeric Attribute Problem.
  • Mixed Attribute Problem.
  • Association Rules.
  • Regression.

Market Basket Analysis.

Introduction to the Introduction

Download this Article

What Is Machine Learning?

We do not know exactly which people are likely to buy a particular product, or which author to suggest to people who enjoy reading Hemingway. If we knew, we would not need any analysis of the data; we would just go ahead and write down the code. But because we do not, we can only collect data and hope to extract the answers to these and similar questions from data.

We do believe that there is a process that explains the data we observe. Though we do not know the details of the process underlying the generation of data-for example, consumer behavior-we know that it is not completely random. People do not go to supermarkets and buy things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices for Ghihwein in winter. There are certain patterns in the data.

We may not be able to identify the process completely, but we believe we can construct a good and useful approximation. That approximation may not explain everything, but may still be able to account for some part of the data. We believe that though identifying the complete process may not be possible, we can still detect certain patterns or regularities. This is the niche of machine learning.

Application of machine learning methods to large databases is called data mining.

But machine learning is not just a database problem; it is also a part of artificial intelligence. To be intelligent, a system that is in a changing environment should have the ability to learn. If the system can learn and adapt to such changes, the system designer need not foresee and provide solutions for all possible situations.

The model may be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both.

Examples of Machine Learning Applications

Learning Associations:

In the case of retail-for example, a supermarket chain-one application of machine learning is basket analysis, which is finding associations between products bought by customers: If people who buy X typically also buy Y, and if there is a customer who buys X and does not buy Y, he or she is a potential Y customer. Once we find such customers, we can target them for cross-selling.

In finding an association rule, we are interested in learning a conditional probability of the form where is the product we would like to condition on , which is the product or the set of products which we know that the customer has already purchased.

Let us say, going over our data, we calculate that = 0.7. Then, we can define the rule:

70 percent of customers who buy beer also buy chips

We may want to make a distinction among customers and toward this, estimate where is the set of customer attributes, for example, gender, age, marital status, and so on, assuming that we have access to this information.


In credit scoring (Hand 1998), the bank calculates the risk given the amount of credit and the information about the customer. The information about the customer includes data we have access to and is relevant in calculating his or her financial capacity-namely, income, savings, collaterals, profeSSion, age, past financial history, and so forth. The bank has a record of past loans containing such customer data and whether the loan was paid back or not. From this data of particular applications, the aim is to infer a general rule coding the association between a customer’s attributes and his risk. That is, the machine learning system fits a model to the past data to be able to calculate the risk for a new application and then decides to accept or refuse it accordingly.

This is an example of a classification problem where there are two classes: low-risk and high-risk customers. The information about a customer makes up the input to the classifier whose task is to assign the input to one of the two classes.

A discriminant function, is a function that separates the examples of different classes. This type of applications is used for prediction.

There are many applications of machine learning in pattern recognition. One is optical character recognition, which is recognizing character codes from their images.

In medical diagnosis, the inputs are the relevant information we have about the patient and the classes are the illnesses. The inputs contain the patient’s age, gender, past medical history, and current symptoms. Some tests may not have been applied to the patient, and thus these inputs would be missing. Tests take time, may be costly, and may inconvience the patient so we do not want to apply them unless we believe that they will give us valuable information. In the case of a medical diagnosis, a wrong decision may lead to a wrong or no treatment, and in cases of doubt it is preferable that the classifier reject and defer decision to a human expert.

Learning also performs compression in that by fitting a rule to the data, we get an explanation that is simpler than the data, requiring less memory to store and less computation to process. Once you have the rules of addition, you do not need to remember the sum of every possible pair of numbers.

Another use of machine learning is outlier detection, which is finding the instances that do not obey the rule and are exceptions. In this case, after learning the rule, we are not interested in the rule but the exceptions not covered by the rule, which may imply anomalies requiring attention for example, fraud.


Let us say we want to have a system that can predict the price of a used car. Inputs are the car attributes-brand, year, engine capacity, milage, and other information-that we believe affect a car’s worth. The output is the price of the car. Such problems where the output is a number are regression problems.

Both regression and classification are supervised learning problems where there is an input, X, an output, Y, and the task is to learn the mapping from the input to the output. The approach in machine learning is that we assume a model defined up to a set of parameters: .

where is the model and are its parameters. is a number in regression and is a class code (e.g., 0/1) in the case of classification. is the regression function or in classification, it is the discriminant function separating the instances of different classes. The machine learning program optimizes the parameters, , such that the approximation error is minimized, that is, our estimates are as close as possible to the correct values given in the training set.

One can envisage other applications of regression where one is trying to optimize a function. Let us say we want to build a machine that roasts coffee. The machine has many inputs that affect the quality: various settings of temperatures, times, coffee bean type, and so forth. We make a number of experiments and for different settings of these inputs, we measure the quality of the coffee, for example, as consumer satisfaction. To find the optimal setting, we fit a regression model linking these inputs to coffee quality and choose new points to sample near the optimum of the current model to look for a better configuration. We sample these points, check quality, and add these to the data and fit a new model. This is generally called response surface design.

Unsupervised Learning:

In supervised learning, the aim is to learn a mapping from the input to an output whose correct values are provided by a supervisor. In unsupervised learning, there is no such supervisor and we only have input data. The aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and we want to see what generally happens and what does not. In statistics, DENSITY ESTIMATION this is called density estimation.

One method for density estimation is clustering where the aim is to find clusters or groupings of input. In the case of a company with a data of past customers, the customer data contains the demographic information as well as the past transactions with the company, and the company may want to see the distribution of the profile of its customers, to see what type of customers frequently occur. In such a case, a clustering model allocates customers similar in their attributes to the same group, providing the company with natural groupings of its customers. Once such groups are found, the company may decide strategies, for example, services and products, specific to different groups. Such a grouping also allows identifying those who are outliers, namely, those who are different from other customers, which may imply a niche in the market that can be further exploited by the company.

An interesting application of clustering is in image compression. In this case, the input instances are image pixels represented as RGB values. A clustering program groups pixels with similar colors in the same group, and such groups correspond to the colors occurring frequently in the image. If in an image, there are only shades of a small number of colors and if we code those belonging to the same group with one color, for example, their average, then the image is quantized. Let us say the pixels are 24 bits to represent 16 million colors, but if there are shades of only 64 main colors, for each pixel, we need 6 bits instead of 24. For example, if the scene has various shades of blue in different parts of the image, and if we use the same average blue for all of them, we lose the details in the image but gain space in storage and transmission. Ideally, one would like to identify higher-level regularities by analyzing repeated image patterns, for example, texture, objects, and so forth. This allows a higher-level, simpler, and more useful description of the scene, and for example, achieves better compression than compressing at the pixel level.

One application area of computer science in molecular biology is alignment, which is matching one sequence to another. This is a difficult string matching problem because strings may be quite long, there are many template strings to match against, and there may be deletions, insertions, and substitutions. Clustering is used in learning motifs, which are sequences of amino acids that occur repeatedly in proteins. Motifs are of interest because they may correspond to structural or functional elements within the sequences they characterize. The analogy is that if the amino acids are letters and proteins are sentences, motifs are like words, namely, a string of letters with a particular meaning occurring frequently in different sentences.

Reinforcement Learning:

In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important; what is important is the policy that is the sequence of correct actions to reach the goal. an action is good if it is part of a good policy. In such a case, the machine learning program should be able to assess the goodness of policies and learn from past good action sequences to be able to generate a policy. Such learning methods are called reinforcement learning algorithms.

A good example is game playing where a single move by itself is not that important; it is the sequence of right moves that is good. A move is good if it is part of a good game playing policy.

A robot navigating in an environment in search of a goal location is another application area of reinforcement learning. At any time, the robot can move in one of a number of directions. After a number of trial runs, it should learn the correct sequence of actions to reach to the goal state from an initial state, doing this as quickly as possible and without hitting any of the obstacles.

One factor that makes reinforcement learning harder is when the system has unreliable and partial sensory information. For example, a robot equipped with a video camera has incomplete information and thus at any time is in a partially observable state and should decide taking into account this uncertainty. A task may also require a concurrent operation of multiple agents that should interact and cooperate to accomplish a common goal. An example is a team of robots playing soccer.


Induction is the process of extracting general rules from a set of particular cases.

In statistics, going from particular observations to general descriptions is called inference and learning is called estimation. Classification is called discriminant analysis in statistics.

In engineering, classification is called pattern recognition and the approach is nonparametric and much more empirical.

Machine learning is related to artificial intelligence (Russell and Norvig 1995) because an intelligent system should be able to adapt to changes in its environment.

Data mining is the name coined in the business world for the application of machine learning algorithms to large amounts of data (Weiss and Indurkhya 1998). In computer science, it is also called knowledge discovery in databases (KDD).

Chapter’s Important Keywords:

  1. Machine Learning.
  2. Data Mining.
  3. Descriptive Model.
  4. Predictive Model.
  5. Association Rule.
  6. Classification.
  7. Discriminant Function.
  8. Knowledge Extraction.
  9. Compression.
  10. Outlier Detection.
  11. Supervised Learning.
  12. Response Surface Design.
  13. Density Estimation.
  14. Clustering.
  15. Reinforcement Learning.
  16. Partially Observable State.
  17. Induction.
  18. Inference.
  19. Estimation.
  20. Discriminant Analysis.
  21. Knowledge Discovery in Databases.


Machine learning is about programming computers to optimize a performance
criterion using example data or past experience.

Examples of Machine Learning

  1. Recognition of spoken speech. That is, converting the acoustic speech signal to an ASCrr text.
  2. Consider routing packets over a computer network. The path maximizing the quality of service from a source to destination changes continuously as the network traffic changes. A learning routing program is able to adapt to the best path by monitoring the network traffic.
  3. Intelligent user interface that can adapt to the biometrics of its user, namely, his or her accent, handwriting, working habits, and so forth.
  4. Retail companies analyze their past sales data to learn their customers’ behavior to improve customer relationship management.
  5. Financial institutions analyze past transactions to predict customers’ credit risks.
  6. Robots learn to optimize their behavior to complete a task using minimum resources.
  7. In bioinformatics, the huge amount of data can only be analyzed and knowledge be extracted using computers.
  8. Cars that can drive themselves under different road and weather conditions.
  9. Phones that can translate in real time to and from a foreign language.
  10. Autonomous robots that can navigate in a new environment, for example, on the surface of another planet.


The book I’m reading, discusses many methods that have their bases in different fields; statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.