Udacity Deep Learning Course Summary

Introduction

Logistic Classifier (Linear Classifier): Takes an input (image pixels) and applies a linear function to them to make predictions.

Linear Function: WX + b = Y, where W is weight matrix, X is input matrix, b is bias and Y is output matrix. The W and b terms are the trained values.

Softmax: is a way to transform scores into probabilities. Increasing the number of training data set would make the classifier more confident and discriminate individual classes clearly.

One-Hot Encoding is when the correct class has probability of 1.0 and other classes 0 probability. This technique is not efficient on large number of classes. The benefit is measuring cross-entropy is easy.

Multinomial Logistic Classification is setting where input is feeded to linear model to logistic to softmax to one-hot encoding.

Gradient Descent is the technique used for finding the value to minimize the loss function. A correct simulation should update Theta0 and Theta1 at same time not after each other.

Learning Rate is variable that controls how big a step the gradient descent takes downhill.

For a Local Minimum the slope is going to be zero and this will result in 0 for derivative term which would result in same value of theta.

Training Dataset is the data set that the classifier uses to train on.

Validation Dataset is the data set used for validation and also it’s possible to reconstruct this dataset until the classifier looks like generalized.

Test Dataset is the data set used for real testing. Data samples that the classifier never saw before.

Rule of 30 is an idea to measure if improvements in the classification model is significant or not.

Usually models use more 30K examples for validation to assure significance. Whenever there’s an improvements more than 0.1% this will be significant, which makes measuring the improvements easy.

The main problem with Gradient Descent is that it’s gets calculated for each example in the training data set, and the derivative term takes long time from processing standpoint. This problem will arise in scalability scenarios where the training data set is huge.

To overcome the scalability and performance issues of gradient descent, using a small random data (to calculate its average loss) set out of the training data set with small learning rate overcomes this problem. That technique is called Stochastic Gradient Descent (SAG).

Use running average of the loss instead of relying solely on the current instance helps reducing the number of steps to converge. This is called momentum technique

Using lower learning rate usually tends to get a more accurate classifier.

Stochastic Gradient Descent Parameters:

  • Initial Learning Rate: the initial value for learning rate
  • Learning Rate Decay: how much learning rate should increase
  • Momentum: the running average of the loss
  • Batch Size: the random sample used by SGD
  • Weight Initialization

ADAGRAD is a technique that eliminates couple of SGD parameters and require batch size and weights only.

Deep Neural Networks

Rectified Linear Units (RELU): non-linear function that’s y=x for y > 0 and 0 otherwise.

Backpropagation is used to calculate the gradient of the cost function to update the weights.

Deep Learning is about adding more hidden layers to the model. This helps because of:

  • Parameter efficiency: uses few parameters but goes deeper.
  • Model Abstraction: going deeper results into layers that learn more generalized and useful shapes, like geometric shape instead of line

Why deep learning has not been used before?

  • Deep learning, requires large dataset which has been available recently
  • Better regularization techniques

Regularization is a technique used to prevent from overfitting

Avoid overfitting:

  • Early termination for the training using validation set as soon as the performance stops improving
  • Using regularization techniques
    • Adding L2 for the weights vector
    • Using Dropout

Dropout (by Geoffrey Hinton) is a regularization technique used to prevent overfitting by stochastically turning off activations for hidden layers. This forces the network not to rely on a specific activation in its training and it learns redundant representation for everything across the input data.

Redundant representations results in a network that need some consensus technique to decide the result.

Convolutional Neural Network

If the structure of the input is known, then there’s a chance to reduce the NN complexity by relying on such fact. For example, if the input is a colored letter then the NN can be simplified by using grayscale instead of adding the colors complexity to the NN.

Features that do not change across time, space or everywhere are called statistical invariant. For example, translation invariance is used to simplify the NN by disregard learning the location of the object in an image.

Weight sharing is used to simplifying learning in a NN by utilizing statistical invariance.

CNN (by Yann LeCun) are NN that share their parameters across the space.

Convolution is the process of applying a mask to a NN in convolution pattern (like image filters) to a given image. This would result into a matrix that has different width, height and depth from the original image.

The layer’s parameters consist of a set of learnable filters (or kernels/patches), which have a small receptive field, but extend through the full depth of the input volume.

An RGB image has three feature maps each corresponds to one color intensity.

Stride is the number of pixels shifted each time the filter is moved

Valid padding is way to cut the image as the size of the filter

Same padding is way to pad zeros at the edge of the image to keep sizes same.

Striding reduces the feature map size but it’s very aggressive and sometimes result in loss of data.  Pooling is a technique used for combining information for an image sample. Two common pooling techniques are max pooling and average pooling.

1×1 convolution is a way to reduce the dimensionality of the NN in a simple way.

Recurrent Neural Network is a NN that’s used with temporal classification/prediction problems.

Long Short Term Memory (LSTM) is a version from RNN that’s being used widely.

Multilayer Perceptron

  • Multilayer perception stands for a neural network with one or more hidden layer.
  • Properties of multilayer neural networks:
    • The model of each neuron in the network includes a nonlinear activation function that’s differentiable.
    • Network contains one or more hidden layer.
    • Network exhibits a high degree of connectivity through its synaptic weights.
  • Common deficiencies in multilayer neural networks:
    • Theoretical analysis of MLNN is difficult to undertake.
      • This comes from the nonlinearity and high connectivity of the network.
    • Harder efforts are required for visualizing the learning process.
      • This comes from the existence of several layers in the network.
  • Back propagation algorithm is used to train MLNN. The training proceeds in two phases:
    • In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network layer by layer until it reaches the output.
    • In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, layer by layer but the propagation is performed in the
      backward direction. In this phase successive adjustments are applied to the synaptic weights of the network.
  • The term “back propagation” appeared after 1985 when the term was popularized through the publication of the book “Parallel Distribution Processing” by Rumelhard and McClelland.

  • Two kinds of signals exist in MLP (Multilayer Perceptron):
    • Function Signals (Input Signal):
      • It’s an input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network and emerges at the output end of the network as output signal.
      • We called it Function Signals because:
        • It’s presumed to perform a useful function at the output of the network.
        • The neuron’s signal is calculated as a function of the input signal(s) and associated weights.
    • Error Signals:
      • It originates at an output neuron of the network and propagates backward (layer by layer) through the network.
      • We called it Error Signal because:
        • It’s computation by every neuron of the network involves an error-dependent function.
  • Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
    • Computation of function signal, which is expressed as a continuous nonlinear function of the input signal and synaptic weights associated with that neuron.
    • Computation of an estimate of the gradient vector which is needed for the backward pass through the network.
      • Gradient vector: the gradients of the error surface with respect to the connected weights of the inputs of a neuron.
  • Function of Hidden Neurons:
    • Hidden neurons act as feature detector.
    • The hidden neurons are able to discover the salient features that characterize training data.
    • They do so by performing a nonlinear transformation on the input data into a new space called feature space
    • In feature space pattern classification task is more simplified and the classes are more separated.
    • This function is the main difference between Rosenblatt’s perceptron and Multilayer Neural Networks.
  • Credit Assignment Problem:
    • Credit-Assignment problem is the problem of assigning credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the learning system. Because as we know those decisions are responsible for the overall outcomes in the first place.
    • Error-correction learning algorithm is not suitable for resolving the credit-assignment problem for MLNN because we can’t just judge on the output neurons where hidden layers play a big role in its decision.
    • Back propagation algorithm is able to solve the credit-assignment problem in an elegant manner.

    Batch Learning

  • Before we start in describing the algorithm you want to introduce some equations that are found in page 157.
  • Batch Learning is a supervised learning algorithm. The learning algorithm is performed after the presentation of all the N examples in the training samples that constitutes one epoch of training.
  • Adjustments to the synaptic weights are made on an epoch-by-epoch basis.
  • With method of gradient descent used to perform training we’ve these 2 advantages:
    • Accurate estimation.
    • Parallelization of learning process.
  • From practical perspective, batch learning suffers from the storage requirements.
  • In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems.

Online Learning

  • Online method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron is performed example-by-example basis. The cost function to be minimized is therefore the total instantaneous error energy.
  • Such algorithm is not suitable for parallelization of the learning process.
  • Sometimes online learning is called stochastic method.
  • Advantages of online learning:
    • This stochasticity has the desirable effect of making it less likely for the learning process to be trapped in a local minimum.
    • Moreover, online learning requires less storage than batch learning.
    • Also, if the training data is redundant, the online learning benefits from this redundancy to improve its learning.
    • Finally, in online learning you are able to track small changes in training data especially if the data environment is non-stationary.
    • It’s simple to implement.
    • Provides effective solutions to large –scale and difficult classification problems.

The Back Propagation Algorithm

  • First you should read the mathematical derivation of the algorithm in pages 195-161.
  • The key factor involved in the calculation of the weight adjustment is the error signal at the output neuron j. As we see the credit-assignment problem arises here. In this context we may identify two distinct cases:
    • Case #1: Neuron j is an output node:
      • The error signal is supplied to the neuron by its own from equation:
        • Where .
    • Case #2: Neuron j is a hidden node:
      • When a neuron j is located in a hidden layer of the network, there’s no specified desired response for that neuron.
      • Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron connected.
      • The final back-propagation formula for the local gradient
        • Where k represents the number of neurons that are connected to hidden neuron j.
      • To know the derivation kindly refer to page 162-163.
  • As a summary the correction is applied to the synaptic weight connecting neuron I to j is defined by:
  • Any activation function that is used in multilayer neural networks should be continuous.
  • The most commonly used activation function is sigmoidal nonlinearity. Two forms of which are described here:
    • Logistic Function: .
    • Hyperbolic Tangent Function: .
  • Learning Parameter:
    • The smaller we make, the smaller changes to synaptic weights in the network will be from one iteration to the next and the smoother will be the trajectory in the network weight space.
    • On the other hand if we make to large in order to speed up the rate of learning, the resulting larges changed in synaptic weights assume such for that the network may become unstable (i.e. oscillatory).
    • A simple method of solving this problem is by including a momentum term, as shown by:
      • ,
      • Where usually positive number is called momentum constant.
      • Also, is the unit-time delay operator,
      • The above equation is called generalized delta rule. Special case here is applied when .
    • The inclusion of momentum in back-propagation algorithm has stability effect in directions that oscillate in sign.
    • The momentum term may also have the benefit of preventing the learning process from terminating in a shallow local minimum on the error surface.
    • In reality the learning rate parameter in connection dependent such that each connection has .
  • Stopping Criteria:
    • In general back-propagation algorithm can’t be shown to converge and there are no well defined criteria for stopping it operation.
    • We may formulate a sensible convergence criterion for back-propagation learning as follows (Kramer and Sangiovanni-Vincentelli 1989):
      • The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold.
    • The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w).
    • Another criterion:
      • The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
        (ranges from 0.1 to 1 percent/epoch).
    • Another theoretical criterion:
      • After each learning iteration, the network is tested for it generalization performance. The learning process stops when the generalization performance is adequate or peaked.
  • Note that in each training epoch the samples should be picked randomly.

Designing a neural network with back-propagation algorithm is more of an art than a science

  • Heuristics for making back-propagation algorithm perform better:
    • Stochastic update is recommended over batch update.
    • Maximizing the information context.
      • This is done by:
        • Use an example that results in the largest training error.
        • Use an example that is radically different from all those previously worked.
    • Activation Function.
      • It’s preferred to use a sigmoid activation function that’s an odd function in its arguments.
      • The hyperbolic sigmoid function is the recommended one.
      • 175-See the useful properties of hyperbolic sigmoid function.
    • Target values:
      • It’s recommended that target values (desired) be chosen within the range of the sigmoid activation function.
    • Normalizing the inputs.
      • Each input variable should be preprocessed so that its mean value, averaged over the entire training samples, is close or equal to zero.
      • In order to accelerate the back-propagation learning process, the normalization of the inputs should also include two other measures:
        • The input variables contained in the training set should be uncorrelated; this can be done by using principal-component analysis.
        • The decorrelated input variables should be scaled so that their covariances are approximately equal
      • Here we’ve 3 normalization steps:
        • Mean removal.
        • Decorrelation.
        • Covariance equalization.

  • Weights initialization.
  • Learning from hints.
    • This is achieved by including prior knowledge to the system.
  • Learning Rates.
    • All neurons in the multilayer should learn at the same rate, except for that at the last layer, the learning rate should be assigned smaller value than that of the front layers.

Generalization

  • A network is said to generalize well when the network input-output mapping is correct (or nearly so) for the test data.
  • The learning process may be viewed as “curve fitting” problem. Thus, generalization is performed by the interpolation made by the network.
  • Memorization in a neural network usually leads to bad generalization. “Memorization” is essentially a “look-up table”, which implies that the input-output mapping computed by the neural network is not smooth.
  • Generalization is influenced by three factors:
    • Size of training sample and how they represent the environment of interest.
    • Architecture of the neural network.
    • Physical complexity of the problem at hand.
  • In practice, good generalization is achieved if we the training sample size, N, satisfies:

 

  • Where:
    • W is the total number of free parameters (i.e. synaptic weights and basis) in the network.
    • denotes the fraction of classification errors permitted on test data.

Cross Validation

  • In statistics, cross-validation randomly divides the available data set into:
    • Training Data:
      • Estimate Subset: used to select the model.
      • Validation Subset: used to test or validate the model.
    • Testing Data.
  • However, this best model may be overfitting the validation data.
  • Then, to guard against this possibility, the generalization performance is measured on the test set, which is different from the validation subset.
  • Early-Stopping Method (Holdout Method):
    • Validation Steps:
      • The training is stopped periodically, i.e., after so many epochs, and the network is assessed using the validation subset where the backward mode is disabled.
      • When the validation phase is complete, the estimation (training) is resumed for another period, and the process is repeated.
      • The best model (free parameters) is that at the minimum validation error.
    • Here we should take care of not going into local minima because of the validation-set.
  • Variant of Cross-Validation (Multifold Method):
    • Validation Steps:
      • Divide the data set of N samples into K subsets, where K > 1.
      • The network is validated in each trial using a different subset. After training the network using the other subsets.
      • The performance of the model is assessed by averaging the squared error under validation over all trials.
    • A special case of this method is called “leave-one out method“, where examples are used to train the model, and the model is validated by testing it on the example that left out.
    • Disadvantage of this method is that it requires an excessive amount of computation.
    • Figure below demonstrates the division of the samples. Where the shaded square denotes the validation set.

  • Questions:
    • What we mean by differentiable function?
      • It’s a continuous function (i.e. not discrete function).
    • What are benefits of differentiable functions?
      • We are able to use it to find the gradient of the activation function in terms on local induced field.
    • 155- “It’s presumed to perform a useful function at the output of the network”. Here author mean performing an activation function at output layer? Or he means that the output signal will lead us to do some function?
      • Because it gives us the network response (this is the function).
    • 155-I need graphical and mental explanation about “the gradients of the error surface with respect to the weights connected”
    • 158-description realization of the learning curve process.
      • He means the difference between Online and Batch.
    • 158-How “parallelization of learning process” exists in the batch learning process?
      • Distribute the computation of error for each sample and then combine them.
    • 158- “In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems“. Discussion is required here.
    • 159-Why batch learning doesn’t benefit from redundancy?
    • 159-How we can track small changes in “training data”? Should it be in “synaptic weights”
      • Because it’s sensitive whereas the batch looks for the global changes.
    • 161-What’s local gradient?
      • It’s the synaptic weight that presents the local minimum for the current neuron.
    • 163- I need to know the mathematical differentiation steps of equation 4.23.
    • 169- “The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w). What’s the drawback here in calculating g(w)?!
      • Well, I think that it’s computationally expensive to compute it each epoch.
    • 178-What’s meant by saturation?
    • 177-What’s meant by uncorrelated data?

Multilayer Perceptron

  • Multilayer perception stands for a neural network with one or more hidden layer.
  • Properties of multilayer neural networks:
    • The model of each neuron in the network includes a nonlinear activation function that’s differentiable.
    • Network contains one or more hidden layer.
    • Network exhibits a high degree of connectivity through its synaptic weights.
  • Common deficiencies in multilayer neural networks:
    • Theoretical analysis of MLNN is difficult to undertake.
      • This comes from the nonlinearity and high connectivity of the network.
    • Harder efforts are required for visualizing the learning process.
      • This comes from the existence of several layers in the network.
  • Back propagation algorithm is used to train MLNN. The training proceeds in two phases:
    • In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network layer by layer until it reaches the output.
    • In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, layer by layer but the propagation is performed in the
      backward direction. In this phase successive adjustments are applied to the synaptic weights of the network.
  • The term “back propagation” appeared after 1985 when the term was popularized through the publication of the book “Parallel Distribution Processing” by Rumelhard and McClelland.

  • Two kinds of signals exist in MLP (Multilayer Perceptron):
    • Function Signals (Input Signal):
      • It’s an input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network and emerges at the output end of the network as output signal.
      • We called it Function Signals because:
        • It’s presumed to perform a useful function at the output of the network.
        • The neuron’s signal is calculated as a function of the input signal(s) and associated weights.
    • Error Signals:
      • It originates at an output neuron of the network and propagates backward (layer by layer) through the network.
      • We called it Error Signal because:
        • It’s computation by every neuron of the network involves an error-dependent function.
  • Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
    • Computation of function signal, which is expressed as a continuous nonlinear function of the input signal and synaptic weights associated with that neuron.
    • Computation of an estimate of the gradient vector which is needed for the backward pass through the network.
      • Gradient vector: the gradients of the error surface with respect to the connected weights of the inputs of a neuron.
  • Function of Hidden Neurons:
    • Hidden neurons act as feature detector.
    • The hidden neurons are able to discover the salient features that characterize training data.
    • They do so by performing a nonlinear transformation on the input data into a new space called feature space
    • In feature space pattern classification task is more simplified and the classes are more separated.
    • This function is the main difference between Rosenblatt’s perceptron and Multilayer Neural Networks.
  • Credit Assignment Problem:
    • Credit-Assignment problem is the problem of assigning credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the learning system. Because as we know those decisions are responsible for the overall outcomes in the first place.
    • Error-correction learning algorithm is not suitable for resolving the credit-assignment problem for MLNN because we can’t just judge on the output neurons where hidden layers play a big role in its decision.
    • Back propagation algorithm is able to solve the credit-assignment problem in an elegant manner.

     

    Batch Learning

  • Before we start in describing the algorithm you want to introduce some equations that are found in page 157.
  • Batch Learning is a supervised learning algorithm. The learning algorithm is performed after the presentation of all the N examples in the training samples that constitutes one epoch of training.
  • Adjustments to the synaptic weights are made on an epoch-by-epoch basis.
  • With method of gradient descent used to perform training we’ve these 2 advantages:
    • Accurate estimation.
    • Parallelization of learning process.
  • From practical perspective, batch learning suffers from the storage requirements.
  • In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems.

Online Learning

  • Online method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron is performed example-by-example basis. The cost function to be minimized is therefore the total instantaneous error energy.
  • Such algorithm is not suitable for parallelization of the learning process.
  • Sometimes online learning is called stochastic method.
  • Advantages of online learning:
    • This stochasticity has the desirable effect of making it less likely for the learning process to be trapped in a local minimum.
    • Moreover, online learning requires less storage than batch learning.
    • Also, if the training data is redundant, the online learning benefits from this redundancy to improve its learning.
    • Finally, in online learning you are able to track small changes in training data especially if the data environment is non-stationary.
    • It’s simple to implement.
    • Provides effective solutions to large –scale and difficult classification problems.

The Back Propagation Algorithm

  • First you should read the mathematical derivation of the algorithm in pages 195-161.
  • The key factor involved in the calculation of the weight adjustment is the error signal at the output neuron j. As we see the credit-assignment problem arises here. In this context we may identify two distinct cases:
    • Case #1: Neuron j is an output node:
      • The error signal is supplied to the neuron by its own from equation:
        • Where .
    • Case #2: Neuron j is a hidden node:
      • When a neuron j is located in a hidden layer of the network, there’s no specified desired response for that neuron.
      • Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron connected.
      • The final back-propagation formula for the local gradient
        • Where k represents the number of neurons that are connected to hidden neuron j.
      • To know the derivation kindly refer to page 162-163.
  • As a summary the correction is applied to the synaptic weight connecting neuron I to j is defined by:
  • Any activation function that is used in multilayer neural networks should be continuous.
  • The most commonly used activation function is sigmoidal nonlinearity. Two forms of which are described here:
    • Logistic Function: .
    • Hyperbolic Tangent Function: .
  • Learning Parameter:
    • The smaller we make, the smaller changes to synaptic weights in the network will be from one iteration to the next and the smoother will be the trajectory in the network weight space.
    • On the other hand if we make to large in order to speed up the rate of learning, the resulting larges changed in synaptic weights assume such for that the network may become unstable (i.e. oscillatory).
    • A simple method of solving this problem is by including a momentum term, as shown by:
      • ,
      • Where usually positive number is called momentum constant.
      • Also, is the unit-time delay operator,
      • The above equation is called generalized delta rule. Special case here is applied when .
    • The inclusion of momentum in back-propagation algorithm has stability effect in directions that oscillate in sign.
    • The momentum term may also have the benefit of preventing the learning process from terminating in a shallow local minimum on the error surface.
    • In reality the learning rate parameter in connection dependent such that each connection has .
  • Stopping Criteria:
    • In general back-propagation algorithm can’t be shown to converge and there are no well defined criteria for stopping it operation.
    • We may formulate a sensible convergence criterion for back-propagation learning as follows (Kramer and Sangiovanni-Vincentelli 1989):
      • The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold.
    • The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w).
    • Another criterion:
      • The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
        (ranges from 0.1 to 1 percent/epoch).
    • Another theoretical criterion:
      • After each learning iteration, the network is tested for it generalization performance. The learning process stops when the generalization performance is adequate or peaked.
  • Note that in each training epoch the samples should be picked randomly.
  • Questions:
    • What we mean by differentiable function?
      • It’s a continuous function that we are able to differentiate it (i.e. not constant).
    • What are benefits of differentiable functions?
      • We are able to use it to find the gradient of the activation function in terms on local induced.
    • 155- “It’s presumed to perform a useful function at the output of the network”. Here author mean performing an activation function at output layer? Or he means that the output signal will lead us to do some function?
      • I think that he means the first option.
    • 155-I need graphical and mental explanation about “the gradients of the error surface with respect to the weights connected”
    • 158-description realization of the learning curve process.
    • 158-How “parallelization of learning process” exists in learning process?
      • Because we can distribute the process of classifying to several PCs then get back them.
    • 158- “In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems“. Discussion is required here.
    • 159-Why batch learning doesn’t benefit from redundancy?
    • 159-How we can track small changes in “training data”? Should it be in “synaptic weights”
    • 161-What’s local gradient?
      • It’s the synaptic weight that presents the local minimum for the current neuron.
    • 163- I need to know the mathematical differentiation steps of equation 4.23.
    • 169- “The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w). What’s the drawback here in calculating g(w)?!
    • Well, I think that it’s computationally expensive to compute it each epoch.

The Least Mean-Square Algorithm

  • Least Mean Square (LMS) algorithm is online learning algorithm developed by Widrow and Hoff in 1960.
  • Rosenblatt’s perceptron was the first learning algorithm for solving linearly separable pattern-classification problem.
  • LMS algorithm was the first linear adaptive-filtering algorithm for solving problems such as prediction and communication-channel equalization.
  • Advantages behind LMS Algorithm:
    • Computationally Efficient: Its complexity is linear with respect to adjustable parameters.
    • Simple to code and easy to build.
    • Robust with respect to external disturbances.
  • Gauss-Newton Method makes a balance between computational complexity and convergence behavior.
  • Diagonal loading concept.
  • Stabilizer term concept.
  • A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state; that is, given the present, the future does not depend on the past. A process with this property is called Markov process. The term strong Markov property is similar to this, except that the meaning of “present” is defined in terms of a certain type of random variable, which might be specified in terms of the outcomes of the stochastic process itself, known as a stopping time.
  • A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. An HMM can be considered as the simplest dynamic Bayesian network
  • Questions:
    • 128-In Gauss-Newton Method, Why we’ve get sum over squres of error signal?
      • To make the value positive.
    • Why not just get the sum of the error itself.
    • 128- “Linearize the dependence of e(i) on w”, Is That mean, trying to find a linear function that maps the dependence between e(i) and w?
    • 128-How equation 3.18 has linearized e(i) and w?
    • 129-Why they call it Jacobean matrix?
    • 129-How we’ve calculated equation 3.20
    • 129-What’s meant by nonsingular matrix multiplication?
    • 129-Last paragraph that’s talking about Jacobean matrix conditions.

Model Building through Regression

 

 

  • Linear regression is a special form of the function approximation, to model a given set of random variables.
  • Linear regression is about finding relationship between set of random variables to be able to predict new value.
  • In linear regression we’ve the following scenario:
    • One of the random variables is considered to be of particular interest; that random variable is referred to as a dependent variable or response.
    • The remaining random variables are called independent variables, or regressors; their role is to explain or predict the statistical behavior of the response.
    • The dependence of the response on the regressors includes an additive error term, to account for uncertainties in the manner in which this dependence is formulated. This is called expectational error or explanational error.
  • Such model is called regress model.
  • In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting.
  • Classes of regression models:
    • Linear: The dependence of the response on the regressors is defined by linear function.
    • Non-Linear: The dependence of the response on the regressors is defined by non-linear function.
  • Methods for tracing the linear regression model:
    • Bayesian Theory:
      • To derive the maximum a posteriori estimate of the vector that parameterizes a linear regression model.
    • Method of Least Square:
      • This is the oldest parameter estimation procedure.
      • It was introduced by Gauss in the early part of 19th century.

  • Model order is common dimension between regressors and fixed weights.
  • Problem of interest here is stated as follows:
    • Given the joint statistics of the regressor X and the corresponding respinse D, estimate the unknown parameter vector w.
  • Maximum a posteriori (MAP) is more profound than maximum-likelihood (ML) estimation because:
    • MAP exploits all the conceivable information about parameter vector w.
    • ML relies solely on the observation model (d, x) and may therefore lead to a nonunique solution. To enforce uniqueness and stability the prior has to be incorporated into the formulation of the of the estimator.
  • Assumptions considered in parameter estimation process:
    • Samples are statistically independent and identically distributed.
      • Statistically independent meant that there’s no covariance matrix.
      • Identically distributed means that values are on diagonal.
    • Gaussianity.
      • The environment responsible for generation of the training sampls is Gaussions distribution.
    • Stationary.
      • The environment is stationary, which means that the parameter vector w is fixed, but unkown, throughout the N trails of the experiment.
  • 103 ->106-Finding MAP of weight vector.
  • In improving the stability of the maximum likelihood estimator through the use of regularization (i.e. the incorporation of prior knowledge), the resulting maximum a posteriori estimator becomes biased.
  • In short, we have a tradeoff between stability and bias.
  • Ordinary least square estimator is defined as the squared expectational errors summed over the experimental trails on the environment.
  • When Regularization Parameter (=0) that means we have complete confidence in the observation model exemplified by the training samples. At the other extreme if (=) that means we have no confidence at the observation model.
  • Regularized Least Square Solution (RLS) is referred as minimizing the cost function with respect to parameter w.
  • This solution is identical to MAP estimate solution.
  • The representation of a stochastic process by a linear model may be used for synthesis or analysis.
  • In synthesis, we generate a desired time series by assigning a formulated set of values to parameters of the model and feeding it with while noise (with zero mean and variance). The obtained model is called generative model.
    • Note: I bet that synthesis is similar to simulation; is that right or wrong?!
  • In analysis, we estimate the parameters of the model by processing a given time series of finite length, using the Bayesian approach or regularized method of least square.
  • In estimation, we need to pick best suitable model to process data. This is called model selection problem.
  • Minimum-Description-Length (MDL) principle is model selection method pioneered by Rissanen.
  • MDL was inspired from:
    • Kolmogorov Complexity Theory. Kolmogorov defined complexity as follows
      • “The algorithmic complexity of a data sequence is the length of the shortest binary computer program that prints out the sequence and then halts”.
    • Regularity itself may be identified with the “ability to compress”.
  • MDL states: “Given a set of hypothesis, and a data sequence d, we should try to find the particular hypothesis or some combination of hypothesis in, that compresses the data sequence d the most”.
  • MDL principle has many versions one of the oldest and simplistic versions is called simplistic two-part code MDL principle for probabilistic modeling.
    • By “simplistic” we mean that the codelengths under consideration are not determined in optimal fashion.
    • By “code” and “codelengths” used to pertain to the process of encoding the data sequence in the shortest or least redundant manner.
  • 111-in MDL we are interested to identify the model that best explains an unknown environment that is responsible for generating training sample , where xi is the stimulus and di is the corresponding response.
  • This above description is called model-order selection problem.
  • 111-Description of MDL selection process.
  • Attributes of MDL Principle:
    • The MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories: “Accept the simplest explanation that fits data”.
    • The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases.
  • Questions:
    • 99-What is meant by stationary environment?
      • The statistics of the system don’t change with time.
    • 100-Why in joint statistics we need correlation matrix of regressors and variances of D?
      • To exploit the dependency of the response on the variances of the regressors and variances of D.
    • 100-Why X and D means are assumed to be zero?
      • To normalize data and put them in the same space.
    • 101-What’s meant by: P(w, d|x), P(w| d, x).
      • P(w, d|x): Joint Probability of w & d given x. Where x is given and w & d are unknown.
      • P(w| d, x): probability of w given joint probability of d and x. Where both d and x are known.
    • 101-Why we’ve made the series of these equations?
    • 101-How equation 2.5 is valid?
    • 102-What’s meant by equation 2.10, 2.11
    • 103-What’s meant by identically distributed? A life example.
      • Follows same distribution type.
    • 103-What’s meant by single-shot realization?
      • Instantiation value.
    • 103-Why expectational error is with zero mean? (Assumption #2)
    • 104-Equation 2.19; why there’s no division on the number of the elements?
      • Because the data are already normalized.
    • 107-Why the ½ appeared?
      • To simplify calculations and takes average of error.
    • 107-How OLSE is same as MLE?
    • Because they have the same cost equation.
    • 107-What’s the problem of “having distinct possibility of obtaining a solution that lacks uniqueness and stability”?
      • Because the estimation is only based on the observed data.
    • 107-How “This solution is identical to MAP estimate solution”?!
      • Because they have the same cost equation.
    • 110-Is MDL is similar to MLE?!
    • 110-What we mean by “p should be encoded” (last paragraph).
    • 111-What’s Occam’s razor theorem? (in short)

Rosenblatt’s Perceptron

  • Perceptron is the simplest form of a neural network used for the classification of patterns said to be linearly separable.
  • Linearly separable are
    patterns that lie on opposite sides of a hyperplane.
  • In 1958, Rosenblatt was first person proposed the perceptron as the first model for learning with a teacher.
  • 79-Structure of neuron.
  • For adapting perceptron we may use error-correction rule known as the perceptron convergence algorithm.
  • Cauchy-Schwarz inequality:
    • Given two vectors and the Cauchy-Schwarz inequality states that:
  • We use Bayes classifier when we have the parameters of the 2 classification problem. Otherwise perceptron is suitable for any 2 linearly separable problems without any parameters.
  • Minsky and Papert proved that the perceptron as defined by Rosenblatt is inherently incapable of making some global generalizations on the basis of locally learned examples.
  • Key terms: perceptron convergence theorem.
  • Proof of convergence algorithm.
  • Questions:
    • In convergence algorithm proof, how equation 1.10 is valid?!

Introduction to Neural Networks and Learning Machines

  • A neural network is a massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making it available for us.
  • Brain is a highly complex, nonlinear and parallel computer.
  • Brain is able to accomplish perceptual recognition tasks in 100-200 ms whereas tasks of much lesser complexity take a great deal longer on a powerful computer.
  • Much of the development of human brain taking place during the first two years of birth! But the development continues well beyond this stage.
  • Plasticity permits the developing nervous system to adapt to its surrounding environment.
  • Neural Network is a machine that is designed to model the way in which the brain performs a particular task or function of interest.
  • Learning algorithm is the function of which to modify the synaptic weights of the network in an orderly fashion to attain a desired design objective.
  • Properties and capabilities on NN:
    • Nonlinearity.
    • Input-Output Mapping.
    • Adaptivity.
      • The principal time constants of the system should be long enough for the system to ignore spurious disturbances, and yet short enough to respond to meaningful changes in the environment. This is the problem on stability-plasticity dilemma.
    • Evidential Response.
      • Supplying each decision with confidence factor.
    • Contextual Information.
      • Every neuron in the network is affected by the global activity of all other neurons in the network.
    • Fault Tolerance.
    • VLSI Implementation.
    • Uniformity of Analysis and Design.
    • Neurobiological Analogy.
  • It’s estimated that there are approximately 10 billion neuron in human cortex and 6o0 trillion synaptic.
  • Synapses or nerve endings are elementary structural and functional units that mediate the interconnections between neurons.
  • Adaptivity in human brain is made by:
    • Creation of new synaptic connections between neurons or,
    • Modification of existing synapses.
  • ANN we are presently able to design is just as primitive compared with the local circuits and the interregional circuits of the brain.
  • Types of activation function:
    • Threshold Function (Heaviside Function).
    • Sigmoid Function.
  • See page 46: mathematical definition of neural network (as a directed graph) and 4 properties of it.
  • See page 47: partially complete directed graph (architectural graph) and its properties.
  • The manner in which the neurons of a NN are structured is intimately linked with the learning algorithm used to train the network.
  • Networks Architecture:
    • Single-Layer Feedforward Networks.
    • Multilayer Feedforward Networks.
      • By adding one or more hidden layers the network is enabled to extract higher-order statistics from its input.
      • We’ve two types of connected networks: fully connected and partially connected.
    • Recurrent Networks.
      • Self-feedback refers to a situation where the output of a neuron is fed back into its own input.
  • Knowledge refers to stored information or models used by a person or machine to interpret, predict, and appropriately respond to the outside world.
  • Characteristics of Knowledge Representation:
    • What information is actually made explicit?
    • How the information is physically encoded for subsequent use?
  • 55-See differences between pattern classifiers and neural networks in page.
  • Knowledge representation of the surrounding system is environment is defined by the values taken by the free parameters (i.e. synapses and bias) of the network.
  • Knowledge Representation Rules:
    • Similar inputs from similar classes should usually produce similar representations inside the networks and should therefore be classified as belonging to the same class.
    • Items to be categorized as separate classes should be given widely different representations in network.
    • If a particular feature is important, then there should be a large number of neurons involved in the representation of that item in the network.
    • Prior information and invariances should be built into the design of a neural network whenever they are available. So as to simplify the network design by not having to learn them.
  • To find similarity for deterministic terms we use Euclidian distance. For stochastic terms we use Mahalanobis distance.
  • Specialized Structured Neural Networks are desired for the following reasons:
    • Having smaller number of free parameters. This lead to small number of training, network learns fast and often generalizes better.
    • The rate of information transmission through a specialized network (i.e. the network throughput) is accelerated.
    • The cost is reduced because its smaller size relative to fully connected network.
  • Ad hoc techniques to build prior information into neural network:
    • Restricting the network architecture, this is achieved through the use of local connections known as receptive fields.
    • Constraining the choice of synaptic weights, which is implemented through the use of weight sharing.
  • Receptive field of a neuron is defined as the region of the input field over which the incoming stimuli can influence the output signal produced by the neuron.
  • Techniques for rendering classifier-type neural network invariant to transformations:
    • Invariance by structure:
      • Synaptic connections between the neurons of the network are created so that transformed versions if the same input are forced to produce the same output. (i.e image center rotation)
    • Invariance by training:
      • Ability to recognize an object from different perspectives using several aspect views.
      • Disadvantages from engineering aspect:
        • Probability of misclassification.
        • High computation demand (especially with high features dimensions)
    • Invariant feature space:
      • This technique relies on the ability of extracting features that characterize the essential information content of an input data set and that are invariant to transformations.
      • Advantages of using this technique:
        • Reduced number of features.
        • Requirements of the design are relaxed
        • Invariance for all objects with respect to known transformations is assured.
  • Learning Paradigms:
    • Supervised Learning.
    • Unsupervised Learning.
    • Reinforcement Learning.
  • Learning Tasks:
    • Pattern Association.
      • Associative memory is a brain like distributed memory that learns by association.
      • Association forms:
        • Autoassociation (Unsupervised).
        • Heteroassociation (Supervised).
      • Phases of associative memory:
        • Storage phase.
        • Recall phase.
      • Challenge here is to make the storage capacity q (expressed as a percentage of the total number N neurons used to constructs the network) as large as possible.
    • Pattern Recognition.
      • Pattern recognition is the process of receiving a pattern/signal and assign it to one of prescribed number of classes.
      • Forms of pattern recognition machines using neural networks:
        • Machine is constructed from feature extractor and supervised classification.
          • Feature extractor applies dimensionality reduction (i.e. data compression).
        • Machine is constructed from Feedforward network using supervised learning algorithm.
          • The task of feature extraction is performed by the computational units in the hidden layers of the network.
    • Function Approximation.
      • Given a set of labeled examples, the requirement is to design a neural network that approximates the unknown function f(.) such that the function F(.) describes input-output mapping actually realized by the network, is close enough to f(.) in Euclidean sense over all inputs (i.e. for all x)
      • Ability of a NN to approximate an unknown input-output mapping is characterized by:
        • System identification: ability to identify key patterns.
        • Inverse modeling:
    • Control.
      • Primary objective of the controller is to supply appropriate inputs to the plant to make its output y track the referenced signal d. In other words, the controller has to invert the plant’s input-output behavior.
      • Approaches for accounting k, j:
        • Indirect Learning.
        • Direct Learning.
    • Beamforming.
      • Beamforming is used to distinguish between the spatial properties of the target and background noise. The device used to do the Beamforming is called a beamformer.
      • Task of Beamforming is complicated according to two factors:
        • Target signal originates from an unknown direction.
        • There is no prior information available on the inferring signals.
  • Key terms: key pattern, memorized pattern, error in recall, memoryless MIMO system, neuro-beamformer, attentional neurocomputers, semisupervised learning.
  • Questions:
    • What’s linear adaptive filter theory?
    • What’s tabula rasa learning?
    • Page 33 line 19. What’s meant by this paragraph?
    • Needs more discussion about “Uniformity of Analysis and Design”.
    • Discussion about the 2 examples in page 35.
    • Last paragraph in page 37.
    • What’s logistic function?
    • 48-What we mean by dynamic system?
    • 48-Why A and B are operators? And what’s the resulted from this?
    • 48-What’s the difference between A and ?
    • 48-What we mean by non-commutative?
    • 48-What are the properties on non-commutative operators?
    • 49-Explanation of equation 19, 20?
    • 49-What’s binomial expansion?
    • 49-Explanation of 2 cases in bottom.
    • 53-Last paragraph
    • 59- Ad hoc techniques to build prior information into neural network.
    • 60-What are the differences between convolution network and usual networks?
    • 61-What’s meant by occlusion?
    • 65-2nd portion of page until unsupervised learning.
    • 66-Reinforcement learning paragraph
    • 68-What is meant by space of dimensionality?
    • 73-Is system identification is done using control task?
    • 73-In equation 37 how he’ll get the differentiation of a constant?
    • 73-What is meant by element of a planet?
    • 73-What’s the problem of j, k?
    • 73-What’s direct and indirect learning?
    • 74-Discussion on diagram of generalized sidelobe canceller.