Multilayer Perceptron

  • Multilayer perception stands for a neural network with one or more hidden layer.
  • Properties of multilayer neural networks:
    • The model of each neuron in the network includes a nonlinear activation function that’s differentiable.
    • Network contains one or more hidden layer.
    • Network exhibits a high degree of connectivity through its synaptic weights.
  • Common deficiencies in multilayer neural networks:
    • Theoretical analysis of MLNN is difficult to undertake.
      • This comes from the nonlinearity and high connectivity of the network.
    • Harder efforts are required for visualizing the learning process.
      • This comes from the existence of several layers in the network.
  • Back propagation algorithm is used to train MLNN. The training proceeds in two phases:
    • In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network layer by layer until it reaches the output.
    • In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, layer by layer but the propagation is performed in the
      backward direction. In this phase successive adjustments are applied to the synaptic weights of the network.
  • The term “back propagation” appeared after 1985 when the term was popularized through the publication of the book “Parallel Distribution Processing” by Rumelhard and McClelland.

  • Two kinds of signals exist in MLP (Multilayer Perceptron):
    • Function Signals (Input Signal):
      • It’s an input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network and emerges at the output end of the network as output signal.
      • We called it Function Signals because:
        • It’s presumed to perform a useful function at the output of the network.
        • The neuron’s signal is calculated as a function of the input signal(s) and associated weights.
    • Error Signals:
      • It originates at an output neuron of the network and propagates backward (layer by layer) through the network.
      • We called it Error Signal because:
        • It’s computation by every neuron of the network involves an error-dependent function.
  • Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
    • Computation of function signal, which is expressed as a continuous nonlinear function of the input signal and synaptic weights associated with that neuron.
    • Computation of an estimate of the gradient vector which is needed for the backward pass through the network.
      • Gradient vector: the gradients of the error surface with respect to the connected weights of the inputs of a neuron.
  • Function of Hidden Neurons:
    • Hidden neurons act as feature detector.
    • The hidden neurons are able to discover the salient features that characterize training data.
    • They do so by performing a nonlinear transformation on the input data into a new space called feature space
    • In feature space pattern classification task is more simplified and the classes are more separated.
    • This function is the main difference between Rosenblatt’s perceptron and Multilayer Neural Networks.
  • Credit Assignment Problem:
    • Credit-Assignment problem is the problem of assigning credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the learning system. Because as we know those decisions are responsible for the overall outcomes in the first place.
    • Error-correction learning algorithm is not suitable for resolving the credit-assignment problem for MLNN because we can’t just judge on the output neurons where hidden layers play a big role in its decision.
    • Back propagation algorithm is able to solve the credit-assignment problem in an elegant manner.

    Batch Learning

  • Before we start in describing the algorithm you want to introduce some equations that are found in page 157.
  • Batch Learning is a supervised learning algorithm. The learning algorithm is performed after the presentation of all the N examples in the training samples that constitutes one epoch of training.
  • Adjustments to the synaptic weights are made on an epoch-by-epoch basis.
  • With method of gradient descent used to perform training we’ve these 2 advantages:
    • Accurate estimation.
    • Parallelization of learning process.
  • From practical perspective, batch learning suffers from the storage requirements.
  • In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems.

Online Learning

  • Online method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron is performed example-by-example basis. The cost function to be minimized is therefore the total instantaneous error energy.
  • Such algorithm is not suitable for parallelization of the learning process.
  • Sometimes online learning is called stochastic method.
  • Advantages of online learning:
    • This stochasticity has the desirable effect of making it less likely for the learning process to be trapped in a local minimum.
    • Moreover, online learning requires less storage than batch learning.
    • Also, if the training data is redundant, the online learning benefits from this redundancy to improve its learning.
    • Finally, in online learning you are able to track small changes in training data especially if the data environment is non-stationary.
    • It’s simple to implement.
    • Provides effective solutions to large –scale and difficult classification problems.

The Back Propagation Algorithm

  • First you should read the mathematical derivation of the algorithm in pages 195-161.
  • The key factor involved in the calculation of the weight adjustment is the error signal at the output neuron j. As we see the credit-assignment problem arises here. In this context we may identify two distinct cases:
    • Case #1: Neuron j is an output node:
      • The error signal is supplied to the neuron by its own from equation:
        • Where .
    • Case #2: Neuron j is a hidden node:
      • When a neuron j is located in a hidden layer of the network, there’s no specified desired response for that neuron.
      • Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron connected.
      • The final back-propagation formula for the local gradient
        • Where k represents the number of neurons that are connected to hidden neuron j.
      • To know the derivation kindly refer to page 162-163.
  • As a summary the correction is applied to the synaptic weight connecting neuron I to j is defined by:
  • Any activation function that is used in multilayer neural networks should be continuous.
  • The most commonly used activation function is sigmoidal nonlinearity. Two forms of which are described here:
    • Logistic Function: .
    • Hyperbolic Tangent Function: .
  • Learning Parameter:
    • The smaller we make, the smaller changes to synaptic weights in the network will be from one iteration to the next and the smoother will be the trajectory in the network weight space.
    • On the other hand if we make to large in order to speed up the rate of learning, the resulting larges changed in synaptic weights assume such for that the network may become unstable (i.e. oscillatory).
    • A simple method of solving this problem is by including a momentum term, as shown by:
      • ,
      • Where usually positive number is called momentum constant.
      • Also, is the unit-time delay operator,
      • The above equation is called generalized delta rule. Special case here is applied when .
    • The inclusion of momentum in back-propagation algorithm has stability effect in directions that oscillate in sign.
    • The momentum term may also have the benefit of preventing the learning process from terminating in a shallow local minimum on the error surface.
    • In reality the learning rate parameter in connection dependent such that each connection has .
  • Stopping Criteria:
    • In general back-propagation algorithm can’t be shown to converge and there are no well defined criteria for stopping it operation.
    • We may formulate a sensible convergence criterion for back-propagation learning as follows (Kramer and Sangiovanni-Vincentelli 1989):
      • The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold.
    • The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w).
    • Another criterion:
      • The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
        (ranges from 0.1 to 1 percent/epoch).
    • Another theoretical criterion:
      • After each learning iteration, the network is tested for it generalization performance. The learning process stops when the generalization performance is adequate or peaked.
  • Note that in each training epoch the samples should be picked randomly.

Designing a neural network with back-propagation algorithm is more of an art than a science

  • Heuristics for making back-propagation algorithm perform better:
    • Stochastic update is recommended over batch update.
    • Maximizing the information context.
      • This is done by:
        • Use an example that results in the largest training error.
        • Use an example that is radically different from all those previously worked.
    • Activation Function.
      • It’s preferred to use a sigmoid activation function that’s an odd function in its arguments.
      • The hyperbolic sigmoid function is the recommended one.
      • 175-See the useful properties of hyperbolic sigmoid function.
    • Target values:
      • It’s recommended that target values (desired) be chosen within the range of the sigmoid activation function.
    • Normalizing the inputs.
      • Each input variable should be preprocessed so that its mean value, averaged over the entire training samples, is close or equal to zero.
      • In order to accelerate the back-propagation learning process, the normalization of the inputs should also include two other measures:
        • The input variables contained in the training set should be uncorrelated; this can be done by using principal-component analysis.
        • The decorrelated input variables should be scaled so that their covariances are approximately equal
      • Here we’ve 3 normalization steps:
        • Mean removal.
        • Decorrelation.
        • Covariance equalization.

  • Weights initialization.
  • Learning from hints.
    • This is achieved by including prior knowledge to the system.
  • Learning Rates.
    • All neurons in the multilayer should learn at the same rate, except for that at the last layer, the learning rate should be assigned smaller value than that of the front layers.


  • A network is said to generalize well when the network input-output mapping is correct (or nearly so) for the test data.
  • The learning process may be viewed as “curve fitting” problem. Thus, generalization is performed by the interpolation made by the network.
  • Memorization in a neural network usually leads to bad generalization. “Memorization” is essentially a “look-up table”, which implies that the input-output mapping computed by the neural network is not smooth.
  • Generalization is influenced by three factors:
    • Size of training sample and how they represent the environment of interest.
    • Architecture of the neural network.
    • Physical complexity of the problem at hand.
  • In practice, good generalization is achieved if we the training sample size, N, satisfies:


  • Where:
    • W is the total number of free parameters (i.e. synaptic weights and basis) in the network.
    • denotes the fraction of classification errors permitted on test data.

Cross Validation

  • In statistics, cross-validation randomly divides the available data set into:
    • Training Data:
      • Estimate Subset: used to select the model.
      • Validation Subset: used to test or validate the model.
    • Testing Data.
  • However, this best model may be overfitting the validation data.
  • Then, to guard against this possibility, the generalization performance is measured on the test set, which is different from the validation subset.
  • Early-Stopping Method (Holdout Method):
    • Validation Steps:
      • The training is stopped periodically, i.e., after so many epochs, and the network is assessed using the validation subset where the backward mode is disabled.
      • When the validation phase is complete, the estimation (training) is resumed for another period, and the process is repeated.
      • The best model (free parameters) is that at the minimum validation error.
    • Here we should take care of not going into local minima because of the validation-set.
  • Variant of Cross-Validation (Multifold Method):
    • Validation Steps:
      • Divide the data set of N samples into K subsets, where K > 1.
      • The network is validated in each trial using a different subset. After training the network using the other subsets.
      • The performance of the model is assessed by averaging the squared error under validation over all trials.
    • A special case of this method is called “leave-one out method“, where examples are used to train the model, and the model is validated by testing it on the example that left out.
    • Disadvantage of this method is that it requires an excessive amount of computation.
    • Figure below demonstrates the division of the samples. Where the shaded square denotes the validation set.

  • Questions:
    • What we mean by differentiable function?
      • It’s a continuous function (i.e. not discrete function).
    • What are benefits of differentiable functions?
      • We are able to use it to find the gradient of the activation function in terms on local induced field.
    • 155- “It’s presumed to perform a useful function at the output of the network”. Here author mean performing an activation function at output layer? Or he means that the output signal will lead us to do some function?
      • Because it gives us the network response (this is the function).
    • 155-I need graphical and mental explanation about “the gradients of the error surface with respect to the weights connected”
    • 158-description realization of the learning curve process.
      • He means the difference between Online and Batch.
    • 158-How “parallelization of learning process” exists in the batch learning process?
      • Distribute the computation of error for each sample and then combine them.
    • 158- “In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems“. Discussion is required here.
    • 159-Why batch learning doesn’t benefit from redundancy?
    • 159-How we can track small changes in “training data”? Should it be in “synaptic weights”
      • Because it’s sensitive whereas the batch looks for the global changes.
    • 161-What’s local gradient?
      • It’s the synaptic weight that presents the local minimum for the current neuron.
    • 163- I need to know the mathematical differentiation steps of equation 4.23.
    • 169- “The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w). What’s the drawback here in calculating g(w)?!
      • Well, I think that it’s computationally expensive to compute it each epoch.
    • 178-What’s meant by saturation?
    • 177-What’s meant by uncorrelated data?

2 thoughts on “Multilayer Perceptron

  1. Hi,

    I am a forensic scientist currently doing my PhD on the applications of neural networks particularly MLNN for dataset of forensic interest for prediction purpose. I’m having problem to convey the idea of MLNN to my colleagues because they can’t see how it works. I came across your blog and hope that you can help me.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s