Preface of the Algorithm Design Manual

  • Designing correct, efficient, and implementable algorithms for real-world problems requires access to two distinct bodies of knowledge:
    • Techniques:
      • Good algorithm designers understand several fundamental algorithm design techniques, including data structures, dynamic programming, depth-first search, backtracking, and heuristics. Perhaps the single most important design technique is modeling, the art of abstracting a messy real-world application into a clean problem suitable for algorithmic attack.
    • Resources:
      • Good algorithm designers stand on the shoulders of giants. Rather than laboring from scratch to produce a new algorithm for every task, they can figure out what is known about a particular problem. Rather than re-implementing popular algorithms from scratch, they seek existing implementations to serve as a starting point. They are familiar with many classic algorithmic problems, which provide sufficient source material to model most any application.
  • Find resources of book in book website.

 

Key Management; Other Public-Key Cryptosystems

  • Use of public key-scheme in key distribution:
    • The distribution of public keys.
      • Public announcement.
      • Publicly available directory.
      • Public-key authority.
      • Public-key certificates.
    • The use of public-key encryption to distribute secret keys.
  • Public announcement of keys has one weakness in that anyone can forge key and then the forger pretend as being authenticated participant.

  • Elements of Publicly Available Directory:
    • The authority maintains a directory with a {name, public key} entry for each participant.
    • Each participant registers a public key with the directory authority.
    • A participant may replace the existing key with a new one at any time.
    • Participants could also access the directory electronically.
      • So, secure, authenticated communication from the authority to the participant is mandatory.
  • The weakness of this scheme is that if an adversary succeeds in obtaining or computing the private key of the directory authority, the adversary could authoritatively pass out counterfeit public keys.

  • 292-Steps for successful message exchange in public-key authority.

  • A technique known as caching. Periodically, a user should request fresh copies of the public keys of its correspondents to ensure currency.
  • Requirements for Public-Key Certificate:
    • Any participant can read a certificate to determine the name and public key of the certificate’s owner.
    • Any participant can verify that the certificate originated from the certificate authority and is not counterfeit.
    • Only the certificate authority can create and update certificates.
    • Any participant can verify the currency of the certificate.

  • The timestamp T validates the currency of the certificate (i.e. that the current certificate is what B’s has).
  • Simple Secret Key Distribution Communication Steps:
    • A generates a public/private key pair {PUa, PRa} and transmits a message to B consisting of PUa and an identifier of A, IDA.
    • B generates a secret key, Ks, and transmits it to A, encrypted with A’s public key.
    • A computes D(PRa, E(PUa, Ks)) to recover the secret key. Because only A can decrypt the message, only A and B will know the identity of Ks.
    • A discards PUa and PRa and B discards PUa.
  • Man-in-the-middle attack is a form of active attack in which the attacker intercepts and selectively modifies communicated data in order to masquerade as one or more of the entities involved in a communication.
  • 296-Read how an adversary can compromise the communication between two entities without being detected.
  • Steps for Secret Key Distribution with Confidentiality and Authentication:
    • A uses B’s public key to encrypt a message to B containing an identifier of A (IDA) and a nonce (N1), which is used to identify this transaction uniquely.
    • B sends a message to A encrypted with PUa and containing A’s nonce (N1) as well as a new nonce generated by B (N2) Because only B could have decrypted message (1), the presence of N1 in message (2) assures A that the correspondent is B.
    • A returns N2 encrypted using B’s public key, to assure B that its correspondent is A.
    • A selects a secret key Ks and sends M = E(PUb, E(PRa, Ks)) to B. Encryption of this message with B’s public key ensures that only B can read it; encryption with A’s private key ensures that only A could have sent it.
    • B computes D(PUa, D(PRb, M)) to recover the secret key.

  • A hybrid scheme retains the use of a key distribution center (KDC) that shares a secret master key with each user and distributes secret session keys encrypted with the master key. A public key scheme is used to distribute the master keys.

Diffie-Hellman Key Exchange

  • The purpose of the algorithm is to enable two users to securely exchange a key that can then be used for subsequent encryption of messages. The algorithm itself is limited to the exchange of secret values.

  • The security of the Diffie-Hellman key exchange lies in the fact that, while it is relatively easy to calculate exponentials modulo a prime, it is very difficult to calculate discrete logarithms. For large primes, the latter task is considered infeasible.
  • If the exchange of YA and YB was not secure a man-in-the-middle attack may occurs.

Elliptic Curve Arithmetic

  • The principal attraction of ECC, compared to RSA, is that it appears to offer equal security for a far smaller key size, thereby reducing processing overhead.
  • Elliptic curves are not ellipses. They are so named because they are described by cubic equations.

y2 + axy + by = x3 + cx2 + dx + e

Multilayer Perceptron

  • Multilayer perception stands for a neural network with one or more hidden layer.
  • Properties of multilayer neural networks:
    • The model of each neuron in the network includes a nonlinear activation function that’s differentiable.
    • Network contains one or more hidden layer.
    • Network exhibits a high degree of connectivity through its synaptic weights.
  • Common deficiencies in multilayer neural networks:
    • Theoretical analysis of MLNN is difficult to undertake.
      • This comes from the nonlinearity and high connectivity of the network.
    • Harder efforts are required for visualizing the learning process.
      • This comes from the existence of several layers in the network.
  • Back propagation algorithm is used to train MLNN. The training proceeds in two phases:
    • In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network layer by layer until it reaches the output.
    • In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, layer by layer but the propagation is performed in the
      backward direction. In this phase successive adjustments are applied to the synaptic weights of the network.
  • The term “back propagation” appeared after 1985 when the term was popularized through the publication of the book “Parallel Distribution Processing” by Rumelhard and McClelland.

  • Two kinds of signals exist in MLP (Multilayer Perceptron):
    • Function Signals (Input Signal):
      • It’s an input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network and emerges at the output end of the network as output signal.
      • We called it Function Signals because:
        • It’s presumed to perform a useful function at the output of the network.
        • The neuron’s signal is calculated as a function of the input signal(s) and associated weights.
    • Error Signals:
      • It originates at an output neuron of the network and propagates backward (layer by layer) through the network.
      • We called it Error Signal because:
        • It’s computation by every neuron of the network involves an error-dependent function.
  • Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
    • Computation of function signal, which is expressed as a continuous nonlinear function of the input signal and synaptic weights associated with that neuron.
    • Computation of an estimate of the gradient vector which is needed for the backward pass through the network.
      • Gradient vector: the gradients of the error surface with respect to the connected weights of the inputs of a neuron.
  • Function of Hidden Neurons:
    • Hidden neurons act as feature detector.
    • The hidden neurons are able to discover the salient features that characterize training data.
    • They do so by performing a nonlinear transformation on the input data into a new space called feature space
    • In feature space pattern classification task is more simplified and the classes are more separated.
    • This function is the main difference between Rosenblatt’s perceptron and Multilayer Neural Networks.
  • Credit Assignment Problem:
    • Credit-Assignment problem is the problem of assigning credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the learning system. Because as we know those decisions are responsible for the overall outcomes in the first place.
    • Error-correction learning algorithm is not suitable for resolving the credit-assignment problem for MLNN because we can’t just judge on the output neurons where hidden layers play a big role in its decision.
    • Back propagation algorithm is able to solve the credit-assignment problem in an elegant manner.

    Batch Learning

  • Before we start in describing the algorithm you want to introduce some equations that are found in page 157.
  • Batch Learning is a supervised learning algorithm. The learning algorithm is performed after the presentation of all the N examples in the training samples that constitutes one epoch of training.
  • Adjustments to the synaptic weights are made on an epoch-by-epoch basis.
  • With method of gradient descent used to perform training we’ve these 2 advantages:
    • Accurate estimation.
    • Parallelization of learning process.
  • From practical perspective, batch learning suffers from the storage requirements.
  • In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems.

Online Learning

  • Online method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron is performed example-by-example basis. The cost function to be minimized is therefore the total instantaneous error energy.
  • Such algorithm is not suitable for parallelization of the learning process.
  • Sometimes online learning is called stochastic method.
  • Advantages of online learning:
    • This stochasticity has the desirable effect of making it less likely for the learning process to be trapped in a local minimum.
    • Moreover, online learning requires less storage than batch learning.
    • Also, if the training data is redundant, the online learning benefits from this redundancy to improve its learning.
    • Finally, in online learning you are able to track small changes in training data especially if the data environment is non-stationary.
    • It’s simple to implement.
    • Provides effective solutions to large –scale and difficult classification problems.

The Back Propagation Algorithm

  • First you should read the mathematical derivation of the algorithm in pages 195-161.
  • The key factor involved in the calculation of the weight adjustment is the error signal at the output neuron j. As we see the credit-assignment problem arises here. In this context we may identify two distinct cases:
    • Case #1: Neuron j is an output node:
      • The error signal is supplied to the neuron by its own from equation:
        • Where .
    • Case #2: Neuron j is a hidden node:
      • When a neuron j is located in a hidden layer of the network, there’s no specified desired response for that neuron.
      • Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron connected.
      • The final back-propagation formula for the local gradient
        • Where k represents the number of neurons that are connected to hidden neuron j.
      • To know the derivation kindly refer to page 162-163.
  • As a summary the correction is applied to the synaptic weight connecting neuron I to j is defined by:
  • Any activation function that is used in multilayer neural networks should be continuous.
  • The most commonly used activation function is sigmoidal nonlinearity. Two forms of which are described here:
    • Logistic Function: .
    • Hyperbolic Tangent Function: .
  • Learning Parameter:
    • The smaller we make, the smaller changes to synaptic weights in the network will be from one iteration to the next and the smoother will be the trajectory in the network weight space.
    • On the other hand if we make to large in order to speed up the rate of learning, the resulting larges changed in synaptic weights assume such for that the network may become unstable (i.e. oscillatory).
    • A simple method of solving this problem is by including a momentum term, as shown by:
      • ,
      • Where usually positive number is called momentum constant.
      • Also, is the unit-time delay operator,
      • The above equation is called generalized delta rule. Special case here is applied when .
    • The inclusion of momentum in back-propagation algorithm has stability effect in directions that oscillate in sign.
    • The momentum term may also have the benefit of preventing the learning process from terminating in a shallow local minimum on the error surface.
    • In reality the learning rate parameter in connection dependent such that each connection has .
  • Stopping Criteria:
    • In general back-propagation algorithm can’t be shown to converge and there are no well defined criteria for stopping it operation.
    • We may formulate a sensible convergence criterion for back-propagation learning as follows (Kramer and Sangiovanni-Vincentelli 1989):
      • The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold.
    • The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w).
    • Another criterion:
      • The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
        (ranges from 0.1 to 1 percent/epoch).
    • Another theoretical criterion:
      • After each learning iteration, the network is tested for it generalization performance. The learning process stops when the generalization performance is adequate or peaked.
  • Note that in each training epoch the samples should be picked randomly.

Designing a neural network with back-propagation algorithm is more of an art than a science

  • Heuristics for making back-propagation algorithm perform better:
    • Stochastic update is recommended over batch update.
    • Maximizing the information context.
      • This is done by:
        • Use an example that results in the largest training error.
        • Use an example that is radically different from all those previously worked.
    • Activation Function.
      • It’s preferred to use a sigmoid activation function that’s an odd function in its arguments.
      • The hyperbolic sigmoid function is the recommended one.
      • 175-See the useful properties of hyperbolic sigmoid function.
    • Target values:
      • It’s recommended that target values (desired) be chosen within the range of the sigmoid activation function.
    • Normalizing the inputs.
      • Each input variable should be preprocessed so that its mean value, averaged over the entire training samples, is close or equal to zero.
      • In order to accelerate the back-propagation learning process, the normalization of the inputs should also include two other measures:
        • The input variables contained in the training set should be uncorrelated; this can be done by using principal-component analysis.
        • The decorrelated input variables should be scaled so that their covariances are approximately equal
      • Here we’ve 3 normalization steps:
        • Mean removal.
        • Decorrelation.
        • Covariance equalization.

  • Weights initialization.
  • Learning from hints.
    • This is achieved by including prior knowledge to the system.
  • Learning Rates.
    • All neurons in the multilayer should learn at the same rate, except for that at the last layer, the learning rate should be assigned smaller value than that of the front layers.

Generalization

  • A network is said to generalize well when the network input-output mapping is correct (or nearly so) for the test data.
  • The learning process may be viewed as “curve fitting” problem. Thus, generalization is performed by the interpolation made by the network.
  • Memorization in a neural network usually leads to bad generalization. “Memorization” is essentially a “look-up table”, which implies that the input-output mapping computed by the neural network is not smooth.
  • Generalization is influenced by three factors:
    • Size of training sample and how they represent the environment of interest.
    • Architecture of the neural network.
    • Physical complexity of the problem at hand.
  • In practice, good generalization is achieved if we the training sample size, N, satisfies:

 

  • Where:
    • W is the total number of free parameters (i.e. synaptic weights and basis) in the network.
    • denotes the fraction of classification errors permitted on test data.

Cross Validation

  • In statistics, cross-validation randomly divides the available data set into:
    • Training Data:
      • Estimate Subset: used to select the model.
      • Validation Subset: used to test or validate the model.
    • Testing Data.
  • However, this best model may be overfitting the validation data.
  • Then, to guard against this possibility, the generalization performance is measured on the test set, which is different from the validation subset.
  • Early-Stopping Method (Holdout Method):
    • Validation Steps:
      • The training is stopped periodically, i.e., after so many epochs, and the network is assessed using the validation subset where the backward mode is disabled.
      • When the validation phase is complete, the estimation (training) is resumed for another period, and the process is repeated.
      • The best model (free parameters) is that at the minimum validation error.
    • Here we should take care of not going into local minima because of the validation-set.
  • Variant of Cross-Validation (Multifold Method):
    • Validation Steps:
      • Divide the data set of N samples into K subsets, where K > 1.
      • The network is validated in each trial using a different subset. After training the network using the other subsets.
      • The performance of the model is assessed by averaging the squared error under validation over all trials.
    • A special case of this method is called “leave-one out method“, where examples are used to train the model, and the model is validated by testing it on the example that left out.
    • Disadvantage of this method is that it requires an excessive amount of computation.
    • Figure below demonstrates the division of the samples. Where the shaded square denotes the validation set.

  • Questions:
    • What we mean by differentiable function?
      • It’s a continuous function (i.e. not discrete function).
    • What are benefits of differentiable functions?
      • We are able to use it to find the gradient of the activation function in terms on local induced field.
    • 155- “It’s presumed to perform a useful function at the output of the network”. Here author mean performing an activation function at output layer? Or he means that the output signal will lead us to do some function?
      • Because it gives us the network response (this is the function).
    • 155-I need graphical and mental explanation about “the gradients of the error surface with respect to the weights connected”
    • 158-description realization of the learning curve process.
      • He means the difference between Online and Batch.
    • 158-How “parallelization of learning process” exists in the batch learning process?
      • Distribute the computation of error for each sample and then combine them.
    • 158- “In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems“. Discussion is required here.
    • 159-Why batch learning doesn’t benefit from redundancy?
    • 159-How we can track small changes in “training data”? Should it be in “synaptic weights”
      • Because it’s sensitive whereas the batch looks for the global changes.
    • 161-What’s local gradient?
      • It’s the synaptic weight that presents the local minimum for the current neuron.
    • 163- I need to know the mathematical differentiation steps of equation 4.23.
    • 169- “The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w). What’s the drawback here in calculating g(w)?!
      • Well, I think that it’s computationally expensive to compute it each epoch.
    • 178-What’s meant by saturation?
    • 177-What’s meant by uncorrelated data?

Public Key Cryptography and RSA

  • Asymmetric encryption is a form of cryptosystem in which encryption and decryption are performed using the different keys one a public key and one a private key. It is also known as public-key encryption.
  • Asymmetric encryption can be used for confidentiality, authentication, or both.
  • The difficulty of attacking RSA is based on the difficulty of finding the prime factors of a composite number.
  • In fact, the security of any encryption scheme depends on the length of the key and the computational work involved in breaking a cipher.
  • Public-Key encryption techniques suffer from the computational overhead. This supports the need for using Private-Key encryption.

Principles of Public-Key Cryptosystems

  • The concept of public-key cryptography evolved from an attempt to attack two of the most difficult problems associated with symmetric encryption:
    • Key distribution.
      • That was done using either:
        • Two communicants already share a key, which somehow has been distributed to them.
        • The use of a key distribution center.
    • Need for digital signature.
      • That is, could a method be devised that would stipulate, to the satisfaction of all parties, that a digital message had been sent by a particular person? (authentication purpose).
  • Characteristics of public-key cryptosystem:
    • It is computationally infeasible to determine the decryption key given only knowledge of the cryptographic algorithm and the encryption key.
    • In RSA, either of the two related keys can be used for encryption, with the other used for decryption.

  • 261-Essential steps for the encryption process.

  • The disadvantage of this last approach is that the public-key algorithm, which is complex, must be exercised four times rather than two in each communication.
  • Classification of public key cryptosystem:
    • Encryption/Decryption.
    • Digital Signature.
    • Key Exchange.

  • Requirements for public-key cryptography:
    • It is computationally easy for a party B to generate a pair (public key PUb, private key PRb).
    • It is computationally easy for a sender A, knowing the public key and the message to be encrypted, M, to generate the corresponding ciphertext: C = E(PUb, M).
    • It is computationally easy for the receiver B to decrypt the resulting ciphertext using the private key to recover the original message: M = D(PRb, C) = D[PRb, E(PUb, M)].
    • It is computationally infeasible for an adversary, knowing the public key, PUb, to determine the private key, PRb.
    • It is computationally infeasible for an adversary, knowing the public key, PUb, and a ciphertext, C, to recover the original message, M.
    • The two keys can be applied in either order: M = D[PUb, E(PRb, M)] = D[PRb, E(PUb, M)].
  • A

Multilayer Perceptron

  • Multilayer perception stands for a neural network with one or more hidden layer.
  • Properties of multilayer neural networks:
    • The model of each neuron in the network includes a nonlinear activation function that’s differentiable.
    • Network contains one or more hidden layer.
    • Network exhibits a high degree of connectivity through its synaptic weights.
  • Common deficiencies in multilayer neural networks:
    • Theoretical analysis of MLNN is difficult to undertake.
      • This comes from the nonlinearity and high connectivity of the network.
    • Harder efforts are required for visualizing the learning process.
      • This comes from the existence of several layers in the network.
  • Back propagation algorithm is used to train MLNN. The training proceeds in two phases:
    • In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network layer by layer until it reaches the output.
    • In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, layer by layer but the propagation is performed in the
      backward direction. In this phase successive adjustments are applied to the synaptic weights of the network.
  • The term “back propagation” appeared after 1985 when the term was popularized through the publication of the book “Parallel Distribution Processing” by Rumelhard and McClelland.

  • Two kinds of signals exist in MLP (Multilayer Perceptron):
    • Function Signals (Input Signal):
      • It’s an input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network and emerges at the output end of the network as output signal.
      • We called it Function Signals because:
        • It’s presumed to perform a useful function at the output of the network.
        • The neuron’s signal is calculated as a function of the input signal(s) and associated weights.
    • Error Signals:
      • It originates at an output neuron of the network and propagates backward (layer by layer) through the network.
      • We called it Error Signal because:
        • It’s computation by every neuron of the network involves an error-dependent function.
  • Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
    • Computation of function signal, which is expressed as a continuous nonlinear function of the input signal and synaptic weights associated with that neuron.
    • Computation of an estimate of the gradient vector which is needed for the backward pass through the network.
      • Gradient vector: the gradients of the error surface with respect to the connected weights of the inputs of a neuron.
  • Function of Hidden Neurons:
    • Hidden neurons act as feature detector.
    • The hidden neurons are able to discover the salient features that characterize training data.
    • They do so by performing a nonlinear transformation on the input data into a new space called feature space
    • In feature space pattern classification task is more simplified and the classes are more separated.
    • This function is the main difference between Rosenblatt’s perceptron and Multilayer Neural Networks.
  • Credit Assignment Problem:
    • Credit-Assignment problem is the problem of assigning credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the learning system. Because as we know those decisions are responsible for the overall outcomes in the first place.
    • Error-correction learning algorithm is not suitable for resolving the credit-assignment problem for MLNN because we can’t just judge on the output neurons where hidden layers play a big role in its decision.
    • Back propagation algorithm is able to solve the credit-assignment problem in an elegant manner.

     

    Batch Learning

  • Before we start in describing the algorithm you want to introduce some equations that are found in page 157.
  • Batch Learning is a supervised learning algorithm. The learning algorithm is performed after the presentation of all the N examples in the training samples that constitutes one epoch of training.
  • Adjustments to the synaptic weights are made on an epoch-by-epoch basis.
  • With method of gradient descent used to perform training we’ve these 2 advantages:
    • Accurate estimation.
    • Parallelization of learning process.
  • From practical perspective, batch learning suffers from the storage requirements.
  • In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems.

Online Learning

  • Online method of supervised learning, adjustments to the synaptic weights of the multilayer perceptron is performed example-by-example basis. The cost function to be minimized is therefore the total instantaneous error energy.
  • Such algorithm is not suitable for parallelization of the learning process.
  • Sometimes online learning is called stochastic method.
  • Advantages of online learning:
    • This stochasticity has the desirable effect of making it less likely for the learning process to be trapped in a local minimum.
    • Moreover, online learning requires less storage than batch learning.
    • Also, if the training data is redundant, the online learning benefits from this redundancy to improve its learning.
    • Finally, in online learning you are able to track small changes in training data especially if the data environment is non-stationary.
    • It’s simple to implement.
    • Provides effective solutions to large –scale and difficult classification problems.

The Back Propagation Algorithm

  • First you should read the mathematical derivation of the algorithm in pages 195-161.
  • The key factor involved in the calculation of the weight adjustment is the error signal at the output neuron j. As we see the credit-assignment problem arises here. In this context we may identify two distinct cases:
    • Case #1: Neuron j is an output node:
      • The error signal is supplied to the neuron by its own from equation:
        • Where .
    • Case #2: Neuron j is a hidden node:
      • When a neuron j is located in a hidden layer of the network, there’s no specified desired response for that neuron.
      • Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron connected.
      • The final back-propagation formula for the local gradient
        • Where k represents the number of neurons that are connected to hidden neuron j.
      • To know the derivation kindly refer to page 162-163.
  • As a summary the correction is applied to the synaptic weight connecting neuron I to j is defined by:
  • Any activation function that is used in multilayer neural networks should be continuous.
  • The most commonly used activation function is sigmoidal nonlinearity. Two forms of which are described here:
    • Logistic Function: .
    • Hyperbolic Tangent Function: .
  • Learning Parameter:
    • The smaller we make, the smaller changes to synaptic weights in the network will be from one iteration to the next and the smoother will be the trajectory in the network weight space.
    • On the other hand if we make to large in order to speed up the rate of learning, the resulting larges changed in synaptic weights assume such for that the network may become unstable (i.e. oscillatory).
    • A simple method of solving this problem is by including a momentum term, as shown by:
      • ,
      • Where usually positive number is called momentum constant.
      • Also, is the unit-time delay operator,
      • The above equation is called generalized delta rule. Special case here is applied when .
    • The inclusion of momentum in back-propagation algorithm has stability effect in directions that oscillate in sign.
    • The momentum term may also have the benefit of preventing the learning process from terminating in a shallow local minimum on the error surface.
    • In reality the learning rate parameter in connection dependent such that each connection has .
  • Stopping Criteria:
    • In general back-propagation algorithm can’t be shown to converge and there are no well defined criteria for stopping it operation.
    • We may formulate a sensible convergence criterion for back-propagation learning as follows (Kramer and Sangiovanni-Vincentelli 1989):
      • The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold.
    • The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w).
    • Another criterion:
      • The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
        (ranges from 0.1 to 1 percent/epoch).
    • Another theoretical criterion:
      • After each learning iteration, the network is tested for it generalization performance. The learning process stops when the generalization performance is adequate or peaked.
  • Note that in each training epoch the samples should be picked randomly.
  • Questions:
    • What we mean by differentiable function?
      • It’s a continuous function that we are able to differentiate it (i.e. not constant).
    • What are benefits of differentiable functions?
      • We are able to use it to find the gradient of the activation function in terms on local induced.
    • 155- “It’s presumed to perform a useful function at the output of the network”. Here author mean performing an activation function at output layer? Or he means that the output signal will lead us to do some function?
      • I think that he means the first option.
    • 155-I need graphical and mental explanation about “the gradients of the error surface with respect to the weights connected”
    • 158-description realization of the learning curve process.
    • 158-How “parallelization of learning process” exists in learning process?
      • Because we can distribute the process of classifying to several PCs then get back them.
    • 158- “In statistical context, batch learning can be viewed as a form of statistical inference. It’s therefore well studied for solving nonlinear regression problems“. Discussion is required here.
    • 159-Why batch learning doesn’t benefit from redundancy?
    • 159-How we can track small changes in “training data”? Should it be in “synaptic weights”
    • 161-What’s local gradient?
      • It’s the synaptic weight that presents the local minimum for the current neuron.
    • 163- I need to know the mathematical differentiation steps of equation 4.23.
    • 169- “The drawback of this criterion is that, for successful trails, learning times may be long. Also, it requires the computation of the gradient vector g(w). What’s the drawback here in calculating g(w)?!
    • Well, I think that it’s computationally expensive to compute it each epoch.