# Introduction

**Logistic Classifier (Linear Classifier)**: Takes an input (image pixels) and applies a linear function to them to make predictions.

**Linear Function**: *WX + b = Y*, where *W* is weight matrix, *X* is input matrix, *b* is bias and *Y* is output matrix. The *W* and *b* terms are the trained values.

**Softmax**: is a way to transform scores into probabilities. Increasing the number of training data set would make the classifier more confident and discriminate individual classes clearly.

**One-Hot Encoding** is when the correct class has probability of 1.0 and other classes 0 probability. This technique is not efficient on large number of classes. The benefit is measuring **cross-entropy** is easy.

**Multinomial Logistic Classification** is setting where input is feeded to linear model to logistic to softmax to one-hot encoding.

**Gradient Descent** is the technique used for finding the value to minimize the loss function. A correct simulation should update Theta0 and Theta1 at same time not after each other.

**Learning Rate** is variable that controls how big a step the gradient descent takes downhill.

For a **Local Minimum** the slope is going to be zero and this will result in 0 for derivative term which would result in same value of theta.

**Training Dataset** is the data set that the classifier uses to train on.

**Validation Dataset** is the data set used for validation and also it’s possible to reconstruct this dataset until the classifier looks like generalized.

**Test Dataset** is the data set used for real testing. Data samples that the classifier never saw before.

**Rule of 30** is an idea to measure if improvements in the classification model is significant or not.

Usually models use more 30K examples for validation to assure significance. Whenever there’s an improvements more than 0.1% this will be significant, which makes measuring the improvements easy.

The main problem with Gradient Descent is that it’s gets calculated for each example in the training data set, and the derivative term takes long time from processing standpoint. This problem will arise in scalability scenarios where the training data set is huge.

To overcome the scalability and performance issues of gradient descent, using a small random data (to calculate its average loss) set out of the training data set with small learning rate overcomes this problem. That technique is called **Stochastic Gradient Descent (SAG)**.

Use running average of the loss instead of relying solely on the current instance helps reducing the number of steps to converge. This is called **momentum technique**

Using lower learning rate usually tends to get a more accurate classifier.

Stochastic Gradient Descent Parameters:

- Initial Learning Rate: the initial value for learning rate
- Learning Rate Decay: how much learning rate should increase
- Momentum: the running average of the loss
- Batch Size: the random sample used by SGD
- Weight Initialization

**ADAGRAD** is a technique that eliminates couple of SGD parameters and require batch size and weights only.

# Deep Neural Networks

**Rectified Linear Units** (RELU): non-linear function that’s y=x for y > 0 and 0 otherwise.

**Backpropagation** is used to calculate the gradient of the cost function to update the weights.

**Deep Learning** is about adding more hidden layers to the model. This helps because of:

- Parameter efficiency: uses few parameters but goes deeper.
- Model Abstraction: going deeper results into layers that learn more generalized and useful shapes, like geometric shape instead of line

Why deep learning has not been used before?

- Deep learning, requires large dataset which has been available recently
- Better regularization techniques

**Regularization** is a technique used to prevent from overfitting

Avoid overfitting:

- Early termination for the training using validation set as soon as the performance stops improving
- Using regularization techniques
- Adding L2 for the weights vector
- Using Dropout

**Dropout** (by Geoffrey Hinton) is a regularization technique used to prevent overfitting by stochastically turning off activations for hidden layers. This forces the network not to rely on a specific activation in its training and it learns redundant representation for everything across the input data.

Redundant representations results in a network that need some consensus technique to decide the result.

# Convolutional Neural Network

If the structure of the input is known, then there’s a chance to reduce the NN complexity by relying on such fact. For example, if the input is a colored letter then the NN can be simplified by using grayscale instead of adding the colors complexity to the NN.

Features that do not change across time, space or everywhere are called **statistical invariant**. For example, **translation invariance** is used to simplify the NN by disregard learning the location of the object in an image.

**Weight sharing** is used to simplifying learning in a NN by utilizing statistical invariance.

CNN (by Yann LeCun) are NN that share their parameters across the space.

**Convolution** is the process of applying a mask to a NN in convolution pattern (like image filters) to a given image. This would result into a matrix that has different width, height and depth from the original image.

The layer’s parameters consist of a set of learnable filters (or **kernels**/**patches**), which have a small receptive field, but extend through the full depth of the input volume.

An RGB image has three **feature maps** each corresponds to one color intensity.

**Stride** is the number of pixels shifted each time the filter is moved

**Valid padding** is way to cut the image as the size of the filter

**Same padding** is way to pad zeros at the edge of the image to keep sizes same.

Striding reduces the feature map size but it’s very aggressive and sometimes result in loss of data. **Pooling** is a technique used for combining information for an image sample. Two common pooling techniques are **max pooling** and average pooling.

**1×1 convolution** is a way to reduce the dimensionality of the NN in a simple way.

**Recurrent Neural Network** is a NN that’s used with temporal classification/prediction problems.

**Long Short Term Memory** (LSTM) is a version from RNN that’s being used widely.

As far as I understand Logistic Regression or the Linear Classifiers can produce more complex models not just linear functions it’s named linear because it separate between two classes +1 and -1 and this issue can be tackled by using Multi Linear Classifiers

Linear classifiers can solve lots of problems, the naming here is just behind fact that the model uses a linear function for learning the weights and bias.