### Introduction

- In deep learning, we have the concept of loss, which tells us how poorly the model is performing at that current instant.
- Now we need to use this loss to
**train**our network such that it performs better. - Essentially what we need to do is to take the loss and try to
**minimize**it, because a lower loss means our model is going to perform better. - The process of minimizing (or maximizing) any mathematical expression is called
**optimization.** - Optimizers are algorithms or methods used to change the attributes of the neural network such as
**weights**and**learning rate**to reduce the losses. - Optimizers are used to solve optimization problems by minimizing the function.

We’ll compare below listed types of optimizers

- Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini Batch Stochastic Gradient Descent (MB-SGD)
- SGD with momentum
- Nesterov Accelerated Gradient (NAG)
- Adaptive Gradient (AdaGrad)
- AdaDelta
- RMSprop
- Adaptive Moment Estimation (Adam)

### 1. **Gradient Descent**

- Most Basic & Most Used
- Used in Back Prpogation
- Depends on first order derivative of a loss function
- algorithm:
**θ=θ−α⋅∇J(θ)**

**Advantages**:

- Easy computation.
- Easy to implement.
- Easy to understand.

**Disadvantages**:

- May trap at local minima.
- Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima.
- Requires large memory to calculate gradient on the whole dataset.

### 2. **Stochastic Gradient Descent** (SGD)

- An atempt to solve the Slow convergenece of Simple GD.
- Loss Calculation and parameter update after every training observation. (1000 updates for 1000 rows instead of 1 in gradient descent)
**θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples**.

**Advantages**:

- Frequent updates of model parameters hence, converges in less time.
- Requires less memory as no need to store values of loss functions.
- May get new minima’s.

**Disadvantages**:

- High variance in model parameters.
- May shoot even after achieving global minima.
- To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.

### 3. Mini Batch Stochastic Gradient Descent (MB-SGD)

- An attempt to solve slow convergence of Simple GD and High variance of SGD.
- Dataset is divided into various batches and after every batch, the parameters are updated.
**θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples**.

**Advantages**:

- Considered as Best of the gradient descent based algorithms.
- Frequently updates the model parameters and also has less variance.
- Requires medium amount of memory.

**All types of Gradient Descent have some challenges:**

- Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge.
- Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
- May get trapped at local minima.

### 4. S**GD with momentum**

- Another attempt to solve high variance of SGD
- One New Hyperparameter: Momentum (‘
**γ**’) is introduced. - It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction.
**V(t)=γV(t−1)+α.∇J(θ)**and the weights are updated by**θ=θ−V(t).**

**Advantages**:

- Reduces the oscillations and high variance of the parameters.
- Converges faster than gradient descent.

**Disadvantages**:

- One more hyper-parameter is added which needs to be selected manually and accurately.
- If the momentum is too high the algorithm may miss the local minima and may continue to rise up.

### 5. Nesterov Accelerated Gradient

- Invented to Solve the issue of missing local minima (when momentum is too high) in case of Momentum based optimization.
- In NAG, we’ll calculate the cost based on this future parameter rather than the current one.
- We’ll be using
**γV(t−1)**for modifying the weights so,**θ−γV(t−1)**approximately tells us the future location. It is a look ahead method. **V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) )**and then update the parameters using**θ=θ−V(t).**

**Advantages**:

- Does not miss the local minima.
- Slows if minima’s are occurring.

**Disadvantages**:

- Still, the hyperparameter needs to be selected manually.

### 6. Adaptive Gradient (AdaGrad)

- An attempt to solve the problem of tuning learning rate manually.
- In Adagrad, we change the learning rate
**‘η’**for each parameter and at every time step**‘t’.** **η**is a learning rate which is modified for given parameter**θ(i)**at a given time based on previous gradients calculated for given parameter**θ(i).**- We store the sum of the squares of the gradients w.r.t.
**θ(i)**up to time step**t**, while**ϵ**is a smoothing term that avoids division by zero. - It makes big updates for less frequent parameters and a small step for frequent parameters.

**Advantages**:

- Learning rate changes for each training parameter.
- Don’t need to manually tune the learning rate.
- Able to train on sparse data.

**Disadvantages**:

- Computationally expensive as a need to calculate the second order derivative.
- The learning rate is always decreasing results in slow training.

### 7. AdaDelta

- It is an extension of
**AdaGrad**which tends to remove the*decaying learning Rate*problem of it. - Instead of accumulating all previously squared gradients,
limits the window of accumulated past gradients to some fixed size*Adadelta***w**. - In this exponentially moving average is used rather than the sum of all the gradients.

**Advantages**:

- Now the learning rate does not decay and the training does not stop.

**Disadvantages**:

- Computationally expensive.

### 8. RMSprop

- RMSprop in fact is identical to the first update vector of Adadelta that we derived above.
- RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests
**γ**be set to 0.9, while a good default value for the learning rate**η**is 0.001. - RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates

### 9. Adaptive Moment Estimation (Adam)

- Adam (Adaptive Moment Estimation) works with momentums of first and second order.
- In addition to storing an exponentially decaying average of past squared gradients like
**AdaDelta**, also keeps track of:*Adam*- first moment :-
**Mean**of past gradients**M(t)** - second moment:- uncentered
**Variance**of the gradients**V(t)**.

- first moment :-

To update the parameter:

**Advantages**:

- The method is too fast and converges rapidly.
- Rectifies vanishing learning rate, high variance.

**Disadvantages**:

Computationally costly.

**Conclusions**

Now observing the above animation and theory we can draw following points:

- It is observed that the SGD algorithm (red) is stuck at a saddle point. So SGD algorithm can only be used for shallow networks.
- All the other algorithms except SGD finally converges one after the other, AdaDelta being the fastest followed by momentum algorithms.
- AdaGrad and AdaDelta algorithm can be used for sparse data.
- Momentum and NAG work well for most cases but is slower.
- Animation for Adam is not available but from the plot above it is observed that it is the fastest algorithm to converge to minima.
- Adam is considered the best algorithm amongst all the algorithms discussed above.