\n",
" \n",
" | Stochastic Gradient Descent (SGD) | \n",
" learning rate, max epoch | \n",
" \n",
" \n",
" - For large dataset, it can converge faster as parameters are updated more frequently.
\n",
" - Low Memory requirement as only one example is evaluated at a time.
\n",
" \n",
" | \n",
" \n",
" \n",
" - Gradient direction is very noisy and thereby might take longer to converge.
\n",
" - Frequent updates are computational expensive.
\n",
" - loses the advantages of vectorized computation.
\n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" | Batch Gradient Descent | \n",
" learning rate, max epoch | \n",
" \n",
" \n",
" - Less oscillations and, thereby, more stable gradient descent convergence
\n",
" - Vectorization increases the speed of processing
\n",
" \n",
" | \n",
" \n",
" \n",
" - More chances of getting stuck in local minima
\n",
" - Memory intensive as the full training dataset needs to keep in memory
\n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" | Min Batch Gradient Descent | \n",
" learning rate, max epoch, either number of batches or batch size | \n",
" \n",
" \n",
" - Less memory intensive
\n",
" - Computationally efficient as it takes advantages of vectorization
\n",
" - Stable convergence as compared to SGD
\n",
" \n",
" | \n",
" \n",
" \n",
" - Convergence is noiser than Batch Gradient Descent
\n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" | Momenutm Gradient Descent | \n",
" learning rate, max epoch, either number of batches or batch size, momentum (=0.9 by default) | \n",
" \n",
" \n",
" - has all the advantages of mini-batch but is usually faster to converge
\n",
" \n",
" | \n",
" \n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" | RMSProp | \n",
" learning rate, max epoch, either number of batches or batch size, momentum (=0.9 by default), \\epsilon(=1.0e-6 by default) | \n",
" \n",
" \n",
" - has all the advantages of momentum gradient descent but is usually faster to converge
\n",
" \n",
" | \n",
" \n",
" \n",
" | \n",
"
\n",
" \n",
"