Deep Learning via Hessian-Free Optimization

Summary

This paper introduces a fairly complex optimization algorithm for deep nets that uses approximate 2nd-order gradient information

In Hessian-Free optimization, you can directly approximate a Hessian-vector product $Hv$ with the method of finite-differences; this only costs 1 more gradient evaluation
linear conjugate gradient algorithm allows one to solve for the optimal search direction in $O(N)$ iterations ($N$ is the number of parameters) with only matrix-vector products
Newton’s method is scale invariant, e.g., for a new parameterization $\hat{\theta} = A \theta$ for some invertible matrix $A$, the optimal search direction is now $\hat{p} = A p$ where $p$ is the original optimal search direction. Gradient descent is not (need proof!) - so many bad things about GD, but it’s so easy to implement..

Need to use an adaptive damping parameters $\lambda$ beause the relative scale of $B = H(\theta)$ is changing and $H(\theta)$ must remain positive semidefinite. Recommended heuristic is given in Section 4.1
Gauss-Newton matrix $G$ can produce better search directions than $H$, see this blog post for a summary
Compute gradient on entire dataset, but use minibatches to compute Hessian-vector products. SGD requires 10’s of thousands of iterations versus ~200 for HF