During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even non-stochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We treat neural network training as a system of stiff ordinary differential equations and use an exponential Euler solver to train the network without entering the edge of stability. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the preactivations at the front of the network can cause a large change in the outputs at the back of the network. We further demonstrate that the degree of layerwise Jacobian alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.97.
Note:
Bio: Mark Lowell is a mathematician specializing in artificial intelligence. After graduating from the University of Massachusetts Amherst with a Ph.D. in mathematics, he has worked on AI for computer vision at the Department of Defense.