Adaptive gradient-based optimizers such as AdaGrad and Adam are among the
methods of choice in modern machine learning. These methods maintain
second-order statistics of each parameter, thus doubling the memory footprint
of the optimizer. In behemoth-size applications, this memory overhead restricts
the size of the model being used as well as the number of examples in a
mini-batch. We describe a novel, simple, and flexible adaptive optimization
method with sublinear memory cost that retains the benefits of per-parameter
adaptivity while allowing for larger models and mini-batches. We give
convergence guarantees for our method and demonstrate its effectiveness in
training very large deep models.