 916 views
 937 downloads
Nonuniform Analysis for Nonconvex Optimization in Machine Learning

 Author / Creator
 Mei, Jincheng

The optimization of nonconvex objective functions is a topic of central interest in machine learning. Remarkably, it has recently been shown that simple gradientbased optimization can achieve globally optimal solutions in important nonconvex problems that arise in machine learning, including policy gradient optimization in reinforcement learning (RL), generalized linear model training in supervised learning (SL), and overparameterized neural network training in deep learning. However, previous work generally relies on uniform
properties of the optimization landscape, ignoring relevant problem structure, which limits both the applicability and strength of the theoretical results that can be obtained.
In this thesis, motivated by fundamental problems in RL and SL, I investigate a nonuniform analysis for nonconvex optimization.
Chapter 2 studies policy gradient optimization (PG) in RL and resolves three open problems in the literature by introducing a new analysis tool called nonuniform Lojasiewicz inequality (NL). In particular, this chapter shows that (i) PG optimization with a softmax parameterization converges to a globally optimal policy at a O(1/t) rate; (ii) adding entropy regularization improves the convergence rate of PG to O(e^{−c t}) (where c > 0) to a regularized optimal policy; and (iii) an Omega(1/t) lower bound can be established on the worst case convergence of softmax PG. The separation of rates is further explained using the concept of the NL degree. These results provide a theoretical explanation of the optimization advantage of entropy regularization.
Next, Chapter 3 reconsiders a common policy parameterization used in machine learning: the softmax transform. Two negative results are established for using the softmax transform in gradient based optimization. In particular, this chapter shows that (i) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (“softmax
gravity well”); and (ii) optimizing logprobabilities under the softmax must exhibit slow convergence (“softmax damping”). I propose an alternative escort mapping that demonstrates better optimization properties for PG and cross entropy minimization in SL. This analysis is based on the NL inequality
and a new nonuniform smoothness (NS) property. These difficulties with the softmax and the advantage of the escort transform are further explained by the concept of the NL coefficient.
Chapter 4 then introduces a nonuniform analysis that combines the nonuniform smoothness (NS) property and the NL inequality, using the combination to more accurately characterize nonconvex objective landscapes and inspire new geometryaware gradient descent methods. One interesting result for general optimization is that geometryaware firstorder methods can converge to global optimality faster than the classical Omega(1/t^2) lower bounds if one additionally considers these nonuniform properties. This chapter then applies new geometryaware firstorder methods to PG and generalized linear model training (GLM). For PG, it is shown that normalizing gradient ascent can accelerate convergence to O(e^{−c t}) for some c > 0, while incurring less overhead than existing algorithms. For GLM, it is shown that geometryaware
normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. Additionally, I show that these geometryaware gradient descent methods can escape landscape plateaus faster
than standard gradient descent.
Finally, Chapter 5 extends the analysis to stochastic policy
optimization, and shows that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. By introducing the concept of committal rate, this chapter contributes two key findings: (i) identifying a
criterion for determining almost sure global convergence; and (ii) revealing an inherent tradeoff between exploiting geometry to accelerate convergence versus achieving almost sure global optimality. This committal rate theory is then used to explain why practical policy optimization methods are sensitive to random initialization, leading to the development of an ensemble method that can be guaranteed to achieve nearoptimal solutions with high probability. 
 Subjects / Keywords

 Graduation date
 Fall 2021

 Type of Item
 Thesis

 Degree
 Doctor of Philosophy

 License
 This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for noncommercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.