Online Learning for Linearly Parametrized Control Problems

  • Author / Creator
    Abbasi-Yadkori, Yasin
  • In a discrete-time online control problem, a learner makes an effort to control the state of an initially unknown environment so as to minimize the sum of the losses he suffers, where the losses are assumed to depend on the individual state-transitions. Various models of control problems have been topic of intensive research for more than 50 years. Despite this, theoretical results for the online variant of the problem remain limited. In this thesis, we study several online control problems, ranging from the simplest state-less problems, also known as bandits, through classical control problems where the system dynamics is linear and the cost is quadratic in the state and the control, to more complex non-linear problems. The main topic is the design of algorithms for these problems and the development of finite-time performance guarantees for the algorithms proposed. A common theme of the problems is that they assume a linear parametric uncertainty. Accordingly, our methods employ a linear-in-the-parameters predictor and construct a confidence set that contains the true unknown parameter with sufficiently high probability. In particular, following the "optimism in the face of uncertainty" principle, the algorithms always use the parameter that gives rise to the lowest expected loss. In this framework, a tighter confidence set immediately results in a better performing online learning method. The first main contribution of the thesis is the construction of smaller confidence sets for the least-squares estimate. To arrive at these confidence sets, we derive a novel tail inequality for vector-valued martingales. Based on this new confidence set, we modify and, consequently, improve the analysis of the algorithm for the linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), and Rusmevichientong and Tsitsiklis (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. The second main contribution is the introduction of a novel technique to construct confidence sets, which we call online-to-confidence-set conversion. The technique allows us to construct high-probability confidence sets for linear prediction with correlated inputs given the predictions of any algorithm (e.g., online LASSO, exponentiated gradient algorithm (Kivinen and Warmuth, 1997), online least-squares (Lai et al., 1979, Auer et al., 2002b, Vovk, 2001), p-norm algorithms (Grove et al., 2001, Gentile and Littlestone, 1999)); in general, any algorithm whose objective is to achieve low regret with respect to the quadratic loss while using linear predictors. By construction, the size of the confidence set is directly governed by the regret of the online learning algorithm. As a result of this "reductionist" approach, progress in constructing better algorithms for online prediction problems directly translates into tighter confidence sets. As a demonstration of this new approach to constructing confidence sets, we introduce the sparse variant of linear stochastic bandits and show that a recent online algorithm together with our online-to-confidence-set conversion allows one to derive algorithms that can exploit if the unknown parameter vector determining the expected loss of actions is sparse. In the second part of the thesis, we study the average loss linear quadratic (LQ) control problem with unknown model parameters, also known as the LQ adaptive control problem in the control community. We design an algorithm and prove that its regret up to time T, apart from logarithmic factors, is O(sqrt{T}). To the best of our knowledge, this is the first time that an algorithm designed for this problem is shown to enjoy a sublinear finite-time regret bound. We also show that similar techniques can be employed to design and analyze an algorithm for a more general problem with nonlinear dynamics but linear parametric uncertainty. To the best of our knowledge this is the the first time that regret bounds are derived for these classes of control problems.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Specialization
    • Statistical Machine Learning
  • Supervisor / co-supervisor and their department(s)
    • Szepesvari, Csaba (Computing Science)
  • Examining committee members and their departments
    • Huang, Biao (Chemical and Materials Engineering)
    • Greiner, Russell (Computing Science)
    • Borkar, Vivek S. (School of Technology and Computer Science)
    • Schuurmans, Dale (Computing Science)
    • Sutton, Richard (Computing Science)