Online Learning for Linearly Parametrized Control Problems

Abbasi-Yadkori, Yasin

doi:doi:10.7939/R3H418

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

1065 views
1384 downloads

Online Learning for Linearly Parametrized Control Problems

Author / Creator

Abbasi-Yadkori, Yasin
In a discrete-time online control problem, a learner makes an effort to control the state of an initially unknown environment so as to minimize the sum of the losses he suffers, where the losses are assumed to depend on the individual state-transitions. Various models of control problems have been topic of intensive research for more than 50 years. Despite this, theoretical results for the online variant of the problem remain limited. In this thesis, we study several online control problems, ranging from the simplest state-less problems, also known as bandits, through classical control problems where the system dynamics is linear and the cost is quadratic in the state and the control, to more complex non-linear problems. The main topic is the design of algorithms for these problems and the development of finite-time performance guarantees for the algorithms proposed. A common theme of the problems is that they assume a linear parametric uncertainty. Accordingly, our methods employ a linear-in-the-parameters predictor and construct a confidence set that contains the true unknown parameter with sufficiently high probability. In particular, following the "optimism in the face of uncertainty" principle, the algorithms always use the parameter that gives rise to the lowest expected loss. In this framework, a tighter confidence set immediately results in a better performing online learning method. The first main contribution of the thesis is the construction of smaller confidence sets for the least-squares estimate. To arrive at these confidence sets, we derive a novel tail inequality for vector-valued martingales. Based on this new confidence set, we modify and, consequently, improve the analysis of the algorithm for the linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), and Rusmevichientong and Tsitsiklis (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. The second main contribution is the introduction of a novel technique to construct confidence sets, which we call online-to-confidence-set conversion. The technique allows us to construct high-probability confidence sets for linear prediction with correlated inputs given the predictions of any algorithm (e.g., online LASSO, exponentiated gradient algorithm (Kivinen and Warmuth, 1997), online least-squares (Lai et al., 1979, Auer et al., 2002b, Vovk, 2001), p-norm algorithms (Grove et al., 2001, Gentile and Littlestone, 1999)); in general, any algorithm whose objective is to achieve low regret with respect to the quadratic loss while using linear predictors. By construction, the size of the confidence set is directly governed by the regret of the online learning algorithm. As a result of this "reductionist" approach, progress in constructing better algorithms for online prediction problems directly translates into tighter confidence sets. As a demonstration of this new approach to constructing confidence sets, we introduce the sparse variant of linear stochastic bandits and show that a recent online algorithm together with our online-to-confidence-set conversion allows one to derive algorithms that can exploit if the unknown parameter vector determining the expected loss of actions is sparse. In the second part of the thesis, we study the average loss linear quadratic (LQ) control problem with unknown model parameters, also known as the LQ adaptive control problem in the control community. We design an algorithm and prove that its regret up to time T, apart from logarithmic factors, is O(sqrt{T}). To the best of our knowledge, this is the first time that an algorithm designed for this problem is shown to enjoy a sublinear finite-time regret bound. We also show that similar techniques can be employed to design and analyze an algorithm for a more general problem with nonlinear dynamics but linear parametric uncertainty. To the best of our knowledge this is the the first time that regret bounds are derived for these classes of control problems.
Subjects / Keywords
Graduation date

Spring 2013
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3H418
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Specialization
- Statistical Machine Learning
Supervisor / co-supervisor and their department(s)
- Szepesvari, Csaba (Computing Science)
Examining committee members and their departments
- Schuurmans, Dale (Computing Science)
- Borkar, Vivek S. (School of Technology and Computer Science)
- Huang, Biao (Chemical and Materials Engineering)
- Sutton, Richard (Computing Science)
- Greiner, Russell (Computing Science)