×

Quadratic approximate dynamic programming for input-affine systems. (English) Zbl 1285.93103

Summary: We consider the use of quadratic approximate value functions for stochastic control problems with input-affine dynamics and convex stage cost and constraints. Evaluating the approximate dynamic programming policy in such cases requires the solution of an explicit convex optimization problem, such as a quadratic program, which can be carried out efficiently. We describe a simple and general method for approximate value iteration that also relies on our ability to solve convex optimization problems, in this case, typically a semidefinite program. Although we have no theoretical guarantee on the performance attained using our method, we observe that very good performance can be obtained in practice.

MSC:

93E20 Optimal stochastic control
49L20 Dynamic programming in optimal control and differential games
93E25 Computational methods in stochastic control (MSC2010)
90C25 Convex programming

Software:

fast_mpc; CVX; CVXGEN
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bellman, Dynamic Programming (1957)
[2] Bertsekas, Dynamic Programming and Optimal Control: Volume 1 (2005)
[3] Bertsekas, Dynamic Programming and Optimal Control: Volume 2 (2007) · Zbl 1209.90343
[4] Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming (1994) · Zbl 0829.90134 · doi:10.1002/9780470316887
[5] Bertsekas, Rollout Algorithms for Stochastic Scheduling Problems, Journal of Heuristics 5 pp 89– (1999) · Zbl 0997.90037 · doi:10.1023/A:1009634810396
[6] Wang, Performance bounds for linear stochastic control, Systems and Control Letters 58 (3) pp 1780– (2009) · Zbl 1159.93356 · doi:10.1016/j.sysconle.2008.10.004
[7] Wang Y Boyd S Approximate dynamic programming via iterated bellman inequalities 2011 · Zbl 1317.93237
[8] O’Donoghue B Wang Y Boyd S Min-max approximate dynamic programming Proceedings IEEE Multi-conference on Systems and Control 2011 424 431
[9] Keshavarz A Wang Y Boyd S Imputing a convex objective function Proceedings of the IEEE Multi-Conference on Systems and Control 2011 613 619
[10] Boyd, Linear matrix inequalities in systems and control theory (1994) · Zbl 0816.93004 · doi:10.1137/1.9781611970777
[11] Ng, Proceedings of 17th International Conference on Machine Learning pp 663– (2000)
[12] Abbeel, Autonomous helicopter aerobatics through apprenticeship learning, The International Journal of Robotics Research 29 pp 1608– (2010) · doi:10.1177/0278364910371999
[13] Ackerberg, chapter 63, in: Handbook of Econometrics, volume 6 of Handbook of Econometrics (2007)
[14] Bajari, Estimating dynamic models of imperfect competition, Econometrica 75 (5) pp 1331– (2007) · Zbl 1133.91008 · doi:10.1111/j.1468-0262.2007.00796.x
[15] Nielsen, Learning a decision maker’s utility function from (possibly) inconsistent behavior, Artificial Intelligence 160 pp 53– (2004) · Zbl 1086.91019 · doi:10.1016/j.artint.2004.08.003
[16] Watkins C Learning from delayed rewards Ph.D. Thesis Cambridge University Cambridge, England 1989
[17] Watkins, Technical note: Q-learning, Machine Learning 8 (3) pp 279– (1992) · Zbl 0773.68062 · doi:10.1007/BF00992698
[18] Kwon, Receding Horizon Control (2005)
[19] Whittle, Optimization Over Time (1982)
[20] Goodwin, Constrained Control and Estimation (2005) · doi:10.1007/b138145
[21] Judd, Numerical Methods in Economics (1998)
[22] Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (2007) · Zbl 1156.90021 · doi:10.1002/9780470182963
[23] Ernst, Tree-based batch mode reinforcement learning, Journal of Machine Learning Research 6 pp 504– (2005) · Zbl 1222.68193
[24] Riedmiller, 16th European Conference on Machine Learning pp 317– (2005)
[25] Munos, Performance bounds in Lp-norm for approximate value iteration, SIAM Journal on Control and Optimization 46 (2) pp 541– (2007) · Zbl 1356.90159 · doi:10.1137/040614384
[26] Munos, Finite-time bounds for fitted value iteration, Journal of Machine Learning Research 9 pp 815– (2008) · Zbl 1225.68203
[27] Antos, Advances in Neural Information Processing Systems 20 pp 9– (2008)
[28] De Farias, The linear programming approach to approximate dynamic programming, Operations Research 51 (6) pp 850– (2003) · Zbl 1165.90666 · doi:10.1287/opre.51.6.850.24925
[29] Desai, Approximate dynamic programming via a smoothed linear program, Operations Research 60 (3) pp 655– (2012) · Zbl 1251.90379 · doi:10.1287/opre.1120.1044
[30] Lagoudakis, Least-squares policy iteration, Journal of Machine Learning Research 4 pp 1107– (2003) · Zbl 1094.68080
[31] Tsitsiklis, An analysis of temporal difference learning with function approximation, IEEE Transactions on Automatic Control 42 pp 674– (1997) · Zbl 0914.93075 · doi:10.1109/9.580874
[32] Van Roy B Learning and value function approximation in complex decision processes Ph.D. Thesis 1998
[33] Bemporad, The explicit linear quadratic regulator for constrained systems, Automatica 38 (1) pp 3– (2002) · Zbl 0999.93018 · doi:10.1016/S0005-1098(01)00174-1
[34] Wang Y Boyd S Fast model predictive control using online optimization Proceedings of the 17th IFAC World Congress 2008 6974 6997
[35] Mattingley, Real-time convex optimization in signal processing, IEEE Signal Processing Magazine 23 (3) pp 50– (2009)
[36] Mattingley J Wang Y Boyd S Code generation for receding horizon control IEEE Multi-Conference on Systems and Control 2010 985 992
[37] Mattingley, Convex Optimization in Signal Processing and Communications pp 1– (2010)
[38] Mattingley, CVXGEN: a code generator for embedded convex optimization, Optimization and Engineering 13 (1) pp 1– (2012) · Zbl 1293.65095 · doi:10.1007/s11081-011-9176-9
[39] Bertsekas, Stochastic Optimal Control: The Discrete-Time Case (1978) · Zbl 0471.93002
[40] Bertsekas D A value iteration method for the average cost dynamic programming problem 1995
[41] Boyd, Convex optimization (2004) · Zbl 1058.90049
[42] Vandenberghe, Semidefinite programming, SIAM Review 38 (1) pp 49– (1996) · Zbl 0845.65023 · doi:10.1137/1038003
[43] Wang, Fast evaluation of quadratic control-Lyapunov policy, IEEE Transactions on control Systems Technology PP (99) pp 1– (2010)
[44] Grant M Boyd S Ye Y CVX: Matlab software for disciplined convex programming 2006 http://cvxr.com/cvx/
[45] Boyd S Mueller M ODonoghue B Wang Y Performance bounds and suboptimal policies for multi-period investment 2012
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.