×

A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. (English) Zbl 1104.93054

Summary: The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins-Monro stochastic approximation. The algorithm is motivated by temporal-difference learning, and our developments lead to a possibly more efficient variant of temporal-difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.

MSC:

93E11 Filtering in stochastic control theory
93E24 Least squares and related methods for stochastic control systems
93E35 Stochastic learning and adaptive control
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Barto A, Crites R 1996. Improving elevator performance using reinforcement learning, Adv Neural Inf Process Syst, 8:1017–1023.
[2] Bellman R, Dreyfuss S 1959. Functional approximations and dynamic programming, Math Tables Other Aids Comput, 13:247–251. · Zbl 0095.34403 · doi:10.2307/2002797
[3] Benveniste A, Métivier M, and Priouret P 1991. Adaptive Algorithms and Stochastic Approximations. Berlin Heidelberg New York: Springer-Verlag · Zbl 0752.93073
[4] Bertsekas DP 1995a. Nonlinear Programming. Athena Scientific. · Zbl 0935.90037
[5] Bertsekas DP 1995b. Dynamic Programming and Optimal Control. Athena Scientific.
[6] Bertsekas DP, Singh S 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Adv Neural Inf Process Syst. MIT, vol. 9, p. 974.
[7] Bertsekas DP, Tsitsiklis JN 1995. Neuro-Dynamic Programming. Athena Scientific.
[8] Borkar V 1995. Probability theory: an advanced course. Berlin Heidelberg New York: Springer-Verlag · Zbl 0838.60001
[9] Boyan J 1999. Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference (ICML) on Machine Learning, pp. 49–56.
[10] Boyan J 2002. Technical update: least-squares temporal difference learning, Mach Learn, 49(2):233–246. · Zbl 1014.68072 · doi:10.1023/A:1017936530646
[11] Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal-difference learning, Mach Learn, 22:33–57. · Zbl 1099.93534
[12] Choi DS, Van Roy B 2001. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning, proceedings of the international joint conference on machine learning.
[13] Dayan PD 1992. The convergence of TD({\(\lambda\)}) for general ({\(\lambda\)}), Mach Learn, 8:341–362. · Zbl 0773.68060
[14] de Farias DP, Van Roy B 2000. On the existence of fixed points for approximate value iteration and temporal-difference learning, J Optim Theory Appl, 105(3). · Zbl 1028.90077
[15] Gurvits L, Lin LJ, and Hanson SJ 1994. incremental learning of evaluation functions for absorbing markov chains: New Methods and Theorems, preprint.
[16] Karatzas I, Shreve SE 1998. Methods of Mathematical Finance. Berlin Heidelberg New York: Springer. · Zbl 0941.91032
[17] Lagoudakis M, Parr R 2001. Model-free least-squares policy iteration. Neural Inf Process Syst (NPIS-14). · Zbl 1094.68080
[18] Nedic A, Bertsekas DP 2001. Policy evaluation algorithms with linear function approximation. Tech. Rep. LIDS-P-2537, MIT Laboratory for Information and Decision Systems, December 2001.
[19] Pineda F 1997. Mean-field analysis for batched TD({\(\lambda\)}), Neural Comput, 1403–1419.
[20] Sutton RS 1988. Learning to predict by the method of temporal differences, Mach Learn, 3:9–44.
[21] Tadić V 2001. On the convergence of temporal-difference learning with linear function approximation, Mach Learn, 42:241–267. · Zbl 0969.68088 · doi:10.1023/A:1007609817671
[22] Tesauro G 1995. Temporal difference learning and TD-gammon, Communications of the ACM, 38(3).
[23] Tsitsiklis JN, Van Roy B 1997. An analysis of temporal-difference learning with function approximation, IEEE Trans Automat Contr, 42:674–690. · Zbl 0914.93075 · doi:10.1109/9.580874
[24] Tsitsiklis JN, Van Roy B 1999. Optimal stopping of markov processes: Hilbert Space Theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives, IEEE Trans Automat Contr, 44(10):1840–1851. · Zbl 0958.60042 · doi:10.1109/9.793723
[25] Van Roy B 1998. Learning and value function approximation in complex decision processes, Ph.D. dissertation, MIT.
[26] Van Roy B, Bertsekas DP, Lee Y, and Tsitsiklis JN 1999. A Neuro-dynamic programming approach to retailer inventory management, Proc. of the IEEE Conf Decis Contr.
[27] Varaiya P, Walrand J, and Buyukkoc C 1985. Extensions of the multiarmed bandit problem: the discounted case, IEEE Trans Automat Contr, 30(5). · Zbl 0566.90096
[28] Warmuth M, Forster J 2000. Relative loss bounds for temporal-difference learning. Proc. of the Seventeenth International Conference on Machine Learning, pp. 295–302.
[29] Warmuth M, Schapire R 1997. On the worst-case analysis of temporal-difference learning algorithms, Journal of Machine Learning, 22(1,2,3):95–121. · Zbl 1099.68699
[30] Zhang W, Dietterich TG 1995. A reinforcement learning approach to job-shop scheduling. Proc. of the International Joint Conference on Artificial Intellience.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.