Linear Least-Squares Algorithms for Temporal Difference Learning

Steven J. Bradtke; A. Barto

DOI:10.1023/A:1018056104778
Corpus ID: 20327856

Linear Least-Squares Algorithms for Temporal Difference Learning

@article{Bradtke2005LinearLA,
  title={Linear Least-Squares Algorithms for Temporal Difference Learning},
  author={Steven J. Bradtke and Andrew G. Barto},
  journal={Machine Learning},
  year={2005},
  volume={22},
  pages={33-57},
  url={https://api.semanticscholar.org/CorpusID:20327856}
}

Steven J. BradtkeA. Barto
Published in Machine-mediated learning 2005
Computer Science, Mathematics

Two new temporal difference algorithms based on the theory of linear least-squares function approximation, LS TD and RLS TD, are introduced and prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters.

View on Springer Nature

link.springer.com

47 Citations

Highly Influential Citations

Background Citations

Methods Citations

Results Citations

An efficient L2-norm regularized least-squares temporal difference learning algorithm

Sheng-Lei ChenGeng ChenRuijun Gu

Computer Science, Mathematics

Knowledge-Based Systems

2013

Stochastic approximation for efficient LSTD and least squares regression

A. PrashanthL.N. KordaR. Munos

Computer Science, Mathematics

2014

This paper considers a “big data” regime where both the dimension, d, of the data and the number, T, of training samples are large and proposes stochastic approximation based methods with randomization of samples in two different settings - one for policy evaluation using the least squares temporal difference (LSTD) algorithm and the other for solving the most squares problem.

12-009 Least-squares methods for policy iteration ∗

L. BuşoniuA. LazaricM. GhavamzadehR. MunosRobert BabuškaB. Schutter

Computer Science

2016

This chapter reviews leastsquares methods for policy iteration, an important class of algorithms for approximate reinforcement learning, and discusses three techniques for solving the core, policy evaluation component of policy iteration: least-squares temporal difference, least-Squares policy evaluation, and Bellman residual minimization.

Sparse Temporal Difference Learning via Alternating Direction Method of Multipliers

Nikos TsipinakisJ. Nelson

Computer Science, Mathematics

International Conference on Machine Learning and…

2015

This paper proposes a new algorithm for approximating the fixed-point based on the Alternating Direction Method of Multipliers (ADMM), and demonstrates, with experimental results, that the proposed algorithm is more stable for policy iteration compared to prior work.

Fastest Convergence for Q-learning

Adithya M. DevrajSean P. Meyn

Computer Science, Mathematics

arXiv.org

2017

The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its…

[PDF]

Optimization methods for structured machine learning problems

Nikolaos Tsipinakis

Computer Science, Mathematics

2019

This thesis attempts to solve the `1regularized fixed-point problem with the help of Alternating Direction Method of Multipliers (ADMM) and argues that the proposed method is well suited to the structure of the aforementioned fixed- point problem.

Q-learning algorithms for optimal stopping based on least squares

Huizhen YuD. Bertsekas

Computer Science, Mathematics

European Control Conference

2007

This work considers the solution of discounted optimal stopping problems using linear function approximation methods and proposes alternative algorithms, which are based on projected value iteration ideas and least squares, which prove the convergence of some of these algorithms.

Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning

Guy LeverJ. Shawe-TaylorRonnie StaffordCsaba Szepesvari

Computer Science

AAAI Conference on Artificial Intelligence

2016

It is demonstrated that the loss function for the CME model suggests a principled approach to compressing the induced (pseudo-)MDP, leading to faster planning, while maintaining guarantees, and superior performance to existing methods in this class of modelbased approaches on a range of MDPs.

Off-Policy Neural Fitted Actor-Critic

Matthieu ZimmerY. BonifaceA. Dutech

Computer Science

Neural Information Processing Systems

2016

A new off-policy, offline, model-free, actor-critic reinforcement learning algorithm dealing with continuous environments in both states and actions is presented, which allows to trade-off between data-efficiency and scalability.

Applying Q ( λ )-learning in Deep Reinforcement Learning to Play Atari Games

Seyed Sajad MousaviM. SchukatE. HowleyP. Mannion

Computer Science

2017

Empirical results on a range of games show that the deep Q(λ) network significantly reduces learning time, and this method provides faster learning in comparison with the DQN method.

The convergence of TD(λ) for general λ

P. Dayan

Mathematics, Computer Science

Machine-mediated learning

2004

Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one is adapted to demonstrate this strong form of convergence for a slightly modified version of TD.

Practical issues in temporal difference learning

G. Tesauro

Computer Science, Psychology

Machine-mediated learning

2004

It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which surpasses comparable networks trained on a massive human expert data set.

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

T. JaakkolaMichael I. JordanSatinder Singh

Computer Science

Neural Computation

1994

A rigorous proof of convergence of DP-based learning algorithms is provided by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem, which establishes a general class of convergent algorithms to which both TD() and Q-learning belong.

1,079

Incremental dynamic programming for on-line adaptive optimal control

Steven J. Bradtke

Computer Science

1995

This dissertation expands the theoretical and empirical understanding of IDP algorithms and increases their domain of practical application, and proves convergence of a DP-based reinforcement learning algorithm to the optimal policy for any continuous domain.

Consistency of HDP applied to a simple reinforcement learning problem

P. Werbos

Computer Science, Engineering

Neural Networks

1990

Q-learning

C. WatkinsP. Dayan

Computer Science

Machine-mediated learning

2004

This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.

Expectation Driven Learning with an Associative Memory

G. LukesB. ThompsonP. Werbos

Computer Science

1990

In these experiments, the automaton has expectations of the minimum future cost of actions leading to a goal state; learning occurs when expectations in the associative memory are modified and the effect on learning is noted.

Generalization of backpropagation with application to a recurrent gas market model

P. Werbos

Computer Science

Neural Networks

1988

1,023

Recursive estimation and time-series analysis

Hong Wang

Mathematics, Computer Science

IEEE Transactions on Acoustics Speech and Signal…

1986

It's important for you to start having that hobby that will lead you to join in better concept of life and reading will be a positive activity to do every time.

Learning rate schedules for faster stochastic gradient search

C. DarkenJoseph T. ChangJ. Moody

Computer Science

Neural Networks for Signal Processing II…

1992

The authors propose a new methodology for creating the first automatically adapting learning rates that achieve the optimal rate of convergence for stochastic gradient descent. Empirical tests agree…

Linear Least-Squares Algorithms for Temporal Difference Learning

47 Citations

An efficient L2-norm regularized least-squares temporal difference learning algorithm

Stochastic approximation for efficient LSTD and least squares regression

12-009 Least-squares methods for policy iteration ∗

Sparse Temporal Difference Learning via Alternating Direction Method of Multipliers

Fastest Convergence for Q-learning

Optimization methods for structured machine learning problems

Q-learning algorithms for optimal stopping based on least squares

Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning

Off-Policy Neural Fitted Actor-Critic

Applying Q ( λ )-learning in Deep Reinforcement Learning to Play Atari Games

25 References

The convergence of TD(λ) for general λ

Practical issues in temporal difference learning

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

Incremental dynamic programming for on-line adaptive optimal control

Consistency of HDP applied to a simple reinforcement learning problem

Q-learning

Expectation Driven Learning with an Associative Memory

Generalization of backpropagation with application to a recurrent gas market model

Recursive estimation and time-series analysis

Learning rate schedules for faster stochastic gradient search

Related Papers