首页 | 官方网站   微博 | 高级检索  
     

神经网络增强学习的梯度算法研究
引用本文:徐昕,贺汉根. 神经网络增强学习的梯度算法研究[J]. 计算机学报, 2003, 26(2): 227-233
作者姓名:徐昕  贺汉根
作者单位:国防科学技术大学自动化研究所长沙10073
基金项目:国家自然科学基金 ( 6 0 0 75 0 2 0 )资助 .
摘    要:针对具有连续状态和离散行为空间的Markov决策问题,提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法,该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略,通过极小化具有非平稳行为策略的Bellman残差平方和性能指标,以实现对Markov决策过程最优值函数的逼近,对算法的收敛性和近似最优策略的性能进行了理论分析,通过Mountain-Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能。

关 键 词:神经网络 增强学习 梯度算法 Markov决策过程 值函数逼近 机器学习
修稿时间:2001-05-23

A Gradient Algorithm for Neural-Network-Based Reinforcement Learning
XU Xin HE Han-Gen. A Gradient Algorithm for Neural-Network-Based Reinforcement Learning[J]. Chinese Journal of Computers, 2003, 26(2): 227-233
Authors:XU Xin HE Han-Gen
Abstract:To solve Markov decision problems with continuous state space and discrete action space, neural networks are commonly used as value function approximators. Since there are no teacher signals in reinforcement learning, gradient algorithms for neural networks in supervised learning can not be applied directly. The existing direct algorithms for reinforcement-learning based on neural networks are not gradient descent algorithms of any objective functions. Thus, their convergence analysis is hard to be obtained and some divergence examples have been found. In the previous work on residual gradient algorithms, the action policy is assumed to be stationary so that convergence can not be guaranteed when the action policy is usually greedy with respect to the estimated value function. In this paper, a new gradient descent reinforcement-learning algorithm is proposed, where multi-layer feed-forward neural networks are used as value function approximators. A nearly greedy and differentiable action policy with Boltzmann probability distribution is employed in the new algorithm. The optimal value functions of Markov decision processes are approximated by minimizing Bellman residuals with non-stationary action polices. To derive incremental gradient learning rules, an upper bound function of the Bellman residuals is employed as the objective function. The convergence of the proposed algorithm and the performance of the approximated optimal policy are analyzed theoretically. Simulation results on the learning control of the Mountain-Car problem illustrate the learning efficiency and generalization ability of the proposed algorithm.
Keywords:reinforcement learning  neural networks  Markov decision processes  value function approximation  gradient descent
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号