神经网络增强学习的梯度算法研究 A Gradient Algorithm for Neural-Network-Based Reinforcement Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

神经网络增强学习的梯度算法研究

引用本文：	徐昕,贺汉根. 神经网络增强学习的梯度算法研究[J]. 计算机学报, 2003, 26(2): 227-233

作者姓名：	徐昕贺汉根

作者单位：	国防科学技术大学自动化研究所长沙10073

基金项目：	国家自然科学基金 ( 6 0 0 75 0 2 0 )资助 .

摘要：	针对具有连续状态和离散行为空间的Markov决策问题，提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法，该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略，通过极小化具有非平稳行为策略的Bellman残差平方和性能指标，以实现对Markov决策过程最优值函数的逼近，对算法的收敛性和近似最优策略的性能进行了理论分析，通过Mountain-Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能。
关键词：	神经网络增强学习梯度算法 Markov决策过程值函数逼近机器学习
修稿时间：	2001-05-23
A Gradient Algorithm for Neural-Network-Based Reinforcement Learning

XU Xin HE Han-Gen. A Gradient Algorithm for Neural-Network-Based Reinforcement Learning[J]. Chinese Journal of Computers, 2003, 26(2): 227-233

Authors:	XU Xin HE Han-Gen

Abstract:	To solve Markov decision problems with continuous state space and discrete action space, neural networks are commonly used as value function approximators. Since there are no teacher signals in reinforcement learning, gradient algorithms for neural networks in supervised learning can not be applied directly. The existing direct algorithms for reinforcement-learning based on neural networks are not gradient descent algorithms of any objective functions. Thus, their convergence analysis is hard to be obtained and some divergence examples have been found. In the previous work on residual gradient algorithms, the action policy is assumed to be stationary so that convergence can not be guaranteed when the action policy is usually greedy with respect to the estimated value function. In this paper, a new gradient descent reinforcement-learning algorithm is proposed, where multi-layer feed-forward neural networks are used as value function approximators. A nearly greedy and differentiable action policy with Boltzmann probability distribution is employed in the new algorithm. The optimal value functions of Markov decision processes are approximated by minimizing Bellman residuals with non-stationary action polices. To derive incremental gradient learning rules, an upper bound function of the Bellman residuals is employed as the objective function. The convergence of the proposed algorithm and the performance of the approximated optimal policy are analyzed theoretically. Simulation results on the learning control of the Mountain-Car problem illustrate the learning efficiency and generalization ability of the proposed algorithm.

Keywords:	reinforcement learning neural networks Markov decision processes value function approximation gradient descent
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏