首页 | 官方网站   微博 | 高级检索  
     

基于双估计器的改进Speedy Q-learning算法
引用本文:郑帅,罗飞,顾春华,丁炜超,卢海峰.基于双估计器的改进Speedy Q-learning算法[J].计算机科学,2020,47(7):179-185.
作者姓名:郑帅  罗飞  顾春华  丁炜超  卢海峰
作者单位:华东理工大学信息科学与工程学院 上海200237;华东理工大学信息科学与工程学院 上海200237;华东理工大学信息科学与工程学院 上海200237;华东理工大学信息科学与工程学院 上海200237;华东理工大学信息科学与工程学院 上海200237
基金项目:华东理工大学教育教学规律与方法研究项目;国家自然科学基金
摘    要:Q-learning算法是一种经典的强化学习算法,更新策略由于保守和过估计的原因,存在收敛速度慢的问题。Speedy Q-learning算法和Double Q-learning算法是Q-learning算法的两个变种,分别用于解决Q-learning算法收敛速度慢和过估计的问题。文中基于Speedy Q-learning算法Q值的更新规则和蒙特卡洛强化学习的更新策略,通过理论分析及数学证明提出了其等价形式,从该等价形式可以看到,Speedy Q-learning算法由于将当前Q值的估计函数作为历史Q值的估计,虽然整体上提升了智能体的收敛速度,但是同样存在过估计问题,使得算法在迭代初期的收敛速度较慢。针对该问题,文中基于Double Q-learning算法中双估计器可以改善智能体收敛速度的特性,提出了一种改进算法Double Speedy Q-learning。其通过双估计器,分离最优动作和最大Q值的选择,改善了Speedy Q-learning算法在迭代初期的学习策略,提升了Speedy Q-learning算法的整体收敛速度。在不同规模的格子世界中进行实验,分别采用线性学习率和多项式学习率,来对比Q-learning算法及其改进算法在迭代初期的收敛速度和整体收敛速度。实验结果表明,Double Speedy Q-learning算法在迭代初期的收敛速度快于Speedy Q-learning算法,且其整体收敛速度明显快于对比算法,其实际平均奖励值和期望奖励值之间的差值最小。

关 键 词:Q-LEARNING  Double  Q-learning  Speedy  Q-learning  强化学习

Improved Speedy Q-learning Algorithm Based on Double Estimator
ZHENG Shuai,LUO Fei,GU Chun-hua,DING Wei-chao,LU Hai-feng.Improved Speedy Q-learning Algorithm Based on Double Estimator[J].Computer Science,2020,47(7):179-185.
Authors:ZHENG Shuai  LUO Fei  GU Chun-hua  DING Wei-chao  LU Hai-feng
Affiliation:(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
Abstract:Q-learning algorithm is a classical reinforcement learning algorithm.However,due to overestimation and the conservative updating strategy,there exists a problem of slow convergence.Speedy Q-learning algorithm and Double Q-learning algorithm are two variants of the Q-learning algorithm which are used to solve the problems of slow convergence and over-estimation in Q-learning algorithm respectively.Based on the updating rule of Q value in Speedy Q-learning algorithm and the updating strategy of Monte Carlo reinforcement learning,the equivalent form of the updating rule of Q value is proposed through theoretical analysis and mathematical proof.According to the equivalent form,Speedy Q-learning algorithm takes the estimation function of current Q value as the estimation of the historical Q value.Although the overall convergence speed of the agent is improved,Speedy Q-learning also has the problem of overestimation,which leads to a slow convergence at the beginning of iterations.In order to solve the problem of slow convergence at the initial stage of iterations in the Speedy Q-learning algorithm,an improved algorithm named Double Speedy Q-learning is proposed based on the fact that the double estimator in the Double Q-learning algorithm can improve the convergence speed of agents.By using double estimator,the selection of optimal action and maximum Q value is separated,so that the learning strategy of Speedy Q-learning algorithm in the initial iteration period can be improved and the overall convergence speed of Speedy Q-learning algorithm can be improved.Through grid world experiments of different scales,linear learning rate and polynomial learning rate are used to compare the convergence speed of Q-learning algorithm and its improved algorithm in the initial iteration stage and the overall convergence speed.The results show that the convergence speed of the Double Speedy Q-learning algorithm is faster than that of Speedy Q-learning algorithm in the initial iteration stage and its overall convergence speed is significantly faster than that of comparison algorithms.Its difference between the actual average reward value and the expected reward value is also the smallest.
Keywords:Q-learning  Double Q-learning  Speedy Q-learning  Reinforcement learning
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号