基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于逐次超松弛技术的Double Speedy Q-Learning算法

引用本文：	周琴,罗飞,丁炜超,顾春华,郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法[J]. 计算机科学, 2022, 49(3): 239-245. DOI: 10.11896/jsjkx.201200173

作者姓名：	周琴罗飞丁炜超顾春华郑帅

作者单位：	华东理工大学信息科学与工程学院上海200237

基金项目：	上海汽车工业科技发展基金会产学研课题;国家自然科学基金

摘要：	Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法.但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时...
关键词：	强化学习 Q-Learning 马尔可夫决策过程逐次超松弛迭代法自循环结构
Double Speedy Q-Learning Based on Successive Over Relaxation

ZHOU Qin,LUO Fei,DING Wei-chao,GU Chun-hua,ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation[J]. Computer Science, 2022, 49(3): 239-245. DOI: 10.11896/jsjkx.201200173

Authors:	ZHOU Qin LUO Fei DING Wei-chao GU Chun-hua ZHENG Shuai

Affiliation:	(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

Abstract:	Q-Learning is a mainstream reinforcement learning algorithm atpresent,but its convergence speed is poor in random environment.Previous studies have improved the overestimation problem of Spee-dy Q-Learning,and have proposed Double Speedy Q-Learning algorithm.However,the Double Speedy Q-Learning algorithm does not consider the self-loop structure exis-ting in the random environment,that is,the probability of entering the current state when the agent performs an action,which will not be conducive to the agent’s learning in the random environment,thereby affecting the convergence speed of the algorithm.Aiming at the self-loop structure existing in Double Speedy Q-Learning,the Bellman operator of Double Speedy Q-Learning algorithm is improved by using successive over-relaxation technology,and the Double Speedy Q-Learning algorithm based on successive over relaxation(DSQL-SOR)is proposed to further improve the convergence speed of the Double Speedy Q-Learning algorithm.By using numerical experiments to compare the error between the actual rewards and expected rewards of DSQL-SOR and other algorithms,the experimental results show that the proposed algorithm has a lower error of 0.6 than the existing mainstream algorithm SQL,which is lower than the successive over-relaxation algorithm GSQL 0.5,indicating that the performance of the DSQL-SOR algorithm is better than other algorithms.The experiment also tests the scalability of the DSQL-SOR algorithm.When the state space is increased from 10 to 1000,the average time of each iteration increases slowly,always maintaining at the magnitude of 10^-4,indicating that DSQL-SOR has strong scalability.

Keywords:	Reinforcement learning Q-Learning Markov decision process(MDP) Successive over relaxation(SOR) Self-loop structure
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏