首页 | 官方网站   微博 | 高级检索  
     

结合优势结构和最小目标Q值的深度强化学习导航算法
引用本文:朱威,洪力栋,施海东,何德峰. 结合优势结构和最小目标Q值的深度强化学习导航算法[J]. 控制理论与应用, 2024, 41(4): 716-728
作者姓名:朱威  洪力栋  施海东  何德峰
作者单位:浙江工业大学,浙江工业大学,浙江工业大学,浙江工业大学
基金项目:国家自然科学基金项目(62173303), 浙江省自然科学基金项目(LY21F010009)
摘    要:针对现有基于策略梯度的深度强化学习方法应用于办公室、走廊等室内复杂场景下的机器人导航时,存在训练时间长、学习效率低的问题,本文提出了一种结合优势结构和最小化目标Q值的深度强化学习导航算法.该算法将优势结构引入到基于策略梯度的深度强化学习算法中,以区分同一状态价值下的动作差异,提升学习效率,并且在多目标导航场景中,对状态价值进行单独估计,利用地图信息提供更准确的价值判断.同时,针对离散控制中缓解目标Q值过估计方法在强化学习主流的Actor-Critic框架下难以奏效,设计了基于高斯平滑的最小目标Q值方法,以减小过估计对训练的影响.实验结果表明本文算法能够有效加快学习速率,在单目标、多目标连续导航训练过程中,收敛速度上都优于柔性演员评论家算法(SAC),双延迟深度策略性梯度算法(TD3),深度确定性策略梯度算法(DDPG),并使移动机器人有效远离障碍物,训练得到的导航模型具备较好的泛化能力.

关 键 词:强化学习  移动机器人  导航  优势结构  最小化目标Q值
收稿时间:2022-04-19
修稿时间:2023-11-11

Deep reinforcement learning navigation algorithm combining advantage structure and minimum target Q-value
ZHU Wei,HONG Li-dong,SHI Hai-dong and HE De-feng. Deep reinforcement learning navigation algorithm combining advantage structure and minimum target Q-value[J]. Control Theory & Applications, 2024, 41(4): 716-728
Authors:ZHU Wei  HONG Li-dong  SHI Hai-dong  HE De-feng
Affiliation:Zhejiang University of Technology,Zhejiang University of Technology,Zhejiang University of Technology,Zhejiang University of Technology
Abstract:The existing deep reinforcement learning methods based on the policy gradients have the problems of long training time and low learning efficiency when they are applied to robot navigation in complex indoor scenes such as offices and corridors. This paper proposes a deep reinforcement learning navigation algorithm which combines the advantage structure and minimizing the target Q value. The algorithm introduces the advantage structure into the deep reinforcement learning method based on the policy gradient to distinguish the action difference under the same state value and improve the learning efficiency. In the multi-target navigation scenario, the method estimates the state value separately to provide more accurate value judgment by using map information. The mitigation over estimation method for discrete control is difficult to work in the mainstream Actor-Critic framework, a minimum target Q-value method based on the Gaussian smoothing is designed to reduce the influence of over estimation on training, The experimental results show that the algorithm in this paper can effectively speed up the learning rate. In the process of single-target and multi-target continuous navigation training, the convergence speed of our method is better than that of SAC, TD3, and DDPG. The trained agent makes the robot effectively away from obstacles and has a good generalization ability.
Keywords:reinforcement learning   mobile robot   navigation   advantage structure   minimize target Q-Value
点击此处可从《控制理论与应用》浏览原始摘要信息
点击此处可从《控制理论与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号