首页 | 官方网站   微博 | 高级检索  
     

融合认知行为模型的深度强化学习框架及算法
引用本文:陈浩,李嘉祥,黄健,王菖,刘权,张中杰.融合认知行为模型的深度强化学习框架及算法[J].控制与决策,2023,38(11):3209-3218.
作者姓名:陈浩  李嘉祥  黄健  王菖  刘权  张中杰
作者单位:国防科技大学 智能科学学院,长沙 410073
基金项目:国家自然科学基金项目(61906202).
摘    要:面对高维连续状态空间或稀疏奖励等复杂任务时,仅依靠深度强化学习算法从零学习最优策略十分困难,如何将已有知识表示为人与学习型智能体之间相互可理解的形式,并有效地加速策略收敛仍是一个难题.对此,提出一种融合认知行为模型的深度强化学习框架,将领域内先验知识建模为基于信念-愿望-意图(belief- desire-intention, BDI)的认知行为模型,用于引导智能体策略学习.基于此框架,分别提出融合认知行为模型的深度Q学习算法和近端策略优化算法,并定量化设计认知行为模型对智能体策略更新的引导方式.最后,通过典型gym环境和空战机动决策对抗环境,验证所提出算法可以高效利用认知行为模型加速策略学习,有效缓解状态空间巨大和环境奖励稀疏的影响.

关 键 词:认知行为模型  强化学习  近端策略优化  深度Q网络  信念-愿望-意图  GOAL  空战机动决策

Deep reinforcement learning framework and algorithms integrated with cognitive behavior models
CHEN Hao,LI Jia-xiang,HUANG Jian,WANG Chang,LIU Quan,ZHANG Zhong-jie.Deep reinforcement learning framework and algorithms integrated with cognitive behavior models[J].Control and Decision,2023,38(11):3209-3218.
Authors:CHEN Hao  LI Jia-xiang  HUANG Jian  WANG Chang  LIU Quan  ZHANG Zhong-jie
Affiliation:College of Intelligence Science and Technology,National University of Defense Technology,Changsha 410073,China
Abstract:When facing complex tasks with high-dimensional continuous state-space or sparse rewards, it is difficult for a reinforcement learning agent to learn an optimal policy from scratch. How to represent the known knowledge in a form understandable by human beings and the learning agent, and effectively accelerate policy convergence is still a difficult problem. Therefore, this paper proposes a deep reinforcement learning(DRL) framework integrating with cognitive behavior models. It represents prior knowledge as belief-desire-intention(BDI) based cognitive behavior models, which are used to guide policy learning in the DRL. Besides, we introduce the deep Q-learning algorithm with the cognitive behavior model(COG-DQN) and the proximal policy optimization algorithm with the cognitive behavior model(COG-PPO) based on the proposed framework. Moreover, we quantitatively design the guidance strategies of the cognitive behavior model to policy update. Finally, in a typical gym environment and an air combat maneuver confrontation environment, we verify that the proposed algorithms can efficiently use the cognitive behavior model to accelerate policy learning, and significantly alleviate the impact of high-dimensional state-space and sparse rewards.
Keywords:
点击此处可从《控制与决策》浏览原始摘要信息
点击此处可从《控制与决策》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号