融合认知行为模型的深度强化学习框架及算法 Deep reinforcement learning framework and algorithms integrated with cognitive behavior models期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

融合认知行为模型的深度强化学习框架及算法

引用本文：	陈浩,李嘉祥,黄健,王菖,刘权,张中杰.融合认知行为模型的深度强化学习框架及算法[J].控制与决策,2023,38(11):3209-3218.

作者姓名：	陈浩李嘉祥黄健王菖刘权张中杰

作者单位：	国防科技大学智能科学学院,长沙 410073

基金项目：	国家自然科学基金项目(61906202).

摘要：	面对高维连续状态空间或稀疏奖励等复杂任务时,仅依靠深度强化学习算法从零学习最优策略十分困难,如何将已有知识表示为人与学习型智能体之间相互可理解的形式,并有效地加速策略收敛仍是一个难题.对此,提出一种融合认知行为模型的深度强化学习框架,将领域内先验知识建模为基于信念-愿望-意图(belief- desire-intention, BDI)的认知行为模型,用于引导智能体策略学习.基于此框架,分别提出融合认知行为模型的深度Q学习算法和近端策略优化算法,并定量化设计认知行为模型对智能体策略更新的引导方式.最后,通过典型gym环境和空战机动决策对抗环境,验证所提出算法可以高效利用认知行为模型加速策略学习,有效缓解状态空间巨大和环境奖励稀疏的影响.
关键词：	认知行为模型强化学习近端策略优化深度Q网络信念-愿望-意图 GOAL 空战机动决策
Deep reinforcement learning framework and algorithms integrated with cognitive behavior models

CHEN Hao,LI Jia-xiang,HUANG Jian,WANG Chang,LIU Quan,ZHANG Zhong-jie.Deep reinforcement learning framework and algorithms integrated with cognitive behavior models[J].Control and Decision,2023,38(11):3209-3218.

Authors:	CHEN Hao LI Jia-xiang HUANG Jian WANG Chang LIU Quan ZHANG Zhong-jie

Affiliation:	College of Intelligence Science and Technology,National University of Defense Technology,Changsha 410073,China

Abstract:	When facing complex tasks with high-dimensional continuous state-space or sparse rewards, it is difficult for a reinforcement learning agent to learn an optimal policy from scratch. How to represent the known knowledge in a form understandable by human beings and the learning agent, and effectively accelerate policy convergence is still a difficult problem. Therefore, this paper proposes a deep reinforcement learning(DRL) framework integrating with cognitive behavior models. It represents prior knowledge as belief-desire-intention(BDI) based cognitive behavior models, which are used to guide policy learning in the DRL. Besides, we introduce the deep Q-learning algorithm with the cognitive behavior model(COG-DQN) and the proximal policy optimization algorithm with the cognitive behavior model(COG-PPO) based on the proposed framework. Moreover, we quantitatively design the guidance strategies of the cognitive behavior model to policy update. Finally, in a typical gym environment and an air combat maneuver confrontation environment, we verify that the proposed algorithms can efficiently use the cognitive behavior model to accelerate policy learning, and significantly alleviate the impact of high-dimensional state-space and sparse rewards.

Keywords:

	点击此处可从《控制与决策》浏览原始摘要信息
	点击此处可从《控制与决策》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏