期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

卞爱华王崇骏陈世福《软件学报》2008,19(6):1309-1316

基于点的算法是部分可观察马尔可夫决策过程(partially observable Markov decision processes,简称POMDP)的一类近似算法.它们只在一个信念点集上进行Backup操作,避免了线性规划并使用了更少的中间变量,从而将计算瓶颈由选择向量转向了生成向量.但这类算法在生成向量时含有大量重复和无意义计算,针对于此,提出了基于点的POMDP算法的预处理方法(preprocessing method for point-based algorithms,简称PPBA).该方法对每个样本信念点作预处理,并且在生成α-向量之前首先计算出该选取哪个动作和哪些α-向量,从而消除了重复计算.PPBA还提出了基向量的概念,利用问题的稀疏性避免了无意义计算.通过在Perseus上的实验,表明PPBA很大地提高了算法的执行速度. 相似文献

2.

基于CUR矩阵分解的多核学习正则化路径近似算法

王梅李董薛成龙《数据采集与处理》2020,35(3):381-391

多核学习在解决不规则、大规模数据问题时表现出良好的优越性。正则化路径是一种多次求解多核学习,选择最优模型的措施。针对多核学习正则化路径算法处理大规模数据时,核矩阵规模较大,计算代价高,影响优化模型效率的问题,提出一种基于CUR矩阵分解的多核学习正则化路径近似算法(Multiple kernel learning regularization path approximation algorithm with CUR, MKLRPCUR)。该算法首先采用CUR算法获得核矩阵的低秩近似矩阵的多个分解矩阵,然后在求解过程中利用低维的分解矩阵相乘替代核矩阵,调整相关矩阵计算的顺序,从而简化算法中核矩阵和拉格朗日乘子向量乘积的计算。 MKLRPCUR算法降低了矩阵的计算规模,优化了矩阵计算,提高了精确算法的计算效率。从理论上分析低秩近似矩阵的相对误差和算法的时间复杂度,验证了近似算法的合理性。同时,在UCI数据集、ORL和COIL图像数据库上的实验结果表明,本文提出的近似算法不仅保证了学习的准确率,并且降低了算法的运行时间,提高了模型的效率。相似文献

3.

本体算法中相似度矩阵的学习 总被引：1，自引：0，他引：1

吴剑章朱林立高炜《小型微型计算机系统》2015,(4):773-777

本体图中顶点之间的相似度计算是各类本体算法的本质所在.本体图中各个顶点对的相似度组成本体相似度矩阵,因此得到一个最优相似度矩阵是本体应用的实质.本文提出一种通过计算距离矩阵来得到本体相似度矩阵的方法,该方法着眼于降维过程的稀疏化和解的光滑性.从样本集得到相似顶点对集合S和不相似度顶点对集合D,由此得到三元组Γ.将Γ的信息融入到计算模型中,进而使得距离矩阵保持了原本体图中顶点间的距离结构特征.借鉴凸最小最大优化模型的光滑逼近法,得到距离矩阵计算模型的求解策略.最后,通过两个具体实验表明,本文所给的相似度矩阵计算方法对于特定应用领域中的本体相似度计算和不同本体间建立本体映射具有较高的效率. 相似文献

4.

FPGA架构上面向稀疏矩阵求解的静态调度算法

王晞阳陈继林李猛刘首文《计算机工程》2022,48(7):199-205+213

在电力系统仿真中,大型稀疏矩阵的求解会消耗大量存储和计算资源,未有效利用矩阵的稀疏性将导致存储空间浪费以及计算效率低下的问题。当前关于稀疏矩阵求解算法的研究主要针对众核加速硬件,聚焦于挖掘层次集合的并行度以提升算法的并行效率,而在众核处理器架构上频繁地进行缓存判断及细粒度访问可能导致潜在的性能问题。针对基于现场可编程门阵列（FPGA）的下三角稀疏矩阵求解问题,在吴志勇等设计的FPGA稀疏矩阵求解器硬件结构的基础上,提出一种静态调度求解算法。通过对稀疏矩阵进行预处理,设计数据分布和指令排布流程,将下三角稀疏矩阵的求解过程静态映射到多个FPGA片上的处理单元,以实现下三角稀疏矩阵在FPGA上的并行高速求解。将串行算法中所有的隐式并行关系排布到缓冲中,使得所有计算单元都能实现计算、访存和单元间通信的高效并行,从而最大限度地利用FPGA的硬件资源。典型算例上的测试结果表明,相较传统的CPU/GPU求解算法,该算法能够实现5~10倍的加速效果。相似文献

5.

部分可观环境中AUV软件自修复模型研究

张汝波孟雷史长亭《计算机工程与应用》2017,53(2):151-156

针对AUV软件在部分可观环境中的故障修复问题,依据部分可观马尔科夫决策过程理论,提出基于POMDP模型和微重启技术的AUV软件故障修复方法。根据AUV 分层结构特点设计了多层次的微重启修复方法,构建了AUV软件自修复POMDP模型,同时采用基于点的值迭代算法求解生成修复策略使系统在部分可观环境下能够以较低的修复代价执行修复动作。仿真实验验证了算法有效性和模型适用性。相似文献

6.

基于正则化路径的支持向量机近似模型选择 总被引：2，自引：0，他引：2

丁立中廖士中《计算机研究与发展》2012,49(6):1248-1255

模型选择问题是支持向量机的基本问题.基于核矩阵近似计算和正则化路径,提出一个新的支持向量机模型选择方法.首先,发展初步的近似模型选择理论,包括给出核矩阵近似算法KMA-α,证明KMA-α的近似误差界定理,进而得到支持向量机的模型近似误差界.然后,提出近似模型选择算法AMSRP.该算法应用KMA-α计算的核矩阵的低秩近似来提高支持向量机求解的效率,同时应用正则化路径算法来提高惩罚因子C参数调节的效率.最后,通过标准数据集上的对比实验,验证了AMSRP的可行性和计算效率.实验结果显示,AMSRP可在保证测试集准确率的前提下,显著地提高支持向量机模型选择的效率.理论分析与实验结果表明,AMSRP是一合理、高效的模型选择算法. 相似文献

7.

SHP-VI:一种基于最短哈密顿通路的POMDP值迭代算法

冯奇周雪忠黄厚宽张小平《计算机研究与发展》2011,(12)

基于试探(trial-based)的值迭代算法是求解部分可观察Markov决策过程(partially observable Markov decision process,POMDP)模型的一类有效算法,其中FSVI算法是目前最快的算法之一.然而对于较大规模的POMDP问题,FSVI计算MDP值函数的时间是不容忽视的.提出一种基于最短哈密顿通路(shortest Hamiltonian path)的值迭代算法(shortest Hamiltonian path-based value iteration,SHP-VI).该方法用求解最短哈密顿通路问题的蚁群算法计算一条最优信念状态轨迹,然后在这些信念状态上反向更新值函数.通过与FSVI算法的实验比较,结果表明SHP-VI算法很大程度地提高了基于试探的算法计算信念状态轨迹的效率. 相似文献

8.

SHP-Ⅵ:一种基于最短哈密顿通路的POMDP值迭代算法

冯奇周雪忠黄厚宽张小平《计算机研究与发展》2011,48(12)

基于试探(trial-based)的值迭代算法是求解部分可观察Markov决策过程(partially observable Markov decision process,POMDP)模型的一类有效算法,其中FSVI算法是目前最快的算法之一.然而对于较大规模的POMDP问题,FSVI计算MDP值函数的时间是不容忽视的.提出一种基于最短哈密顿通路(shortest Hamiltonian path)的值迭代算法(shortest Hamiltonian path-based value iteration,SHP-VI).该方法用求解最短哈密顿通路问题的蚁群算法计算一条最优信念状态轨迹,然后在这些信念状态上反向更新值函数.通过与FSVI算法的实验比较,结果表明SHP-VI算法很大程度地提高了基于试探的算法计算信念状态轨迹的效率. 相似文献

9.

一种基于信念状态压缩的实时POMDP算法

下载免费PDF全文

仵博吴敏《控制与决策》2007,22(12):1417-1420

针对求解部分可观察马尔可夫决策过程（POMDP）信念状态空间是NP难问题．提出一种信念状态空间压缩（BSSC）算法．将信念状态空间的高维压缩到低维，利用动态贝叶斯网络对状态转移函数、观察函数和报酬函数进行压缩。降低求解规模，达到实时决策的目的．对比实验表明，所提出的算法可以快速求解最优策略和最优值函数．相似文献

10.

口语对话系统的POMDP模型及求解 总被引：3，自引：0，他引：3

张波蔡庆生郭百宁《计算机研究与发展》2002,39(2):217-224

许多口语对话系统已进入实用阶段，但一直没有很好的对话管理模型，把对话管理看做随机优化问题，用马尔科夫决策过程（MDP）来建模是最近出现的方向，但是对话状态的不确定性使MDP不能很好地反映对话模型，提出了一种新的基于部分可观察MDP（POMDP）的口语对话系统模型，用部分可观察特性来处理不确定问题，由于精确求解算法的局限性，考察了许多启发式近似算法在该模型中的话用性，并改进了部分算法，如对于格点近似算法，提出了两种基于模拟点的格点选择方法。相似文献

11.

基于策略迭代和值迭代的POMDP算法

孙湧仵博冯延蓬《计算机研究与发展》2008,45(10)

部分可观察Markov决策过程是通过引入信念状态空间将非Markov链问题转化为Markov链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察Markov决策过程的基本原理和决策过程,提出一种基于策略迭代和值迭代的部分可观察Markov决策算法,该算法利用线性规划和动态规划的思想,解决当信念状态空间较大时出现的"维数灾"问题,得到Markov决策的逼近最优解.实验数据表明该算法是可行的和有效的. 相似文献

12.

基于环境状态分布优化的POMDP值迭代求解算法

朱荣鑫《计算机应用研究》2022,39(2):374-378

基于点的值迭代算法是一类解决POMDP问题的有效算法,PBVI是基于点集的经典算法,但是其算法效率较为低下。FSVI使用内在的MDP最优策略来降低算法复杂度,但求解大规模问题的效果较差。为解决上述问题,提出了基于环境状态分布优化的前向搜索值迭代算法（PBVI-OSD）,通过基于权重值的Q^MDP选出最佳的动作,基于信念状态和转换函数选取最大可能的状态,基于动作和状态从观察中随机选取一个观察概率大于阈值的观察,由此获得更具探索价值的后继信念点集,提升值迭代收敛的质量。在四个基准问题上的实验表明,相比于FSVI和PBVI,PBVI-OSD能保证收敛效率,特别是在大规模问题上能收敛到更好的全局最优解。相似文献

13.

Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems

Blaise Thomson Steve Young 《Computer Speech and Language》2010,24(4):562-588

This paper describes a statistically motivated framework for performing real-time dialogue state updates and policy learning in a spoken dialogue system. The framework is based on the partially observable Markov decision process (POMDP), which provides a well-founded, statistical model of spoken dialogue management. However, exact belief state updates in a POMDP model are computationally intractable so approximate methods must be used. This paper presents a tractable method based on the loopy belief propagation algorithm. Various simplifications are made, which improve the efficiency significantly compared to the original algorithm as well as compared to other POMDP-based dialogue state updating approaches. A second contribution of this paper is a method for learning in spoken dialogue systems which uses a component-based policy with the episodic Natural Actor Critic algorithm.The framework proposed in this paper was tested on both simulations and in a user trial. Both indicated that using Bayesian updates of the dialogue state significantly outperforms traditional definitions of the dialogue state. Policy learning worked effectively and the learned policy outperformed all others on simulations. In user trials the learned policy was also competitive, although its optimality was less conclusive. Overall, the Bayesian update of dialogue state framework was shown to be a feasible and effective approach to building real-world POMDP-based dialogue systems. 相似文献

14.

部分可观察马尔可夫决策过程研究进展

仵博吴敏《计算机工程与设计》2007,28(9):2116-2119,2126

部分可观察马尔可夫决策过程是通过引入信念状态空间将非马尔可夫链问题转化为马尔可夫链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察马尔可夫决策过程的基本原理和决策过程,然后介绍了3种典型的算法,它们分别是Littman等人的Witness算法、hcremental Pruning算法和Pineau等人的基于点的值迭代算法,对这3种算法进行了分析比较.讲述部分可观察马尔可夫决策过程的应用. 相似文献

15.

一种基于独立任务的POMDP问题的解决方法_*

房俊恒朱斐刘全伏玉琛凌兴宏《计算机应用研究》2016,33(1)

部分可观测马尔可夫决策过程(POMDP)是马尔可夫决策过程(MDP)的扩展。通常利用POMDPs来模拟在部分可观测的随机环境中决策的Agents。针对完整POMDP的求解方法扩展能力弱的问题,提出把一个多元的POMDP分解成一组受限制的POMDPs,然后分别独立地求解每个这样的模型,获得一个值函数并将这些受限制的POMDPs的值函数结合起来以便获得一个完整POMDP的策略。该方法主要阐述了识别与独立任务相关的状态变量的过程,以及如何构造一个被限制在一个单独任务上的模型。将该方法应用到两个不同规模的岩石采样问题中,实验结果表明,该方法能够获得很好的策略。相似文献

16.

A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems

Lei Zheng Siu-Yeung Cho 《Neural Processing Letters》2011,33(2):187-200

Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems. 相似文献

17.

Monte-Carlo tree search for Bayesian reinforcement learning 总被引：2，自引：2，他引：0

Ngo Anh Vien Wolfgang Ertel Viet-Hung Dang TaeChoong Chung 《Applied Intelligence》2013,39(2):345-353

Bayesian model-based reinforcement learning can be formulated as a partially observable Markov decision process (POMDP) to provide a principled framework for optimally balancing exploitation and exploration. Then, a POMDP solver can be used to solve the problem. If the prior distribution over the environment’s dynamics is a product of Dirichlet distributions, the POMDP’s optimal value function can be represented using a set of multivariate polynomials. Unfortunately, the size of the polynomials grows exponentially with the problem horizon. In this paper, we examine the use of an online Monte-Carlo tree search (MCTS) algorithm for large POMDPs, to solve the Bayesian reinforcement learning problem online. We will show that such an algorithm successfully searches for a near-optimal policy. In addition, we examine the use of a parameter tying method to keep the model search space small, and propose the use of nested mixture of tied models to increase robustness of the method when our prior information does not allow us to specify the structure of tied models exactly. Experiments show that the proposed methods substantially improve scalability of current Bayesian reinforcement learning methods. 相似文献

18.

Multiagent Reinforcement Learning:Rollout and Policy Iteration

下载免费PDF全文

Dimitri Bertsekas 《IEEE/CAA Journal of Automatica Sinica》2021,8(2):249-272

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved policy.Rollout is the simplest method of this type,where just one improved policy is generated.We can view PI as repeated application of rollout,where the rollout policy at each iteration serves as the base policy for the next iteration.In contrast with PI,rollout has a robustness property:it can be applied on-line and is suitable for on-line replanning.Moreover,rollout can use as base policy one of the policies produced by PI,thereby improving on that policy.This is the type of scheme underlying the prominently successful Alpha Zero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected(conceptually)by a separate agent.This is the class of multiagent problems where the agents have a shared objective function,and a shared and perfect state information.Based on a problem reformulation that trades off control space complexity with state space complexity,we develop an approach,whereby at every stage,the agents sequentially(one-at-a-time)execute a local rollout algorithm that uses a base policy,together with some coordinating information from the other agents.The amount of total computation required at every stage grows linearly with the number of agents.By contrast,in the standard rollout algorithm,the amount of total computation grows exponentially with the number of agents.Despite the dramatic reduction in required computation,we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout:it guarantees an improved performance relative to the base policy.We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information,which is sufficient to maintain the cost improvement property,without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems,we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation.For one of our PI algorithms,we prove convergence to an agentby-agent optimal policy,thus establishing a connection with the theory of teams.For another PI algorithm,which is executed over a more complex state space,we prove convergence to an optimal policy.Approximate forms of these algorithms are also given,based on the use of policy and value neural networks.These PI algorithms,in both their exact and their approximate form are strictly off-line methods,but they can be used to provide a base policy for use in an on-line multiagent rollout scheme. 相似文献

19.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献

20.

Value set iteration for Markov decision processes

Hyeong Soo Chang 《Automatica》2014

This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods. 相似文献