共查询到19条相似文献,搜索用时 109 毫秒
1.
逻辑马尔可夫决策过程和关系马尔可夫决策过程的引入,使得人们可能简洁地、陈述地表达复杂的马尔可夫决策过程。本文首先介绍有关逻辑马尔可夫决策过程和关系马尔可夫决策过程的概念,然后重点介绍它们与普通的马尔可夫决策过程根本不同的一些算法:①依赖于基本状态空间RL的转换法;②把Bellman方程推广到抽象状态空间的方法;③利用策略偏置空间寻求近似最优策略方法。最后对它们的研究现状进行总结及其对它们发展的一些展望。 相似文献
2.
人类在处理问题中往往分为两个层次,首先在整体上把握问题,即提出大体方案,然后再具体实施.也就是说人类就是具有多分辨率智能系统的极好例子,他能够在多个层次上从底向上泛化(即看问题角度粒度变"粗",它类似于抽象),并且又能从顶向下进行实例化(即看问题角度变"细",它类似于具体化).由此构造了由在双层(理想空间即泛化和实际空间即实例化)上各自运行的马尔可夫决策过程组成的半马尔可夫决策过程,称之为双马尔可夫决策过程联合模型.然后讨论该联合模型的最优策略算法,最后给出一个实例说明双马尔可夫决策联合模型能够经济地节约"思想",是运算有效性和可行性的一个很好的折中. 相似文献
3.
近些年来,群体动画在机器人学、电影、游戏等领域得到了广泛的研究和应用,但传统的群体动画技术均涉及复杂的运动规划或碰撞避免操作,计算效率较低.本文提出了一种基于马尔可夫决策过程(MDPs)的群体动画运动轨迹生成算法,该算法无需碰撞检测即可生成各智能体的无碰撞运动轨迹.同时本文还提出了一种改进的值迭代算法用于求解马尔可夫决策过程的状态-值,利用该算法在栅格环境中进行实验,结果表明该算法的计算效率明显高于使用欧氏距离作为启发式的值迭代算法和Dijkstra算法.利用本文提出的运动轨迹生成算法在三维(3D)动画场景中进行群体动画仿真实验,结果表明该算法可实现群体无碰撞地朝向目标运动,并具有多样性. 相似文献
4.
5.
部分可观察马尔可夫决策过程是通过引入信念状态空间将非马尔可夫链问题转化为马尔可夫链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察马尔可夫决策过程的基本原理和决策过程,然后介绍了3种典型的算法,它们分别是Littman等人的Witness算法、hcremental Pruning算法和Pineau等人的基于点的值迭代算法,对这3种算法进行了分析比较.讲述部分可观察马尔可夫决策过程的应用. 相似文献
6.
7.
8.
部分可观察马尔可夫决策过程在策略空间和状态空间上的计算复杂性,使求解其一个最优策略成为NP-hard难题.为此,提出一种动态影响图模型来建模不确定环境下的Agent动态决策问题.动态影响图模型以有向无环图表示系统变量之间的复杂关系.首先,动态影响图利用动态贝叶斯网络表示转移模型和观察模型以简化系统的状态空间;其次,效用函数以效用结点的形式清晰地表示出来,从而简化系统效用函数的表示;最后,通过决策结点表示系统的行为来简化系统的策略空间.通过实例从3个方面和POMDP模型进行了比较,研究的结果表明,动态影响图模型为大型的POMDP问题提供了一种简明的表示方式,最后在Robocup环境初步验证了该模型. 相似文献
9.
分层强化学习研究进展* 总被引:1,自引:0,他引:1
首先介绍了半马尔可夫决策过程、分层与抽象等分层强化学习的理论基础;其次,较全面地比较HAM、options、MAXQ和HEXQ四种典型的学习方法,从典型学习方法的拓展、学习分层、部分感知马尔可夫决策过程、并发和多agent合作等方面讨论分层强化学习的研究现状;最后指出分层强化学习未来的发展方向。 相似文献
10.
马尔可夫决策过程自适应决策的进展 总被引:6,自引:0,他引:6
在介绍一般马尔可夫决策过程的基础上,分析了当前主要马尔可夫过程自适应决策方法的基本思想、具体算法实现以及相应结论,总结了现有马尔可夫过程自适应决策算法的特点,并指出了需要进一步解决的问题。 相似文献
11.
Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403
The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献
12.
Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components. 相似文献
13.
Dmitri A. Dolgov Edmund H. Durfee 《Annals of Mathematics and Artificial Intelligence》2006,47(3-4):273-293
A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation.
Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions
of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness
of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising
MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms
the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear
approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried
out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization
variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation
that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation
also performs well on unconstrained problems.
相似文献
14.
15.
平均和折扣准则MDP基于TD(0)学习的统一NDP方法 总被引:3,自引:0,他引:3
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况. 相似文献
16.
Abhijit Gosavi Author Vitae 《Automatica》2006,42(8):1321-1330
While risk-sensitive (RS) approaches for designing plans of total productive maintenance are critical in manufacturing systems, there is little in the literature by way of theoretical modeling. Developing such plans often requires the solution of a discrete-time stochastic control-optimization problem. Renewal theory and Markov decision processes (MDPs) are commonly employed tools for solving the underlying problem. The literature on preventive maintenance, for the most part, focuses on minimizing the expected net cost, and disregards issues related to minimizing risks. RS maintenance managers employ safety factors to modify the risk-neutral solution in an attempt to heuristically accommodate elements of risk in their decision making. In this paper, our efforts are directed toward developing a formal theory for developing RS preventive-maintenance plans. We employ the Markowitz paradigm in which one seeks to optimize a function of the expected cost and its variance. In particular, we present (i) a result for an RS approach in the setting of renewal processes and (ii) a result for solving an RS MDP. We also provide computational results to demonstrate the efficacy of these results. Finally, the theory developed here is of sufficiently general nature that can be applied to problems in other relevant domains. 相似文献
17.
This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. 相似文献
18.
Cao's work shows that, by defining an α-dependent equivalent infinitesimal generator A α, a semi-Markov decision process (SMDP) with both average- and discounted-cost criteria can be treated as an α-equivalent Markov decision process (MDP), and the performance potential theory can also be developed for SMDPs. In this work, we focus on establishing error bounds for potential and A α-based iterative optimization methods. First, we introduce an α-uniformized Markov chain (UMC) for a SMDP via A α and a uniformized parameter, and show their relations. Especially, we obtain that their performance potentials, as solutions of corresponding Poisson equations, are proportional, so that the studies of a SMDP and the α-UMC based on potentials are unified. Using these relations, we derive the error bounds for a potential-based policy-iteration algorithm and a value-iteration algorithm, respectively, when there exist various calculation errors. The obtained results can be applied directly to the special models, i.e., continuous-time MDPs and Markov chains, and can be extended to some simulation-based optimization methods such as reinforcement learning and neuro-dynamic programming, where estimation errors or approximation errors are common cases. Finally, we give an application example on the look-ahead control of a conveyor-serviced production station (CSPS), and show the corresponding error bounds. 相似文献
19.
Edge-clouds provide a promising new approach to significantly reduce network operational costs by moving computation closer to the edge. A key challenge in such systems is to decide where and when services should be migrated in response to user mobility and demand variation. The objective is to optimize operational costs while providing rigorous performance guarantees. In this paper, we model this as a sequential decision making Markov Decision Problem (MDP). However, departing from traditional solution methods (such as dynamic programming) that require extensive statistical knowledge and are computationally prohibitive, we develop a novel alternate methodology. First, we establish an interesting decoupling property of the MDP that reduces it to two independent MDPs on disjoint state spaces. Then, using the technique of Lyapunov optimization over renewals, we design an online control algorithm for the decoupled problem that is provably cost-optimal. This algorithm does not require any statistical knowledge of the system parameters and can be implemented efficiently. We validate the performance of our algorithm using extensive trace-driven simulations. Our overall approach is general and can be applied to other MDPs that possess a similar decoupling property. 相似文献