首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory.  相似文献   

2.
We consider Markov decision processes (MDPs) where the state transition probability distributions are not uniquely known, but are known to belong to some intervals-so called "controlled Markov set-chains"-with infinite-horizon discounted reward criteria. We present formal methods to improve multiple policies for solving such controlled Markov set-chains. Our multipolicy improvement methods follow the spirit of parallel rollout and policy switching for solving MDPs. In particular, these methods are useful for online control of Markov set-chains and for designing policy iteration (PI) type algorithms. We develop a PI-type algorithm and prove that it converges to an optimal policy  相似文献   

3.
Semi-Markov decision problems and performance sensitivity analysis   总被引:1,自引:0,他引:1  
Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.  相似文献   

4.
一种新的多智能体Q学习算法   总被引:2,自引:0,他引:2  
郭锐  吴敏  彭军  彭姣  曹卫华 《自动化学报》2007,33(4):367-372
针对非确定马尔可夫环境下的多智能体系统,提出了一种新的多智能体Q学习算法.算法中通过对联合动作的统计来学习其它智能体的行为策略,并利用智能体策略向量的全概率分布保证了对联合最优动作的选择. 同时对算法的收敛性和学习性能进行了分析.该算法在多智能体系统RoboCup中的应用进一步表明了算法的有效性与泛化能力.  相似文献   

5.
Devin Schwab  Soumya Ray 《Machine Learning》2017,106(9-10):1569-1598
In this work, we build upon the observation that offline reinforcement learning (RL) is synergistic with task hierarchies that decompose large Markov decision processes (MDPs). Task hierarchies can allow more efficient sample collection from large MDPs, while offline algorithms can learn better policies than the so-called “recursively optimal” or even hierarchically optimal policies learned by standard hierarchical RL algorithms. To enable this synergy, we study sample collection strategies for offline RL that are consistent with a provided task hierarchy while still providing good exploration of the state-action space. We show that naïve extensions of uniform random sampling do not work well in this case and design a strategy that has provably good convergence properties. We also augment the initial set of samples using additional information from the task hierarchy, such as state abstraction. We use the augmented set of samples to learn a policy offline. Given a capable offline RL algorithm, this policy is then guaranteed to have a value greater than or equal to the value of the hierarchically optimal policy. We evaluate our approach on several domains and show that samples generated using a task hierarchy with a suitable strategy allow significantly more sample-efficient convergence than standard offline RL. Further, our approach also shows more sample-efficient convergence to policies with value greater than or equal to hierarchically optimal policies found through an online hierarchical RL approach.  相似文献   

6.
Learning Sequential Decision Rules Using Simulation Models and Competition   总被引:12,自引:7,他引:5  
The problem of learning decision rules for sequential tasks is addressed, focusing on the problem of learning tactical decision rules from a simple flight simulator. The learning method relies on the notion of competition and employs genetic algorithms to search the space of decision policies. Several experiments are presented that address issues arising from differences between the simulation model on which learning occurs and the target environment on which the decision rules are ultimately tested.  相似文献   

7.
Certain methods for constructing embedded Markov decision processes (MDPs) lead to performance measures that are the ratio of two long-run averages. For such MDPs with finite state and action spaces and under an ergodicity assumption, this note presents algorithms for computing optimal policies based on policy iterations, linear programming, value iterations and Q-learning.  相似文献   

8.
A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation. Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation also performs well on unconstrained problems.   相似文献   

9.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771).  相似文献   

10.
The paper presents an advanced modeling approach and a simulation model for supporting supply chain management. The first objective is to develop a flexible, time-efficient and parametric supply chain simulator starting from a discrete event simulation package. To this end we propose and advanced modeling approach. The second objective is to provide a decision making tool for supply chain management. The simulator is a decision making tool capable of analyzing different supply chain scenarios by using an approach based on multiple performance measures and user-defined set of input parameters. Our simulator capabilities as decision making tool are strongly amplified if Design of Experiment (DOE) and Analysis of Variance (ANOVA) are respectively used for experiments planning and simulation results analysis. With regard to supply chain decision making process, we propose an application example for a better understanding of tool potentials. The application example considers a specific supply chain scenario and analyzes the effects of inventory control policies, lead times, customers’ demand intensity and variability, on three different supply chain performance measures.  相似文献   

11.
We assess the potentials of the approximate dynamic programming (ADP) approach for process control, especially as a method to complement the model predictive control (MPC) approach. In the artificial intelligence (AI) and operations research (OR) research communities, ADP has recently seen significant activities as an effective method for solving Markov decision processes (MDPs), which represent a type of multi-stage decision problems under uncertainty. Process control problems are similar to MDPs with the key difference being the continuous state and action spaces as opposed to discrete ones. In addition, unlike in other popular ADP application areas like robotics or games, in process control applications first and foremost concern should be on the safety and economics of the on-going operation rather than on efficient learning. We explore different options within ADP design, such as the pre-decision state vs. post-decision state value function, parametric vs. nonparametric value function approximator, batch-mode vs. continuous-mode learning, and exploration vs. robustness. We argue that ADP possesses great potentials, especially for obtaining effective control policies for stochastic constrained nonlinear or linear systems and continually improving them towards optimality.  相似文献   

12.
The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.   相似文献   

13.
This article addresses reinforcement learning problems based on factored Markov decision processes (MDPs) in which the agent must choose among a set of candidate abstractions, each build up from a different combination of state components. We present and evaluate a new approach that can perform effective abstraction selection that is more resource‐efficient and/or more general than existing approaches. The core of the approach is to make selection of an abstraction part of the learning agent's decision‐making process by augmenting the agent's action space with internal actions that select the abstraction it uses. We prove that under certain conditions this approach results in a derived MDP whose solution yields both the optimal abstraction for the original MDP and the optimal policy under that abstraction. We examine our approach in three domains of increasing complexity: contextual bandit problems, episodic MDPs, and general MDPs with context‐specific structure. © 2013 Wiley Periodicals, Inc.  相似文献   

14.
In this paper, we propose a challenging research direction for Constraint Programming and optimization techniques in general. We address problems where decisions to be taken affect and are affected by complex systems, which exhibit phenomena emerging from a collection of interacting objects, capable to self organize and to adapt their behaviour according to their history and feedback. Such systems are unfortunately impervious to modeling efforts via state-of-the-art combinatorial optimization techniques. We provide some hints on approaches to connect and integrate decision making and optimization technology with complex systems via machine learning, game theory and mechanism design. In the first case, the aim is to extract modeling components to express the relation between global decisions and observables emerging from the real system, or from an accurate predictive model (e.g. a simulator). In the second case, the idea is to exploit game theory, mechanism design and distributed decision making to drive the process toward realistic equilibrium points avoiding globally optimal, but unrealistic, configurations. We conclude by observing how dealing with the complexity of the considered problems will require to greatly extend the capabilities of state of the art solvers: in this context, we identify some key issues and highlight future research directions.  相似文献   

15.
The selection of meaningful lines for 3D line data visualization has been intensively researched in recent years. Most approaches focus on single line fields where one line passes through each domain point. This paper presents a selection approach for sets of line fields which is based on a global optimization of the opacity of candidate lines. For this, existing approaches for single line fields are modified such that significantly larger amounts of line representatives are handled. Furthermore, time coherence is addressed for animations, making this the first approach that solves the line selection problem for 3D time‐dependent flow. We apply our technique to visualize dense sets of pathlines, sets of magnetic field lines, and animated sets of pathlines, streaklines and masslines.  相似文献   

16.
Live cell imaging is an important biomedical research paradigm for studying dynamic cellular behaviour. Although phenotypic data derived from images are difficult to explore and analyse, some researchers have successfully addressed this with visualization. Nonetheless, visualization methods for live cell imaging data have been reported in an ad hoc and fragmented fashion. This leads to a knowledge gap where it is difficult for biologists and visualization developers to evaluate the advantages and disadvantages of different visualization methods, and for visualization researchers to gain an overview of existing work to identify research priorities. To address this gap, we survey existing visualization methods for live cell imaging from a visualization research perspective for the first time. Based on recent visualization theory, we perform a structured qualitative analysis of visualization methods that includes characterizing the domain and data, abstracting tasks, and describing visual encoding and interaction design. Based on our survey, we identify and discuss research gaps that future work should address: the broad analytical context of live cell imaging; the importance of behavioural comparisons; links with dynamic data visualization; the consequences of different data modalities; shortcomings in interactive support; and, in addition to analysis, the value of the presentation of phenotypic data and insights to other stakeholders.  相似文献   

17.
Preference transitivity characterized by ordinal consistency is a fundamental principle for decision making models based on pairwise comparison matrices (PCMs). However, little previous research has addressed ordinal consistency in an optimal way. Further, because ordinal consistency is not considered in the consensus reaching process, non-transitive preferences may still exist in the revised PCMs. In this paper, optimization models are proposed to obtain transitive preferences for solving individual consistency and group consensus problems. First, the conditions satisfying the ordinal consistency of PCMs are analysed and a system of constraints is derived to allow for the ordinal consistency to be explicitly controlled in the optimization model. A mixed integer linear optimization model is then proposed to assist decision makers satisfy both the ordinal and cardinal consistencies. A second mixed integer linear optimization model is then designed to ensure that the consensus level in group decision making problems can be achieved when both the group as a whole and all individuals have acceptably cardinal and ordinal consistencies. Optimization models considering ordinal consistency and classical cardinal consistency indices are open problems needing to be managed in future. Compared with existing methods, the proposed models provide an optimal way to minimize modifications in deriving transitive preferences. Finally, the feasibility and validity of the models are verified through comparisons with classic models.  相似文献   

18.
Edge-clouds provide a promising new approach to significantly reduce network operational costs by moving computation closer to the edge. A key challenge in such systems is to decide where and when services should be migrated in response to user mobility and demand variation. The objective is to optimize operational costs while providing rigorous performance guarantees. In this paper, we model this as a sequential decision making Markov Decision Problem (MDP). However, departing from traditional solution methods (such as dynamic programming) that require extensive statistical knowledge and are computationally prohibitive, we develop a novel alternate methodology. First, we establish an interesting decoupling property of the MDP that reduces it to two independent MDPs on disjoint state spaces. Then, using the technique of Lyapunov optimization over renewals, we design an online control algorithm for the decoupled problem that is provably cost-optimal. This algorithm does not require any statistical knowledge of the system parameters and can be implemented efficiently. We validate the performance of our algorithm using extensive trace-driven simulations. Our overall approach is general and can be applied to other MDPs that possess a similar decoupling property.  相似文献   

19.
Robustness of policies in constrained Markov decision processes   总被引:1,自引:0,他引:1  
We consider the optimization of finite-state, finite-action Markov decision processes (MDPs), under constraints. Cost and constraints are discounted. We introduce a new method for investigating the continuity, and a certain type of robustness, of the optimal cost and the optimal policy under changes in the constraints. This method is also applicable for other cost criteria such as finite horizon and infinite horizon average cost.  相似文献   

20.
This paper focuses on the design of time-homogeneous fully observed Markov decision processes (MDPs), with finite state and action spaces. The main objective is to obtain policies that generate the maximal set of recurrent states, subject to convex constraints on the set of invariant probability mass functions. We propose a design method that relies on a finitely parametrized convex program inspired on principles of entropy maximization. A numerical example is provided to illustrate these ideas.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号