首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 32 毫秒
1.
车辆路径问题是物流运输优化中的核心问题,目的是在满足顾客需求下得到一条最低成本的车辆路径规划。但随着物流运输规模的不断增大,车辆路径问题求解难度增加,并且对实时性要求也不断提高,已有的常规算法不再适应实际要求。近年来,基于强化学习算法开始成为求解车辆路径问题的重要方法,在简要回顾常规方法求解车辆路径问题的基础上,重点总结基于强化学习求解车辆路径问题的算法,并将算法按照基于动态规划、基于价值、基于策略的方式进行了分类;最后对该问题未来的研究进行了展望。  相似文献   

2.
Elevator Group Control Using Multiple Reinforcement Learning Agents   总被引:22,自引:0,他引:22  
Crites  Robert H.  Barto  Andrew G. 《Machine Learning》1998,33(2-3):235-262
Recent algorithmic and theoretical advances in reinforcement learning (RL) have attracted widespread interest. RL algorithms have appeared that approximate dynamic programming on an incremental basis. They can be trained on the basis of real or simulated experiences, focusing their computation on areas of state space that are actually visited during control, making them computationally tractable on very large problems. If each member of a team of agents employs one of these algorithms, a new collective learning algorithm emerges for the team as a whole. In this paper we demonstrate that such collective RL algorithms can be powerful heuristic methods for addressing large-scale control problems.Elevator group control serves as our testbed. It is a difficult domain posing a combination of challenges not seen in most multi-agent learning research to date. We use a team of RL agents, each of which is responsible for controlling one elevator car. The team receives a global reward signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, we show results that in simulation surpass the best of the heuristic elevator control algorithms of which we are aware. These results demonstrate the power of multi-agent RL on a very large scale stochastic dynamic optimization problem of practical utility.  相似文献   

3.
强化学习研究综述   总被引:87,自引:2,他引:87  
高阳  陈世福  陆鑫 《自动化学报》2004,30(1):86-100
摘要强化学习通过试错与环境交互获得策略的改进,其自学习和在线学习的特点使其成为 机器学习研究的一个重要分支.该文首先介绍强化学习的原理和结构;其次构造一个二维分类 图,分别在马尔可夫环境和非马尔可夫环境下讨论最优搜索型和经验强化型两类算法;然后结 合近年来的研究综述了强化学习技术的核心问题,包括部分感知、函数估计、多agent强化学 习,以及偏差技术;最后还简要介绍强化学习的应用情况和未来的发展方向.  相似文献   

4.
徐昕  沈栋  高岩青  王凯 《自动化学报》2012,38(5):673-687
基于马氏决策过程(Markov decision process, MDP)的动态系统学习控制是近年来一个涉及机器学习、控制理论和运筹学等多个学科的交叉研究方向, 其主要目标是实现系统在模型复杂或者不确定等条件下基于数据驱动的多阶段优化控制. 本文对基于MDP的动态系统学习控制理论、算法与应用的发展前沿进行综述,重点讨论增强学习(Reinforcement learning, RL)与近似动态规划(Approximate dynamic programming, ADP)理论与方法的研究进展,其中包括时域差值学习理论、求解连续状态与行为空间MDP的值函数逼近方法、 直接策略搜索与近似策略迭代、自适应评价设计算法等,最后对相关研究领域的应用及发展趋势进行分析和探讨.  相似文献   

5.
强化学习是机器学习领域的研究热点, 是考察智能体与环境的相互作用, 做出序列决策、优化策略并最大化累积回报的过程. 强化学习具有巨大的研究价值和应用潜力, 是实现通用人工智能的关键步骤. 本文综述了强化学习算法与应用的研究进展和发展动态, 首先介绍强化学习的基本原理, 包括马尔可夫决策过程、价值函数、探索-利用问题. 其次, 回顾强化学习经典算法, 包括基于价值函数的强化学习算法、基于策略搜索的强化学习算法、结合价值函数和策略搜索的强化学习算法, 以及综述强化学习前沿研究, 主要介绍多智能体强化学习和元强化学习方向. 最后综述强化学习在游戏对抗、机器人控制、城市交通和商业等领域的成功应用, 以及总结与展望.  相似文献   

6.
This paper descibes an explanation-based learning (EBL) system based on a version of Newell, Shaw, and Simon's LOGIC-THEORIST (LT). Results of applying this system to propositional calculus problems from Principia Mathematica are compared with results of applying several other versions of the same performance element to these problems. The primary goal of this study is to characterize and analyze differences between non-learning, rote learning (LT's original learning method), and EBL. Another aim is to provide a characterization of the performance of a simple problem solver in the context of the Principia problems, in the hope that these problems can be used as a benchmark for testing improved learning methods, just as problems like chess and the eight puzzle have been used as benchmarks in research on search methods.  相似文献   

7.
在线学习时长是强化学习算法的一个重要指标.传统在线强化学习算法如Q学习、状态–动作–奖励–状态–动作(state-action-reward-state-action,SARSA)等算法不能从理论分析角度给出定量的在线学习时长上界.本文引入概率近似正确(probably approximately correct,PAC)原理,为连续时间确定性系统设计基于数据的在线强化学习算法.这类算法有效记录在线数据,同时考虑强化学习算法对状态空间探索的需求,能够在有限在线学习时间内输出近似最优的控制.我们提出算法的两种实现方式,分别使用状态离散化和kd树(k-dimensional树)技术,存储数据和计算在线策略.最后我们将提出的两个算法应用在双连杆机械臂运动控制上,观察算法的效果并进行比较.  相似文献   

8.
强化学习在足球机器人基本动作学习中的应用   总被引:1,自引:0,他引:1  
主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用.强化学习的状态空间 和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛.针对这一问题,提出了基于T-S 模型模糊 神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射.此外,使用提出的强化学习方 法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题.最后,通 过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要.  相似文献   

9.
激励学习的最优判据研究   总被引:8,自引:0,他引:8       下载免费PDF全文
激励学习智能体通过最优策略的学习与规划来求解序贯决策问题,因此如何定义策略的最优判所是激励学习研究的核心问题之一,本文讨论了一系列来自动态规划的最优判据,通过实例检验了各种判据对激励学习的适用性和优缺点,分析了设计各种判据的激励学习算法的必要性。  相似文献   

10.
Ensemble Algorithms in Reinforcement Learning   总被引:1,自引:0,他引:1  
This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: $Q$ -learning, Sarsa, actor–critic (AC), $QV$-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms.   相似文献   

11.
Mahadevan  Sridhar 《Machine Learning》1996,22(1-3):159-195
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.  相似文献   

12.
In this paper, we introduce a novel reinforcement learning (RL) scheme for linear continuous-time dynamical systems. Different from traditional batch learning algorithms, an incremental learning approach is developed, which provides a more efficient way to tackle the on-line learning problem in real-world applications. We provide concrete convergence and robust analysis on this incremental-learning algorithm. An extension to solving robust optimal control problems is also given. Two simulation examples are also given to illustrate the effectiveness of our theoretical result.   相似文献   

13.
再励学习(Reinforcement Learning,RL)是一种成功地结合动态编程和控制问题的机器智能方法,它将动态编程和有监督学习方法结合到机器学习系统中,通常用于解决预测和控制两类问题。提出了以矢量形式表示的评估函数,为了实现多维再励学习,用一专门的神经网络(Q网络)实现评判网络,研究其在移动机器人行为规划中的应用。  相似文献   

14.
Literature shows that reinforcement learning (RL) and the well-known optimization algorithms derived from it have been applied to assembly sequence planning (ASP); however, the way this is done, as an offline process, ends up generating optimization methods that are not exploiting the full potential of RL. Today’s assembly lines need to be adaptive to changes, resilient to errors and attentive to the operators’ skills and needs. If all of these aspects need to evolve towards a new paradigm, called Industry 4.0, the way RL is applied to ASP needs to change as well: the RL phase has to be part of the assembly execution phase and be optimized with time and several repetitions of the process. This article presents an agile exploratory experiment in ASP to prove the effectiveness of RL techniques to execute ASP as an adaptive, online and experience-driven optimization process, directly at assembly time. The human-assembly interaction is modelled through the input-outputs of an assembly guidance system built as an assembly digital twin. Experimental assemblies are executed without pre-established assembly sequence plans and adapted to the operators’ needs. The experiments show that precedence and transition matrices for an assembly can be generated from the statistical knowledge of several different assembly executions. When the frequency of a given subassembly reinforces its importance, statistical results obtained from the experiments prove that online RL applications are not only possible but also effective for learning, teaching, executing and improving assembly tasks at the same time. This article paves the way towards the application of online RL algorithms to ASP.  相似文献   

15.
This paper introduces a setting for multiclass online learning with limited feedback and its application to utterance classification. In this learning setting, a parameter k limits the number of choices presented for selection by the environment (e.g. by the user in the case of an interactive spoken system) during each trial of the online learning sequence. New versions of standard additive and multiplicative weight update algorithms for online learning are presented that are more suited to the limited feedback setting, while sharing the efficiency advantages of the standard ones. The algorithms are evaluated on an utterance classification task in two domains. In this utterance classification task, no training material for the domain is provided (for training the speech recognizer or classifier) prior to the start of online learning. We present experiments on the effect of varying k and the weight update algorithms on the learning curve for online utterance classification. In these experiments, the new online learning algorithms improve classification accuracy compared with the standard ones. The methods presented are directly relevant to applications such as building call routing systems that adapt from feedback rather than being trained in batch mode.Editors: Dan Roth and Pascale FungThe work reported in this paper was carried out while the author was at AT&T Labs.  相似文献   

16.
In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements.  相似文献   

17.
强化学习用于解决无模型情况下的优化决策问题,是实现人工智能的重要技术之一,但传统的表格型强化学习方法难以处理具有大规模、连续空间的控制问题。近似强化学习受到函数逼近思想的启发,对价值函数或策略函数参数化表示,通过参数优化间接获得最优行为策略,在视频游戏、棋类对抗及机器人控制等领域应用效果显著。基于此,对近似强化学习算法的研究现状与应用进展进行了梳理和综述。介绍了近似强化学习相关的基础理论;分类总结了近似强化学习的经典算法及一些相应的改进方法;概述了近似强化学习在机器人控制领域的研究进展,并总结了当前面临的若干主要问题,为后续的研究提供参考。  相似文献   

18.
AUTOMATIC COMPLEXITY REDUCTION IN REINFORCEMENT LEARNING   总被引:1,自引:0,他引:1  
High dimensionality of state representation is a major limitation for scale-up in reinforcement learning (RL). This work derives the knowledge of complexity reduction from partial solutions and provides algorithms for automated dimension reduction in RL. We propose the cascading decomposition algorithm based on the spectral analysis on a normalized graph Laplacian to decompose a problem into several subproblems and then conduct parameter relevance analysis on each subproblem to perform dynamic state abstraction. The elimination of irrelevant parameters projects the original state space into the one with lower dimension in which some subtasks are projected onto the same shared subtasks. The framework could identify irrelevant parameters based on performed action sequences and thus relieve the problem of high dimensionality in learning process. We evaluate the framework with experiments and show that the dimension reduction approach could indeed make some infeasible problem to become learnable.  相似文献   

19.
This article proposes a reinforcement learning procedure for mobile robot navigation using a latent-like learning schema. Latent learning refers to learning that occurs in the absence of reinforcement signals and is not apparent until reinforcement is introduced. This concept considers that part of a task can be learned before the agent receives any indication of how to perform such a task. In the proposed topological reinforcement learning agent (TRLA), a topological map is used to perform the latent learning. The propagation of the reinforcement signal throughout the topological neighborhoods of the map permits the estimation of a value function which takes in average less trials and with less updatings per trial than six of the main temporal difference reinforcement learning algorithms: Q-learning, SARSA, Q(λ)-learning, SARSA(λ), Dyna-Q and fast Q(λ)-learning. The RL agents were tested in four different environments designed to consider a growing level of complexity in accomplishing navigation tasks. The tests suggested that the TRLA chooses shorter trajectories (in the number of steps) and/or requires less value function updatings in each trial than the other six reinforcement learning (RL) algorithms.  相似文献   

20.
强化学习(reinforcement learning)是机器学习和人工智能领域的重要分支,近年来受到社会各界和企业的广泛关注。强化学习算法要解决的主要问题是,智能体如何直接与环境进行交互来学习策略。但是当状态空间维度增加时,传统的强化学习方法往往面临着维度灾难,难以取得好的学习效果。分层强化学习(hierarchical reinforcement learning)致力于将一个复杂的强化学习问题分解成几个子问题并分别解决,可以取得比直接解决整个问题更好的效果。分层强化学习是解决大规模强化学习问题的潜在途径,然而其受到的关注不高。本文将介绍和回顾分层强化学习的几大类方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号