英语原文共 7 页,剩余内容已隐藏,支付完成后下载完整资料
基于double Q-Learning算法的深度增强学习
摘要
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common,whether they harm performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the
recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations,as hypothesized, but that this also leads to much better performance on several games.
众所周知,流行的Q-Learning算法在某些条件下高估了动作值。 之前并不知道在实践中,这种过高估计是否普遍,是否会损害性能,以及它们是否能够普遍得到预防。 在本文中,我们肯定地回答所有问题。 具体来说,我们首先展示了最近的将Q-Learning与深度神经网络相结合的DQN算法,在Atari 2600领域的某些游戏中存在高估。 然后,我们展示了在表格设置中引入的双Q学习算法背后的想法可以推广到与大规模函数逼近一起工作。 我们提出了对DQN算法的具体调整,并且证明了所得到的算法不仅减少了所观察到的高估,而且使得在几个游戏中更好的表现。
The goal of reinforcement learning (Sutton and Barto 1998) is to learn good policies for sequential decision problems,by optimizing a cumulative future reward signal. Q-learning(Watkins 1989) is one of the most popular reinforcement learning algorithms, but it is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values.
强化学习的目标(萨顿和巴托1998)是通过优化累积的未来回报信号来学习顺序决策问题的良好策略。Q-learning(Watkins,1989)是最流行的强化学习算法之一,但有时会学习不切实际的高动作值,因为它包含了估计动作值的最大化步骤,相对于低估值往往更倾向于高估值。
In previous work, overestimations have been attributed to insufficiently flexible function approximation (Thrun and Schwartz 1993) and noise (van Hasselt 2010, 2011). In this paper, we unify these views and show overestimations can occur when the action values are inaccurate, irrespective of the source of approximation error. Of course, imprecise value estimates are the norm during learning, which indicates that overestimations may be much more common than previously appreciated.
在以前的工作中,过高估计归是因为不够灵活的函数逼近(Thrun和Schwartz 1993)和噪声(van Hasselt 2010,2011)。 在本文中,我们统一了这些观点,并且显示当动作值不准确时可能出现高估,而不考虑近似误差的来源。当然,不准确的价值估计是学习期间的常态,这表明高估可能比以前增长的更普遍。
It is an open question whether, if the overestimations do occur, this negatively affects performance in practice.Overoptimistic value estimates are not necessarily a problem in and of themselves. If all values would be uniformly higher then the relative action preferences are preserved and we would not expect the resulting policy to be any worse.
Furthermore, it is known that sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known exploration technique (Kaelbling et al. 1996). If, however, the overestimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy. Thrun
and Schwartz (1993) give specific examples in which this leads to suboptimal policies, even asymptotically.
如果过高估计确实发生,这是一个悬而未决的问题,这在实践中会对绩效产生负面影响。过度乐观的价值估计本身并不一定是问题。如果所有的价值都会一致地高一些,那么相对的行为偏好就会得到保留,我们不会期望得到的政策会变得更糟。此外,众所周知,有时候乐观是件好事:面对不确定性时的乐观主义是一种众所周知的探索技术(Kaelbling et al。1996)。 然而,如果过高估计不统一,并且不集中在我们希望了解更多的状态,那么它们可能对最终政策的质量产生负面影响。Thrun和Schwartz(1993)给出了具体的例子,其中这导致次优策略,甚至渐近。
To test whether overestimations occur in practice and at scale, we investigate the performance of the recent DQN algorithm (Mnih et al. 2015). DQN combines Q-learning with a flexible deep neural network and was tested on a varied and large set of deterministic Atari 2600 games, reaching human-level performance on many games. In some ways,this setting is a best-case scenario for Q-learning, because the deep neural network provides flexible function approximation with the potential for a low asymptotic approximation error, and the determinism of the environments prevents the harmful effects of noise. Perhaps surprisingly, we show that even in this comparatively favorable setting DQN sometimes substantially overestimates the values of the actions.
为了检验高估是否发生在实践和规模上,我们调查了最近DQN算法的性能(Mnih et al。2015)。DQN将Q-learning与灵活的深度神经网络相结合,并在各种各样的大量确定性Atari2600游戏上进行测试,在许多游戏中达到了人类级别的表现。在某些方面,这种设置是Q-learning的最佳情况,因为深度神经网络提供了灵活的函数逼近以及低渐近逼近误差的可能性,并且环境的确定性防止了噪声的有害影响。 也许令人惊讶的是,我们表明,即使在这个相对有利的环境下,DQN有时也会大大高估行为动作的价值。
We show that the Double Q-learning algorithm (van Hasselt 2010), which was first proposed in a tabular setting, can be generalized to arbitrary function approximation, including deep neural networks. We use this to construct a new algorithm called Double DQN. This algorithm not only yields more accurate value estimates, but leads to much higher scores on several games. This demonstrates that the overestimations of DQN indeed lead to poorer policies and that it is beneficial to reduce them. In addition, by improving upon DQN we obtain state-of-the-art results on the Atari domain.
我们表明,Double-Q学习算法(van Hasselt 2010)首次在表格设置中提出,可以推广到任意函数逼近,包括深度神经网络。我们用它来构造一个称为Double DQN的新算法。这种算法不仅可以产生更准确的估值,而且可以在几款游戏中获得更高的分数。这表明DQN的高估确实导致了较差的政策,并且减少它们是有益的。 另外,通过对DQN进行改进,我们可以在Atari领域获得最新的结果。
背景
To solve sequential decision problems we can learn estimates for the optimal value of each action, defined as the expected sum of future rewards when taking that action and following the optimal policy thereafter. Under a given policy pi;, the true value of an action a in a state s is
Qpi;(s, a) equiv; E [R1 gamma;R2 . . . | S0 = s,A0 = a, pi;]
where gamma; isin; [0, 1] is a discount factor that trades off the importance of immediate and later rewards. The optimal value is then Qlowast;(s, a) = maxpi; Qpi;(s, a). An optimal policy is eas
全文共24840字,剩余内容已隐藏,支付完成后下载完整资料
资料编号:[13042],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。