Sequence to Sequence Learning with Neural Networks





Dropout Srivastava (2013) is a recently introduced regularization method that has been very successful with feed-forward neural networks. While much work has extended dropout in various ways Wang & Manning (2013); Wan et al. (2013), there has been relatively little research in applying it to RNNs. The only paper on this topic is by Bayer et al. (2013), who focuses on “marginalized dropout” Wang & Manning (2013), a noiseless deterministic approximation to standard dropout. Bayer et al. (2013) claim that conventional dropout does not work well with RNNs because the recurrence amplifies noise, which in turn hurts learning. In this work, we show that this problem can be fixed by applying dropout to a certain subset of the RNNs’ connections. As a result, RNNs can now also benefit from dropout. Independently of our work, Pham et al. (2013) developed the very same RNN regularization method and applied it to handwriting recognition. We rediscovered this method and demonstrated strong empirical results over a wide range of problems. Other work that applied dropout to LSTMs is Pachitariu & Sahani (2013).

1. Introduction本⽂提出了⼀种端到端的序列学习⽅法,并将其⽤于英语到法语的机器翻译任务中。






测试结果表明,该模型在机器翻译任务中可以得到不错的BLEU score,显著地优于统计机器翻译基线(SMT baseline)。




2. The modelRNN是前馈神经⽹络的⼀种⾃然泛化。

给定⼀个输⼊序列{ \left( x\mathop{{}}\nolimits_{{1}},...,x\mathop{{}}\nolimits_{{T}} \right) },RNN通过以下公式迭代计算出输出:{\begin{array}{*{20}{l}} {h\mathop{{}}\nolimits_{{t}}= \sigma \left(W\mathop{{}}\nolimits^{{hx}}x\mathop{{}}\nolimits_{{t}}+W\mathop{{}}\nolimits^{{hh}}h\mathop{{}}\nolimits_{{t-1}} \right) }\\{y\mathop{{}}\nolimits_{{t}}=W\mathop{{}}\nolimits^{{yh}}h\mathop{{}}\nolimits_{{t}}} \end{array}}只要事先知道输⼊与输出之间的对齐⽅式,RNN就可以将序列映射到序列。



第34卷第3期2021年3月模式识别与人工智能Pattern Recognition and Artificial IntelligenceVol.34No.3Mar.2021序列多智能体强化学习算法史腾飞1王莉1黄子蓉1摘要针对当前多智能体强化学习算法难以适应智能体规模动态变化的问题,文中提出序列多智能体强化学习算法(SMARL).将智能体的控制网络划分为动作网络和目标网络,以深度确定性策略梯度和序列到序列分别作为分割后的基础网络结构,分离算法结构与规模的相关性.同时,对算法输入输出进行特殊处理,分离算法策略与规模的相关性.SMARL中的智能体可较快适应新的环境,担任不同任务角色,实现快速学习.实验表明SMARL在适应性、性能和训练效率上均较优.关键词多智能体强化学习,深度确定性策略梯度(DDPG),序列到序列(Seq2Seq),分块结构引用格式史腾飞,王莉,黄子蓉.序列多智能体强化学习算法.模式识别与人工智能,2021,34(3):206-213. DOI10.16451/ki.issn1003-6059.202103002中图法分类号TP18Sequence to Sequence Multi-agent Reinforcement Learning AlgorithmSHI Tengfei',WANG Li1,HUANG Zirong1ABSTRACT The multi-agent reinforcement learning algorithm is difficult to adapt to dynamically changing environments of agent scale.Aiming at this problem,a sequence to sequence multi-agent reinforcement learning algorithm(SMARL)based on sequential learning and block structure is proposed. The control network of an agent is divided into action network and target network based on deep deterministic policy gradient structure and sequence-to-sequence structure,respectively,and the correlation between algorithm structure and agent scale is removed.Inputs and outputs of the algorithm are also processed to break the correlation between algorithm policy and agent scale.Agents in SMARL can quickly adapt to the new environment,take different roles in task and achieve fast learning. Experiments show that the adaptability,performance and training efficiency of the proposed algorithm are superior to baseline algorithms.Key Words Multi-agent Reinforcement Learning,Deep Deterministic Policy Gradient(DDPG), Sequence to Sequence(Seq2Seq),Block StructureCitation SHI T F,WANG L,HUANG Z R.Sequence to Sequence Multi-agent Reinforcement Learning Algorithm.Pattern Recognition and Artificial Intelligence,2021,34(3):206-213.在多智能体强化学习(Multi-agent Reinforce-收稿日期:2020-10-10;录用日期:2020-11-20Manuscript received October10,2020;accepted November20,2020国家自然科学基金项目(No.61872260)资助Supported by National Natural Science Foundation of China(No. 61872260)本文责任编委陈恩红Recommended by Associate Editor CHEN Enhong1.太原理工大学大数据学院晋中0306001.College of Data Science,Taiyuan University of Technology,Jinzhong030600ment Learning,MARL)技术中,智能体与环境及其它智能体交互并获得奖励(Reward),通过奖励得到信息并改善自身策略.多智能体强化学习对环境的变化十分敏感,一旦环境发生变化,训练好的策略就可能失效.智能体规模变化是一种典型的环境变化,可造成已有模型结构和策略失效.针对上述问题,需要研究自适应智能体规模动态变化的MARL.现今MARL在多个领域已有广泛应用[1],如构建游戏人工智能(Artificial Intelligence,AI)[2]、机器人控制[3]和交通指挥⑷等.MARL研究涉及范围广泛,与本文相关的研究可分为如下3方面.1)多智能体性能方面的研究.多智能体间如何第3期史腾飞等:序列多智能体强化学习算法207较好地合作,保证整体具有良好性能是所有MARL 必须考虑的问题.Lowe等[5]提出同时适用于合作与对抗场景的多智能体深度确定性策略梯度(Multi-agent Deep Deterministic Policy Gradient,MADDPG),使用集中训练分散执行的方式让智能体之间学会较好的合作,提升整体性能.Foerster等⑷提出反事实多智能体策略梯度(Counterfactual Multi-agent Policy Gradients,COMA),同样使用集中训练分散执行的方式,使用单个Critic多个Actor的网络结构,Actor 网络使用门控循环单兀(Gate Recurrent Unit,GRU)网络,提高整体团队的合作效果.Wei等[7]提出多智能体软Q学习算法(Multi-agent Soft Q-Learning, MASQL),将软Q学习(Soft Q-Learning)算法迁至多智能体环境中,多智能体采用联合动作,使用全局回报评判动作好坏,一定程度上提升团队的合作效果.上述算法在一定程度上提升多智能体团队合作和对抗的性能,但是均存在难以适应智能体规模动态变化的问题.2)多智能体迁移性方面的研究.智能体的迁移包括同种环境中不同智能体之间的迁移和不同环境中智能体的迁移.研究如何较好地实现智能体的迁移可提升训练效率及提升智能体对环境的适应性. Brys等⑷通过重构奖励实现智能体策略的迁移.虽然可解决智能体策略的迁移问题,但在奖励重构的过程中需要耗费大量资源.Taylor等[9]提出在源任务和目标任务之间通过任务数据的双向传输,实现源任务和目标任务并行学习,加快智能体学习的进度和智能体知识的迁移,但在智能体规模巨大时,训练速度仍然有限.Mnih等[10]通过多线程模拟多个环境空间的副本,智能体网络同时在多个环境空间副本中进行学习,再将学习到的知识进行迁移整合,融入一个网络中.该方法在某种程度上也可视作一种知识的迁移,但并不能直接解决规模变化的问题.3)多智能体可扩展性和适应性方面的研究.在实际应用中,智能体的规模通常不固定并且十分庞大.当前一般解决思路是先人为调整设定模型的网络结构,然后通过大量再训练甚至是从零训练,使模型适应新的智能体规模.这种做法十分耗时耗力,根本无法应对智能体规模动态变化的环境.Khan 等[11]提出训练一个可适用于所有智能体的单一策略,使用该策略(参数共享)控制所有的智能体,实现算法可适应任意规模的智能体环境.但是该方法未注意到智能体规模对模型网络结构的影响.Zhang 等[12]提出使用降维方法对智能体观测进行表征,将不同规模的智能体的观测表征在同个维度下,再将表征作为强化学习算法的输入.该方法本质上是扩充模型网络可接受的输入维度大小,但当智能体规模持续扩大时,仍会超出模型网络的最大范围,从而导致模型无法运行.Long等[|3]改进MADDPG,使用注意力机制进行预处理观测,再将处理后的观测输入MADDPG,使用编码器(Encoder)实现注意力网络.该方法在一定程度上可适应智能体规模的变化,但在面对每次智能体规模变动时,均需要重新调整网络结构和进行再训练.针对智能体规模动态变化引发的MARL失效的问题,本文提出序列多智能体强化学习算法(Sequence to Sequence Multi-agent Reinforcement Learning Algorithm,SMARL).SMARL中的智能体可较快适应新的环境,担任不同任务角色,实现快速学习.1序列多智能体强化学习算法SMARL的核心思想是分离模型网络结构和模型策略与智能体规模的相关性,具体框图见图1.图1SMARL框图Fig.1Framework of SMARL首先在结构上,将智能体的控制网络划分为2个平行的模块一智能体动作网络(图1左侧)和智能体目标网络(图1右侧).每个智能体的执行动作由这两个网络的输出组成.为了适应算法结构,划分智能体的观测数据和动作数据.智能体的观测分为每个智能体的局部观测和所有智能体的全局观测,本文称为个性观测和共性观测.个性观测不会随智能体规模变化而变化.同理,算法中对智能体动作也分成智能体的共性动作和个性动作,所有智能体动作集的交集为共性动作,某智能体的动作集与共208模式识别与人工智能(PR&AI)第34卷性动作的差集为该智能体的个性动作.共性动作为智能体的执行动作,个性动作为智能体执行动作的目标.共性动作不会随智能体规模变化而变化.每个智能体执行的动作由共性动作和个性动作共同组成.举例说明,在二维格子世界中存在3个可移动且能相互之间抛小球的机械手臂.它们的共性观测是统一坐标系下整个地图的观测,个性观测是以自身为坐标原点的坐标系下的观测.它们的共性动作为上、下、左、右抛.个性动作由智能体ID决定:0号智能体的个性动作为1号、2号;1号智能体的个性动作为0号、2号;2号智能体的个性动作为0号、1号.经过上述分割,算法将与智能体规模相关和无关的内容分割为两部分.考虑到深度确定性策略梯度(Deep Deterministic Policy Gradient,DDPG)网络[⑷在单智能强化学习上性能较优,本文在对智能体观测和动作进行分割之后,将所有智能体的动作策略视作同个策略,选取DDPG网络作为智能体动作网络的内部结构.Khan等[||]证明使用单智能体网络和单一策略控制多个智能体的有效性.考虑到序列到序列(Sequence-to-Sequence,Seq2Seq)网络[15-16]对输入输出长度的不敏感性,本文选取Seq2Seq作为智能体目标网络的内部结构,将智能体规模视作序列长度.智能体动作网络输入为智能体的个性观测,输出为智能体的共性动作,详细框图见图2.图2智能体动作网络框图Fig.2Framework of agent action network 智能体动作网络由多个DDPG网络组成,每个智能体均有各自的DDPG网络,其中,Actor网络参数为兹,,Critic网络参数为Q,Actor-target网络参数为兹;,Critic-target网络参数为Q;,i=0,1,…,N-1.单个的DDPG网络仅接收其对应的智能体以自身作为“坐标原点”的局部观测.此时,使用单一策略(参数共享)控制所有智能体的动作是有意义的.另外,为了实现参数共享,本文参考异步优势演员评论家(Asynchronous Advantage Actor-Critic, A3C)的做法[10],在智能体动作网络中额外设置一个不进行梯度更新的中心参数网络,Actor网络参数为兹”,Critic网络参数为Q n网络接收其它DDPG网络的参数进行软更新(软更新超参数子=0.01),再使用软更新更新其它DDPG网络,最终使所有DDPG网络的参数达到同个单一策略.智能体动作网络更新方式如下.令m n l,=o D pg移(九-Q(o ib,山Q J)2达到最小以更新Critic网络,其中,Q i为Critic网络的参数,Q(•-)为网络评估,B_DDPG为算法批次(Batch Size)数量,o ib、两、r ib、0亦1为抽取样本,Ju,=r,b+酌Q'(s u,+1,滋'(s u,+1丨兹忆)Q;),酌为折扣因子.Actor网络更新如下:V兹丿抑B_DDPG移(VQ(o,a Q i)s o)V汕(o丨兹J L), ib lb lb lb其中,兹i为Actor网络的参数,m(••)为网络策略.中心参数网络和其它网络相互更新如下:兹N饮子兹i+(1-子)兹N,Q N饮子匕+(1-子)Q,兹i饮子兹N+(1-子)兹i,Q i饮t Q N+(1-子)Q i-其中:中心参数网络的Actor网络参数为如,Critic 网络参数为Q N;其它DDPG网络的Actor网络参数为兹,,Critic网络参数为Q i,i=0,1,-,N-1;t为软更新超参数.智能体目标网络输入为智能体的共性观测,输出为智能体的个性动作,框图如图3所示.网络由一个Seq2Seq网络和一个存储器组成,Seq2Seq网络参数为啄.Seq2Seq网络由编码器和解码器组成,这两部分内部结构均为循环神经网络(Recurrent Neural Network,RNN).编码器负责将输入序列表征到更高的维度,由解码器将高维表征进行解码,输出新的序列.Seq2Seq网络负责学习和预测智能体间的合作关系.智能体目标网络使用强化学习的思想,存储器起到强化学习中Q的作用,负责记录某观测(序第3期史腾飞等:序列多智能体强化学习算法209列)到动作(序列)的映射及相应获得的奖励. Seq2Seq部分相当于强化学习中的Actor,负责学习最优观测序列到动作序列的映射及预测新观测序列的动作序列.所有智能体的全局观测(共性观测)所有智能体在整体坐标下的全局观测序列存储器取数据训练“翻译”Seq2Seq编码器I RNN^rRN^k l rn N|注意力机制层解码器|RNN川RNN f RNN|智能体动作目标(个性动作)▼图3智能体目标网络框图Fig.3Framework of agent target network智能体目标网络输入的序列长度为智能体规模,序列中的元素维度为每个智能体的观测.输出序列的长度同样为智能体规模,序列中的元素是智能体编号.输入序列和输出序列的顺序均按照智能体的编号排序,每当智能体规模发生变化时,智能体重新从0开始编号.具体如下:先定义Seq2Seq的奖励函数,通过强化学习的思想筛选奖励最大的观测序列到动作序列的映射,将该映射视作一种翻译,再由Seq2Seq网络进行学习.网络输出表示智能体间的合作关系.另外,本文在Seq2Seq网络中引入Attention机制,提升Seq2Seq网络性能[17].Seq2Seq的核心公式如下:m^x Z*q=1E1s s s s s sN移ln(a0,,…,a N-1o0,o1,…,0N-1,啄),n=0其中,啄为Seq2Seq的参数,。

Long short-term memory

Long short-term memory

Long short-termmemoryA simple LSTM gate with only input,output,and forget gates. LSTM gates may have more gates.[1]Long short-term memory(LSTM)is a recurrent neural network(RNN)architecture(an artificial neural network) published[2]in1997by Sepp Hochreiter and Jürgen Schmidhuber.Like most RNNs,an LSTM network is universal in the sense that given enough network units it can compute anything a conventional computer can com-pute,provided it has the proper weight matrix,which may be viewed as its program.Unlike traditional RNNs,an LSTM network is well-suited to learn from experience to classify,process and predict time series when there are very long time lags of unknown size between important events.This is one of the main reasons why LSTM out-performs alternative RNNs and Hidden Markov Models and other sequence learning methods in numerous appli-cations.For example,LSTM achieved the best known results in unsegmented connected handwriting recogni-tion,[3]and in2009won the ICDAR handwriting compe-tition.LSTM networks have also been used for automatic speech recognition,and were a major component of a net-work that in2013achieved a record17.7%phoneme er-ror rate on the classic TIMIT natural speech dataset.[4] 1ArchitectureAn LSTM network is an artificial neural network that contains LSTM blocks instead of,or in addition to,regu-lar network units.An LSTM block may be described as a“smart”network unit that can remember a value for an arbitrary length of time.An LSTM block contains gates that determine when the input is significant enough to re-member,when it should continue to remember or forget the value,and when it should output the value.A typical implementation of an LSTM block is shown to the right.The four units shown at the bottom of thefig-A typical implementation of an LSTM block.ure are sigmoid units y=s(∑w i x i),where s is some squashing function,such as the logistic function.The left-most of these units computes a value which is condition-ally fed as an input value to the block’s memory.The other three units serve as gates to determine when values are allowed toflow into or out of the block’s memory.The second unit from the left(on the bottom row)is the“in-put gate”.When it outputs a value close to zero,it zeros out the value from the left-most unit,effectively blocking that value from entering into the next layer.The third unit from the left is the“forget gate”.When it outputs a value close to zero,the block will effectively forget whatever value it was remembering.The right-most unit(on the bottom row)is the“output gate”.It determines when the unit should output the value in its memory.The units con-taining theΠsymbol compute the product of their inputs (y=Πx i).These units have no weights.The unit with theΣsymbol computes a linear function of its inputs( y=∑w i x i).The output of this unit is not squashed so that it can remember the same value for many time-steps without the value decaying.This value is fed back in so that the block can“remember”it(as long as the forget gate allows).Typically,this value is also fed into the3 gating units to help them make gating decisions.125REFERENCES2TrainingTo minimize LSTM’s total error on a set of train-ing sequences,iterative gradient descent such as backpropagation through time can be used to change each weight in proportion to its derivative with respect to the error.A major problem with gradient descent for stan-dard RNNs is that error gradients vanish exponentially quickly with the size of the time lag between important events,asfirst realized in1991.[5][6]With LSTM blocks, however,when error values are back-propagated from the output,the error becomes trapped in the memory portion of the block.This is referred to as an“error carousel”, which continuously feeds error back to each of the gates until they become trained to cut offthe value.Thus,reg-ular backpropagation is effective at training an LSTM block to remember values for very long durations. LSTM can also be trained by a combination of artificial evolution for weights to the hidden units,and pseudo-inverse or support vector machines for weights to the out-put units.[7]In reinforcement learning applications LSTM can be trained by policy gradient methods,evolution strategies or genetic algorithms.3ApplicationsApplications of LSTM include:•Robot control[8]•Time series prediction[9]•Speech recognition[10][11][12]•Rhythm learning[13]•Music composition[14]•Grammar learning[15][16][17]•Handwriting recognition[18][19]•Human action recognition[20]•Protein Homology Detection[21]4See also•Artificial neural network•Prefrontal Cortex Basal Ganglia Working Memory (PBWM)•Recurrent neural network•Time series•Long-term potentiation 5References[1]Klaus Greff,Rupesh Kumar Srivastava,Jan Koutník,BasR.Steunebrink,Jürgen Schmidhuber(2015).“LSTM:A Search Space Odyssey”.arXiv:1503.04069.[2]Sepp Hochreiter and Jürgen Schmidhuber(1997).“Longshort-term memory”(PDF).Neural Computation9(8): 1735–1780.doi:10.1162/neco.1997.9.8.1735.PMID 9377276.[3] A.Graves,M.Liwicki,S.Fernandez,R.Bertolami,H.Bunke,J.Schmidhuber.A Novel Connectionist System for Improved Unconstrained Handwriting Recognition.IEEE Transactions on Pattern Analysis and Machine In-telligence,vol.31,no.5,2009.[4]Graves,Alex;Mohamed,Abdel-rahman;Hinton,Geof-frey(2013).“Speech Recognition with Deep Recurrent Neural Networks”.Acoustics,Speech and Signal Pro-cessing(ICASSP),2013IEEE International Conference on: 6645–6649.[5]S.Hochreiter.Untersuchungen zu dynamischen neu-ronalen Netzen.Diploma thesis,Institut rmatik, Technische Univ.Munich,1991.[6]S.Hochreiter,Y.Bengio,P.Frasconi,and J.Schmid-huber.Gradientflow in recurrent nets:the difficulty of learning long-term dependencies.In S.C.Kremer and J.F.Kolen,editors,A Field Guide to Dynamical RecurrentNeural Networks.IEEE Press,2001.[7]Schmidhuber,J.;Wierstra, D.;Gagliolo,M.;Gomez, F.(2007).“Training Recurrent Networks by Evolino”.Neural Computation19(3):757–779.doi:10.1162/neco.2007.19.3.757.[8]H.Mayer,F.Gomez,D.Wierstra,I.Nagy,A.Knoll,andJ.Schmidhuber.A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks.Advanced Robotics,22/13–14,pp.1521–1537,2008. [9]J.Schmidhuber and D.Wierstra and F.J.Gomez.Evolino:Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning.Proceedings of the19th Interna-tional Joint Conference on Artificial Intelligence(IJCAI), Edinburgh,pp.853–858,2005.[10]Graves, A.;Schmidhuber,J.(2005).“Framewisephoneme classification with bidirectional LSTM and other neural network architectures”.Neural Networks18(5–6): 602–610.doi:10.1016/j.neunet.2005.06.042.[11]S.Fernandez,A.Graves,J.Schmidhuber.An applica-tion of recurrent neural networks to discriminative key-word spotting.Intl.Conf.on Artificial Neural Networks ICANN'07,2007.[12]Graves,Alex;Mohamed,Abdel-rahman;Hinton,Geof-frey(2013).“Speech Recognition with Deep Recurrent Neural Networks”.Acoustics,Speech and Signal Pro-cessing(ICASSP),2013IEEE International Conference on: 6645–6649.3[13]Gers, F.;Schraudolph,N.;Schmidhuber,J.(2002).“Learning precise timing with LSTM recurrent net-works”.Journal of Machine Learning Research3:115–143.[14] D.Eck and J.Schmidhuber.Learning The Long-TermStructure of the Blues.In J.Dorronsoro,ed.,Proceedings of Int.Conf.on Artificial Neural Networks ICANN'02, Madrid,pages284–289,Springer,Berlin,2002.[15]Schmidhuber,J.;Gers, F.;Eck, D.;Schmidhu-ber,J.;Gers, F.(2002).“Learning nonregular lan-guages:A comparison of simple recurrent networks and LSTM”.Neural Computation14(9):2039–2041.doi:10.1162/089976602320263980.[16]Gers,F.A.;Schmidhuber,J.(2001).“LSTM RecurrentNetworks Learn Simple Context Free and Context Sensi-tive Languages”.IEEE Transactions on Neural Networks 12(6):1333–1340.doi:10.1109/72.963769.[17]Perez-Ortiz,J.A.;Gers, F.A.;Eck, D.;Schmidhu-ber,J.(2003).“Kalmanfilters improve LSTM net-work performance in problems unsolvable by traditional recurrent nets”.Neural Networks16(2):241–250.doi:10.1016/s0893-6080(02)00219-8.[18] A.Graves,J.Schmidhuber.Offline Handwriting Recog-nition with Multidimensional Recurrent Neural Networks.Advances in Neural Information Processing Systems22, NIPS'22,pp545–552,Vancouver,MIT Press,2009. [19] A.Graves,S.Fernandez,M.Liwicki,H.Bunke,J.Schmidhuber.Unconstrained online handwriting recog-nition with recurrent neural networks.Advances in Neu-ral Information Processing Systems21,NIPS'21,pp577–584,2008,MIT Press,Cambridge,MA,2008.[20]M.Baccouche, F.Mamalet,C Wolf, C.Garcia, A.Baskurt.Sequential Deep Learning for Human Action Recognition.2nd International Workshop on Human Be-havior Understanding(HBU),A.A.Salah,B.Lepri ed.Amsterdam,Netherlands.pp.29–39.Lecture Notes in Computer Science7065.Springer.2011[21]Hochreiter,S.;Heusel,M.;Obermayer,K.(2007).“Fast model-based protein homology detection with-out alignment”.Bioinformatics23(14):1728–1736.doi:10.1093/bioinformatics/btm247.PMID17488755. 6External links•Recurrent Neural Networks with over30LSTM pa-pers by Jürgen Schmidhuber's group at IDSIA •Gers PhD thesis on LSTM networks.•Fraud detection paper with two chapters devoted to explaining recurrent neural networks,especially LSTM.•Paper on a high-performing extension of LSTM that has been simplified to a single node type and can train arbitrary architectures.•Tutorial:How to implement LSTM in python with theano47TEXT AND IMAGE SOURCES,CONTRIBUTORS,AND LICENSES 7Text and image sources,contributors,and licenses7.1Text•Long short-term memory Source:https:///wiki/Long_short-term_memory?oldid=720271917Contributors:Fnielsen, Michael Hardy,Glenn,Rich Farmbrough,Denoir,Woohookitty,Rjwilmsi,Tony1,SmackBot,Derek farn,Ninjakannon,Magioladitis, Barkeep,Pwoolf,Headlessplatter,M4gnum0n,Muhandes,Jncraton,Yobot,Dithridge,Richard.decal,Omnipaedista,Valdemus,Olexa Riznyk,Albertzeyer,BiObserver,Silenceisgod,Epsiloner,Ego White Tray,Mister Mormon,Dexbot,Hmainsbot1,Mogism,Velvel2, Mpritham,Thoreyrunars and Anonymous:187.2Images•File:Long_Short_Term_Memory.png Source:https:///wikipedia/commons/d/d5/Long_Short_Term_Memory.png License:CC BY-SA4.0Contributors:Own work Original artist:BiObserver•File:Lstm_block.svg Source:https:///wikipedia/commons/8/8d/Lstm_block.svg License:Public domain Contrib-utors:(Original text:Headlessplatter(talk)(Uploads)-I made this image myself and I gift it to the public domain.)Original artist: Headlessplatter(talk)(Uploads)7.3Content license•Creative Commons Attribution-Share Alike3.0。





因此,新的翻译技术——基于语言模型的深度学习翻译技术(Deep Learning Translation),应运而生。





具体来说,该技术包含以下三个方面的技术原理:1. 序列到序列模型(Sequence-to-Sequence Model)序列到序列模型是将输入的序列转换为输出的序列的一种方法。





2. 注意力机制(Attention Mechanism)注意力机制是一种权重分配机制,它允许模型将注意力集中于输入序列中的某些位置,以便更好地对其进行翻译。



3. 词嵌入(Word Embedding)词嵌入是一种将文本中的单词转换为向量表示的方法。

















The idea is to use one LSTM to read the input sequence,one timestep at a time,to obtain largefixed-dimensional vector representation,and then to use another LSTM to extract the output sequence from that vector(fig.1).The second LSTM is essentially a recurrent neural network language model [28,23,30]except that it is conditioned on the input sequence.The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs(fig.1). There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks.Our approach is closely related to Kalchbrenner and Blunsom[18] who were thefirst to map the entire input sentence to vector,and is related to Cho et al.[5]although the latter was used only for rescoring hypotheses produced by a phrase-based system.Graves[10] introduced a novel differentiable attention mechanism that allows neural networks to focus on dif-ferent parts of their input,and an elegant variant of this idea was successfully applied to machine translation by Bahdanau et al.[2].The Connectionist Sequence Classification is another popular technique for mapping sequences to sequences with neural networks,but it assumes a monotonic alignment between the inputs and the outputs[11].Figure1:Our model reads an input sentence“ABC”and produces“WXYZ”as the output sentence.The model stops making predictions after outputting the end-of-sentence token.Note that the LSTM reads the input sentence in reverse,because doing so introduces many short term dependencies in the data that make the optimization problem much easier.The main result of this work is the following.On the WMT’14English to French translation task, we obtained a BLEU score of34.81by directly extracting translations from an ensemble of5deep LSTMs(with384M parameters and8,000dimensional state each)using a simple left-to-right beam-search decoder.This is by far the best result achieved by direct translation with large neural net-works.For comparison,the BLEU score of an SMT baseline on this dataset is33.30[29].The34.81 BLEU score was achieved by an LSTM with a vocabulary of80k words,so the score was penalized whenever the reference translation contained a word not covered by these80k.This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.Finally,we used the LSTM to rescore the publicly available1000-best lists of the SMT baseline on the same task[29].By doing so,we obtained a BLEU score of36.5,which improves the baseline by 3.2BLEU points and is close to the previous best published result on this task(which is37.0[9]). Surprisingly,the LSTM did not suffer on very long sentences,despite the recent experience of other researchers with related architectures[26].We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set.By doing so,we introduced many short term dependencies that made the optimization problem much simpler(see sec.2and3.3).As a result,SGD could learn LSTMs that had no trouble with long sentences.The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.A useful property of the LSTM is that it learns to map an input sentence of variable length into afixed-dimensional vector representation.Given that translations tend to be paraphrases of the source sentences,the translation objective encourages the LSTM tofind sentence representations that capture their meaning,as sentences with similar meanings are close to each other while differentsentences meanings will be far.A qualitative evaluation supports this claim,showing that our model is aware of word order and is fairly invariant to the active and passive voice.2The modelThe Recurrent Neural Network(RNN)[31,28]is a natural generalization of feedforward neural networks to sequences.Given a sequence of inputs(x1,...,x T),a standard RNN computes a sequence of outputs(y1,...,y T)by iterating the following equation:h t=sigm W hx x t+W hh h t−1y t=W yh h tThe RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time.However,it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relation-ships.The simplest strategy for general sequence learning is to map the input sequence to afixed-sized vector using one RNN,and then to map the vector to the target sequence with another RNN(this approach has also been taken by Cho et al.[5]).While it could work in principle since the RNN is provided with all the relevant information,it would be difficult to train the RNNs due to the resulting long term dependencies(figure1)[14,4,16,15].However,the Long Short-Term Memory(LSTM) [16]is known to learn problems with long range temporal dependencies,so an LSTM may succeed in this setting.The goal of the LSTM is to estimate the conditional probability p(y1,...,y T′|x1,...,x T)where (x1,...,x T)is an input sequence and y1,...,y T′is its corresponding output sequence whose length T′may differ from T.The LSTM computes this conditional probability byfirst obtaining thefixed-dimensional representation v of the input sequence(x1,...,x T)given by the last hidden state of the LSTM,and then computing the probability of y1,...,y T′with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1,...,x T:p(y1,...,y T′|x1,...,x T)=T′t=1p(y t|v,y1,...,y t−1)(1)In this equation,each p(y t|v,y1,...,y t−1)distribution is represented with a softmax over all the words in the vocabulary.We use the LSTM formulation from Graves[10].Note that we require that each sentence ends with a special end-of-sentence symbol“<EOS>”,which enables the model to define a distribution over sequences of all possible lengths.The overall scheme is outlined infigure 1,where the shown LSTM computes the representation of“A”,“B”,“C”,“<EOS>”and then uses this representation to compute the probability of“W”,“X”,“Y”,“Z”,“<EOS>”.Our actual models differ from the above description in three important ways.First,we used two different LSTMs:one for the input sequence and another for the output sequence,because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously[18].Second,we found that deep LSTMs significantly outperformed shallow LSTMs,so we chose an LSTM with four layers.Third,we found it extremely valuable to reverse the order of the words of the input sentence.So for example,instead of mapping the sentence a,b,c to the sentenceα,β,γ,the LSTM is asked to map c,b,a toα,β,γ, whereα,β,γis the translation of a,b,c.This way,a is in close proximity toα,b is fairly close toβ, and so on,a fact that makes it easy for SGD to“establish communication”between the input and the output.We found this simple data transformation to greatly improve the performance of the LSTM.3ExperimentsWe applied our method to the WMT’14English to French MT task in two ways.We used it to directly translate the input sentence without using a reference SMT system and we it to rescore the n-best lists of an SMT baseline.We report the accuracy of these translation methods,present sample translations,and visualize the resulting sentence representation.3.1Dataset detailsWe used the WMT’14English to French dataset.We trained our models on a subset of12M sen-tences consisting of348M French words and304M English words,which is a clean“selected”subset from[29].We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with1000-best lists from the baseline SMT[29].As typical neural language models rely on a vector representation for each word,we used afixed vocabulary for both languages.We used160,000of the most frequent words for the source language and80,000of the most frequent words for the target language.Every out-of-vocabulary word was replaced with a special“UNK”token.3.2Decoding and RescoringThe core of our experiments involved training a large deep LSTM on many sentence pairs.We trained it by maximizing the log probability of a correct translation T given the source sentence S, so the training objective is1/|S|log p(T|S)(T,S)∈Swhere S is the training set.Once training is complete,we produce translations byfinding the most likely translation according to the LSTM:ˆT=arg maxp(T|S)(2)TWe search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses,where a partial hypothesis is a prefix of some translation.At each timestep we extend each partial hypothesis in the beam with every possible word in the vocabulary.This greatly increases the number of the hypotheses so we discard all but the B most likely hypotheses according to the model’s log probability.As soon as the“<EOS>”symbol is appended to a hypothesis,it is removed from the beam and is added to the set of complete hypotheses.While this decoder is approximate,it is simple to implement.Interestingly,our system performs well even with a beam size of1,and a beam of size2provides most of the benefits of beam search(Table1).We also used the LSTM to rescore the1000-best lists produced by the baseline system[29].To rescore an n-best list,we computed the log probability of every hypothesis with our LSTM and took an even average with their score and the LSTM’s score.3.3Reversing the Source SentencesWhile the LSTM is capable of solving problems with long term dependencies,we discovered that the LSTM learns much better when the source sentences are reversed(the target sentences are not reversed).By doing so,the LSTM’s test perplexity dropped from5.8to4.7,and the test BLEU scores of its decoded translations increased from25.9to30.6.While we do not have a complete explanation to this phenomenon,we believe that it is caused by the introduction of many short term dependencies to the dataset.Normally,when we concatenate a source sentence with a target sentence,each word in the source sentence is far from its corresponding word in the target sentence.As a result,the problem has a large“minimal time lag”[17].By reversing the words in the source sentence,the average distance between corresponding words in the source and target language is unchanged.However,thefirst few words in the source language are now very close to thefirst few words in the target language,so the problem’s minimal time lag is greatly reduced.Thus,backpropagation has an easier time“establishing communication”between the source sentence and the target sentence,which in turn results in substantially improved overall performance.Initially,we believed that reversing the input sentences would only lead to more confident predic-tions in the early parts of the target sentence and to less confident predictions in the later parts.How-ever,LSTMs trained on reversed source sentences did much better on long sentences than LSTMstrained on the raw source sentences(see sec.3.7),which suggests that reversing the input sentences results in LSTMs with better memory utilization.3.4Training detailsWe found that the LSTM models are fairly easy to train.We used deep LSTMs with4layers, with1000cells at each layer and1000dimensional word embeddings,with an input vocabulary of160,000and an output vocabulary of80,000.Thus the deep LSTM uses8000real numbers to represent a sentence.We found deep LSTMs to significantly outperform shallow LSTMs,where each additional layer reduced perplexity by nearly10%,possibly due to their much larger hidden state.We used a naive softmax over80,000words at each output.The resulting LSTM has384M parameters of which64M are pure recurrent connections(32M for the“encoder”LSTM and32M for the“decoder”LSTM).The complete training details are given below:•We initialized all of the LSTM’s parameters with the uniform distribution between-0.08 and0.08•We used stochastic gradient descent without momentum,with afixed learning rate of0.7.After5epochs,we begun halving the learning rate every half epoch.We trained our models for a total of7.5epochs.•We used batches of128sequences for the gradient and divided it the size of the batch (namely,128).•Although LSTMs tend to not suffer from the vanishing gradient problem,they can have exploding gradients.Thus we enforced a hard constraint on the norm of the gradient[10, 25]by scaling it when its norm exceeded a threshold.For each training batch,we computes= g2,where g is the gradient divided by128.If s>5,we set g=5gs.•Different sentences have different lengths.Most sentences are short(e.g.,length20-30) but some sentences are long(e.g.,length>100),so a minibatch of128randomly chosen training sentences will have many short sentences and few long sentences,and as a result, much of the computation in the minibatch is wasted.To address this problem,we made sure that all sentences in a minibatch are roughly of the same length,yielding a2x speedup.3.5ParallelizationA C++implementation of deep LSTM with the configuration from the previous section on a sin-gle GPU processes a speed of approximately1,700words per second.This was too slow for our purposes,so we parallelized our model using an8-GPU machine.Each layer of the LSTM was executed on a different GPU and communicated its activations to the next GPU/layer as soon as they were computed.Our models have4layers of LSTMs,each of which resides on a separate GPU.The remaining4GPUs were used to parallelize the softmax,so each GPU was responsible for multiplying by a1000×20000matrix.The resulting implementation achieved a speed of6,300 (both English and French)words per second with a minibatch size of128.Training took about a ten days with this implementation.3.6Experimental ResultsWe used the cased BLEU score[24]to evaluate the quality of our translations.We computed our BLEU scores using multi-bleu.pl1on the tokenized predictions and ground truth.This way of evaluating the BELU score is consistent with[5]and[2],and reproduces the33.3score of[29]. However,if we evaluate the best WMT’14system[9](whose predictions can be downloaded from \matrix)in this manner,we get37.0,which is greater than the35.8reported by \matrix.The results are presented in tables1and2.Our best results are obtained with an ensemble of LSTMs that differ in their random initializations and in the random order of minibatches.While the decoded translations of the LSTM ensemble do not outperform the best WMT’14system,it is thefirst time that a pure neural translation system outperforms a phrase-based SMT baseline on a large scale MT 1There several variants of the BLEU score,and each variant is defined with a perl script.Methodtest BLEU score (ntst14)Bahdanau et al.[2]28.45Baseline System [29]33.30Single forward LSTM,beam size 1226.17Single reversed LSTM,beam size 1230.59Ensemble of 5reversed LSTMs,beam size 133.00Ensemble of 2reversed LSTMs,beam size 1233.27Ensemble of 5reversed LSTMs,beam size 234.50Ensemble of 5reversed LSTMs,beam size 1234.81Table 1:The performance of the LSTM on WMT’14English to French test set (ntst14).Note that an ensemble of 5LSTMs with a beam of size 2is cheaper than of a single LSTM with a beam of size 12.Methodtest BLEU score (ntst14)Baseline System [29]33.30Cho et al.[5]34.54Best WMT’14result [9]37.0Rescoring the baseline 1000-best with a single forward LSTM35.61Rescoring the baseline 1000-best with a single reversed LSTM35.85Rescoring the baseline 1000-best with an ensemble of 5reversed LSTMs36.5Oracle Rescoring of the Baseline 1000-best lists ∼45Table 2:Methods that use neural networks together with an SMT system on the WMT’14English to French test set (ntst14).task by a sizeable margin,despite its inability to handle out-of-vocabulary words.The LSTM is within 0.5BLEU points of the best WMT’14result if it is used to rescore the 1000-best list of the baselinesystem.3.7Performance on long sentencesWe were surprised to discover that the LSTM did well on long sentences,which is shown quantita-tively in figure 3.Table 3presents several examples of long sentences and their translations.3.8Model AnalysisFigure 2:The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtained after processing the phrases in the figures.The phrases are clustered by meaning,which in these examples is primarily a function of word order,which would be difficult to capture with a bag-of-words model.Notice that both clusters have similar internal structure.One of the attractive features of our model is its ability to turn a sequence of words into a vector of fixed dimensionality.Figure 2visualizes some of the learned representations.The figure clearly shows that the representations are sensitive to the order of words,while being fairly insensitive to theTypeSentence Our modelUlrich UNK ,membre du conseil d’administration du constructeur automobile Audi ,affirme qu’il s’agit d’une pratique courante depuis des ann´e es pour que les t´e l´e phones portables puissent ˆe tre collect´e s avant les r´e unions du conseil d’administration afin qu’ils ne soient pas utilis´e s comme appareils d’´e coute `a distance .TruthUlrich Hackenberg ,membre du conseil d’administration du constructeur automobile Audi ,d´e clare que la collecte des t´e l´e phones portables avant les r´e unions du conseil ,afin qu’ils ne puissent pas ˆe tre utilis´e s comme appareils d’´e coute `a distance ,est une pratique courante depuis des ann´e es .Our model“Les t´e l´e phones cellulaires ,qui sont vraiment une question ,non seulement parce qu’ils pourraient potentiellement causer des interf´e rences avec les appareils de navigation ,mais nous savons ,selon la FCC ,qu’ils pourraient interf´e rer avec les tours de t´e l´e phone cellulaire lorsqu’ils sont dans l’air ”,dit UNK .Truth“Les t´e l´e phones portables sont v´e ritablement un probl`e me ,non seulement parce qu’ils pourraient ´e ventuellement cr´e er des interf´e rences avecles instruments de navigation ,mais parce que nous savons ,d’apr`e s la FCC ,qu’ils pourraient perturber les antennes-relais de t´e l´e phonie mobile s’ils sont utilis´e s `a bord ”,a d´e clar´e Rosenker .Our modelAvec la cr´e mation ,il y a un “sentiment de violence contre le corps d’un ˆe tre cher ”,qui sera “r´e duit `a une pile de cendres ”en tr`e s peu de temps au lieu d’un processus de d´e composition “qui accompagnera les ´e tapes du deuil ”.Truth Il y a ,avec la cr´e mation ,“une violence faite au corps aim´e ”,qui va ˆe tre “r´e duit `a un tas de cendres ”en tr`e s peu de temps ,et non apr`e s un processus ded´e composition ,qui “accompagnerait les phases du deuil ”.Table 3:A few examples of long translations produced by the LSTM alongside the ground truth translations.The reader can verify that the translations are sensible using Google translate.Figure 3:The left plot shows the performance of our system as a function of sentence length,where the x-axis corresponds to the test sentences sorted by their length and is marked by the actual sequence lengths.There is no degradation on sentences with less than 35words,there is only a minor degradation on the longest sentences.The right plot shows the LSTM’s performance on sentences with progressively more rare words,where the x-axis corresponds to the test sentences sorted by their “average word frequency rank”.replacement of an active voice with a passive voice.The two-dimensional projections are obtained using PCA.4Related workThere is a large body of work on applications of neural networks to machine translation.So far,the simplest and most effective way of applying an RNN-Language Model (RNNLM)[23]or aFeedforward Neural Network Language Model(NNLM)[3]to an MT task is by rescoring the n-best lists of a strong MT baseline[22],which reliably improves translation quality.More recently,researchers have begun to look into ways of including information about the source language into the NNLM.Examples of this work include Auli et al.[1],who combine an NNLM with a topic model of the input sentence,which improves rescoring performance.Devlin et al.[8] followed a similar approach,but they incorporated their NNLM into the decoder of an MT system and used the decoder’s alignment information to provide the NNLM with the most useful words in the input sentence.Their approach was highly successful and it achieved large improvements over their baseline.Our work is closely related to Kalchbrenner and Blunsom[18],who were thefirst to map the input sentence into a vector and then back to a sentence,although they map sentences to vectors using convolutional neural networks,which lose the ordering of the words.Similarly to this work,Cho et al.[5]used an LSTM-like RNN architecture to map sentences into vectors and back,although their primary focus was on integrating their neural network into an SMT system.Bahdanau et al.[2]also attempted direct translations with a neural network that used an attention mechanism to overcome the poor performance on long sentences experienced by Cho et al.[5]and achieved encouraging results.Likewise,Pouget-Abadie et al.[26]attempted to address the memory problem of Cho et al.[5]by translating pieces of the source sentence in way that produces smooth translations,which is similar to a phrase-based approach.We suspect that they could achieve similar improvements by simply training their networks on reversed source sentences.End-to-end training is also the focus of Hermann et al.[12],whose model represents the inputs and outputs by feedforward networks,and map them to similar points in space.However,their approach cannot generate translations directly:to get a translation,they need to do a look up for closest vector in the pre-computed database of sentences,or to rescore a sentence.5ConclusionIn this work,we showed that a large deep LSTM,that has a limited vocabulary and that makes almost no assumption about problem structure can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task.The success of our simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems,provided they have enough training data.We were surprised by the extent of the improvement obtained by reversing the words in the source sentences.We conclude that it is important tofind a problem encoding that has the greatest number of short term dependencies,as they make the learning problem much simpler.In particular,while we were unable to train a standard RNN on the non-reversed translation problem(shown infig.1), we believe that a standard RNN should be easily trainable when the source sentences are reversed (although we did not verify it experimentally).We were also surprised by the ability of the LSTM to correctly translate very long sentences.We were initially convinced that the LSTM would fail on long sentences due to its limited memory, and other researchers reported poor performance on long sentences with a model similar to ours [5,2,26].And yet,LSTMs trained on the reversed dataset had little difficulty translating long sentences.Most importantly,we demonstrated that a simple,straightforward and a relatively unoptimized ap-proach can outperform an SMT system,so further work will likely lead to even greater translation accuracies.These results suggest that our approach will likely do well on other challenging sequence to sequence problems.6AcknowledgmentsWe thank Samy Bengio,Jeff Dean,Matthieu Devin,Geoffrey Hinton,Nal Kalchbrenner,Thang Luong,Wolf-gang Macherey,Rajat Monga,Vincent Vanhoucke,Peng Xu,Wojciech Zaremba,and the Google Brain team for useful comments and discussions.。
