# Neural Machine Translation by Jointly Learning to Align and Translate

<https://arxiv.org/pdf/1409.0473.pdf>

<https://www.e-learn.cn/content/qita/2356479>

* [Dzmitry Bahdanau](https://www.semanticscholar.org/author/Dzmitry-Bahdanau/3335364), [Kyunghyun Cho](https://www.semanticscholar.org/author/Kyunghyun-Cho/1979489), [Yoshua Bengio](https://www.semanticscholar.org/author/Yoshua-Bengio/1751762)
* Published in ICLR 2014

## ABSTRACT

​ Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation **aims at building a single neural network that can be jointly tuned to maximize the translation performance**. The models proposed recently for neural machine translation often **belong to a family of encoder–decoders** and encode a source sentence into **a fixed-length vector** from which a decoder generates a translation. In this paper, we conjecture(推测) that the use of **a fixed-length vector is a bottleneck(瓶颈) in improving the performance** of this basic encoder–decoder architecture, and propose to extend this by allowing a model to **automatically (soft)search** for parts of a source sentence that **are relevant to predicting a target word**, without having to form these parts as a hard segment explicitly（不必明确硬性的分段）. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis（定性分析） reveals that the (soft-)alignments found by the model agree well with our intuition.

## 1 INTRODUCTION

​ *Neural machine translation* is a newly emerging approach to machine translation, recently proposed by Kalchbrenner and Blunsom (2013), Sutskever *et al.* (2014) and Cho *et al.* (2014b). Unlike the traditional phrase-based translation system（不像传统的基于短语的翻译系统） (see, e.g., Koehn *et al.*, 2003) which consists of many small sub-components that are tuned separately, **neural machine translation attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation（机器翻译系统视图打造大型神经网络，读取句子、输出翻译）**.

​ Most of the proposed neural machine translation models belong to a family of *encoder– decoders* (Sutskever *et al.*, 2014; Cho *et al.*, 2014a), with an encoder and a decoder for each language, or involve a language-specific encoder applied to each sentence whose outputs are then compared (Hermann and Blunsom, 2014). **An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector**. The whole encoder–decoder system, which **consists of the encoder and the decoder for a language pair**, is jointly trained to **maximize the probability of a correct translation** given a source sentence.

​ A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector（问题是，编码-解码器需要把所有信息塞到固定的向量里面）. This may make it difficult for the neural network to cope（对付） with long sentences（对付长句子，难以招架）, especially those that are longer than the sentences in the training corpus（特别是那些比语料库还长的句子）. Cho *et al.* (2014b) showed that indeed the performance of a basic encoder–decoder **deteriorates rapidly** as the length of an input sentence increases（长度增加时确实迅速变差）.

​ In order to address this issue, we introduce an extension to the encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, **it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated**. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

​ The `most important distinguishing feature` of this approach from the basic encoder–decoder `is that it does not attempt to encode a whole input sentence into a single fixed-length vector(最大的不同就是，我们并不打算把它放到一个固定长度里面)`. Instead, it encodes the input sentence into a sequence of vectors and **chooses a subset of these vectors adaptively** while decoding the translation（把输入句子编码为向量序列，解码翻译时，自适应的选择子序列）. This frees a neural translation model from having to squash（挤压） all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences（这样以来就不要强逼破神经网络把所有信息压缩到固定向量了，它可以自己自己学习选择那些子向量作为信息去利用）.

​ In this paper, we show that the proposed approach of jointly learning to align and translate achieves significantly improved translation performance over the basic encoder–decoder approach. The im- provement is more apparent with longer sentences（**句子越长效果越明显**）, but can be observed with sentences of any length. On the task of English-to-French translation, the proposed approach achieves, with a single model, a translation performance comparable, or close, to the conventional phrase-based system. Furthermore, qualitative analysis（定量分析表明） reveals that the proposed model finds a linguistically plausible (soft-)alignment between a source sentence and the corresponding target sentence（**在翻译句子和原句子间，找到了合理的对齐方式**）.

​

## 2 BACKGROUND: NEURAL MACHINE TRANSLATION

​ From a probabilistic perspective（从概率角度）, translation is equivalent to finding a target sentence y（翻译等价于寻找目标句子y） that maximizes the conditional probability of y given a source sentence x, i.e., arg maxy p(y | x)（给定x最大化条件概率p(y|x)）. In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus（用成对的训练语料库去最大化这个概率）. Once the conditional distribution is learned by a translation model, given a source sentence a corresponding translation can be generated by searching for the sentence that maximizes the conditional probability.

​ Recently, a number of papers have proposed the use of neural networks to directly learn this conditional distribution(最近的论文啊，基本都是直接用深度学到这个分布).This neural machine translation approach typically consists of two components（这种神经机器翻译方法通常由两个部分组成）, the first of which encodes a source sentence x and the second decodes to a target sentence y（对x进行编码，对目标y进行解码）. For instance, two recurrent neural networks (RNN) were used by (Cho *et al.*, 2014a) and (Sutskever *et al.*, 2014) to encode a variable-length source sentence into a fixed-length vector and to decode the vector into a variable-length target sentence（**使用两个递归神经网络（RNN）将可变长度的源句子编码为固定长度的向量，并将向量解码为可变长度的目标句子**）.

​ Despite being a quite new approach, neural machine translation has already shown promising results. Sutskever *et al.* (2014) reported that the neural machine translation based on RNNs with long short- term memory (LSTM) units achieves close to the state-of-the-art performance of the conventional phrase-based machine translation system on an English-to-French translation task(LSTM表现的很好比传统的). Adding neural components to existing translation systems（添加神经层到现有翻译系统）,for instance, to score the phrase pairs in the phrase table（比如，对短语表中的短语对进行评分） (Cho *et al.*, 2014a) or to re-rank candidate translations（候选翻译排序） (Sutskever *et al.*, 2014), has allowed to surpass the previous state-of-the-art performance level.

### 2.1 RNN ENCODER–DECODER

​ Here, we describe briefly the underlying framework,called *RNN Encoder–Decoder*, proposed by Cho *et al.* (2014a) and Sutskever *et al.* (2014) upon which **we build a novel architecture that learns to align and translate simultaneously**.

​ In the **Encoder–Decoder framework, an encoder reads the input sentence**, a sequence of vectors

$$X=(x\_1,x\_2,....,x\_{T\_x})$$,into a vector $$c^2$$，The most common approach is to use an RNN such that

$$
h\_t=f(x\_t,h\_{t-1})
$$

and $$c=q({ h\_1,....,h\_{T\_x} })$$,

where $$h\_t \in R^n$$ is **a hidden state at time t**, and **c is a vector generated from the sequence of the hidden states**. f and q are some nonlinear functions（f,q就是网络等非线性函数）.Sutskever *et al.* (2014) used an LSTM as f and q ({h1, · · · , hT }) = hT , for instance.

​ The decoder is often trained to predict the next word yt\` given the context vector c and all the previously predicted words {y1 , · · · , yt\`- 1 }（解码器被训练去预测下一个单词，基于上下文c和之前所有预测好的单词列y1,..,yt\`-1）. In other words, the decoder defines a probability over the translation y by decomposing（分解） the joint probability（联合概率） into the ordered conditionals（解码器通过将联合概率分解为有序条件，来定义转换y的概率）:

![](/files/-Lpsm_WTyDTeg1j1Qe3R)

where y = (y1 , · · · , yTy) . With an RNN, each conditional probability is modeled as

$$p(y\_t|{ y\_1,...,y\_{t-1} },c)=g(y\_{t-1},s\_t,c)$$

where **g is a nonlinear, potentially multi-layered, function** that outputs the probability of $$y\_t$$, and $$s\_t$$ is the hidden state of the RNN.It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

## 3 LEARNING TO ALIGN AND TRANSLATE

​ In this section, we propose a novel architecture for neural machine translation（在此，提出厉害等网络结构用于机器翻译）. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder(双向RNN作为编码和解码器) that emulates searching through a source sentence during decoding a translation（解码翻译过程中模拟搜索源句子的过程） (Sec. 3.1).

### 3.1 DECODER: GENERAL DESCRIPTION

​ In a new model architecture, we define each conditional probability in Eq. (2) as:

$$p(y\_i|y\_1,...,y\_{i-1},x)=g(y\_{i-1},s\_i,c\_i)$$

where $$s\_i$$ is an RNN hidden state for time i, computed by:

$$s\_i=f(s\_{i-1},y\_{i-1},c\_i)$$

It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi.

The context vector ci depends on a sequence of *annotations* $$(h\_1,...,h\_{T\_x})$$（下图左右箭头h合并的方块记为不带箭头的h） to which an encoder maps the input sentence(向量ci是一个编码器映射输入的句子). **Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence（每个注解hi包含有关整个输入序列的信息，重点放在输入序列的第i个单词周围的部分）**. We explain in detail how the annotations are computed in the next section.

![](/files/-Lpsm_WZ3IEq5GjWNmgT)

Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th target word yt given a source sentence (x1,x2,...,xT )（输入句子）.

​ The context vector ci is, then, computed as a weighted sum of these annotations hi（ci是输入句子历史隐藏状态加权和）:

$$
c\_i=\sum\_{j=1}^{T\_x}\alpha\_{ij}h\_j
$$

The weight $$\alpha\_{ij}$$ of each annotation $$h\_j$$ is computed by（softmax，后面讲了这里是个小网络，输出一个概率）

$$
\alpha\_{ij}=\frac{exp{(e\_{ij})}}{\sum\_{k=1}^{T\_x}exp(e\_{ik})}
$$

where $$e\_{ij}=a(s\_{i-1},h\_j)$$ is an *alignment model*(一致性模型) which scores how well the inputs around position j and the output at position i match（给 j 附近的输入和i 位置的输出的匹配度进行评分）.The score is based on the RNN hidden state $$s\_{i-1}$$ (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence(这个分数基于RNN隐藏状态$$s\_{i-1}$$和输入句子的第j个hj作为输出).

​ We parametrize the alignment model a as a feedforward neural network(**a参数化为前馈神经网络**) which is jointly trained with all the other components of the proposed system（在提出的系统里面和其它部分一同训练）. Note that unlike in traditional machine translation,the alignment is not considered to be a latent variable（对齐方式不被视为潜在变量）. Instead, the alignment model directly computes a soft alignment（对齐模型直接计算软对齐，）, which allows the gradient of the cost function to be backpropagated through（允许反向传播成本函数的梯度）. This gradient can be used to train the alignment model as well as the whole translation model jointly（该梯度可用于联合训练对齐模型以及整个翻译模型）.

​ We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments（我们可以理解成，将所有注释的加权总和作为计算预期注释的方法，其中期望在可能的对齐上）.Let $$\alpha\_{ij}$$ be a probability that the target word yi is aligned to, or translated from, a source word xj . Then, the i-th context vector ci is the expected annotation over all the annotations with probabilities $$\alpha\_{ij}$$.（**令**$$\alpha\_{ij}$$**为目标单词**$$y\_i$$**与源词**$$x\_j$$**对齐或翻译的概率。然后，第i个上下文向量**$$c\_i$$**是所有具有概率**$$\alpha\_{ij}$$**的注释的期望注释(PS：加权和，自然是期望了)**）

​ The probability $$\alpha\_{ij}$$ , or its associated energy $$e\_{ij}$$, reflects the importance of the annotation $$h\_j$$ with respect to the previous hidden state $$s\_{i-1}$$ in deciding the next state $$s\_i$$ and generating $$y\_i$$（**概率**$$\alpha\_{ij}$$**或它关联的能量**$$e\_{ij}$$**，反映了注释**$$h\_j$$**相对于先前隐藏状态**$$s\_{i-1}$$**，在，决定下一状态**$$s\_i$$**和生成**$$y\_i$$**中的重要性**）. **Intuitively, this implements a mechanism of attention in the decoder(直观上，这在解码器中实现了一种attention机制)**. The decoder decides parts of the source sentence to pay attention to（解码器决定哪部分去关注）. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector（减轻了编码器的负担，不用再把源语句中的所有信息，编码为固定长的向量啦）. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly（利用这种新方法，信息可以散布在注释的整个序列中，解码器去有选择地检索）.

### 3.2 ENCODER: BIDIRECTIONAL(双向) RNN FOR ANNOTATING SEQUENCES


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://im-qianuxn.gitbook.io/pytorch/lun-wen-yue-du/neural-machine-translation-by-jointly-learning-to-align-and-translate.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.