Machine Translation Flashcards

1
Q

Word ordering and Subject-Verb-Object order

A

Languages differ in the basic word order of verbs, subjects and objects in simple declarative clauses.

Examples:

  • French, English, and Mandarin are SVO (Subject-Verb-Object) order
  • Hindi and Japanese are SOV order
  • Arabic is VSO order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Word alignment, spurious words

A

Word alignment is the correspondence between words in the source and target sentences.

Type of aligments:

  • many-to-one
  • one-to-many
  • many-to-many

Spurious words have no counterpart in the target language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Statistical machine translation (SMT) using language model and translation model

A

Machine translation can be formulated as a structured prediction task. Given a source sentence x, find the most probable target sentence ŷ:

ŷ = argmax (y) P(y|x)

Statistical machine translation (SMT) uses Bayes’ rule to decompose the probability model into two components that can be learned separately:

ŷ = argmax (y) P(x|y)*P(y)

where:

  • P(x|y) is a translation model
  • P(y) is a language model

We have that:

  • P(x|y) assigns large probability to strings that have the necessary words (roughly at the right places)
  • P(y) assigns large probability to well-formed strings y, regardless of the connection to x

P(x|y) and P(y) collaborate to produce a large probability for well-formed translation pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Neural machine translation (NMT)

A

Neural machine translation (NMT) models the translation task through a single artificial neural network.

Let x be the source text and y = y1 … ym be the target text.
In contrast to SMT, in NMT we directly model P(y|x), using an approach similar to the one adopted for language modeling:

P(y|x) = p(y1|x)P(y2|y1,x)…*P(ym|y1, …, ym-1,x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Encoder-decoder neural architecture (seq2seq) general idea, components

A

Encoder-decoder networks, also called sequence-to-sequence (seq2seq) networks, are models capable of generating contextually appropriate, arbitrary length sequences.

The encoder-decoder model consists of two components:

  • The encoder is a neural network that produces a representation of the source sentence.
  • The decoder is an autoregressive language model that generates the target sentence, conditioned on the output of the encoder (Autoregressive = takes its own output as new input).

The key idea underlying encoder-decoder networks. The output of the encoder is called context and drives the translation, together with the decoder output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Autoregressive encoder-decoder using RNN for machine translation: greedy inference algorithm

A

P(y|x) can be computed as follows:

  1. run an RNN encoder through x = x1 … xn performing forward inference, generating hidden states het with t form 1 to n
  2. run an RNN decoder performing autoregressive generation; to generate yt with t form 1 to n, use:
  • encoder hidden states hen
  • decoder hidden states hdt-1
  • embedding of word yt-1

Stops when the end-of-sentence marker is predicted.

draw the inference process at slide 27 pdf 12
write the model equations at slide 28 (hint: g and f affine functions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Encoder-decoder with RNN: training, teacher forcing

A

Given the source text and the gold translation, we compute the average loss w.r.t. our predictions on next word in the translation.

At each step of training, the decoder computes Li that measures how far the generated distribution is from the gold one.

The total loss is L = 1/T * sum (i=1 to T) Li (average cross-entropy loss).

During training, the decoder uses gold translation words as the input for the next step prediction. This is called teacher forcing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Attention-based neural architecture idea

A

The context vector c must represent the whole source sentence in one fixed-length vector. This is called the bottleneck problem.

The attention mechanism allows the decoder to get information from all the hidden states of the encoder.

The idea is to compute context ci at each decoding step i, as a weighted sum of all the encoder hidden states hej.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Attention-based neural architecture: dynamic context vector

A

Attention replaces the static context c with a context ci dynamically computed from all encoder hidden states:

hdi = g(ci, hdi-1, yi-1)

HOW IS ci COMPUTED?

  1. At each step i during decoding, we compute relevent scores score(hdi-1, hej) for each encoder hidden state hej.
  2. We normalize the scores to create weights αij for each j (using softmax).
  3. We finally compute a fixed-length context vector that takes into account information from all of the encoder hidden states and that is dynamically updated:
    ci = sum (over j) αij hej
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Attention-based neural architecture: scoring functions for creating weights

A
  • The simplest score is the dot-product attention
  • Bilinear model: score(hdi-1, hej) = hdi-1 Ws hdj where Ws are learnable parameters. This score allows the encoder and decoder to use different dimensions for their hidden states.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Search tree for the decoder

A

A greedy algorithm makes choices that are locally optimal, but this may not find the highest probability translation.

We define the search tree for the decoder:

  • the branches are the actions of generating a token
  • the nodes are the states, representing the generated prefix of the translation

Unfortunately, dynamic programming is not applicable to this search tree, because of long-distance dependencies between the output decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Beam search for the decoder

A

In beam search we keep K possible hypotheses at each step; parameter K is called the beam width.

Beam search algorithm:

  1. start with K initial best hypotheses
  2. at each step, expand each of the K hypotheses resulting in V * K new hypotheses, which are all scored
  3. prune the V * K hypotheses down to the K best hypotheses
  4. when a complete hypothesis is found, remove it from the frontier and reduce by one the size of the beam, stopping at K = 0

find this on the lecterature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Evaluation and the BLUE metric

A

The most popular automatic metric for MT systems is called BLEU, for BiLingual Evaluation Understudy.

The N-gram precision for a candidate translation is the percentage of N-grams in the target sentence that also occur in the reference translation.

BLEU combines 1,2,3,4-gram precisions by means of the geometric mean. A brevity penalty for too-short translations is also added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly