00 Overview Flashcards

1
Q

Sequence and grammar

A

Sequences:

  • Probabilities over strings: n-gram models: Linear and surface oriented
  • Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear

Grammar; adds hierarchical structure

  • Shift focus from “sequences” to “sentences”
  • Identifying underlying structure using formal rules
    • Declarative aspect: formal grammar
    • Procedural aspect: parsing strategy
  • Learn probability distribution over the rules for scoring trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We have been doing data-driven learning by counting observations in what contexts?

A
  • feature vectors in semantic spaces; bag-of-words, etc.
  • previous n-1 words in n-gram models
  • previous n-1 states in HMMs
  • local sub-trees in PCFGs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Vector space models

A
  • Data representation based on a spatial metaphor
  • Objects modeled as feature vectors positioned in a coordinate system
  • Semantic spaces = VS for distributional lexical semantics
  • Some issues:
    • Usage = meaning? (The distributional hypothesis)
    • How do we define context / features? (BoW, n-grams, etc.)
    • Text normalization (lemmatization, stemming, etc.)
    • How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
    • Length-normalization (ways to deal with frequency effects / length-bias)
    • High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Supervised learning from labeled training data.

Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cluster analysis

A

Unsupervised learning from unlabeled data.
􏰀 Automatically forming groups of similar objects.
􏰀 No predefined classes; we only specify the similarity measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Issues with categorization tasks

A
  • Measuring similarity
  • Representing classes (e.g. exemplar-based vs. centroid-based)
  • Representing class membership (hard vs. soft)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Examples of vector space classifiers

A

Rocchio vs. kNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Differences between Rocchio and kNN

A
  • Centroid- vs exemplar-based class representation
  • Linear vs non-linear decision boundaries
  • Assumptions about the distribution within the class
  • Complexity in training vs complexity in prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Evaluation of classifiers

A
  • Accuracy, precision, recall and F-score
  • Multi-class evaluation: Micro-/macro-averaging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of clustering

A
  • Flat /partitional clustering
  • Hierarchical clustering
    • Agglomerative clustering (bottom-up)
    • Divisive clustering (top-down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Flat clustering algorithm

A
  • Example: k-Means.
  • Partitioning viewed as an optimization problem:
  • 􏰀Minimize the within-cluster sum of squares.
  • Approximated by iteratively improving on some initial partition.
  • Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Agglomerative clustering

A
  • Bottom-up hierarchical clustering
  • Resulting in a set of nested partitions, often visualized as a dendrogram. 􏰀
  • Issues:
    • Linkage criterions — how to measure inter-cluster similarity:
      • Single, Complete, Centroid, or Average Linkage
    • Cutting the tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Divisive clustering

A
  • Top-down hierarchical clustering
  • Generates the tree by iteratively applying a flat clustering method.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Structured Probabilistic Models

A
  • Switching from a geometric view to a probability distribution view.
  • Model the probability that elements (words, labels) are in a particular configuration.
  • We looked at many of the same concepts over structures that were linear or hierarchical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What have we been modelling with Structured Probabilistic Models?

A

Linear

  • With string is more likely?
  • Which tag sequence is most likely?

Hierarchical

  • Which tree structure is most likely?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Structured Probabilistic Models: Types

A

Linear

  • n-gram language models
    • Chain rule combines conditional probabilities to model context
    • Markov assumption allows us to limit the length of context
  • Hidden Markov Model
    • addad a hidden layer of abstraction: PoS tags
    • also uses the chain rule with the Markov assumption

Hierarchical

  • (Probabilistic) Context-Free Grammars (PCFGs)
    • Hidden layer of abstraction: trees
    • Chain rule over (P)CFG rules
17
Q

n-gram language models

A

n-gram language models

  • Linear Structured Probabilistic Model
  • Chain rule combines conditional probabilities to model context
  • Markov assumption allows us to limit the length of context
18
Q

Hidden Markov Model

A
  • Linear Structured Probabilistic Model
  • A hidden layer of abstraction: PoS tags
  • Uses the chain rule with the Markov assumption
19
Q

PCFGs

A

(Probabilistic) Context-Free Grammars

  • Hierarchical Structured Probabilistic Model
  • Hidden layer of abstraction: trees
  • Chain rule over (P)CFG rules:
20
Q

Maximum Likelihood Estimation

21
Q

HMMs can be used to:

A
  • calculate the probability of a string
  • find the most likely state sequence for a particular observation sequence
22
Q

CFGs can be used to:

A
  • A CFG can recognise strings that are a valid part of the defined language.
  • A PCFG can calculate the probability of a tree (where the sentence is encoded by the leaves).
23
Q

n-gram models can be used to

A

n-gram models can be used to calculate the probability of a string

24
Q

When to use the Viterbi algorithm?

A
  • HMM: Efficiently find the most likely state sequence
  • PCFG: Efficiently find the most likely parse tree
25
When to use the Forward algorithm?
HMM: To efficiently calculate the probability of the obervation sequence
26
When to use chart parsing?
To efficiently explore the parse tree search space
27
Dynamic programming: What are our sub-problems?
* HMM: Our sub-problems are prefixes to our full sequence * (P)CFG parsing: Our subproblems are sub-trees which cover sub-spans of our input
28
Evaluation
Linear * **Tag accuracy** is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed. Hierarchical * **Coverage** is a measure of how well a CFG models the full range of the language it is designed for. * The **ParsEval** metric evaluates parser accuracy by calculating the **precision**, **recall** and **F1** score over labelled constituents.
29
Tag accuracy
Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.
30
Coverage
Coverage is a measure of how well a CFG models the full range of the language it is designed for.
31
ParsEval
The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.