Sequences: Probabilities over strings: n-gram models: Linear and surface oriented Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear Grammar; adds hierarchical structure Shift focus from "sequences" to "sentences" Identifying underlying structure using formal rules Declarative aspect: formal grammar Procedural aspect: parsing strategy Learn probability distribution over the rules for scoring trees

Data representation based on a spatial metaphor Objects modeled as feature vectors positioned in a coordinate system Semantic spaces = VS for distributional lexical semantics Some issues: Usage = meaning? (The distributional hypothesis) How do we define context / features? (BoW, n-grams, etc.) Text normalization (lemmatization, stemming, etc.) How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.) Length-normalization (ways to deal with frequency effects / length-bias) High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)

Flat /partitional clustering Hierarchical clustering Agglomerative clustering (bottom-up) Divisive clustering (top-down)

Top-down hierarchical clustering Generates the tree by iteratively applying a flat clustering method.

00 Overview Flashcards by Henrik Hortemo

Sequence and grammar

Sequences:

Probabilities over strings: n-gram models: Linear and surface oriented
Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear

Grammar; adds hierarchical structure

Shift focus from “sequences” to “sentences”
Identifying underlying structure using formal rules
- Declarative aspect: formal grammar
- Procedural aspect: parsing strategy
Learn probability distribution over the rules for scoring trees

How well did you know this?

Not at all

Perfectly

We have been doing data-driven learning by counting observations in what contexts?

feature vectors in semantic spaces; bag-of-words, etc.
previous n-1 words in n-gram models
previous n-1 states in HMMs
local sub-trees in PCFGs

How well did you know this?

Not at all

Perfectly

Vector space models

Data representation based on a spatial metaphor
Objects modeled as feature vectors positioned in a coordinate system
Semantic spaces = VS for distributional lexical semantics
Some issues:
- Usage = meaning? (The distributional hypothesis)
- How do we define context / features? (BoW, n-grams, etc.)
- Text normalization (lemmatization, stemming, etc.)
- How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
- Length-normalization (ways to deal with frequency effects / length-bias)
- High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)

How well did you know this?

Not at all

Perfectly

Classification

Supervised learning from labeled training data.

Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.

How well did you know this?

Not at all

Perfectly

Cluster analysis

Unsupervised learning from unlabeled data.
􏰀 Automatically forming groups of similar objects.
􏰀 No predefined classes; we only specify the similarity measure.

How well did you know this?

Not at all

Perfectly

Issues with categorization tasks

Measuring similarity
Representing classes (e.g. exemplar-based vs. centroid-based)
Representing class membership (hard vs. soft)

How well did you know this?

Not at all

Perfectly

Examples of vector space classifiers

Rocchio vs. kNN

How well did you know this?

Not at all

Perfectly

Differences between Rocchio and kNN

Centroid- vs exemplar-based class representation
Linear vs non-linear decision boundaries
Assumptions about the distribution within the class
Complexity in training vs complexity in prediction

How well did you know this?

Not at all

Perfectly

Evaluation of classifiers

Accuracy, precision, recall and F-score
Multi-class evaluation: Micro-/macro-averaging

How well did you know this?

Not at all

Perfectly

Types of clustering

Flat /partitional clustering
Hierarchical clustering
- Agglomerative clustering (bottom-up)
- Divisive clustering (top-down)

How well did you know this?

Not at all

Perfectly

Flat clustering algorithm

Example: k-Means.
Partitioning viewed as an optimization problem:
􏰀Minimize the within-cluster sum of squares.
Approximated by iteratively improving on some initial partition.
Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.

How well did you know this?

Not at all

Perfectly

Agglomerative clustering

Bottom-up hierarchical clustering
Resulting in a set of nested partitions, often visualized as a dendrogram. 􏰀
Issues:
- Linkage criterions — how to measure inter-cluster similarity:
  - Single, Complete, Centroid, or Average Linkage
- Cutting the tree

How well did you know this?

Not at all

Perfectly

Divisive clustering

Top-down hierarchical clustering
Generates the tree by iteratively applying a flat clustering method.

How well did you know this?

Not at all

Perfectly

Structured Probabilistic Models

Switching from a geometric view to a probability distribution view.
Model the probability that elements (words, labels) are in a particular configuration.
We looked at many of the same concepts over structures that were linear or hierarchical

How well did you know this?

Not at all

Perfectly

What have we been modelling with Structured Probabilistic Models?

Linear

With string is more likely?
Which tag sequence is most likely?

Hierarchical

Which tree structure is most likely?

How well did you know this?

Not at all

Perfectly

Structured Probabilistic Models: Types

Study These Flashcards

Linear

n-gram language models
- Chain rule combines conditional probabilities to model context
- Markov assumption allows us to limit the length of context
Hidden Markov Model
- addad a hidden layer of abstraction: PoS tags
- also uses the chain rule with the Markov assumption

Hierarchical

(Probabilistic) Context-Free Grammars (PCFGs)
- Hidden layer of abstraction: trees
- Chain rule over (P)CFG rules

n-gram language models

Study These Flashcards

n-gram language models

Linear Structured Probabilistic Model
Chain rule combines conditional probabilities to model context
Markov assumption allows us to limit the length of context

Hidden Markov Model

Study These Flashcards

Linear Structured Probabilistic Model
A hidden layer of abstraction: PoS tags
Uses the chain rule with the Markov assumption

PCFGs

Study These Flashcards

(Probabilistic) Context-Free Grammars

Hierarchical Structured Probabilistic Model
Hidden layer of abstraction: trees
Chain rule over (P)CFG rules:

Maximum Likelihood Estimation

Study These Flashcards

HMMs can be used to:

Study These Flashcards

calculate the probability of a string
find the most likely state sequence for a particular observation sequence

CFGs can be used to:

Study These Flashcards

A CFG can recognise strings that are a valid part of the defined language.
A PCFG can calculate the probability of a tree (where the sentence is encoded by the leaves).

n-gram models can be used to

Study These Flashcards

n-gram models can be used to calculate the probability of a string

When to use the Viterbi algorithm?

Study These Flashcards

HMM: Efficiently find the most likely state sequence
PCFG: Efficiently find the most likely parse tree

When to use the Forward algorithm?

HMM: To efficiently calculate the probability of the obervation sequence

When to use chart parsing?

To efficiently explore the parse tree search space

Dynamic programming: What are our sub-problems?

* HMM: Our sub-problems are prefixes to our full sequence * (P)CFG parsing: Our subproblems are sub-trees which cover sub-spans of our input

Evaluation

Linear * **Tag accuracy** is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed. Hierarchical * **Coverage** is a measure of how well a CFG models the full range of the language it is designed for. * The **ParsEval** metric evaluates parser accuracy by calculating the **precision**, **recall** and **F1** score over labelled constituents.

Tag accuracy

Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.

Coverage

Coverage is a measure of how well a CFG models the full range of the language it is designed for.

ParsEval

The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.

00 Overview Flashcards

(31 cards)