00 Overview Flashcards
Sequence and grammar
Sequences:
- Probabilities over strings: n-gram models: Linear and surface oriented
- Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear
Grammar; adds hierarchical structure
- Shift focus from “sequences” to “sentences”
- Identifying underlying structure using formal rules
- Declarative aspect: formal grammar
- Procedural aspect: parsing strategy
- Learn probability distribution over the rules for scoring trees
We have been doing data-driven learning by counting observations in what contexts?
- feature vectors in semantic spaces; bag-of-words, etc.
- previous n-1 words in n-gram models
- previous n-1 states in HMMs
- local sub-trees in PCFGs
Vector space models
- Data representation based on a spatial metaphor
- Objects modeled as feature vectors positioned in a coordinate system
- Semantic spaces = VS for distributional lexical semantics
- Some issues:
- Usage = meaning? (The distributional hypothesis)
- How do we define context / features? (BoW, n-grams, etc.)
- Text normalization (lemmatization, stemming, etc.)
- How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
- Length-normalization (ways to deal with frequency effects / length-bias)
- High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)
Classification
Supervised learning from labeled training data.
Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.
Cluster analysis
Unsupervised learning from unlabeled data.
Automatically forming groups of similar objects.
No predefined classes; we only specify the similarity measure.
Issues with categorization tasks
- Measuring similarity
- Representing classes (e.g. exemplar-based vs. centroid-based)
- Representing class membership (hard vs. soft)
Examples of vector space classifiers
Rocchio vs. kNN
Differences between Rocchio and kNN
- Centroid- vs exemplar-based class representation
- Linear vs non-linear decision boundaries
- Assumptions about the distribution within the class
- Complexity in training vs complexity in prediction
Evaluation of classifiers
- Accuracy, precision, recall and F-score
- Multi-class evaluation: Micro-/macro-averaging
Types of clustering
- Flat /partitional clustering
- Hierarchical clustering
- Agglomerative clustering (bottom-up)
- Divisive clustering (top-down)
Flat clustering algorithm
- Example: k-Means.
- Partitioning viewed as an optimization problem:
- Minimize the within-cluster sum of squares.
- Approximated by iteratively improving on some initial partition.
- Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.
Agglomerative clustering
- Bottom-up hierarchical clustering
- Resulting in a set of nested partitions, often visualized as a dendrogram.
- Issues:
- Linkage criterions — how to measure inter-cluster similarity:
- Single, Complete, Centroid, or Average Linkage
- Cutting the tree
- Linkage criterions — how to measure inter-cluster similarity:
Divisive clustering
- Top-down hierarchical clustering
- Generates the tree by iteratively applying a flat clustering method.
Structured Probabilistic Models
- Switching from a geometric view to a probability distribution view.
- Model the probability that elements (words, labels) are in a particular configuration.
- We looked at many of the same concepts over structures that were linear or hierarchical
What have we been modelling with Structured Probabilistic Models?
Linear
- With string is more likely?
- Which tag sequence is most likely?
Hierarchical
- Which tree structure is most likely?
Structured Probabilistic Models: Types
Linear
- n-gram language models
- Chain rule combines conditional probabilities to model context
- Markov assumption allows us to limit the length of context
- Hidden Markov Model
- addad a hidden layer of abstraction: PoS tags
- also uses the chain rule with the Markov assumption
Hierarchical
- (Probabilistic) Context-Free Grammars (PCFGs)
- Hidden layer of abstraction: trees
- Chain rule over (P)CFG rules
n-gram language models
n-gram language models
- Linear Structured Probabilistic Model
- Chain rule combines conditional probabilities to model context
- Markov assumption allows us to limit the length of context

Hidden Markov Model
- Linear Structured Probabilistic Model
- A hidden layer of abstraction: PoS tags
- Uses the chain rule with the Markov assumption

PCFGs
(Probabilistic) Context-Free Grammars
- Hierarchical Structured Probabilistic Model
- Hidden layer of abstraction: trees
- Chain rule over (P)CFG rules:

Maximum Likelihood Estimation

HMMs can be used to:
- calculate the probability of a string
- find the most likely state sequence for a particular observation sequence
CFGs can be used to:
- A CFG can recognise strings that are a valid part of the defined language.
- A PCFG can calculate the probability of a tree (where the sentence is encoded by the leaves).
n-gram models can be used to
n-gram models can be used to calculate the probability of a string
When to use the Viterbi algorithm?
- HMM: Efficiently find the most likely state sequence
- PCFG: Efficiently find the most likely parse tree