Unsupervised Flashcards

1
Q

NLP - steps

A
  1. Normalize
    - we need to make sure all the words follow the same standard
  2. Tokenizing
    - words that we extract from the document by splitting it, either by using punctuations as separator.
    - we can also consider sentences as tokens
  3. lower case
  4. filtering stopwords (this may not yield better results, try removing and not removing to see which gives you better results)
  5. stemming and lemmatization
    - finding roots of the words. Stemming: using algorithm. Lemmatization: using dictionaries
  6. N-grams
    - look at groups of words together. N = number of words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NLP - indexing bag of words into a vector table

A

try to see which gives you a better result:

  1. term frequency - look at a document and count how many times a key word is repeated
    - from sklearn.feature_extraction.text import CountVectorizer
  2. obtaining document frequencies - how many documents does the key word repeated
  3. TFIDF - the higher TFIDF is, the more important is the word. Look at term frequency in each document and how often they appear in the corpus. Key words that appear in every document are less important.
    - from sklearn.feature_extraction.text import TfidfVectorizer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

NLP vocabs

A

corpus - whole wikipedia (whole database)

document - each article (sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Naive Bayes

A

in situations when we have high number of dimensional space (features) than data. Instead of calculating frequencies of keywords, it calculates probability of (ie) keywords in spam emails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Clustering - k means

A

calculate distances from center of the cluster to all the points in the cluster, the small the overall distance is, the tighter the cluster is. The question is: where is the center of the cluster and how many clusters do we use.

  1. choose the initial clusters
  2. randomly pick centers of these clusters
  3. data points are grouped with the smallest distance to those center points
  4. then reassign centroids to the average center of these clusters
  5. until our centroids are not changing anymore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clustering - k means++

A

Similar to k-means but with an added weight. Once a random centroid is picked, the second point that is picked will be randomly but more likely the point farthest away will be chosen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

clustering - how do we pick how many clusters?

A

the elbow method can be used as a rule of thumb: graph total sum of squares vs number of clusters - pick a point where we start to get diminishing returns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

silhouette scores

A

A measurement of how confidently the point is assigned to each cluster by calculating the distance between a data point with the current cluster, and the distance between a data point to the next nearest cluster. Choose the number of clusters with the highest average silhouette score - the higher the average silhouette score, the tighter and more separated the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

hierarchal clustering

A

Calculate distances with all data points and then group those that are closest together. It then repeats this step to reduce the number of clusters by expanding clusters to include the nearest points. Pro: we can graph the results and take control of how many clusters we want.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

k means pros and cons

A

cons:

  1. when there’s unequal cluster sizes, or if it has unequal density, k-means does not give a good boundary to clusters. Sometimes it divides a cluster arbitrary.
  2. in the case of non-linear clustering, k-means might split the clusters linearly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly