NLP Flashcards
How to create a Doc object in Spacy?
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u’This is some text’)
What is a span in Spacy?
span is a slice of Doc object, Doc[start:end]
What are noun_chunks in Spacy?
base noun phrases
How to visualize in Spacy?
from spacy import displacy
displacy. render(doc, style=’dep|ent’, jupyter=True, options={‘distance’: 110})
displacy. serve(doc, style=’dep’)
127. 0.0.1:port
How to get a list of stopwords in Spacy?
import spacy
nlp = spacy.load(‘en_core_web_sm’)
print(nlp.Defaults.stop_words)
How to check if a word is a stop word in Spacy?
nlp.vocab[‘word’].is_stop
How to add a stop word in Spacy?
nlp. Defaults.stop_words.add(‘btw’)
nlp. vocab[‘btw’].is_stop = True
How to remove a stop word in Spacy?
nlp. Defaults.stop_words.remove(‘btw’)
nlp. vocab[‘btw’].is_stop = False
How to build a library of token patterns in Spacy?
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{‘LOWER’: ‘solarpower’}]
pattern2 = [{‘LOWER’: ‘solar’}, {‘IS_PUNCT’: True, ‘OP’:’*’}, {‘LOWER’: ‘power’}]
matcher.add(‘SolarPower’, None, pattern1, pattern2)
found_matches = matcher(doc)
print(found_matches)
How to use a matcher for terminology lists in Spacy?
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = [‘voodoo economics’, ‘supply-side economics’, ‘trickle-down economics’, ‘free-market economics’]
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add(‘VoodooEconomics’, None, *phrase_patterns)
matches = matcher(doc)
How to count POS frequency in a text in Spacy?
POS_counts = doc.count_by(spacy.attrs.POS)
for k,v in sorted(POS_counts.items()):
print(f’{k}. {doc.vocab[k].text:{5}}: {v}’)
How to add a named entity in Spacy?
from spacy.tokens import Span
ORG = doc.vocab.strings[u’ORG’]
new_ent = Span(doc, start, end, label=ORG)
doc.ents = list(doc.ents) + [new_ent]
How to add named entities to all matching spans in Spacy?
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = [‘vacuum cleaner’, ‘vacuum-cleaner’]
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add(‘newproduct’, None, *phrase_patterns)
matches = matcher(doc)
from spacy.tokens import Span
PROD = doc.vocab.strings[u’PRODUCT’]
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]
doc.ents = list(doc.ents) + new_ents
How to add a new rule to pipeline in Spacy?
def set_custom_boundaries(doc): for token in doc[:-1]: if token.text == ';': doc[token.i+1].is_sent_start = True return doc
nlp.add_pipe(set_custom_boundaries, before=’parser’)
How to change segmentation rules in Spacy?
from spacy.pipeline import SentenceSegmenter
def split_on_newlines(doc): start = 0 seen_newline = False for word in doc: if seen_newline: yield doc[start:word.i] start = word.i seen_newline = False elif word.text.startswith('\n'): # handles multiple occurrences seen_newline = True yield doc[start:] # handles the last group of tokens
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
TF-IDF
term frequency * (1/document frequency)
How to extract text features using scikit-learn?
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_count = cv.fit_transform(X_train)
How to make a prediction for a new text using classification model in scikit-learn?
text_clf.predict([‘some text here’])
How to clear dataset from missing data in scikit-learn?
df.dropna(inplace=True)
How to clear data set from empty strings in scikit-learn?
blanks = []
for i, lb, rv in df.itertuples():
if rv.isspace():
blanks.append(i)
df.drop(blanks,inplace=True)
How does Word2vec train words agains words in a corpus?
- Using context to predict a target word (continuous bag of words)
- Using a word to predict a target context (skip-gram)
check word similarity in Spacy
tokens = nlp(u’fox dog animal’)
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
vector arithmetic in Spacy (finding similar words using vectors)
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
king = nlp.vocab['king'].vector man = nlp.vocab['man'].vector woman = nlp.vocab['woman'].vector
# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen" new_vector = king - man + woman computed_similarities = []
for word in nlp.vocab:
# Ignore words without vectors and mixed-case words:
if word.has_vector:
if word.is_lower:
if word.is_alpha:
similarity = cosine_similarity(new_vector, word.vector)
computed_similarities.append((word, similarity))
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])
How to do sentiment analysis using NLTK?
import nltk
nltk.download(‘vader_lexicon’)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
import pandas as pd
df = pd.read_csv(‘../TextFiles/amazonreviews.tsv’, sep=’\t’)
df.head()
df[‘scores’] = df[‘review’].apply(lambda review: sid.polarity_scores(review))
df[‘compound’] = df[‘scores’].apply(lambda d: d[‘compound’])
df[‘score’] = df[‘compound’].apply(lambda s: ‘pos’ if s>= 0 else ‘neg’)