Hugging Face ecosystem | HF NLP course | 2. Using Hugging Face Transformers | Priority Flashcards

1
Q

Code to get a tokenizer.

hugging-face tokenizers

A
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Code to send example input through a tokenizer with arguments.

hugging-face tokenizers

A
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Output format from a tokenizer.

hugging-face tokenizers

A
{'input_ids': tensor[[sentence1,…], 'attention_mask': tensor[[sentence1,…]}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Code to get a model (not for a specific task).

hugging-face transformers

A
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

hugging-face tokenizers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the output dimensions of a Transformer module?

hugging-face transformers

A

batch size, sequence length, hidden size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Example code to feed the outputs of a tokenizer into a model.

hugging-face tokenizers transformers

A
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example code to get a model that will classify text. What will the output shape be?

hugging-face transformers

A
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs) # (batch_size, num_classes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[page] Tokenizers: [page section] Loading and saving: [q] Code to use the AutoTokenizer class to grab the proper tokenizer class in the library based on the checkpoint name.

A
"from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(""bert-base-cased"")"
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[page] Tokenizers:[page section] Loading and saving: [q] What are the 3 elements in the dict output of a tokenizer?

A

input_ids, token_type_ids, attention_mask

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[page] Tokenizers: [page section] Loading and saving: [q] Saving a tokenizer.

A

tokenizer.save_pretrained(“directory_on_my_computer”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[page] Handling multiple sequences: [page section] Models expect a batch of inputs: [q] Models expect ? sentences by default.

A

multiple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

[page] Putting it all together: [page section]: Wrapping up: From tokenizer to model: [q] Write a code snippet that uses the tokenizer API to tokenize 2 sequences (using 3 arguments) and run them through a sequence classification model.

A
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]					
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

[page] Processing the data: [page section]: Loading a dataset from the Hub: [q] Code to download and cache the GLUE benchmark dataset from the Hugging Face hub.

A
from datasets import load_dataset
raw_datasets = load_dataset(""glue"")
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to inspect the features of a dataset?

A

dataset.features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to tokenize all elements in a HF dataset?

A
"def tokenize_function(example):
	return tokenizer(…)
tokenized_dataset = dataset.map(tokenize_function)"

Source > video

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

[page] Processing the data: [page section] Preprocessing a dataset: [q] What is the idiom to decode IDs to words?

A

tokenizer.convert_ids_to_tokens(inputs[“input_ids”])

17
Q

[page] Processing the data: [page section] Dynamic padding: [q] What is a collate function? What is the default behavior?

A

The function that is responsible for putting together samples inside a batch. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them.

18
Q

[page] Processing the data: [page section] Dynamic padding: [q] Example of how to create a collator?

A

“from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch = data_collator(samples)”

19
Q

When not using the Trainer class, what does creating the DataLoader objects look like (using a collator for dynamic padding)?

A
"from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding 
data_collator = DataCoIIatorWithPadding(tokenizer) 
train_dataloader = DataLoader(
    tokenized_datasets[""train""], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets[""validation""], batch_size=8, collate_fn=data_collator
)
for step, batch in enumerate(train_dataloader): 
    print(batch[ ""input_ids""].shape) 
    if step > 5: 
        break "
20
Q

[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to pass parameters to training?

A

“from transformers import TrainingArguments
training_args = TrainingArguments(““test-trainer””)”

21
Q

[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to instantiate a Trainer and start training, with evaluation per epoch?

A
"from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets[""train""],
    eval_dataset=tokenized_datasets[""validation""],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()"
22
Q

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] What is the output of the trainer.predict() method?

A

named tuple with three fields: predictions, label_ids, and metrics

23
Q

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Idiom to get the predictions from the output of trainer.predict()?

A

import numpy as np
predictions = trainer.predict(tokenized_datasets["”validation””])
preds = np.argmax(predictions.predictions, axis=-1)

24
Q

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Example compute_metrics() function on mrpc to be passed to a Trainer? What is the eval_preds argument?

A

“def compute_metrics(eval_preds):
metric = evaluate.load(““glue””, ““mrpc””)
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#eval_preds = EvalPredict object = namedtuple with a predictions field and a label_ids field”

25
Q

[q] When not using the Trainer class, write a basic training loop.

A
"from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)"
26
Q

[q] Write a basic evaluation loop using a glue metric.

A
"import evaluate
metric = evaluate.load(""glue"", ""mrpc"")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch[""labels""])
metric.compute()"