Hugging Face ecosystem | HF NLP course | 5. The Hugging Face Datasets library Flashcards

1
Q

Slicing and dicing our data: Load TSV data

A
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Slicing and dicing our data: Find the number of unique items in each split

A
for split in drug_dataset.keys():
 assert len(drug_dataset[split]) == len(drug_dataset[split].unique(col_name))
# BR
for split, subset in dataset.items()
     subset.unique(col_name)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Slicing and dicing our data: Filter rows where a column is None using a lambda expression

A

drug_dataset = drug_dataset.filter(lambda x: x[“condition”] is not None)

https://huggingface.co/course/chapter5/3?fw=pt#slicing-and-dicing-our-data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Creating new columns: Create a new column with a function (that computes the number of words in a text column) and map()

A


def compute_review_length(example):


return {“review_length”: len(example[“review”].split())}



drug_dataset = drug_dataset.map(compute_review_length)


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Creating new columns: Sort a dataset by a column

A

drug_dataset[“train”].sort(“review_length”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The map() method’s superpowers: How to speed up applying a function to a column

A

.map(…, batched=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The map() method’s superpowers: How to speed up applying a function to a column using parallelization

A

.map(…, num_proc=n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[website section] The Hugging Face Datasets library [page] Time to slice and dice -> how to use Datasets to clean data [page section] The map() method’s superpowers [q] Truncate while tokenizing but return all chunks.

A
def tokenize_and_split(examples):

 
     return tokenizer(

 
          examples["review"],

 
          truncation=True,

 
          max_length=128,

 
          return_overflowing_tokens=True,

 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

From Datasets to DataFrames and back: Change the format of a Dataset to pandas


A
drug_dataset.set_format("pandas")


# or


df = drug_dataset.to_pandas()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Creating a validation set: Create a validation set


A
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)


# Rename the default "test" split to "validation"


drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")


# Add the "test" set to our `DatasetDict`


drug_dataset_clean["test"] = drug_dataset["test"]


drug_dataset_clean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Saving a dataset: Save dataset in multiple splits in JSON format


A
for split, dataset in drug_dataset_clean.items():


     dataset.to_json(f"drug-reviews-{split}.jsonl")

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What if my dataset isn’t on the Hub?: Loading a local dataset: [q] What is the basic syntax for loading a (local) dataset?

Source: What if my dataset isn’t on the Hub? Loading a local dataset

A
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

Source: What if my dataset isn’t on the Hub? Loading a local dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Display the memory usage of a huge process.

A

Process.memory_info is expressed in bytes, so convert to megabytes

import psutil

print(f”RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to load a dataset that doesn’t fit into machine memory

A
pubmed_dataset_streamed = load_dataset("json", data_files=data_files, split="train", streaming=True)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to access an element of a streamed dataset

A
next(iter(pubmed_dataset_streamed))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Example of how to tokenize a streamed dataset

A
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))
17
Q

How to select elements of a streamed dataset

A
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)
18
Q

Way to combine multiple datasets that don’t fit into memory together

A
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))
19
Q

Function to create an iterable to return selected elements from an iterable

20
Q

Python review: How to get data via HTTP.

Source: Creating your own dataset > Getting the data

A

response = requests.get(url)

Source: Creating your own dataset > Getting the data

21
Q

Pandas review: How to create a dataframe from a list of dicts.

Source: Creating your own dataset > Getting the data

A

df = pd.DataFrame.from_records(all_issues)

Source: Creating your own dataset > Getting the data

22
Q

Pandas review: How to write out a dataframe as line-delimited JSON.

Source: Creating your own dataset > Getting the data

A

df.to_json(f”{issues_path}/{repo}-issues.jsonl”, orient=”records”, lines=True)

Source: Creating your own dataset > Getting the data

23
Q

Idiom to remove the complement of a set of columns from a Dataset.

A
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
24
Q

Pandas function to take a column with list elements and repeat other column elements while flattening each list element.

A

comments_df = df.explode(“comments”, ignore_index=True)
comments_df.head(4)

25
Q

Simple way to speed up the embedding process.

A

device = torch.device(“cuda”)
model.to(device)

26
Q

Review: Idiom to get a tokenizer and a model.

A
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)
27
Q

Idiom to do CLS pooling.

A
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]
28
Q

How to index a Dataset column with FAISS.

A

embeddings_dataset.add_faiss_index(column=”embeddings”)

29
Q

Use the FAISS index to do nearest neighbor search and return in descending order.

A
scores, samples = embeddings_dataset.get_nearest_examples(
						    "embeddings", question_embedding, k=5
						)
import pandas as pd
						
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values(""scores"", ascending=False, inplace=True)
30
Q

PyTorch review: Meaning of x.unsqueeze(-1)

A

https://www.educba.com/pytorch-unsqueeze/

Reference: https://www.educba.com/pytorch-unsqueeze/

31
Q

PyTorch review: Idiom to sum across rows.

A

torch.sum(a,1)

Source: https://pytorch.org/docs/stable/generated/torch.sum.html

32
Q

PyTorch review: How to compute the Euclidean norm.

A

https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html

https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html

33
Q

Idiom to do mean pooling.

A
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = {attention_mask.unsqueeze(-1).expand(token_embeddings).size().float()}
    return torch.sum(token_embeddings*input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
# Normalize the embeddings.
sentence_embeddings = F.normalize(sentence_embeddings p=2, dim=1)
print(f"Sentence embeddings shape: {sentence_embeddings.size()}")