Hugging Face ecosystem | HF NLP course | 5. The Hugging Face Datasets library Flashcards by Brian Riordan

Slicing and dicing our data: Load TSV data

from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

How well did you know this?

Not at all

Perfectly

Slicing and dicing our data: Find the number of unique items in each split

for split in drug_dataset.keys():
 assert len(drug_dataset[split]) == len(drug_dataset[split].unique(col_name))
# BR
for split, subset in dataset.items()
     subset.unique(col_name)

How well did you know this?

Not at all

Perfectly

Slicing and dicing our data: Filter rows where a column is None using a lambda expression

drug_dataset = drug_dataset.filter(lambda x: x[“condition”] is not None)

https://huggingface.co/course/chapter5/3?fw=pt#slicing-and-dicing-our-data

How well did you know this?

Not at all

Perfectly

Creating new columns: Create a new column with a function (that computes the number of words in a text column) and map()

 def compute_review_length(example):  
return {“review_length”: len(example[“review”].split())}  

drug_dataset = drug_dataset.map(compute_review_length) 

How well did you know this?

Not at all

Perfectly

Creating new columns: Sort a dataset by a column

drug_dataset[“train”].sort(“review_length”)

How well did you know this?

Not at all

Perfectly

The map() method’s superpowers: How to speed up applying a function to a column

.map(…, batched=True)

How well did you know this?

Not at all

Perfectly

The map() method’s superpowers: How to speed up applying a function to a column using parallelization

.map(…, num_proc=n)

How well did you know this?

Not at all

Perfectly

[website section] The Hugging Face Datasets library [page] Time to slice and dice -> how to use Datasets to clean data [page section] The map() method’s superpowers [q] Truncate while tokenizing but return all chunks.

def tokenize_and_split(examples):   
     return tokenizer(   
          examples["review"],   
          truncation=True,   
          max_length=128,   
          return_overflowing_tokens=True,   )

How well did you know this?

Not at all

Perfectly

From Datasets to DataFrames and back: Change the format of a Dataset to pandas 

drug_dataset.set_format("pandas")  
# or  
df = drug_dataset.to_pandas()

How well did you know this?

Not at all

Perfectly

Creating a validation set: Create a validation set 

drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
  # Rename the default "test" split to "validation"  
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")  
# Add the "test" set to our `DatasetDict`  
drug_dataset_clean["test"] = drug_dataset["test"]  
drug_dataset_clean

How well did you know this?

Not at all

Perfectly

Saving a dataset: Save dataset in multiple splits in JSON format 

for split, dataset in drug_dataset_clean.items():  
     dataset.to_json(f"drug-reviews-{split}.jsonl")

How well did you know this?

Not at all

Perfectly

What if my dataset isn’t on the Hub?: Loading a local dataset: [q] What is the basic syntax for loading a (local) dataset?

Source: What if my dataset isn’t on the Hub? Loading a local dataset

from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

Source: What if my dataset isn’t on the Hub? Loading a local dataset

How well did you know this?

Not at all

Perfectly

Display the memory usage of a huge process.

Source: Big data? Hugging Face Datasets to the rescue!

Process.memory_info is expressed in bytes, so convert to megabytes

import psutil

print(f”RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB”)

Source: Big data? Hugging Face Datasets to the rescue!

How well did you know this?

Not at all

Perfectly

How to load a dataset that doesn’t fit into machine memory

Source: Big data? Hugging Face Datasets to the rescue!

pubmed_dataset_streamed = load_dataset("json", data_files=data_files, split="train", streaming=True)

Source: Big data? Hugging Face Datasets to the rescue!

How well did you know this?

Not at all

Perfectly

How to access an element of a streamed dataset

Source: Big data? Hugging Face Datasets to the rescue!

next(iter(pubmed_dataset_streamed))

Source: Big data? Hugging Face Datasets to the rescue!

How well did you know this?

Not at all

Perfectly

Example of how to tokenize a streamed dataset

Source: Big data? Hugging Face Datasets to the rescue!

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

Source: Big data? Hugging Face Datasets to the rescue!

How to select elements of a streamed dataset

Source: Big data? Hugging Face Datasets to the rescue!

dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

Source: Big data? Hugging Face Datasets to the rescue!

Way to combine multiple datasets that don’t fit into memory together

Source: Big data? Hugging Face Datasets to the rescue!

from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

Source: Big data? Hugging Face Datasets to the rescue!

Function to create an iterable to return selected elements from an iterable

Source: Big data? Hugging Face Datasets to the rescue!

isslice()

Source: Big data? Hugging Face Datasets to the rescue!

Python review: How to get data via HTTP.

Source: Creating your own dataset > Getting the data

response = requests.get(url)

Source: Creating your own dataset > Getting the data

Pandas review: How to create a dataframe from a list of dicts.

Source: Creating your own dataset > Getting the data

df = pd.DataFrame.from_records(all_issues)

Source: Creating your own dataset > Getting the data

Pandas review: How to write out a dataframe as line-delimited JSON.

Source: Creating your own dataset > Getting the data

df.to_json(f”{issues_path}/{repo}-issues.jsonl”, orient=”records”, lines=True)

Source: Creating your own dataset > Getting the data

Idiom to remove the complement of a set of columns from a Dataset.

Source: Semantic search with FAISS

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)

Source: Semantic search with FAISS

Pandas function to take a column with list elements and repeat other column elements while flattening each list element.

Source: Semantic search with FAISS

comments_df = df.explode(“comments”, ignore_index=True)
comments_df.head(4)

Source: Semantic search with FAISS

Simple way to speed up the embedding process. ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

device = torch.device("cuda") model.to(device) ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

Review: Idiom to get a tokenizer and a model. ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

``` model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModel.from_pretrained(model_ckpt) ``` ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

Idiom to do CLS pooling. ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

``` def cls_pooling(model_output): return model_output.last_hidden_state[:, 0] ``` ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

How to index a Dataset column with FAISS. ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

embeddings_dataset.add_faiss_index(column="embeddings") ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

Use the FAISS index to do nearest neighbor search and return in descending order. ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

``` scores, samples = embeddings_dataset.get_nearest_examples( "embeddings", question_embedding, k=5 ) import pandas as pd samples_df = pd.DataFrame.from_dict(samples) samples_df["scores"] = scores samples_df.sort_values(""scores"", ascending=False, inplace=True) ``` ## Footnote Source: [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=tf)

PyTorch review: Meaning of x.unsqueeze(-1)

https://www.educba.com/pytorch-unsqueeze/ ## Footnote Reference: https://www.educba.com/pytorch-unsqueeze/

PyTorch review: Idiom to sum across rows.

torch.sum(a,1) ## Footnote Source: https://pytorch.org/docs/stable/generated/torch.sum.html

PyTorch review: How to compute the Euclidean norm.

https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html ## Footnote https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html

Idiom to do mean pooling.

``` def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = {attention_mask.unsqueeze(-1).expand(token_embeddings).size().float()} return torch.sum(token_embeddings*input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) # Normalize the embeddings. sentence_embeddings = F.normalize(sentence_embeddings p=2, dim=1) print(f"Sentence embeddings shape: {sentence_embeddings.size()}") ```