Data Manegement Flashcards

1
Q

2 types of sources:

A
  1. Proprietary data (owned by the company)
  2. Publicly available - no competitive advantage, good for a starting point

This is why most companies invest in labeling their own datasets.
Also why the flywheel model is super useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Semi-supervised?

A

It means that without having labels, you train the model by changing the question so something that you have (two parts of the same sentence, different parts of a cat from
The same pictures etc…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Image data augmentation

A

Must do for trading vision,
Frameworks provide this.
Done on the cpu in parallel to the training on gpu.

For tabular - delete some cells to simulate missing data.

For text no well established techniques. Replace words with synonyms change order…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Storage option - filesystem

A

Foundational layer of storage.
The data can be changed and it’s not organized.
Fastest option
Can be simple on the machines, networked, or distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Local data format

A

Binary data: just file (tensor flow has something called tfrecords batches files, not important with NVMe drives)

Large tabular/text:
Parquet is wide spread and recommended.
Feather by Apache arrow is up and coming
Hdf5 is old and not relevant

Try to use the native tensor flow/ PyTorch dataset classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Object storage

A

An api over a file system (like Amazon s3).
Than you can store there without worrying where they are stored.
Good for versioning and redundancy
Fast enough within the cloud.
Cheaper some times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Database - what is it?

A

A fast, persistent, scalable storage and retrieval of structured data that will be accessed repeatedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Database - a good mental model for it and what should be stored

A

A mental model: everything is actually in RAM, software insures it logged to disk
Not for binary data but for references.

Use Postgres
Don’t use noSQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OLTP

A

Online transactions processing - databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OLAP

A

Online analytical processing = warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ETL

A

Extract, transform, load -
also warehouses
The idea is to extract data from different sources, transform it to some common schema and than upload it to a warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data lake

A

ELT - extract, load, transform

Not like warehouse - here you first load the data, than you transform it and move it to the place that needs it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SQL and dataframe

A

Pandas dataframes is like sql with benefits for both, I should know both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Lake house

A

This is the trend now - it’s both a data lake and a warehouse.
It’s an open source called “delta lake” where you can store all kinds of data - structured, semistructured and unstructured. Everything
Than it’s connects to the analytics tools or the ml tools etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data management summery:

A

Binary data: images sound files compressed text etc is stored as objects

Metadata (labels, user activity) is stored in database

If there are features that are not obtained from the data base (Like logs) set up a data lake and a process to aggregate needed data

At training time copy the data that is needed to a filesystem on a fast drive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Air flow, prefect area used for:

A

Managing the different task and workers that we have to run for our program

17
Q

Feature store - what is it and what options are out there

A

The features are created from the data lake, stored, and and they are used for the training. But the model is than loses into a different pipeline where the data is extracted from real time and than transformes etc…
So a feature store is a why to have this 2 processes as similar as possible so we will have as little bugs as possible.

Options - tecton (a company), feast (open source)

18
Q

Tools to optimize panadas

A

Dask - if the data is too big, they can run very fast, with paralleling the dataset

Rapids - scaling data analytics to the GPU

19
Q

Data labeling

A

Best is to hire a company to do it,
If not than label yourself with existing tools
And if not that than crowdsource…

20
Q

Data versioning

A

It’s important to version the data as it is important to version the code and the model.

Using simple git-lfs is good but can get really big.
DVC - open source for versioning data.

21
Q

How to deal with data that is private and kept within the company (hospitals and such)

A

Federated learning - the learning is done on the computers of the hospitals… and only the results come back to you so no need to have access to the data.

Differential privacy - aggregating data such that individual points cannot be identified

Learning on encrypted data

All in research