Quant 2.7 Flashcards

1
Q

What is Big Data?

A

Big data includes all the data generated by financial markets, businesses, governments, individuals, sensors, internet of things and more.
Big data has grown exponentially over the past decade.
Investment managers are increasingly using big data (example is using financial text data for forecasting stock movement)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is big data different than traditional data?

A

Cos of 3 characteristics (3 V’s):
Volume - Quantity of data
Variety - array of available data sources
Velocity - speed at which data are created
When using big data for prediction or inference, we have a 4th V
Veracity - Credibility and reliability of different data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the steps in executing a data analysis project with structured data?

A

There are two sources of data - internal & external
Then the steps to build a model include:
1. Conceptualization - What will the model do and who will use the model?
2. Data collection
3. Data Prep & Wrangling - Cleaning the data, filling in the missing elements, etc
4. Data exploration
5. Model training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps in executing a data analysis project with unstructured data?

A

There can be quite a lot of sources for unstructured data, news articles, social media, other docs, open data, etc
Steps to build a model are similar to that of structured data:
1. Text problem formulation - Define the text classification prob (example can be to come up with a sentiment score)
2. Data curation
3. Text prep & wrangling - Cleaning and pre-processing tasks.
4. Text exploration
5. Model Training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Data Collection?

A

collection from internal or external sources based on the trade-offs between time, financial costs and accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data preparation (cleansing)?

A

Examine, identify and mitigate errors in raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Data Wrangling (pre-processing)?

A

perform transformations and critical processing steps on the cleansed data to make it ready for the ML model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the types of errors one can expect while dealing with structured data while data preparation?

A

Types are, incompleteness error (blanks), invalidity (DOB with year 1900), Inaccuracy (gender name mismatch), inconsistency (some other answer to Y or N), Non-uniformity (different formats for the same data type), duplication error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the types of errors you can come across in the step data wrangling with structured data?

A

Some common terms you need to know are Extraction, Aggregation (Salary + other income = total income), Filtration (only looking at filtered data), Selection (only looking at a particular selection), Conversion (All total income should be in USD).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are outliers handled in data wrangling?

A

By trimming and winsorization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do you mean by scaling?

A

Normalization and standardization are two methods we use for scaling in data wrangling.
Normalization = (Xi - Xmin) / (Xmax - Xmin) Range.
Standardization (reducing the impact of outliers) = (Xi - u) / Std Deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does unstructured data preparation include?

A

Involves removing unimportant text from raw input - removing html tags, punctuations, most numbers & white spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does unstructured data wrangling include?

A

We start with tokenization which is splitting a given sentence into different tokens. Token is equivalent to a word.
Then, we normalise the tokens. Normalization includes Lower casing, then removing stop words, then stemming (increased or increasing are both stemmed down to increas - to increase the frequency of infrequent words) and final method is Lemmatization (which is seldom used).
After normalization, we’re left with a Bag of Words -> which can be thought of as a distinct set of tokens or words where the sequence doesn’t come into play.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a DTM?

A

A document term matrix consists of a table where rows are the different sources from which the tokens were collected and the columns are the tokens/words.
This table is the end result of our text prep and wrangling process which leaves us with the final DTM (a structured table) that we can use in our model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are N-grams?

A

We can also have tokens in the form of a sequence of words. It can be a two word sequence(Bi-grams) or three word sequence(Tri-grams) and the Bag of words thus can become bag of N-grams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does data exploration include?

A

EDA (exploratory data analysis) - Summarizing & observing data. EDA objectives are: serve as communication medium among project stakeholders, understanding data properties, finding patterns & relationships, inspecting basic questions and hypothesis, etc
Feature selection - Select relevant features from the data set. (Less features = less compexity)
Feature engineering - create new features by transforming or combining existing features.