The Data Science Handbook Flashcards

1
Q

What is data wrangling?

A

The nitty-gritty task of cleaning data and getting it into a standard format that is suitable for downstream analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is exploratory analysis?

A

A stage of analysis that focuses on exploring the data to generate hypotheses about it. EDA relies heavily on visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a feature?

A

A feature is a small piece of data, usually a number or label, that is extracted from your data and characterizes some entity in your dataset. i.e. you might extract average word length from a text doc or the # of characters in a doc
Feature extraction means taking your raw datasets and distilling them down into a table with rows and columns (tabular data) with a row corresponding to some real world entity and each column giving a single piece of information (generally a number) that describes the entity.
Extracting good features is the most important thing for getting your analysis to work
Feature extraction is the most creative part of data science and the one most closely tied to domain expertise; typically a really good feature will correspond to some real-world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a PRD?

A

A product requirements document is as document that specifies exactly what functionality a planned product should have

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is production code?

A

Software that is run repeatedly and maintained. It especially refers to source code of software product that is distributed to other people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is SOW?

A

A statement of work is a document that specifies what work is to be done in a project, relevant timelines, and specific deliverables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a target variable?

A

A feature you are trying to predict in machine learning. Sometimes it is already in your data and other times you must construct it yourself. if you want to figure out whether client’s customers would lose their brand loyalty, there’s no loyalty field in the data–it’s just a log of various customer interactions and transactions and you need to figure out how to measure “loyalty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the data science roadmap?

A

The data science road map: 1. Frame the problem 2. Understand the data 3. Extract features 4. Model and analyze 5a. Present results to a human (give business insights in the form of a deck or report, likely) OR 5b. Deploy code (deliverable is apiece of software that performs some analytics work. I.e. implementing an algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Excel best used for

A

Simple data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Tableau best for?

A

Visualizing data in relational databases. It’s pretty limited in its functionality but makes beautiful graphics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Weka? What are its advantages?

A

A tool for applying pre-canned machine learning algorithms to datasets that are already well formatted and contain relevant features.

Weka has an advantage because it essentially provides a user-friendly interface (GUI) that makes it easy for people to interact with and use some powerful tools written in Java (a programming language). This means that if you create models or perform analyses in Weka during your initial exploration of data, you can smoothly transition to using the same models in your actual computer code for production purposes, especially if you’re working with Java. This seamless integration makes it convenient for users to move from the user-friendly environment of Weka to incorporating the same models into their more advanced and customized programming work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a GUI?

A

Graphical user interface. It provides a more user-friendly way to interact with software compared to text-based interfaces. It allows users to perform tasks by clicking on visual elements rather than entering commands manually. GUIs are commonly used in applications, operating systems, and software tools to enhance the user experience and make it more intuitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do Excel, Tableau, and Weka all have in common?

A

They all assume data is in tabular form to begin with. Because each dataset requires its own idiosyncratic data wrangling, you need to be creative and flexible in what features you extract from raw data which is why you need to be proficient in at least one programming language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Python? (4)

A

Best programming language for general purpose use and a popular choice among data scientists. Balances flexibility of a conventional scripting language with numerical muscles of a good mathematics package.

Released in 1991.

High level scripting language with functionality similar to Perl and Ruby with a clean, self-consistent syntax.

Has open-source technical computing libraries that make it powerful for analytics

Designed for computer programmers and augmented with libraries for technical computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is R?

A

Another popular programming language. While Python is designed for computer programmers and augmented with libraries for technical computing, R was designed by and for statisticians and is natively integrated with graphics capabilities and extensive statistical functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some of the issues with R?

A

It was founded almost 40 years ago, and it’s showing it’s age; there are areas where syntax is clunky, support for strings is bad, and type system is antiquated.

17
Q

Why would you use R over Python?

A

There are so many special libraries that have been written for it and Python hasn’t covered al the special use cases yet.

The statistics community uses R

18
Q

What are MATLAB and Octave?

A

MATLAB - best in class for numerical computing (nicer syntax than R and more powerful than Python). Octave is an open source version of MATLAB (less functinoality)

19
Q

What is SAS?

A

Statistical Analysis Software is a proprietary statistics framework that is very useful for business statistic applications. There is a lot of legacy code written in SAS, but not good for general purpose data science.

20
Q

What is Scala?

A

Up and coming language; not currently a general purpose tool. DOesn’t have library support for analytics and visualizations. Similar to Java but simpler syntax and features borrowed form other languages

21
Q

What is the “hello world!” script about?

A

A common way to learn a new programming language is to first write a “hello world!” program - a program that just prints the text “hello world!” to the screen

22
Q

What is the Python Interpreter?

A

The Python interpreter is a software program that executes Python code. It reads and interprets the source code written in the Python programming language and executes the instructions accordingly. The interpreter is responsible for translating the high-level Python code into machine-readable instructions that can be executed by the computer’s hardware.

When you run a Python script or enter commands in the Python interactive mode, the Python interpreter processes the code line by line, executing each statement and producing the corresponding output. Python supports both an interactive mode, where you can enter commands directly, and a script mode, where you write your code in a separate file and execute it using the interpreter.

Python has a reference implementation known as CPython, which is the default and most widely used implementation of the Python programming language. There are also alternative implementations such as Jython (Python on the Java Virtual Machine), IronPython (Python on the .NET framework), and PyPy (a fast and just-in-time compiled implementation). Despite the variations in implementations, they all provide a Python interpreter to run Python code.

23
Q

What is the difference between interactive mode and script mode in Python?

A

Python supports both an interactive mode (this uses Python interpreter), where you can enter commands directly (useful for exploring data and experimenting with what you want to do)

Script mode, where you write your code in a separate file and execute it using the interpreter.

24
Q

What is Big Data? What are the 3Vs?

A

Big Data refers to extremely large and complex sets of data that traditional data processing applications are inadequate to deal with. The term “Big Data” encompasses not only the size of the data but also its variety, velocity, and complexity. The three main characteristics of Big Data are often referred to as the “3Vs”:
1. Volume: Big Data involves the processing of vast amounts of data. This can range from terabytes to petabytes and beyond. The sheer volume of data is a key aspect of Big Data.
2. Velocity: Big Data is often generated at a high speed. Data streams in rapidly from various sources, such as social media, sensors, and business transactions. Real-time processing is often required to handle this continuous influx of data.
3. Variety: Big Data comes in various formats and types. It includes structured data (such as databases and spreadsheets), semi-structured data (like XML files and JSON), and unstructured data (such as text documents, images, videos, and social media posts).