geelecds_notes Flashcards

1
Q

What is Data Science?

A

A multi-disciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from data.

Data Science involves data gathering, analysis, and decision-making, impacting various daily activities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the core elements of Data Science?

A
  • Computational - Algorithmic methods and code.
  • Statistical - Statistical inference for predictions.
  • Real-world Problems - Solving actual world issues, not theoretical models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the equation that represents the components of Data Science?

A

Data Science = Statistics + Data Collection + Data Preprocessing + Machine Learning + Visualization + Business Insights + Scientific Hypotheses + Big Data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In which industries is Data Science commonly used?

A
  • Banking
  • Consultancy
  • Healthcare
  • Manufacturing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some applications of Data Science in transport?

A
  • Route planning
  • Predictive analysis for delays
  • Driverless cars.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What skills are required for a Data Scientist?

A
  • Machine Learning
  • Statistics
  • Programming (Python or R)
  • Mathematics
  • Databases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fill in the blank: Data Science helps companies make _______.

A

[better decisions, predictive analysis, pattern discoveries].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What was one of the key milestones in the development of Data Science in 1962?

A

John Tukey’s influential paper, ‘The Future of Data Analysis,’ which shifted the focus to a more exploratory approach in data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Apache Spark?

A

An open-source data processing and analytics engine capable of handling large datasets, known for its fast data processing.

Originally developed as a faster alternative to MapReduce for Hadoop clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is D3.js used for?

A

Creating custom data visualizations in the web browser using web standards like HTML, SVG, and CSS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is IBM SPSS?

A

A suite of software tools for managing and analyzing complex statistical data.

Components include SPSS Statistics for statistical analysis and SPSS Modeler for predictive analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the key features of the Julia programming language?

A
  • High-performance for numerical computing
  • Combines simplicity with C/Java-like performance
  • Supports multiple dispatch for fast execution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of Jupyter Notebook?

A

An open-source web-based application for interactive collaboration among data scientists, supporting multiple programming languages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Keras?

A

A high-level deep learning API designed for easy experimentation with neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Matlab programming language used for?

A

Numerical computing and data visualization, supporting machine learning and predictive modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does Matplotlib do?

A

A Python plotting library for creating static, animated, and interactive visualizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What was the significance of the term ‘data scientist’ in 2008?

A

It became a buzzword popularized by DJ Patil and Jeff Hammerbacher of LinkedIn and Facebook.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: Data Science is confined to one discipline.

A

False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What has been a recent trend in Data Science programming?

A

A shift toward conservative programming with a focus on simpler, less risky algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a major use of Data Science in healthcare?

A
  • Detecting tumors
  • Drug discoveries
  • Medical image analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fill in the blank: Data Science is integral to business and academic research, encompassing areas like _______.

A

[machine translation, robotics, speech recognition, digital economy].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does predictive modeling in finance allow companies to do?

A

Predict customer lifetime value and stock market moves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the purpose of the autocomplete feature in Data Science?

A

To complete user input based on previously typed text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the significance of the Knowledge Discovery in Databases workshop started in 1989?

A

It played a key role in the evolution of data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the importance of data normalization in Data Science?
To scale values to a practical range for accurate analysis.
26
What is fast experimentation?
A method that includes sequential and functional APIs for model creation.
27
What is Matlab?
High-level programming language and environment for numerical computing and data visualization.
28
What are the key features of Matlab?
* Used by engineers and scientists for algorithm design and data analysis * Supports machine learning, deep learning, and predictive modeling * Simulink tool offers model-based design and simulation * Good for mathematical modeling and engineering applications
29
What is Matplotlib?
Python plotting library for creating static, animated, and interactive visualizations.
30
What are the key features of Matplotlib?
* Generates 2D visualizations with high-level commands * Supports interactive plots in Jupyter Notebooks * Includes pyplot for simple plotting and object-oriented interfaces for complex visualizations
31
What is NumPy?
Core Python library for scientific computing and handling multidimensional arrays.
32
What are the key features of NumPy?
* Supports linear algebra, random number generation, and other mathematical functions * ndarray is the fundamental object for efficient storage and manipulation * Known for fast operations due to optimized C code
33
What is Pandas?
Python library used for data manipulation and analysis.
34
What are the key features of Pandas?
* Provides DataFrame (2D) and Series (1D) structures for data management * Supports various data formats like CSV, SQL, JSON * Features for data cleaning, reshaping, and data alignment * Built on top of NumPy
35
What is Python?
High-level programming language with dynamic typing, interpreted semantics, and object-oriented design.
36
What are the key features of Python?
* Widely used in data science, machine learning, and AI * Simplified syntax and readability * Supports multiple programming paradigms * Extensive libraries make it a go-to language for data analysis
37
What is PyTorch?
Open-source framework for deep learning based on neural networks.
38
What are the key features of PyTorch?
* Fast and flexible framework supporting GPU processing * Provides automatic differentiation and tools for deep learning * Tensors similar to NumPy arrays with GPU support * Popular for research and prototyping
39
What is R?
Open-source programming language for statistical computing and data visualization.
40
What are the key features of R?
* Widely used by data scientists, statisticians, and academics * Extensive package ecosystem for data manipulation and visualization * Known for intuitive syntax and complex statistical analysis
41
What is SAS?
Integrated software suite for statistical analysis, predictive analytics, and data management.
42
What are the key features of SAS?
* Offers robust tools for data cleansing, preparation, and analysis * Initially built for statisticians, but now supports a wide range of analytics tasks * Focused on enterprise use with SAS Viya, a cloud-native platform
43
What is Scikit-learn?
Python library for machine learning, built on top of NumPy and SciPy.
44
What are the key features of Scikit-learn?
* Supports supervised and unsupervised learning algorithms * Provides tools for model evaluation, data preprocessing, and model selection * Ideal for traditional machine learning but lacks deep learning support
45
What is SciPy?
Python library for scientific and technical computing.
46
What are the key features of SciPy?
* Extends NumPy with functions for optimization, interpolation, and differential equations * Adds additional mathematical and statistical functions
47
What is TensorFlow?
Open-source machine learning platform by Google, primarily used for deep learning.
48
What are the key features of TensorFlow?
* Uses tensors for data representation * Flexible platform for building neural networks * Includes tools for model training and deployment
49
What is Weka?
Open-source workbench for machine learning and data mining.
50
What are the key features of Weka?
* Includes algorithms for classification, clustering, regression, and association mining * Provides an intuitive GUI for users to apply machine learning
51
What is data collection?
The process of collecting and evaluating information from multiple sources.
52
What are the types of data?
* Primary Data * Secondary Data
53
What is primary data?
Data collected directly through experiments or surveys.
54
What is secondary data?
Data pre-collected by others and easily accessible.
55
What is data cleaning?
The process of ensuring that a dataset is accurate, consistent, and ready for analysis.
56
What are common methods to gather data?
* Public Datasets * Web Scraping * Surveys and Forms * Internal Data Sources * Data from Sensors and IoT Devices * Collaborations and Partnerships * Purchase Data * Crowdsourcing * Simulated Data
57
What are the challenges of web scraping?
Not all websites offer structured data and scraping can be restricted.
58
What is web scraping?
An automatic method to obtain large amounts of data from websites.
59
What are the types of web scrapers?
* Self-built Web Scrapers * Pre-built Web Scrapers * Browser Extension Web Scrapers * Software Web Scrapers * Cloud Web Scrapers * Local Web Scrapers
60
What is the goal of data cleaning?
To ensure that the data is accurate, consistent, and free of errors.
61
What are the steps to perform data cleaning?
* Understanding data structure * Identifying issues like missing values and duplicates * Removing unwanted observations
62
What is data transformation?
The process of converting data into a suitable format for analysis.
63
What is the importance of data collection and data cleaning?
They ensure the quality and usability of data for effective analysis.
64
What are common uses of web scraping?
* Price Comparison * Market Research * SEO Analysis * Social Media Monitoring
65
What is a DataFrame in Pandas?
A 2D data structure provided by Pandas for data management.
66
Fill in the blank: Data cleaning is also known as _______.
data cleansing or scrubbing
67
True or False: Web scraping can only be done using pre-built tools.
False
68
What is the first step in the data cleaning process?
Thorough understanding of data and its structure ## Footnote This helps identify issues like missing values, duplicates, and outliers.
69
What does removal of unwanted observations involve?
Identifying and removing irrelevant or redundant observations ## Footnote This includes analyzing data entries for duplicates and irrelevant information.
70
What are structural errors in a dataset?
Inconsistencies in data formats or variable types ## Footnote These need to be addressed to ensure uniformity in data structure.
71
What is an outlier?
A point that deviates significantly from the dataset mean ## Footnote Managing outliers improves model accuracy.
72
What are some methods to handle missing data?
Imputing missing values, removing records, or using advanced imputation techniques ## Footnote Handling missing data prevents biases and maintains integrity.
73
List some advantages of data cleaning in machine learning.
* Improved model performance * Increased accuracy * Better representation of data * Improved data quality * Improved data security
74
What is a disadvantage of data cleaning?
It is time-consuming ## Footnote Especially for large and complex datasets.
75
What risk does data cleaning pose regarding data?
It can result in loss of important information ## Footnote This highlights the importance of careful data cleaning.
76
What is a resource-intensive aspect of data cleaning?
It requires significant time, effort, and expertise ## Footnote Specialized software tools may also be needed.
77
How can data cleaning contribute to overfitting?
By removing too much data ## Footnote This can affect model generalization.
78
Name a data cleansing tool that supports removing duplicates.
OpenRefine ## Footnote It is an open-source tool for cleaning and transforming messy data.
79
What is Trifacta Wrangler designed for?
Cleaning, transforming, and preparing data for analysis ## Footnote It uses AI to suggest transformations.
80
What does TIBCO Clarity help with?
Profiling, standardizing, and enriching data ## Footnote It ensures high-quality data and consistency across datasets.
81
What is the focus of Cloudingo?
De-duplication, data cleansing, and record management ## Footnote It helps maintain the accuracy of data.
82
What is IBM Infosphere Quality Stage suitable for?
Large-scale and complex data ## Footnote It provides robust data quality management.