C1 : What is Data Science? Flashcards

Understand Introductory concepts.

1
Q

What is MLOps?

A
  • Machine learning operations.
  • Tools that provide ongoing monitoring of models and automated retraining of drifted models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Algorithm?

A

A set of step-by-step instructions to solve a problem or complete a task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Model?

A

A representation of the relationships and patterns found in data.
* They are useful for making predictions or when analyzing complex systems.
* They retain the essential elements of the data needed for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s an Outlier?

A

A data point that differs significantly from other observations.

Potentially indicating anomalies, errors, or unique phenomena that could impact statistical analysis or modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Structured Data?

A

Data is organized and formatted into a predictable schema, usually related tables with rows and columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Unstructured Data?

A
  • Unorganized data that lacks a predefined data model.
  • Which are harder to analyze using traditional methods.
  • This data type often includes text, images, videos, and other content that doesn’t fit neatly into rows and columns like structured data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does .CSV stand for?

A

Comma seperated values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does .XLSX stand for?

A

Microsoft Excel Open XML Spreadsheet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does .XML stand for?

A

Extennsible Markup Language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does .PDF stand for?

A

Portable document format. (Adobe)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does .JSON stand for?

A

JavaScript Object Notation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does .TSV stand for?

A

Tab Seperated Values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some of the benfits of .JSON file format?

A
  • Language-independent data format.
  • Is considered as one of the best tools for sharing data of any size and type, even audio and video.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some of the benifits of .XLSX file format?

A
  • XLSX uses the open file format.
  • It can use and save all functions available in Excel.
  • Is known to be one of the more secure file formats as it cannot save malicious code.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some of the benifits of the .XML file format?

A
  • Readable by humans and machines.
  • It is a self-descriptive language.
  • Does not use predefined tags like .HTML does. * XML is platform independent.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Data Visualization?

A

A visual way of representing data and it’s trends that is easily comprehensible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What defines a Delimited Text File?

A

It is a plain text file where a specific character separates the data values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Hadoop?

A

An open-source framework designed to store and process large datasets across clusters of computers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are Jupyter Notebooks?

A

An IDE and type of computational notebook that allows reserchers create to share code, equations, visualizations, and explanatory text.

(AKA, Python notebooks.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the Nearest Neighbor algorithm?

A

An algorithm that uses proximity to make classifications or predictions about how to group an individual data point.

aka., KNN or k-NN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a Neural Network?

A

A computational model used in deep learning that mimics the structure and functioning of the human brain’s neural pathways. It takes an input, processes it using previous learning, and produces an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Pandas?

A
  • An open-source Python library that provides tools for working with structured data.
  • It is often used for data manipulation and analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is R?

A

An open-source programming language used for statistical computing, data analysis, and data visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a recommendatoin engine?

A

A computer program that analyzes user input, such as behaviors or preferences, and makes personalized recommendations based on that analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is regression?

A

A statistical model that identifies strength & correlation between one or more inputs and an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What defines Tabular Data?

A

Data that is orgainized into rows and columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the five characteristics of Cloud Computing?

A
  • On-demand self-service.
  • Broad network access.
  • Resource pooling.
  • Rapid elasticity.
  • Measured service.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is on on-demand self-service in cloud computing?

A

Access cloud resources such as the processing power, storage, and network without requiring human interaction..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is broad network access in cloud computing?

A

When cloud computing resources can be accessed via the network through standard mechanisms and platforms such as mobile phones, tablets, laptops, and workstations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is resource pooling in cloud computing?

A

*** A schema that gives cloud providers economies of scale. **
* Whereby cloud resources are dynamically assigned and reassigned according to demand, without customers needing to know the physical location of these resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is rapid elasticity in cloud computing?

A

A characteristic of cloud computing wherby organizations are able to access more cloud resources when they need them, and scale back when they don’t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is measured service in cloud computing?

A

A schema by which an organization only pays for what they use or reserve as they go.
* Resource usage is monitored, measured, and reported transparently based an organization’s utilization.
* If they’re not using resources, they’re not paying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the three Cloud Deployment Models?

A
  • Public Cloud.
  • Hybrid Cloud.
  • Private Cloud.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a public cloud in cloud computing?

A

When an orgaization leverages cloud services over the open internet on hardware owned by the cloud provider, but its usage is shared by other companies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is private cloud in cloud computing?

A

Infrastructure provisioned for exclusive use by a single organization. It could run on-premises or it could be owned, managed, and operated by a service provider.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is hybrid cloud in cloud computing?

A

When an oganization is leveraging a mix of public cloud(s) and private cloud(s) that are configured to work together seamlessley.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the three cloud service models?

A
  • IaaS
  • PaaS
  • SaaS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What does IaaS stand for?

A

Infrastructure as a Service.
A cloud computing service model that gives an organization access to the infrastructure and physical computing resources such as servers, networking, storage, and data center space without the need to manage or operate them.

39
Q

What does PaaS stand for?

A

Platform as a Service.
A cloud computing service model that gives an organization access to the platform that comprises the hardware and software tools that are usually needed to develop and deploy applications to users over the Internet

40
Q

What does SaaS stand for?

A

Software as a Service
A cloud computing service model that gives an organization access to a software licensing and delivery model whereby software and applications are centrally hosted on the cloud and licensed on a subscription basis.

41
Q

What are the V’s of Big Data?

A
  • Velocity - The speed at which data is accumulating.
  • Volume - The scale of the data accumulating.
  • Variety - The vast and growing number of data types.
  • Veracity - The quality and origin of data.
  • Value - Utility implicit in data.
42
Q

What is a Hadoop node?

A

A single computer.

43
Q

What is a Hadoop cluster?

A

A network of hadoop nodes.

44
Q

What does HDFS stand for?

A

Hadoop Distributed File System.

45
Q

What is Apache Hive?

A

Data warehouse software.
It is open-source and excels at reading, writing, and managing large data set files that are stored directly in either HDFS or other data storage systems such as Apache HBase

46
Q

What is the best use case for Apache Hive and why?

A

It is suited for data warehousing tasks such as ETL, reporting, and data analysis.
* This is because it excels at low-write, high-latency, read-based queries.
* Think, “slow and meticulious tasks”.

47
Q

What is Apache Spark?

A

A distributed data analytics framework designed to perform complex data analytics in real-time.

48
Q

What is the best use case for Apache Spark and why?

A

Extractng and processing large volumes of data for a wide range of applications.

E.g:
* Interactive Analytics,
* Streams Processing,
* Machine Learning,
* Data Integration,
* ETL.

It is a general-purpose data processing engine that takes advantage of in-memory processing and only writes to disk when it’s memory is constrained.

49
Q

What is an in-sample forecast?

A

A test of the predictive capabilities of a model on observed data.

50
Q

What are the seven steps of Data Mining exercise?

A
  • Goal Setting.
  • Selecting Data.
  • Preprocessing.
  • Transforming Data.
  • Storing Data.
  • Data Mining.
  • Evaluating.
51
Q

In a data mining exercise, what is goal setting?

A

Identifying the key questions that need to be answered.
Also, preforming a cost-benifit analysis of collecting the data vis a vis expected level of accuracy and usefulness of the results obtained from the data mining exercise.

52
Q

In a data mining exercise, describe the process of selecting data.

A

Identifing relevent existing data, and/or collecting new data.
Costs in time and money should be kept in mind when aquiring any data.

53
Q

In a data mining exercise, what is preprocessing?

A

Developing and/or employing a formal method of dealing with missing data and determining whether the data are missing randomly or systematically.
Also, employng regular checks to ensure data integrity.

54
Q

In a data mining exercise, what is transforming?

A
  • Determining the appropriate format in which data must be stored.
  • Prioritizing reducing the number of attributes needed to explain the phenomena.
  • Using aglorithims to convert the data to fit those determinaions and priorities.
55
Q

In a data mining exercise, what considerations must me made when storing transformed data?

A
  • Is the format conducive to data mining?
  • Does the storage grand expidited Read/write privileges to the data scientist.
  • Are you taking into accouunt data safety and privacy concerns.
56
Q

In a data mining exercise, what is “data mining”?

A

Usingdata analysis methods, including parametric and non-parametric methods, and machine-learning algorithms to discover insights in the cleaned, transformed, and stored data.

57
Q

In a data mining exercise, what is evaluation?

A

The formaly evaluation of data mining results.

58
Q

What language is Hadoop implemented in?

A

Java.

59
Q

In business, what is Digital Change / Digital Transformation?

A

A strategic and cultural organizational change driven by data science, especially Big Data, where digital technology is integrated across the organization, resulting in fundamental operational and value delivery changes.

60
Q

What is Data Replication in cloud computing?

A

A strategy in which data is duplicated across multiple nodes in a cluster to ensure data durability and availability, reducing the risk of data loss due to hardware failures.

61
Q

What is Commodity Hardware in cloud computing?

A

Standard, off-the-shelf hardware components that can be used in a big data cluster, offering cost-effective solutions for storage and processing without relying on specialized hardware.

62
Q

What is Data Science?

A

The process and method for extracting knowledge and insights from large volumes of disparate data.

63
Q

What is Data Mining?

A

Automatically searching and analyzing data, and discovering previously unrevealed patterns.

64
Q

What is Machine Learning?

A

A subset of AI that uses computer algorithms to analyze data and make intelligent decisions based on what it has previously learned without being explicitly programmed.

65
Q

What is Deep Learning?

A

A specialized subset of machine learning that uses layered neural networks to simulate human decision-making.
* Deep learning algorithms can label and categorize information and identify patterns.

66
Q

What is Gernarive AI?

A

A subset of AI that focuses on creating new data, such as images, music, text, or code, rather than just analyzing existing data.

67
Q

What does GAN stand for?

A

Generative Adversarial Networks.

68
Q

What does VAE stand for?

A

Variational Autoencoders.

69
Q

In general what do VAEs and GANs do?

A

These models create new instances of data that replicate the underlying distribution of the original data by learning patterns from enormous volumes of data.

70
Q

What is synthetic data?

A

Artificial data with properties similar to the real data, such as its distribution, clustering, and many other factors an AI learned about the real data set.

71
Q

What is an Artificial Neural Network?

A

Collections of small computing units (neurons) that process data and learn to make decisions over time.

72
Q

What is Bayesian Analysis?

A

Using Bayes’ theorem to update probabilities based on new evidence.

73
Q

What is Cluster Analysis?

A

Grouping similar data points together based on certain features or attributes.

74
Q

What is a Decision Tree?

A

A type of machine learning algorithm used for decision-making that creates a tree-like structure of decisions.

75
Q

Name two deep learning models.

A
  • Generative Adversarial Networks (GANs)
  • Variational Autoencoders (VAEs)
76
Q

What is NLP?

A

Natural Language Processing.
A field of AI that enables machines to understand, generate, and interact with human language.

77
Q

What is an Arithmetic Model?

A

A mathematical model to analyze data and predict outcomes.

78
Q

What is a Data Custer?

A

A group of similar, related data points distinct from other clusters.

79
Q

What is an HPC?

A

High-performing computing cluster.
A computing technology that uses a system of networked computers designed to solve complex and computationally intensive problems in traditional environments.

80
Q

What is Stata?

A

A software package used for statistical analysis.

81
Q

What is SQL?

A

Structured Query Language.

82
Q

What is EDA?

A

Exploratory Data Analysis.

83
Q

What is Technical Metadata?

A

Technical definitions of the data structures.

84
Q

What is Process Metadata?

A

Data that describe the processes that operate behind business systems such as data warehouses, accounting systems, or CRM tools.

85
Q

What is Business Metadata?

A

It is information about the data described in readily interpretable ways.

86
Q

What is a NoSQL database?

A

A database designed to store and manage unstructured data.

87
Q

What does RDBMS mean?

A

Relational Database Management System.

88
Q

in a DB, Rows are called?

A

Records.

89
Q

In a DB, Columns are called?

A

Attributes.

90
Q

What is ACID?

A

Atomicity, Consistency, Isolation, and Durability

91
Q

What is a Data Mart?

A

A sub-section of the data warehouse, built specifically for a particular function.

92
Q

What is a Data Lake?

A

A pool of raw data, where data is simply tagged with a UID for future use.

93
Q

What does NoSQL stand for?

A

Not Only SQL.