BigData Flashcards

1
Q

The major characteristics used to define big data are

A

volume, variety and velocity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

distributed computing allows us to

A

process big data because it divides it into more manageable chunks and distributed the work among computers that can process the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When we think in terms of big data processing, there are two types of data that we process - batch and streaming data.

differences between batch and streaming data?

A

Batch data is data that we have in storage and that we process all at once, or in a batch.
Streaming data is data that is being continually produced by one or more sources and therefore must be processed incrementally as it arrives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data warehouse
data lakes
unified data platform

A

Data warehouse technology emerged in the 1980’s and provides a centralized repository for storing all of an organization’s data. Data warehouses can be on-premises or in the cloud.

Unlike data warehouses which usually take clean data, data lakes store data in its raw format. Data lakes can store unstructured as well as structured data, and are known to be more horizontally scalable (in other words, it’s easy to keep adding more data into data lakes).

Finally, a data storage system that is quickly gaining popularity today is the unified data platform. These provide all of the benefits of data lakes, with the addition of some data warehousing capabilities, all wrapped up in a platform that your data teams can work in together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data storage systems:

A

Data warehouse
data lakes
unified data platform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

the whole point of working with big data

A

to be able to extract insights that can help drive business decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

artificial intelligence, machine learning, deep learning and data science

A

Artificial intelligence (AI) is a branch of computer science in which computer systems are developed to perform tasks that would typically need human intelligence. AI is a broad field, and it encapsulates many techniques within its umbrella.

Machine learning (ML) is a subset of artificial intelligence that works very well with structured data. The goal behind machine learning is for machines to learn patterns in your data without you explicitly programming them to do so. There are a few types of machine learning; the most commonly used type is called supervised machine learning.

Deep learning (DL) is a subset of machine learning that uses neural networks or sets of algorithms modeled by the structure of the human brain. They are much more complex than most machine learning models, and require significantly more time and effort to build. Unlike machine learning which plateaus after a certain amount of data, deep learning continues to improve as the data size increases. It performs well on complex datasets like images, sequences and natural language.

Data science is a field that combines tools and workflows from disciplines like math and statistics, computer science and business, to process, manage and analyze data. Data science is very popular in businesses today as a way to extract insights from big data to help inform business decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

data science workflow

A

The data science workflow is a series of steps that data practitioners follow to work with big data. It is a cyclical process that often starts with identifying business problems and ends with delivering business value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The data science workflow

A
  • Identifying business needs
  • Data ingestion
  • Data cleansing / preparation
  • Data analysis (Although machine learning and deep learning aren’t the only types of analyses that can be applied to your data, it is becoming more and more popular today, especially when it comes to big data.)
  • Sharing insights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Big data refers to data that …

A

is nearly impossible to process using traditional methods, like a single computer, because there’s so much of it, being generated so quickly, in many different formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Velocity?

A

refers to the speed at which new data is generated and the speed at which data moves around.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly