Selecting appropriate storage technologies Flashcards

Chapter 1

1
Q

Four stages of data lifecycle

A
  1. Ingest
  2. Store
  3. Process and analyze
  4. Explore and visualize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define ingestion stage

A

Acquiring data and bringing data into GCP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define storage stage

A

persisting data into a storage system from which it can be accessed for later stages of hte data lifecycle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

process and analyze

A

transforming data into a usable format for analysis applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explore and visualize

A

insights are derived from analysis and presented in tables, charts and other visualizations for use by others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Three broad ingestion modes

A
  1. Application data
  2. Streaming data
  3. Batch data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is application data generated?

A

Generated by applications including mobile apps, pushes and backend services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does application data include?

A
  1. user generated (ej. name, address),
  2. data generated by the app (ej. logs),
  3. event data (ej. clickstream)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Examples of services that can ingest application data

A
  1. Compute Engine
  2. Kubernetes Engine
  3. App Engine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Examples of locations that Application data can be written to:

A
  1. Stackdrive Logging
  2. Managed databases such as: Cloud SQL or Cloud Datastore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are examples of types of streaming data?

A
  1. sensor data,
  2. event data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the event time?

A

It is the often included streaming data timestamp that indicates the time that the data was generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the process time?

A

When streaming data, some applications will also track the time that data arrives at the beginning of the ingestion pipeline. This is the process time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why may time-series data require some additional processing early in the ingestion process?

A

If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is google cloud pub/sub?

A

is a fully-managed, scalable, global and secure messaging service that allows you to send and receive messages among applications and services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How Cloud Pub/Sub ingestion aid streaming data?

A

Streaming data is well suited for Cloud Pub/Sub ingestion because it can buffer data while applications process the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens during streaming data spikes when application instances cannot keep up with the rate at which data is arriving?

A

When this happenst he data can be preserved in a cloud Pub/sub topic and processed later after applciation instances have a chance to cath up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How is Cloud Pub/Sub set up in a way that is accessible and scalable?

A

Cloud Pub/sub has global endpoints and uses GCP’s global frontend load balancer to support ingestion. The messaging service scales automatically to meet the demands of the current workload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how is batch data ingested?

A

Batch data is ingested in bulk, typically in files.

20
Q

What GCP services are often. used for batch uploads?

A

Google cloud storage is typically used for batch uploads. It may also be used in conjunction with Cloud Transfer Service and transfer Appliance when uploading large volumes of data.

21
Q

What are the minimum three things that should be considered when choosing a storage system?

A
  1. How is data accessed?
  2. How access controls need to be implemented
  3. How long data will be stored
22
Q

Examples of databases to use when requiring to query for specific records using a set of filtering parameters

A

Cloud SQL, Cloud Datastore

23
Q

Examples of options when needing to access data in bulk

A

Cloud storage

24
Q

Options for when you need to access files using filesystem operations

A

Cloud filestore

25
Q

What is nearline storage?

A

Nearline storage is used for data that is accessed less than once per 30 days.

26
Q

what is coldline strogae?

A

Coldline storage is used to store data access less than once per year.

27
Q

what is a service that is suited to transform both stream and batch data?

A

Cloud Dataflow

28
Q

What are services that are useful for data analysis?

A
  1. Cloud dataflow
  2. Cloud Dataproc
  3. BigQuery
  4. Cloud ML Engine
29
Q

What is Cloud Datalab?

A

Cloud Datalab which is based on Jupyter Notebooks is a GCP tool for exploring, analyzing, and visualizing data sets.

30
Q

what are the 5 technical aspects of data?

A
  1. Volume
  2. Velocity
  3. Variation
  4. Access
  5. Security
31
Q

An individual item in cloud storage can be up to ___TB.

A

5 TB

32
Q

Cloud Bigtable can store up to ___ TB per node when using a hard disk drive and ___ TB per node when using SSDs.

A

8TB, 2.5TB

33
Q

In General Cloud SQL is a good choice for applications that need:

A
  1. relational database,
  2. serve requests in a single region
34
Q

What is velocity of data?

A

Velocity of data is the rate at which it is sent to and processed by an application

35
Q

What are examples of low velocity and high velocity

A

low velocity: human entered data

high velocity: machine generated data such as IoT

36
Q

What is structured data?

A

Structured data has a fixed set of attributes that can be modeled in a table of rows and columns

37
Q

What is semi-structured data?

A

Semi-structured data has attributes like structured data, but the set of attributes can very from one instance to another.

38
Q

Examples of row oriented storage

A

Cloud SQL and Cloud Spanner

39
Q

How do wide-column databases organizes information?

A

Rahter than using indexes to allow efficeint lookup of rows with needed data, wide column databases organize data so taht rows with similar row keys are closer together.

40
Q

Wide column databases are used for use cases with the following:

A
  1. High volumes of data
  2. Need for low-latency writes
  3. More write operations tahn read operations
  4. Limited range of queries - in other words no ad hoc queries
  5. Look up by a single key
41
Q

what are the four types of NoSQL databases available in GCP?

A
  1. key-value
  2. Document
  3. Wide column
  4. Graph
42
Q

What is a key-value data store?

A

databases that use associative arrays of dictionaries as the basic datatype.

43
Q

When to use key value data store and when to use document database?

A

In situations where items in the JSON structure should be searchable, a document database would a better option.

44
Q

What is the most significant difference between a wide-column database and relational tables?

A

Wide column databases are often sparse, with the exception of IoT and other time series databases that have few columns that are almost always used.

45
Q

What is a Graph database?

A

based on modeling entities and relationships as nodes and links in a graph or network. Social networks are a good example of a use case for graph databases.

46
Q

which google cloud storage services are used with the different structure types?

A

Structured data is stored in cloud SQL and cloud panner if it is used with a transaction processing system

Big query is used for analytical applications of structured data.

Semi-structured data is stored in Cloud Datastore if data acess requires full indexing; otherwise, it can be sotred in Bigtable.