Adswerve Study Guide Flashcards

1
Q

What is BigQuery?

A

BigQuery is the data warehouse, the petabyte scale data warehouse on Google Cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BigQuery Formats

A

Avro, CSV, JSON(newline delimited), ORC, Parquet, Cloud Datastore Exports, Cloud Firestore exports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Parquet

A

A data format on HDFS, BQ compatible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ORC

A

(Optimized Row Columnar) A data format on HDFS, BQ compatible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop

A

Software framework for distributed storage and processing of big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HDFS

A

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dataproc: HDFS or GCS

A

Google reccomends using GCS for storage instead of HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why Google recommends PubSub over Kafka?

A

Better scaling, fully managed service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How long can Kafka retain messages

A

However long you configure it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How long can PubSub retain messages?

A

7 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is Kafka push or pull?

A

Pull

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is PubSub push or pull?

A

Both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Does Kafka guarantee ordering?

A

Yes in a partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does PubSub guarantee ordering?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kafka Delivery Guarantee

A

At most once, at least once, exactly once (limited)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PubSub Delivery Guarantee

A

At least once for each subscription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Spark

A

Lives on Hadoop. Framework that uses RAM to process data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is BQ default encoding?

A

UTF-8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Dataproc: is HDFS data persistant?

A

No, it goes away when the dataproc cluster is shut down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Dataproc: is GCS data persistant?

A

Yes, it remains even when a cluster is shut down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

BigTable: What causes hotspotting?

A

Contiguous row keys. Example keys: 20190101 20190102 20190103 …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BigTable: How to prevent hotspotting?

A

Make row keys non contiguous. Example keys: a93js-20190101, vomdn-20190102, odsjs-20190103

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is ANSI SQL

A

Standard SQL in BQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Dataproc: when to create a cluster?

A

Clusters are recommended to be job specific. Have a separate cluster for each job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Which is simpler? cbt or hbase shell

A

cbt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Databases: What does an index do?

A

Improves the search speed of a specific column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is an MID?

A

Machine-generated IDentifier. A unique identifier for an entity in Google’s Knowledge Graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

BQ: Does LIMIT clause reduce cost?

A

No. All the data will still be queried and billed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

BQ: Can you change an existing table to use partitions?

A

No. You must create a partitioned table from scratch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

BQ: which column specifies a partition?

A

_PARTITIONTIME

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

BQ: which column specifies a shard (wildcard)?

A

_TABLE_SUFFIX

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

BQ: At what levels can you control access?

A

Project and Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

BQ: Can you limit access to a table?

A

No. Dataset is the most granular level of access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

BQ: How long are query results cached?

A

24 hours

35
Q

BQ: Majority of time spent in wait stage

A

Just wait. Buy more slots. Query over smaller datasets. Have queries that have faster executing compute stages

36
Q

BQ: Majority of time spent in read stage

A

All other operations were less expensive than the base cost of reading input data. Ideal. Can improve via partitioning commonly used tables but not much else to do.

37
Q

BQ: Majority of time spent in compute stage

A

Filter as early as possible. Pre-calculate common calculations.

38
Q

BQ: Majority of time spent in write stage

A

Expected if we emit more data than was originally read from inputs.

39
Q

Which storage options have transactions?

A

Cloud SQL, Spanner, Datastore

40
Q

Which storage options have high throughput?

A

BigTable

41
Q

When to use datastore?

A

NoSQL. Quick Read slow Write

42
Q

When to use bigtable?

A

NoSQL. Massive throughput

43
Q

PubSub: What is a push subscription?

A

PubSub sends message to preconfigured endpoint

44
Q

PubSub: What is a pull subscription?

A

Subscriber must ask PubSub for a message

45
Q

PubSub: How to get message ordering?

A

Include sequence information in the message

46
Q

PubSub: Subscriptions are auto-deleted after how many days of inactivity by default?

A

31

47
Q

Wide-column store

A

NoSQL database. Rows and columns. Names and format of columns can vary from row to row. 2D key value store

48
Q

Pig

A

scripting language that compiles into MapReduce jobs. Runs on top of Hadoop

49
Q

Hive

A

SQL-like data warehousing system and language runs on top of Hadoop

50
Q

Sqoop

A

Sqoop imports data from a relational database system or a mainframe into HDFS.

51
Q

Oozie

A

Workflow scheduler system to manage Apache Hadoop jobs

52
Q

Cassandra

A

Wide-column store based on ideas of BigTable. Has a Query Language

53
Q

HBase

A

Wide-column store based on ideas of BigTable. No Query Language

54
Q

Redis

A

Very fast in-memory data structure store

55
Q

Kafka

A

Pub/sub message queue

56
Q

Impala

A

Use SQL to query data in HDFS or HBase. Like bq or spanner

57
Q

Stackdriver: Monitoring

A

Full-stack monitoring for Google Cloud Platform and Amazon Web Services

58
Q

Stackdriver: Logging

A

Real-time log management and analysis

59
Q

Stackdriver: Error Reporting

A

Identify and understand your application errors

60
Q

Stackdriver: Debugger

A

Investigate your code’s behavior in production

61
Q

Stackdriver: Trace

A

Find performance bottlenecks in production.

62
Q

Stackdriver: Profiler

A

Identify patterns of CPU, time, and memory consumption in production

63
Q

YARN

A

Resource Negotiator for hadoop. Allows arbitrary application to be executed on a hadoop cluster

64
Q

GIRAPH

A

Graph processing on hadoop

65
Q

BigQueryML

A

Train ML models entirely within BQ

66
Q

AutoML Tables

A

Beta: Automatically build and deploy ML models on structured data

67
Q

Cloud Inference API

A

Alpha: Run large-scale correlations over typed time-series datasets

68
Q

Recommendations AI

A

Beta: Build an end-to-end personalized recommendation system

69
Q

Cloud AutoML

A

Beta: Easily train ML models with basically a UI or something

70
Q

When to use syncronous speech recognition?

A

Audio files shorter than ~1 min

71
Q

When to use asynchronous speech recogintion?

A

Audio files are long (longer than 1 min)

72
Q

SSML

A

Speech Synthesis Markup Language. Has markup for emphasis, pitch, countour, volume etc.

73
Q

NLU

A

Natural Language Understanding

74
Q

BigTable Minimum Nodes

A

3

75
Q

When should you consider HDDs for Bigtable

A

Generally only when storing at least 10TB of data. Data is not time sensitive

76
Q

How many clusters can you have in a Bigtable instance

A

4

77
Q

Bigtable: recommended row size limit

A

100 MB

78
Q

Bigtable: recommended column value limit

A

10 MB

79
Q

Max size of single object in GCS

A

5 TB

80
Q

Cloud SQL: About how big you should go?

A

10 TB

81
Q

Size of Largest Datastore Entity

A

1 MiB

82
Q

Bigtable: Hard limit row size

A

256 MB

83
Q

Bigtable: max tables per instance

A

1,000 tables