Cloud Computing Concepts Flashcards

1
Q

What is BigData?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Data

A
  1. Structured Data (tabular format)
  2. Semi-Structured Data(JSON,XML,HTML)
  3. Un-Structured Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Azure Storages (90+)

A
  1. Azure Blob Storage (40 GB) Limited Storage (All kinds of data)
  2. Azure Data Lake Gen 2 (Unlimited) (All Kinds of data)
  3. Azure SQL DB(only Structured data)
  4. Azure Cosmos DB- No-SQL(Structured, Semi-Structured)
  5. Azure Elastic Pool (Group of DB) (Structured)
  6. Azure Synapse Analytics(DataWarehouse) (Structured)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ADB

A

ADB is developed on ETL and supports ELT as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ADB Components

A

Resource Group: it is like a folder, the ADB session resides in it
–> WorkSpace : It provides interface fro all the users to collaberatively in a single session
————> Cluster: Group of machines working as a single machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ADB

A

Azure Databricks is a cloud service that provides a scalable platform for data analytics using Apache Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cluster

A

Group of machines working like a single machine. Max worker nodes 100,000
Types:
1. All Purpose Compute
2. Job Compute
3. SQL Warehouse
4. Pools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

All Purpose Cluster

A

All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. It can be created with or without a pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Job Cluster

A

It is automatically created and deleted by the user without human intervention when the job is created . Job clusters terminate when your job ends, reducing resource usage and cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Workspace

A

A collaborative env which allows multiple users to work in a single session.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Magic Commands

A

Are used to change the language of the notebook
1. %python, %py
2. %scala
3. %r
4. %sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dynamic data masking

A

Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It is a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Apache Spark Architecture

A

It contains 3 major components
1. Master Node(Driver Node)
2. Cluster Manager
3. Worker Node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Workspace Assets

A

In ADB workspace, we can manage different assets
* Cluster
* Notebook
* Jobs
* Libraries
* Folders
* Models
* Experiments

Workspace: A Repository with folder like structure that contains all the azure datbricks assets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

On Demand Instaces

A

on-demand instances in Azure are like renting a computer whenever you need it without any long-term commitments. You pay for the time you use the computer, and you can start or stop it whenever you want. These instances are flexible and can be scaled up or down based on your needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

On Spot Instances

A

Azure’s Spot Virtual Machines offer discounted pricing by using spare capacity in Azure’s data centers. However, if there’s high demand for this capacity from paying customers or for Azure’s own needs, Azure can reclaim these Spot VMs.

When Azure reclaims a Spot VM:

Eviction Notice: Azure gives a 30-second eviction notice to the Spot VM.
Deallocation: The VM is gracefully deallocated, giving you a short time window to save work or perform any necessary shutdown procedures.

17
Q

What is the purpose of Spark Context?

A

Spark Context serves as the entry point to Spark, managing connections to a Spark cluster and coordinating job execution.

18
Q

In which Spark version was Spark Session introduced?

A

Spark Session was introduced in Spark 2.0 as an evolution of Spark Context and SQL Context.

19
Q

What functionalities did Spark Context primarily handle?

A

Spark Context was mainly responsible for RDD-based operations, managing resources, and interacting with the Spark cluster.

20
Q

What is the broader scope of Spark Session in comparison to Spark Context?

A

Spark Session covers a wider scope by providing a unified entry point, supporting DataFrames, Datasets, and simplifying interactions with structured data.

21
Q

How does Spark Session differ in its API evolution from Spark Context?

A

While Spark Context was the primary entry point in older Spark versions, Spark Session is the recommended entry point in newer versions, offering a higher-level API and unifying different context functionalities.

22
Q

wget

A

Similar to cURL, wget retrieves content from web servers but with a focus on downloading files. It’s capable of recursive downloads, supports resuming interrupted downloads, and works well for fetching entire websites or specific files.

23
Q

What is the primary API provided by Spark for working with structured data?

A

The primary API for working with structured data in Spark is the DataFrame API.

24
Q

Which Spark API is used for working with distributed collections of data?

A

The Resilient Distributed Dataset (RDD) API is used for working with distributed collections of data in Spark.

25
Q

Which API in Spark provides higher-level abstractions and optimizations for working with structured data?

A

DataFrames and Datasets APIs provide higher-level abstractions and optimizations for working with structured data compared to RDDs.

26
Q

What is the main programming language used with Spark APIs?

A

Spark supports APIs in multiple languages, but the primary language is Scala, followed by Java, Python, and R.

27
Q

Which Spark API provides SQL-like querying capabilities for working with data?

A

The DataFrame API offers SQL-like querying capabilities, allowing users to run SQL queries programmatically on distributed data.

28
Q

What are the characteristics of RDDs in Spark?

A

RDDs are immutable, fault-tolerant, and lazily evaluated. They can be rebuilt from lineage in case of failure.

29
Q

How can you create an RDD in Spark?

A

RDDs can be created by parallelizing an existing collection in memory, by loading data from external storage (like HDFS, S3), or by transforming an existing RDD using operations like map, filter, etc.

30
Q

What operations can you perform on RDDs in Spark?

A

RDDs support two types of operations: transformations (like map, filter, reduceByKey) and actions (like count, collect, saveAsTextFile) enabling data transformation and computation.

31
Q

How does RDD lineage contribute to fault tolerance?

A

RDD lineage tracks the sequence of transformations, enabling Spark to recompute lost partitions by reapplying these transformations in case of node failures.

32
Q
A