Chapter Ten Flashcards by Roger Dunbar

What are the classifications of NoSQL databases?

Key-value stores
Document stores
Wide-column stores
Graph-oriented databases

These classifications help in understanding the various types of NoSQL databases and their use cases.

How well did you know this?

Not at all

Perfectly

What is a wide-column store?

A NoSQL database that distributes data based on both key values and columns, using ‘column groups/families’.

Example: Apache Cassandra.

How well did you know this?

Not at all

Perfectly

What is a graph-oriented database?

A database that maintains information regarding the relationships between data items, with nodes and properties, and connections between nodes can also have properties.

Example: Neo4j.

How well did you know this?

Not at all

Perfectly

What does NoSQL stand for?

Not Only SQL.

A category of data storage and retrieval technologies not based on the relational model.

How well did you know this?

Not at all

Perfectly

What are the main characteristics of NoSQL databases?

Performance: Variable
Scalability: High
Flexibility: High
Complexity: Low
Functionality: Variable

These characteristics differentiate NoSQL databases from traditional relational databases.

How well did you know this?

Not at all

Perfectly

What are the Five V’s of Big Data?

Volume
Variety
Velocity
Veracity
Value

These characteristics define the nature and challenges of Big Data.

How well did you know this?

Not at all

Perfectly

What is meant by ‘Schema on Read’?

A data model determined later, depending on how the data will be used, allowing flexibility in data access.

This contrasts with ‘Schema on Write’, which uses a preexisting data model.

How well did you know this?

Not at all

Perfectly

What is a Data Lake?

A large integrated repository for internal and external data that does not follow a predefined schema.

It allows flexible access and captures everything.

How well did you know this?

Not at all

Perfectly

What is Hadoop?

An open-source implementation framework of MapReduce designed for managing large files in a distributed environment.

Hadoop is widely used for big data management.

How well did you know this?

Not at all

Perfectly

What is MapReduce?

An algorithm for massive parallel processing of various types of computing tasks, dividing a computing task so multiple nodes can work on it simultaneously.

It consists of two stages: Map and Reduce.

How well did you know this?

Not at all

Perfectly

What is the function of the Hadoop Distributed File System (HDFS)?

A file system designed for managing large files in a distributed environment, breaking data into blocks and distributing them across nodes.

HDFS allows efficient data storage and processing.

How well did you know this?

Not at all

Perfectly

What is the role of Pig in the Hadoop ecosystem?

A tool that integrates a scripting language and an execution environment to simplify the use of MapReduce.

It is useful for developing data processing tasks.

How well did you know this?

Not at all

Perfectly

What is Hive in the context of Hadoop?

A tool that supports management and querying of large datasets using a SQL-like language called HiveQL.

It is useful for ETL tasks.

How well did you know this?

Not at all

Perfectly

What is the primary difference between relational and NoSQL databases?

Relational databases rely on a predefined schema, while NoSQL databases support schema on read and greater flexibility.

This affects how data is stored and retrieved.

How well did you know this?

Not at all

Perfectly

Fill in the blank: A document-store database uses _______ for storage.

BSON-based storage format.

BSON stands for Binary JSON.

How well did you know this?

Not at all

Perfectly

True or False: NoSQL databases are typically ACID compliant.

False.

NoSQL databases are often not ACID compliant, which distinguishes them from traditional relational databases.

How well did you know this?

Not at all

Perfectly

What does the acronym BASE in NoSQL refer to?

Basically Available, Soft state, Eventually consistent.

This is a model used in many NoSQL databases.

How well did you know this?

Not at all

Perfectly

What is the purpose of the _id property in MongoDB?

To uniquely identify a document.

It serves as a primary key in the database.

How well did you know this?

Not at all

Perfectly

What does the term ‘move computation to the data’ refer to in Hadoop?

The principle of processing data where it is stored rather than moving it to a centralized location for processing.

This approach enhances efficiency in data processing.

How well did you know this?

Not at all

Perfectly

With HDFS it is less expensive to move the execution of computation to data than to move the:

data to computation.
data to hardware.
data to systems analysis.
data to processes.

How well did you know this?

Not at all

Perfectly

Hive uses ________ to query data.

Honeyquery
* HiveQL *
BeesNest
SQL

How well did you know this?

Not at all

Perfectly

Structured Query Language (SQL) is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful information.

False

How well did you know this?

Not at all

Perfectly

JSON is commonly used in conjunction with the ‘document store’ NoSQL database model.

True

How well did you know this?

Not at all

Perfectly

NoSQL includes data storage and retrieval:

based on normalized tables.
* not based on the relational model. *
based on the relational model.
not based on data.

How well did you know this?

Not at all

Perfectly

Hive creates MapReduce jobs and executes them on a Hadoop Cluster.

True

An organization that requires a sole focus on performance with the ability for keys to include strings, hashes, lists, and sorted sets would select ________ database management system.

Access * Redis * Neo4j Excel Spreadsheet

The NoSQL model that is specifically designed to maintain information regarding the relationships (often real-world instances of entities) between data items is called a:

key-value store. * graph-oriented database. * wide-column store. document store.

It is true that in an HDFS cluster the NameNode is the:

* single master server. * business intelligence. language library. large number of slaves.

Data in MongoDB is represented in:

*BSON. * SON. CSON. JSON

Collect everything is a characteristic of a data lake.

True

Big data requires effectively processing:

a single data type (text). a single data type (numeric). * many data types. * two data types (text and numeric).

________ includes concern about data quality issues.

Velocity * Veracity * Variety Vigilant

The primary use of Pig is to:

*transform raw data into a format that is useful for analysis.* create data warehouses. query large databases. create large databases.

MapReduce is an algorithm for massive parallel processing utilized by Hadoop.

True

A business owner that needs carefully normalized tables would likely need a relational database instead of a NoSQL database.

True

The Hadoop framework consists of the ________ algorithm to solve large scale problems.

MapSystem MapComponent * MapReduce * MapCluster

Neo4j is a wide-column NoSQL database management system developed by Oracle.

False

Hive is a(n) ________ data warehouse software.

Oracle Macintosh Microsoft * Apache *

NoSQL focuses on avoidance of replication and minimizing storage space.

False

Why do modern data needs extend beyond relational DBs & data warehouses?

They must handle unstructured/semi-structured data, petabyte scale, real-time ingestion, and elastic horizontal growth.

Name two technical limits of traditional relational databases in a big-data world.

(1) Vertical scaling costs skyrocket; (2) Schema-on-write forces ETL & slows agile development.

Give an example use case poorly served by a relational DB but ideal for NoSQL.

IoT sensor streams arriving at millions of events per second.

Main NoSQL categories (list).

* Key-Value * Document * Column-Family (Wide-Column) * Graph

Key-Value store—core idea?

Data = arbitrary 'value' blob referenced by a unique key.

Document store—core idea?

JSON/BSON documents with flexible nested fields.

Column-Family (Wide-Column) store—core idea?

Data stored by column families rather than rows for large sparse tables.

Graph store—core idea?

Nodes + edges with properties; optimized for traversals.

What kind of NoSQL DB is MongoDB?

Document store.

Which on-disk format does MongoDB use?

BSON (Binary JSON).

Two features of MongoDB that illustrate NoSQL flexibility.

(1) Dynamic schema; (2) Automatic sharding for horizontal scaling.

ACID vs BASE—relational or NoSQL?

ACID = Relational transactions; BASE = many NoSQL systems.

When would you choose a relational DB over NoSQL?

Need strong transactions, complex joins, strict integrity.

When would you choose NoSQL over relational?

High write throughput, variable schema, massive scale.

Define 'Big Data' in one sentence.

Data sets whose volume, velocity, and variety exceed the capacity of conventional systems.

List the classic 3 Vs of big data.

* Volume * Velocity * Variety

Two additional Vs often mentioned.

* Veracity * Value

Core storage layer of Hadoop.

HDFS (Hadoop Distributed File System).

Hadoop’s resource manager.

YARN (Yet Another Resource Negotiator).

Hadoop’s original compute engine.

MapReduce.

What are 'Hadoop Common' libraries?

Shared utilities & Java APIs that support all Hadoop modules.

Name two ecosystem tools that often run on Hadoop.

* Hive * Pig

Purpose of MapReduce’s 'Map' phase.

Parallelize data into (key, value) pairs.

Purpose of MapReduce’s 'Reduce' phase.

Aggregate or merge all values for each key.

Why is MapReduce necessary for big data?

It abstracts fault-tolerant parallel processing across thousands of nodes.

Explain 'data locality' in MapReduce.

Code is sent to the HDFS node holding the data block.

What happens if a Map task fails?

YARN re-schedules the task on another node.

HDFS replication default factor & reason.

Default 3 copies; ensures fault tolerance.

Column-family DBs achieve high write throughput by…

Using log-structured storage with sequential disk writes.

Graph query language example.

Cypher (Neo4j) or Gremlin (Apache TinkerPop).

CAP theorem trade-offs—how do NoSQL DBs differ from RDBMS?

Many NoSQL systems sacrifice strong consistency for availability and partition tolerance.

Describe 'sharding.'

Horizontal partitioning of data across many nodes.

Which NoSQL category best fits time-series metrics?

Often a specialized subset of Key-Value or Wide-Column.

Document DB vs Key-Value store—key distinction?

Document DB allows querying inside the value.

What does BASE’s 'soft-state' imply?

System state may change over time even without input.

In Hadoop, what service copies data blocks between DataNodes?

DataNode daemons themselves.

HDFS NameNode single point of failure mitigation.

High-Availability mode with active & standby NameNodes.

Spark vs MapReduce (one advantage).

Spark keeps intermediate data in memory.

Hadoop streaming—what is it?

Allows MapReduce programs written in any language.

ACID transactions in MongoDB—supported?

Yes, multi-document ACID transactions since v4.0.

Eventual consistency—give a real-world analogy.

Family group chat: messages may appear in slightly different order.

Chapter Ten Flashcards

(80 cards)