1. Volume 1. Variety 1. Velocity 1. Veracity 1. Visualization 1. Value

UNDERSTANDING BIG DATA AND NOSQL Flashcards by Lucky Louise

Generally refers to either data that exceed the ability of database management tools used to capture, store, and analyze data (McKinsey, 2011), or to next-generation technologies and architectures designed to extract value from large-scale data at low cost and support the rapid collection, discovery, and analysis of data (IDC, 2011).

Big data

How well did you know this?

Not at all

Perfectly

The characteristics of big data can be explained by the three elements (3V)

volume, velocity, and variety

How well did you know this?

Not at all

Perfectly

Characteristics of big data
Refers to a volume of data of tens of terabytes, petabytes, or more, thus exceeding the processing limit of commonly used software when collecting, storing, and processing data.

Volume

How well did you know this?

Not at all

Perfectly

Characteristics of big data
* ‘Big data’ is created very quickly
* Data collection, processing, storage and analysis need to be pricessed in real time

Velocity

How well did you know this?

Not at all

Perfectly

Characteristics of big data
* Diverse kinds of data
* Big data can be classified into structured, semi structured, and unstructured data

Variety

How well did you know this?

Not at all

Perfectly

6V of big data

Volume
Variety
Velocity
Veracity
Visualization
Value

How well did you know this?

Not at all

Perfectly

Data to be stored in a fixed field

Structured data

How well did you know this?

Not at all

Perfectly

Data not stored in a fixed field, but which contain metadata or schema, such as XML or HTML.

Semi-structured data

How well did you know this?

Not at all

Perfectly

Data not to be stored in a fixed field
Document, picture, video, and audio data, etc.

Unstructured

How well did you know this?

Not at all

Perfectly

A technology that can collect data from all devices and systems

Collection
Crawling, ETL, CEP, etc

How well did you know this?

Not at all

Perfectly

A technology that can store and process collected large-scale data using a distributed processing system.

Storage/processing
Distributed file system, NoSQL, MapReduce processing

How well did you know this?

Not at all

Perfectly

A method of analysis that can assist companies and the public with using big data in business and daily life.

Analysis
Natural language processing, machine learning, data mining algorithms

How well did you know this?

Not at all

Perfectly

A technology that can visualize analyzed results effectively.

Visualization
R, graphs, drawing, etc

How well did you know this?

Not at all

Perfectly

Web ___ copies the entire web page after collecting the URLs to be collected, or collects data with a specific tag only after analyzing the HTML code.

crawling

How well did you know this?

Not at all

Perfectly

Collects data using the SQL function of the DBMS.

Collection using the DBMS
Oracle, MariaDB, MS SQL, Tibero, etc.

How well did you know this?

Not at all

Perfectly

Collects data when a certain condition is met

Collection using sensors
CQL, Kafka

How well did you know this?

Not at all

Perfectly

Collects data using port that can transfer files.

FTP collection

How well did you know this?

Not at all

Perfectly

Collects data by reading HTML tags

HTTP collection
Scraper

How well did you know this?

Not at all

Perfectly

A file system that allows access to files on multiple host computers which are shared over a computer network.

Distributed File System (DFS)
GFS (Google File System), HDFS (Hadoop Distributed File System), etc.

How well did you know this?

Not at all

Perfectly

A new type of data storage/retrieval
system that uses a less restrictive consistency model (BASE characteristics) than the traditional relational database.

NoSQL (Not Only SQL)
Hbase, Cassandra, Mongodb, CouchBase, Redis, Neo4J, etc.

How well did you know this?

Not at all

Perfectly

A technology that processes a large amount of data in a distributed parallel computing environment.

Distributed parallel processing
MapReduce

How well did you know this?

Not at all

Perfectly

A file system architecture for storing and processing large-scale and unstructured data in a distributed environment. It has the following characteristics.

Distributed File System (DFS)

How well did you know this?

Not at all

Perfectly

A programming model designed for the parallel distributed processing of big data using inexpensive machines. This model can process large amounts of data in parallel using a program composed of a map procedure and a reduce method.
Allows the analysis of large-scale data by processing data that has been distributed and stored in multiple machines.

MapReduce

How well did you know this?

Not at all

Perfectly

This technique provides insights by effectively transferring numbers, statistics, and valuable meanings, by classifying data for the user’s easy understanding, and by analyzing large-scale data.

Visualization technology

How well did you know this?

Not at all

Perfectly

Big data visualization methods (5)

1. Time visualization 1. Distribution visualization 1. Relationship visualization 1. Comparison visualization 1. Spatial visualization

* Shows the passage of time. * Continuous, segmented

Time Visualization

* Shows the relationship between the whole and the part, and the ratio * Pie, treemap

Distribution visualization

* Shows the relationship between two or more variable * Bubble chart, histogram

Relationship visualization

* Shows spaces and shadows intuitively * Heatmap, Stars

Comparison visualization

* Shows information by mapping it on the map. * Including POI data

Spatial visualization

Refers to the process of “discovering meaningful patterns from big data.”

Big data analytics

**Classification of big data analytics** The primary purpose is to find patterns that describe the given data.

**Descriptive modeling** ... *Applied technique* Association rule, clustering, database, segmentation, visualization, etc

**Classification of big data analytics** A model is created based on the given data, and is used to predict new input data.

**Predictive modeling** ... *Applied technique* Classification, Regression, time series analysis, Neural network, SVM

**Classification of big data analytics** When the target is determined

**Supervised data** ... *Applied technique* Decision Tree, Neural network, Case-based reasoning

**Classification of big data analytics** When there is no target. The correlation or similarity between data is analyzed with the focus on input variables

**Unsupervised data** .... *Applied technique* Association rule discovery, market basket K-means clustering

Main methods of big data analysis

1. Logistic regression 1. Decision tree analysis 1. Neural network analysis 1. Text mining 1. SNA (Social Network Analysis) 2. Opinion Mining 3. Natural Language Processing (NLP)

A statistical technique used to predict the possibility of an event’s occurrence (probability of occurrence) using a linear combination of independent variables.

Logistic regression

A method of quantitative analysis that classifies an interested group into several subgroups or performs Prediction by drawing a decision tree chart.

Decision tree analysis

A method of analysis that handles a problem with parallel/distributed/probabilistic calculation based on the idea that digital information is a network of nerve cells, rather than a method of processing digital information based on a deterministic binary computational model using the human brain itself as the model.

Neural network analysis

A technology that extracts and processes useful information by applying natural language processing technology and document processing technology to unstructured/semi structured data.

Text mining

An analysis methodology that analyzes and visualizes the relationship between objects - such as people, groups, organizations, computers and data - and the characteristics and structure of the network.

SNA (Social Network Analysis)

* A technology that quickly analyzes the information the user wants and intelligently infers meaningful information from a large number of unstructured reviews such as SNS and replies. * It is used effectively for corporate marketing policies or public opinion analysis by extracting the hot topics of the social network service and analyzing the flow in real time.

Opinion mining

* An artificial intelligence technology that understands, creates, and analyzes human language using computers. * The process of understanding natural language by analyzing human language mechanically to convert it into a form that can be understood by computers.

Natural Language Processing (NLP)

* A ____ is an expert who can collect, organize, investigate, analyze, and visualize data. * The ____ provides information necessary for corporate/organizational decision-making by collecting, analyzing, and discovering the value of data, using various platform foundations and analysis infrastructures related to big data.

data scientist

Capabilities of the data scientist

**Management** * - Business * - Data management * - Data analysis * - Change management **Technology –capability** * - Understanding statistical analysis tools * - Programming language * - RDBMS technology * - Distributed computing * - Mathematical knowledge

Provides other means of processing than the tabular relations used in relational databases.

NoSQL database

A non-relational distributed data repository that can be expanded horizontally, such as data replication and distributed storage on multiple servers focusing on the write speed for processing unstructured and ultra-high-capacity data.

NoSQL database

**CHARACTERISTICS OF NOSQL** Provides a loose data structure that allows data to be processed at the petabyte level.

Processing of large-scale data

**CHARACTERISTICS OF NOSQL** - Saves data relatively freely without a predefined schema. - Saves (data) in a simplified form such as the key value, graph, and document structure.

Use of flexible schemas

**CHARACTERISTICS OF NOSQL** Supports scale-out, data replication, and distributed storage using multiple servers composed of Pc-level commercial hardware.

Inexpensive cluster configuration

**CHARACTERISTICS OF NOSQL** - No query language like SQL in existing relational databases is provided. - A simple interface is provided by calling a simple API or HTPP.

Simple CLI (Call Level Interface)

**CHARACTERISTICS OF NOSQL** NoSQL loads data by automatically dividing data items into the cluster environment.

High availability

**CHARACTERISTICS OF NOSQL** While the relational DBMS focuses on ensuring logical structure and ACID, NoSQL makes the application process some of the integrity works instead of assigning them all to the DBMS.

Allow as much integrity is needed

**CHARACTERISTICS OF NOSQL** The methods of saving data are largely divided into column, value, document, and chart, using a function that allows data storage and access using the key values, without the fixed data schema for data modeling.

Schema-less

**CHARACTERISTICS OF NOSQL** NoSQL has a structure that allows expansion of the system’s scale and performance and distribution of the I/O load more easily, so that large-scale data can be created, updated, and queried, while not causing downtime for any clients and application systems that access the system, even if the system fails partially.

Elasticity

**CHARACTERISTICS OF NOSQL** NoSQL provides query language, related processing technology, and API that can efficiently search and process data according to the characteristics of data even in a system composed of tens or thousand of servers.

Query

**CHARACTERISTICS OF NOSQL** NoSQL has a structure in which memory-based caching technology is very important, and which can provide a high-performance response speed even for large-scale queries and be consistently applied to development and operation.

Caching

**CHARACTERISTICS OF NOSQL** Partitioning allows a gradual node increase.

High scalability

**CHARACTERISTICS OF NOSQL** There is no single point of failure, and data are available even though a certain node is down because they are replicated.

High availability

**CHARACTERISTICS OF NOSQL** The result should be quickly returned based on memory instead of disk, which can be achieved by using the non-blocking write and low complexity algorithm.

High performance

**CHARACTERISTICS OF NOSQL** Each write operation needs to be atomic.

Atomicity

**CHARACTERISTICS OF NOSQL** Strong consistency is not needed, but the resulting consistency is sufficient (Read-Your-Writes).

Consistency

**CHARACTERISTICS OF NOSQL** Data should be kept on a disk, not just in a volatile memory only.

Persistence

**CHARACTERISTICS OF NOSQL** When a node is added or deleted, data should be automatically loaded without the need for data distribution or manual mediation, and there should be no constraints, such as distributed file system or shared storage, or any need for special hardware. Hardware should be operable in heterogeneous hardware.

Deployment

**CHARACTERISTICS OF NOSQL** Data of various types such as key-value pairs, hierarchical data, and graphs should be modeled conveniently.

Modeling flexibility

**CHARACTERISTICS OF NOSQL** Multiple GET that obtains a set of values for the provided key from a query, and queries that obtain data based on a specific range of keys, are needed.

Query flexibility

**Description of NoSQL’s BASE properties** - Emphasis is placed on availability and the use of optimistic locking and queue. - Availability is ensured even with multiple failures and copies are stored in multiple storages.

Basically Available

**Description of NoSQL’s BASE properties** - Node status is determined by the information transmitted from outside. - Updates between distributed nodes are updated when data reach the node.

Soft-State

**Description of NoSQL’s BASE properties** The property of maintaining consistency optimally even though consistency is lost temporarily.

Eventually Consistent

**Types of NoSQL** - The most basic NoSQL database that provides simple and fast Get, Put and Delete functions based on the key value. - Dynamo, Redis, MemcacheDB, etc.

Key-value based

**Types of NoSQL** - An NoSQL database that expresses the entry attribute of the relational database as a node and the relationship as the edge between nodes, such as Neo4J, Flock DB, etc.

Graph based

**Types of NoSQL** - An NoSQL database that stores data in rows in the column family, which corresponds to the table in the relational database. - Cassandra, Hbase, SimpleDB, etc.

Column family based

**Types of NoSQL** - An NoSQL database that stores documents such as XML, JSON, BSON, etc. in the value part of the key-value database, such as NoSQL DB, MongoDB, CouchDB, etc.

Document based

A data modeling technique that derives logical connection points using the data composition method with generalized notation and execution procedures

ACID-based data modelling

Data modeling is closer to file structural design, rather than to general data modelingin which a data set that can be easily processed by the program is created

BASE-based data modeling

This theory asserts that as it is impossible for a distributed data store to simultaneously satisfy all of data consistency, availability, and partition tolerance, only two should be strategically selected.

CAP theorem (Consistency, Availability, and Partition Tolerance)

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** All nodes should show the same data at the same time (Each user should always view the same data)

Consistency

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Even if some nodes are down, it should not affect the other nodes. (All users should always be able to read and write data.)

Availability

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Even if some messages are lost, the system should operate normally. (The system should work properly in a physically distributed network environment.)

Partition Tolerance

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** An exceptionally reliable type in which message loss can be prevented even if the system is down. Essential when a transaction is required. Example: General RDBMS.

C + A

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** A performance type in which all nodes must perform well together. Examples: Google's BigTable, HyperTable, HBase.

C + P

**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Essential for asynchronous store operations. Examples: Dynamo, Apache Cassandra, CouchDB, Oracle Coherence.

A + P

A _____ refers to the network between the components constituting a given society.

social network

UNDERSTANDING BIG DATA AND NOSQL Flashcards

(83 cards)