UNDERSTANDING BIG DATA AND NOSQL Flashcards

(83 cards)

1
Q

Generally refers to either data that exceed the ability of database management tools used to capture, store, and analyze data (McKinsey, 2011), or to next-generation technologies and architectures designed to extract value from large-scale data at low cost and support the rapid collection, discovery, and analysis of data (IDC, 2011).

A

Big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The characteristics of big data can be explained by the three elements (3V)

A

volume, velocity, and variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Characteristics of big data
Refers to a volume of data of tens of terabytes, petabytes, or more, thus exceeding the processing limit of commonly used software when collecting, storing, and processing data.

A

Volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Characteristics of big data
* ‘Big data’ is created very quickly
* Data collection, processing, storage and analysis need to be pricessed in real time

A

Velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Characteristics of big data
* Diverse kinds of data
* Big data can be classified into structured, semi structured, and unstructured data

A

Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6V of big data

A
  1. Volume
  2. Variety
  3. Velocity
  4. Veracity
  5. Visualization
  6. Value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data to be stored in a fixed field

A

Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data not stored in a fixed field, but which contain metadata or schema, such as XML or HTML.

A

Semi-structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • Data not to be stored in a fixed field
  • Document, picture, video, and audio data, etc.
A

Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A technology that can collect data from all devices and systems

A

Collection
Crawling, ETL, CEP, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A technology that can store and process collected large-scale data using a distributed processing system.

A

Storage/processing
Distributed file system, NoSQL, MapReduce processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A method of analysis that can assist companies and the public with using big data in business and daily life.

A

Analysis
Natural language processing, machine learning, data mining algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A technology that can visualize analyzed results effectively.

A

Visualization
R, graphs, drawing, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Web ___ copies the entire web page after collecting the URLs to be collected, or collects data with a specific tag only after analyzing the HTML code.

A

crawling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Collects data using the SQL function of the DBMS.

A

Collection using the DBMS
Oracle, MariaDB, MS SQL, Tibero, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Collects data when a certain condition is met

A

Collection using sensors
CQL, Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Collects data using port that can transfer files.

A

FTP collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Collects data by reading HTML tags

A

HTTP collection
Scraper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A file system that allows access to files on multiple host computers which are shared over a computer network.

A

Distributed File System (DFS)
GFS (Google File System), HDFS (Hadoop Distributed File System), etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A new type of data storage/retrieval
system that uses a less restrictive consistency model (BASE characteristics) than the traditional relational database.

A

NoSQL (Not Only SQL)
Hbase, Cassandra, Mongodb, CouchBase, Redis, Neo4J, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A technology that processes a large amount of data in a distributed parallel computing environment.

A

Distributed parallel processing
MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A file system architecture for storing and processing large-scale and unstructured data in a distributed environment. It has the following characteristics.

A

Distributed File System (DFS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  • A programming model designed for the parallel distributed processing of big data using inexpensive machines. This model can process large amounts of data in parallel using a program composed of a map procedure and a reduce method.
  • Allows the analysis of large-scale data by processing data that has been distributed and stored in multiple machines.
A

MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

This technique provides insights by effectively transferring numbers, statistics, and valuable meanings, by classifying data for the user’s easy understanding, and by analyzing large-scale data.

A

Visualization technology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Big data visualization methods (5)
1. Time visualization 1. Distribution visualization 1. Relationship visualization 1. Comparison visualization 1. Spatial visualization
26
* Shows the passage of time. * Continuous, segmented
Time Visualization
27
* Shows the relationship between the whole and the part, and the ratio * Pie, treemap
Distribution visualization
28
* Shows the relationship between two or more variable * Bubble chart, histogram
Relationship visualization
29
* Shows spaces and shadows intuitively * Heatmap, Stars
Comparison visualization
30
* Shows information by mapping it on the map. * Including POI data
Spatial visualization
31
Refers to the process of “discovering meaningful patterns from big data.”
Big data analytics
32
**Classification of big data analytics** The primary purpose is to find patterns that describe the given data.
**Descriptive modeling** ... *Applied technique* Association rule, clustering, database, segmentation, visualization, etc
33
**Classification of big data analytics** A model is created based on the given data, and is used to predict new input data.
**Predictive modeling** ... *Applied technique* Classification, Regression, time series analysis, Neural network, SVM
34
**Classification of big data analytics** When the target is determined
**Supervised data** ... *Applied technique* Decision Tree, Neural network, Case-based reasoning
35
**Classification of big data analytics** When there is no target. The correlation or similarity between data is analyzed with the focus on input variables
**Unsupervised data** .... *Applied technique* Association rule discovery, market basket K-means clustering
36
Main methods of big data analysis
1. Logistic regression 1. Decision tree analysis 1. Neural network analysis 1. Text mining 1. SNA (Social Network Analysis) 2. Opinion Mining 3. Natural Language Processing (NLP)
37
A statistical technique used to predict the possibility of an event’s occurrence (probability of occurrence) using a linear combination of independent variables.
Logistic regression
38
A method of quantitative analysis that classifies an interested group into several subgroups or performs Prediction by drawing a decision tree chart.
Decision tree analysis
39
A method of analysis that handles a problem with parallel/distributed/probabilistic calculation based on the idea that digital information is a network of nerve cells, rather than a method of processing digital information based on a deterministic binary computational model using the human brain itself as the model.
Neural network analysis
40
A technology that extracts and processes useful information by applying natural language processing technology and document processing technology to unstructured/semi structured data.
Text mining
41
An analysis methodology that analyzes and visualizes the relationship between objects - such as people, groups, organizations, computers and data - and the characteristics and structure of the network.
SNA (Social Network Analysis)
42
* A technology that quickly analyzes the information the user wants and intelligently infers meaningful information from a large number of unstructured reviews such as SNS and replies. * It is used effectively for corporate marketing policies or public opinion analysis by extracting the hot topics of the social network service and analyzing the flow in real time.
Opinion mining
43
* An artificial intelligence technology that understands, creates, and analyzes human language using computers. * The process of understanding natural language by analyzing human language mechanically to convert it into a form that can be understood by computers.
Natural Language Processing (NLP)
44
* A ____ is an expert who can collect, organize, investigate, analyze, and visualize data. * The ____ provides information necessary for corporate/organizational decision-making by collecting, analyzing, and discovering the value of data, using various platform foundations and analysis infrastructures related to big data.
data scientist
45
Capabilities of the data scientist
**Management** * - Business * - Data management * - Data analysis * - Change management **Technology –capability** * - Understanding statistical analysis tools * - Programming language * - RDBMS technology * - Distributed computing * - Mathematical knowledge
46
Provides other means of processing than the tabular relations used in relational databases.
NoSQL database
47
A non-relational distributed data repository that can be expanded horizontally, such as data replication and distributed storage on multiple servers focusing on the write speed for processing unstructured and ultra-high-capacity data.
NoSQL database
48
**CHARACTERISTICS OF NOSQL** Provides a loose data structure that allows data to be processed at the petabyte level.
Processing of large-scale data
49
**CHARACTERISTICS OF NOSQL** - Saves data relatively freely without a predefined schema. - Saves (data) in a simplified form such as the key value, graph, and document structure.
Use of flexible schemas
50
**CHARACTERISTICS OF NOSQL** Supports scale-out, data replication, and distributed storage using multiple servers composed of Pc-level commercial hardware.
Inexpensive cluster configuration
51
**CHARACTERISTICS OF NOSQL** - No query language like SQL in existing relational databases is provided. - A simple interface is provided by calling a simple API or HTPP.
Simple CLI (Call Level Interface)
52
**CHARACTERISTICS OF NOSQL** NoSQL loads data by automatically dividing data items into the cluster environment.
High availability
53
**CHARACTERISTICS OF NOSQL** While the relational DBMS focuses on ensuring logical structure and ACID, NoSQL makes the application process some of the integrity works instead of assigning them all to the DBMS.
Allow as much integrity is needed
54
**CHARACTERISTICS OF NOSQL** The methods of saving data are largely divided into column, value, document, and chart, using a function that allows data storage and access using the key values, without the fixed data schema for data modeling.
Schema-less
55
**CHARACTERISTICS OF NOSQL** NoSQL has a structure that allows expansion of the system’s scale and performance and distribution of the I/O load more easily, so that large-scale data can be created, updated, and queried, while not causing downtime for any clients and application systems that access the system, even if the system fails partially.
Elasticity
56
**CHARACTERISTICS OF NOSQL** NoSQL provides query language, related processing technology, and API that can efficiently search and process data according to the characteristics of data even in a system composed of tens or thousand of servers.
Query
57
**CHARACTERISTICS OF NOSQL** NoSQL has a structure in which memory-based caching technology is very important, and which can provide a high-performance response speed even for large-scale queries and be consistently applied to development and operation.
Caching
58
**CHARACTERISTICS OF NOSQL** Partitioning allows a gradual node increase.
High scalability
59
**CHARACTERISTICS OF NOSQL** There is no single point of failure, and data are available even though a certain node is down because they are replicated.
High availability
60
**CHARACTERISTICS OF NOSQL** The result should be quickly returned based on memory instead of disk, which can be achieved by using the non-blocking write and low complexity algorithm.
High performance
61
**CHARACTERISTICS OF NOSQL** Each write operation needs to be atomic.
Atomicity
62
**CHARACTERISTICS OF NOSQL** Strong consistency is not needed, but the resulting consistency is sufficient (Read-Your-Writes).
Consistency
63
**CHARACTERISTICS OF NOSQL** Data should be kept on a disk, not just in a volatile memory only.
Persistence
64
**CHARACTERISTICS OF NOSQL** When a node is added or deleted, data should be automatically loaded without the need for data distribution or manual mediation, and there should be no constraints, such as distributed file system or shared storage, or any need for special hardware. Hardware should be operable in heterogeneous hardware.
Deployment
65
**CHARACTERISTICS OF NOSQL** Data of various types such as key-value pairs, hierarchical data, and graphs should be modeled conveniently.
Modeling flexibility
66
**CHARACTERISTICS OF NOSQL** Multiple GET that obtains a set of values for the provided key from a query, and queries that obtain data based on a specific range of keys, are needed.
Query flexibility
67
**Description of NoSQL’s BASE properties** - Emphasis is placed on availability and the use of optimistic locking and queue. - Availability is ensured even with multiple failures and copies are stored in multiple storages.
Basically Available
68
**Description of NoSQL’s BASE properties** - Node status is determined by the information transmitted from outside. - Updates between distributed nodes are updated when data reach the node.
Soft-State
69
**Description of NoSQL’s BASE properties** The property of maintaining consistency optimally even though consistency is lost temporarily.
Eventually Consistent
70
**Types of NoSQL** - The most basic NoSQL database that provides simple and fast Get, Put and Delete functions based on the key value. - Dynamo, Redis, MemcacheDB, etc.
Key-value based
71
**Types of NoSQL** - An NoSQL database that expresses the entry attribute of the relational database as a node and the relationship as the edge between nodes, such as Neo4J, Flock DB, etc.
Graph based
72
**Types of NoSQL** - An NoSQL database that stores data in rows in the column family, which corresponds to the table in the relational database. - Cassandra, Hbase, SimpleDB, etc.
Column family based
73
**Types of NoSQL** - An NoSQL database that stores documents such as XML, JSON, BSON, etc. in the value part of the key-value database, such as NoSQL DB, MongoDB, CouchDB, etc.
Document based
74
A data modeling technique that derives logical connection points using the data composition method with generalized notation and execution procedures
ACID-based data modelling
75
Data modeling is closer to file structural design, rather than to general data modelingin which a data set that can be easily processed by the program is created
BASE-based data modeling
76
This theory asserts that as it is impossible for a distributed data store to simultaneously satisfy all of data consistency, availability, and partition tolerance, only two should be strategically selected.
CAP theorem (Consistency, Availability, and Partition Tolerance)
77
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** All nodes should show the same data at the same time (Each user should always view the same data)
Consistency
78
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Even if some nodes are down, it should not affect the other nodes. (All users should always be able to read and write data.)
Availability
79
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Even if some messages are lost, the system should operate normally. (The system should work properly in a physically distributed network environment.)
Partition Tolerance
80
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** An exceptionally reliable type in which message loss can be prevented even if the system is down. Essential when a transaction is required. Example: General RDBMS.
C + A
81
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** A performance type in which all nodes must perform well together. Examples: Google's BigTable, HyperTable, HBase.
C + P
82
**BASE ATTRIBUTES OF NOSQL- CHARACTERISTICS OF THE NOSQL DATA MODEL** Essential for asynchronous store operations. Examples: Dynamo, Apache Cassandra, CouchDB, Oracle Coherence.
A + P
83
A _____ refers to the network between the components constituting a given society.
social network