UNDERSTANDING BIG DATA AND NOSQL Flashcards
(83 cards)
Generally refers to either data that exceed the ability of database management tools used to capture, store, and analyze data (McKinsey, 2011), or to next-generation technologies and architectures designed to extract value from large-scale data at low cost and support the rapid collection, discovery, and analysis of data (IDC, 2011).
Big data
The characteristics of big data can be explained by the three elements (3V)
volume, velocity, and variety
Characteristics of big data
Refers to a volume of data of tens of terabytes, petabytes, or more, thus exceeding the processing limit of commonly used software when collecting, storing, and processing data.
Volume
Characteristics of big data
* ‘Big data’ is created very quickly
* Data collection, processing, storage and analysis need to be pricessed in real time
Velocity
Characteristics of big data
* Diverse kinds of data
* Big data can be classified into structured, semi structured, and unstructured data
Variety
6V of big data
- Volume
- Variety
- Velocity
- Veracity
- Visualization
- Value
Data to be stored in a fixed field
Structured data
Data not stored in a fixed field, but which contain metadata or schema, such as XML or HTML.
Semi-structured data
- Data not to be stored in a fixed field
- Document, picture, video, and audio data, etc.
Unstructured
A technology that can collect data from all devices and systems
Collection
Crawling, ETL, CEP, etc
A technology that can store and process collected large-scale data using a distributed processing system.
Storage/processing
Distributed file system, NoSQL, MapReduce processing
A method of analysis that can assist companies and the public with using big data in business and daily life.
Analysis
Natural language processing, machine learning, data mining algorithms
A technology that can visualize analyzed results effectively.
Visualization
R, graphs, drawing, etc
Web ___ copies the entire web page after collecting the URLs to be collected, or collects data with a specific tag only after analyzing the HTML code.
crawling
Collects data using the SQL function of the DBMS.
Collection using the DBMS
Oracle, MariaDB, MS SQL, Tibero, etc.
Collects data when a certain condition is met
Collection using sensors
CQL, Kafka
Collects data using port that can transfer files.
FTP collection
Collects data by reading HTML tags
HTTP collection
Scraper
A file system that allows access to files on multiple host computers which are shared over a computer network.
Distributed File System (DFS)
GFS (Google File System), HDFS (Hadoop Distributed File System), etc.
A new type of data storage/retrieval
system that uses a less restrictive consistency model (BASE characteristics) than the traditional relational database.
NoSQL (Not Only SQL)
Hbase, Cassandra, Mongodb, CouchBase, Redis, Neo4J, etc.
A technology that processes a large amount of data in a distributed parallel computing environment.
Distributed parallel processing
MapReduce
A file system architecture for storing and processing large-scale and unstructured data in a distributed environment. It has the following characteristics.
Distributed File System (DFS)
- A programming model designed for the parallel distributed processing of big data using inexpensive machines. This model can process large amounts of data in parallel using a program composed of a map procedure and a reduce method.
- Allows the analysis of large-scale data by processing data that has been distributed and stored in multiple machines.
MapReduce
This technique provides insights by effectively transferring numbers, statistics, and valuable meanings, by classifying data for the user’s easy understanding, and by analyzing large-scale data.
Visualization technology