Exam Flashcards
(90 cards)
When does big data become too big? 3 things
1) When analysis become too slow or too unreliable
2) When systems become unresponsive
3) When day-to-day businesses are impacted
What are the 3 biggest changes in Big Data?
1) In the past storage was expensive
2) Only crucial data was preserved
3) Companies only consulted historical data
4 examples of data analysis that are computationally intensive and expensive
1) Online recommender systems
2) Frequent pattern mining
3) Multi-label classification
4) Subspace clustering
3 aspects of big data
1) Volume: quantity
2) Variety: different types of data
3) Velocity: speed at which new data is coming in
2 solutions for big data
1) Invest in hardware
2) Use intelligent algorithms
Goal of parallel computing
Leveraging the full potential of your multicore multicomputer system. The goal is to reduce computation time
Embarrassingly parallel
When an algorithm is split into smaller parts that can run in chunks simultaneously
Linear speedup
Executing two tasks in parallel on two cores, should have the running time
Task parallelism
Multiple tasks are applied on the same data in parallel
Data parallelism
A calculation is performed in parallel on many different data chunks
Data dependencies
These prevent you to realize a linear speedup. The input of a segment of code depends on the output of another piece of code
Distributed systems from a computational perspective
To solve a common problem, multiple systems work together using messages over the network to coordinate their actions
Distributed systems from a service perspective
Multiple systems are available over the internet to deliver services to vast amount of users. It’s used to spread the load, but also to guarantee a certain level of availability
5 characteristics of distributed hardware
1) Heterogenous hardware
2) Multiple systems work together using messages over the network to coordinate their actions
3) For the user it looks like one single system (transparant)
4) Tolerant to failures of one system
5) Scalable: it should be easy to add new nodes to increase computational capacity
Difference between parallel and distributed computing
Parallel computing typically requires one computer with multiple processors. Distributed computing involves several autonomous computer systems working on divided tasks
Definition of service
A single-task online operation with a request/response model. Availability is important. Load balancer distributes requests to different service instances each hosted on different nodes
Batch processing
An offline operation that takes a large amount of input data, runs a job to process it and produces some output data. Throughput is important, the time needed to process a dataset of a certain size
Stream processing
Offers near real-time processing somewhere between batch processing and online services. It’s not request driven, but consumes inputs and produces output. The idea is to process events shortly after they happen. Low delay is important
What does a cluster architecture looks like?
CPU, memory and disk
3 challenges with large-scale computing for data mining
1) How to distribute computation?
2) How to make it easy to write distributed programs?
3) Machines fail
Distributed file system
Provides global file namespace. Huge files (100s of GB to TB), data is rarely updated in place, reads and appends are common
4 characteristics of DFS (distributed file systems)
1) Chunk servers: file is split into contiguous chunks, each chunk is replicated for reliability, replicas are in different racks
2) Master node (name node): master node assigns map and reduce tasks to worker nodes, stores metadata about where files are stored, might be replicated
3) Client library for file access: talks to master to find chunker servers, connects directly to chunk servers to access data
4) Reliable distributed file system
Map-reduce: what does the programmer and the environment do?
Programmer: Map and Reduce and input files
Environment: partitioning, scheduling, node failures, etc.
Workflow Map-Reduce
1) Read inputs as a set of key-value pairs
2) Map transforms input kv-paris into a new set of k’v’-pairs
3) Sorts & shuffles the k’v’-pairs to output nodes
4) All k’v’-pairs with a given k’ are sent to the same reduce
5) Reduce processes all k’v’-pairs grouped by key into new k’‘v’‘-pairs
6) Write the resulting pairs to files