Exam Flashcards

Question

3 types of Map Reduce nodes

Answer 1

1) Master node: coordination and dealing with node failures 2) Worker nodes: chunk servers, map tasks, reduce tasks 3) Dynamic load balancing: ideally a lot more Map tasks than machines, spawn copies of in-progress tasks at the end of a phase

Answer 2

Input and final output are stored in a DFS. The scheduler tries to schedule map tasks 'close' to physical storage location of input data. Intermediate results are stored on local FS of Map and Reduce workers. Output is often input to another Map-Reduce task

Answer 3

1) Make M much larger than the number of nodes in the cluster 2) One DFS chunk per map is common 3) Improves dynamic load balancing and speeds up recovery from worker failures 4) Usually R is smaller than M, because output is spread across R files

Answer 4

Slow workers significantly lengthen the job completion time. So near end of phase, spawn backup copies of tasks. This would dramatically shorten job completion time

Answer 5

Often a Map task will produce many pairs of the form (k, v1), (k, v2)... for the same key 'k', for example for popular words in the word count example. Combining can save network time by pre-aggregating values in the mapper. This only works if reduce function is commutative and associative

Answer 6

1) Communication costs 2) Elapsed communication costs 3) (Elapsed) computation costs

Answer 7

It uses the mathematical concept of a relation as the formalism for describing and representing data. Can be thought of as a table

Answer 8

Rows are called tuples, columns are called attributes, defined by attribute names

Answer 9

Set of all possible combinations of those values

Answer 10

An expression R(A1, ..., Ak) consisting of a relation name R and a sorted list (A1, ..., Ak) of attributes. The relation is a set, so the order of the tuples does not matter, but the attributes must come in the same order in each tuple

Answer 11

Set of relation schemas

Answer 12

Is based on a few basic operators that take one or two relations as input and map them to a new relation. These then can be combined into more complex mappings from database instances to relations. Core of the query language SQL

Answer 13

An operator defined by a condition C that is applied to a single relation R. Select * from Drinkers where Drink = "Beer"

Answer 14

An operator that removes columns from a table. Select Person from Drinkers

Answer 15

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in at least one of the two relations

Answer 16

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in both R and S

Answer 17

A set-theoretic operator applied to two relations R and S that share the same schema. Produces the set of all tuples present in R, but not in S

Answer 18

An operator applied to two relations R and S. The set of attributes of the resulting relation is an union of the sets of attributes of R and S. So Drinkers(Person, Drink) and Bars(Bar, Drink) are combined into a relation of all possible combinations of persons and bars such that the person drinks a drink that is served in that bar

Answer 19

Produces some kind of aggregate of a particular attribute for each group of values of another attribute

Answer 20

It's at the core of data analysis. Operators/concepts are perfectly applicable to more flexibly stored data. Relational algebra is also implementable in Map Reduce, it is naturally parallelizable and has massive runtime gains

Answer 21

1) Assume that main matrix does not fit in memory 2) The vector fits in memory and every worker has a copy 3) The elements of the matrix are saved in a CSV file Map > shuffle > reduce

Answer 22

Two matrices A = (aij) with dimensions p x q and B = (bjk) with dimensions q x r. C = A X b. C = (cik) with dimensions p x r

Answer 23

The range and distribution of numbers have to be known.

Answer 24

1) Inflexible: forces data processing into map and reduce steps 2) Based on acyclic data flow from Disk to Disk (HDFS), so you need to write to Disk before/after map and reduce. Not efficient for iterative tasks

Answer 25

1) It generalizes to more complex Directed Acyclic Graphs (DAGs) that are optimized by Spark as it has a global view of the process 2) It abstract data into an object called Resilient Distributed Dataset (RDD), you can cache it 3) For ML algorithms that do multiple iterations over the dataset, this is much faster (more work in memory and fewer disk accesses) 4) Spark is more flexible in workflow (not just map and reduce phases) and data format (not only key-value pairs)

Answer 26

Data containers which transform Spark - You can mix different kinds of transformations to create new RDDs - Created by parallelizing a collection or reading a file or by transforming other RDDs - Fault tolerant - Flexible - Memory instead of disk

Answer 27

Many more than just map and reduce 1) Transformations 2) Actions

Answer 28

Map, filter, sample, groupByKey, reduceByKey, sortByKey, intersection, flatMap, union, join, etc.

Answer 29

Reduce, count, collect, first, foreach, etc.

Answer 30

1) Spark can process data in-memory 2) Ease of use: Spark is easier to program 3) Data processing: Spark is more general 4) Maturity: Spark maturing, Hadoop map-reduce mature

Answer 31

1) Spark can process data in-memory 2) Spark generally outperforms map-reduce, but needs a lot of memory to do so 3) Map-reduce easily runs alongside other services with minor performance differences and works well with the 1-pass jobs it was designed for

Answer 32

1) For many simple use cases, map-reduce might be a more appropriate choice 2) Spark not designed for multi-user environment 3) Spark users are required to know that the memory they have is sufficient for a dataset 4) Adding more users adds complications, since the users will have to coordinate memory usage to run code

Answer 33

Provides an easy solution for processing of Big Data. Brings a paradigm shift in programming distributed systems

Answer 34

Spark has extended map-reduce for in-memory computations. For streaming, interactive, iterative and machine learning tasks

Answer 35

The distribution of the data changes over time

Answer 36

1) Mining query streams 2) Mining click streams 3) Mining social network news feeds 4) Sensor networks 5) Telephone call records 6) IP packets monitored at a switch

Answer 37

Sample a fixed proportion of elements in the stream

Answer 38

Useful model of stream processing is that queries are about a window of length N, the N most recent elements received.

Answer 39

Method is never off by more than 50%. Idea is to summarize blocks with specific number of 1s: the block sizes increase exponentially, the size of the block is the number of 1s in the block, not it's length in time

Answer 40

1) Timestamp of its end 2) Number of 1s between its beginning and end

Answer 41

1) Either 1 or 2 buckets with the same power of 2 number of 1s 2) Buckets don't overlap in timestamps 3) Buckets are sorted by size 4) Buckets disappear when their end-time is > N time units in the past

Answer 42

1) Sum the sizes of all buckets but the last (note 'size' means the number of 1s in the bucket) 2) Add half the size of the last bucket

Answer 43

Select elements of the stream that satisfy some property. Given a list of keys S, determine which tuples of stream are in S

Answer 44

Guarantees no false negatives and uses limited memory. Great for pre-processing before more expensive checks. Suitable for hardware implementation

Answer 45

Gives an estimate that is 'almost always' close to the correct number, and picks a hash function h that maps each of the N elements to at least log2 N bits

Answer 46

Number of distinct elements

Answer 47

Count of the number of elements = length of the stream

Answer 48

Measure of how uneven the distribution is

Answer 49

Works for all moments, and gives an unbiased estimate

Answer 50

Market-basket model. Goal is to identify items that are bought together by sufficiently many customers

Answer 51

Number of baskets containing all items in I. Often expressed as a fraction of the total number of baskets

Answer 52

If/then rules about the contents of baskets. There are many rules in practice, we focus on the significant/interesting ones

Answer 53

Probability of j given I

Answer 54

Difference between its confidence and the fraction of baskets that contain j

Answer 55

Generates larger candidate itemsets from smaller frequent itemsets - can use triangular matrix approach for pairs

Answer 56

Uses idle memory in the first pass to eliminate more candidate pairs - needs table of triples for pairs

Answer 57

Find all frequent itemsets in memory on a sample of data - misses out on (potentially many) frequent itemsets. Second pass can ensure that there are no false positives

Answer 58

Find all frequent itemsets in memory on each chunk of data. Second pass to count actual frequencies of all such itemsets on full data - ideal for distributed processing

Answer 59

Read file once, counting in main memory the occurrences of each pair. Fails if (#items)2 exceeds main memory

Answer 60

Count all pairs using a matrix. Only requires 4 bytes per pair.

Answer 61

Keep a table of triples [i, j, c] = "the count of the paris of items {i, j} is c". Only uses 12 bytes per pair (but only for pairs with count > 0). This beats the Triangular Matrix approach, if less than 1/3 of possible pairs actually occur

Answer 62

A two-pass approach called apriori limits the need for main memory. The key idea is monotonicity: if a set of items I appears in at least s times, so does every subset J of I. Contrapositive for pairs: if item I does not appear in S baskets, then no pair including i can appear in s baskets

Answer 63

1) Read baskets and count in main memory the occurences of each individual item 2) Read baskets again and count in main memory only those pairs where both elements are frequent

Answer 64

In pass 1 of apriori, most memory is idle. Pass 1 of PCY: in addition to item counts, maintain a hash table with as many buckets as fit in memory. Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep the count, not the actual paris that hash to the bucket

Answer 65

Take a random sample of the market baskets. Run apriori or one of its improvement in main memory. So we don't pay for disk I/O each time we increase the size of itemsets. Optionally, verify that the candidate pairs are truly frequent in the entire data set by a second pass (avoid false positives). But you don't catch sets frequent in the whole but not in the sample

Answer 66

An algorithm suited for distributed computation. Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. Key monotonicity idea.

Exam Flashcards

(90 cards)