Intro To The Work-Span Model Flashcards

1
Q

Dynamic Multithreading Model

A

Two parts:
- A computation can be represented by a directed acyclic graph, in which each node is an operation and each edge a dependence.
- Pseudocode notation, or programming model, for writing down the algorithm. Designed so that “when you execute one of these algorithms, it generates a DAG, at least conceptually.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

From the point of view of exploiting parallelism, what is the key property of a “good” DAG?

A

It has few dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What separation does the pseudo code notation do?

A

It separates how to produce work (algorithm) from how to schedule and execute it (hardware).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a scheduling problem?

A

The problem of how to take free units of work and assign them to processors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Given a complete binary tree with n leaves, how many levels does it have?

A

log(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is work?

A

Total number of nodes/verticies in a multithreaded DAG model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is span?

A

The length of the longest path through the multithreaded DAG model (from start to end). Also called “depth”.

Notated D(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is another name for the longest path through a DAG?

A

Critical path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How much time does it take to execute all operations in DAG with one processor, assuming all ops have unit cost?

A

W(n), i.e., total work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How much time does it take to execute all operations in DAG with infinite processors, assuming all ops have unit cost?

A

D(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Average available parallelism

A

W(n)/D(n)

Average work per critical path vertex. This is the amount of parallelism we can take advantage of at each critical path vertex, and is a good starting point for choosing an appropriate number of processors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Span Law

A

T_p(n) >= D(n)

i.e., Span is the lower bound on the execution time given p processors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Work Law

A

T_p(n) >= ceiling(W(n)/p)

Another lower bound on the execution time given p processors. This time it’s total work divided by number of processors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Work-Span Law

A

Both the Work Law and Span Law must be true simultaneously, so we can combine them.

T_p(n) >= max{D(n), ceiling(W(n)/p)}

This gives a lower bound on execution time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you derive Brent’s Theorem (which provides an upper bound on execution time of a DAG)?

A

1) Break up the DAG into k phases, where each phase meets three conditions:
- Each phase has STRICTLY one critical path vertex.
- All vertices within a phase that are NOT on the critical path must be independent, i.e., have no dependencies between them.
- Each vertex must belong to some phase.
2) Each phase k has a number of vertices in it, W_k (including critical path vertex)
3) By condition three of our phase conditions, the sum of W_k for all k = W, total work for the DAG.
4) How long does it take to execute phase k? t_k = ceiling(W_k/p)
5) 4 implies that the total time T_p = sum from k=1 to D of ceiling(W_k/p)
6) We can algebraically rearrange the ceiling of the quotient in 5 to turn it into a floor: total time T_p = sum from k=1 to D of floor( (W_k-1) / p) + 1
7) floor(x) is always <= x, so T_p <= sum from k=1 to D of (W_k-1)/p + 1
8) 7 implies that T_p <= (W - D)/p + D! Brent’s Theorem!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Brent’s Theorem?

A

T_p <= (W - D)/p + D

T_p := total execution time of DAG given p processors
W := work of DAG, i.e., number of vertices or operations
D := span of DAG, i.e., depth
p : number of processors

17
Q

What is the intuitive interpretation of Brent’s Theorem?

A

Total time to execute the DAG is less than or equal to the time needed to execute the critical path (D) plus the time needed to execute all other nodes using p processors.

Note that the time to execute the critical path does not depend on p because you can’t execute nodes of the critical path in parallel.

18
Q

What’s the equation for speedup of parallel time over best sequential time?

A

S_p(n) := T_*(n)/T_p(n)

19
Q

What does T_p(n) depend on?

A

T_p(n) = f(W, D; n, P)

work, span, problem size, and number of processors

20
Q

What does T_*(n) depend on?

A

T_(n) = W_(n)

Work done by best sequential algorithm

21
Q

What is the ideal speedup of a parallel algorithm over the best sequential algorithm?

A

Ideal speedup is linear in P, also called linear speed up, linear scaling, or ideal scaling.

S_p(n) = Theta(P)

22
Q

How does big-Theta notation work?

A

big-Theta depends on both big-O and big-Omega.

big-O provides an upper bound on the performance of an algorithm. f(n) is O(g(n)) if there exist some constants C_1 and n_0 s.t. 0 <= f(x) <= C_1*g(n) for all n >= n_0.

big-Omega provides a lower bound on the performance of an algorithm. f(n) is Omega(g(n)) if there exist some constants C_2 and n_0 s.t. 0 <= C_2*g(n) <= f(n) for all n >= n_0.

big-Theta provides the tightest bound on execution time. f(n) is Theta(g(n)) if there exist some constants C_1, C_2, and n_0 such that 0 <= C_2g(n) <= f(n) <= C_1g(n) for all n >= n_0.

23
Q

How do you use Brent’s Theorem to get a lower bound on speedup (and then make it easier to analyze)? Using the resulting equation, what has to be true to get ideal scaling?

A

S_p(n) = T_(n)/T_p(n) = W_(n)/T_p(n) because we know execution time using the best sequential algorithm is basically just a function of work done using that algorithm.

Brent’s Theorem gives us an upper bound on execution time. By plugging it in to this equation we get a lower bound on speedup: S_p(n) = W_(n)/T_p(n) >= W_(n)/[(W-D)/P + D] = (via a little algebra) P/[(W/W_) + (P-1)/(W_/D)]

To get ideal scaling, the denominator must be constant (since P is the numerator and ideal scaling means linear scaling in P). To do this, W(n) must equal W_(n) (so W/W_ = a constant fraction). This is called “work-optimality”.

We also need p to be proportional to W_/D, i.e., P = O(W_/D). Could be rewritten as W*/P = Omega(D), which says the work per processor must grow with the span, which in turn depends on n. This is called “weak scalability”.

24
Q

What is work-optimality?

A

W(n) = O(W*(n)), i.e., the work of our parallel algorithm must be proportional to the work of our best sequential algorithm. We use this to get ideal scaling.

25
Q

What is weak-scalability?

A

P = O(W/D) implies W/P = Omega(D), i.e., the work per processor must grow with the span, which in turn depends on n, the problem size. We use this to get ideal scaling.

26
Q

Spawn

A

A keyword that tells the compiler or runtime system that the target is an independent unit of work, i.e., may be executed asynchronously. The target is either a function or procedure call.

27
Q

Sync

A

A keyword that tells the compiler or runtime system that the program can only continue after all spawns have finished their work, i.e., when everyone’s caught up!

To which spawns does sync apply? In this class we’ll assume sync always applies to any spawn that has occurred on the same stack frame.

28
Q

Call stack

A

A stack data structure that stores information about the active subroutines of a computer program.

Its main purpose it to keep track of the point to which each active subroutine should return control when it completes execution.

A call stack is composed of stack frames.

29
Q

Stack frame

A

These are the components of the call stack. Each stack frame corresponds to an active subroutine; it is a data structure that contains state information about its corresponding subroutine.

30
Q

Returns include an implicit sync. Why might this give us incorrect results?

A

Say you spawn two processes that put some values in variables A and B, then return A+B. The return call will evaluate A+B first and store it in a temporary variable and only sync afterwards.

However, we need to sync BEFORE the addition. If we don’t, then we can’t guarantee that the values in A and B are the values we computed in our spawned processes! They might be junk, or they might be values previously stored to those variables.

31
Q

Within the span-sync framework, can you analyze work-span algorithms the same way you analyze sequential algorithms?

A

Basically, yes!

If we assume spawns and syncs are O(1) (which is often true in practice), then we can ignore them and do the usual sequential analysis.

The recurrence for work W(n) will look like the recurrence for the sequential analysis T(n)! The recurrence for span will be different, but not too bad.

32
Q

What are your goals as a parallel algorithm designer, with respect to work and span?

A

1) Work-optimality: Achieve work that matches the best sequential algorithm, i.e. O(n) if at all humanly possible.

2) Poly-logarithmic span (or “low” span): D(n) = O(log^k n). Because our work goal is O(n) and log n grows more slowly than n, this means is that the average available parallelism grows with n. Note: this is because span is DEPTH, so basically a good parallel algorithm doesn’t grow too much deeper as the amount of work increases.

33
Q

par-for

A

Parallel-for, or for-any or for-all. A kind of loop where all iterations are independent of one another. You can execute in any order.

In DAG terms, this creates n independent sub-paths. Like executing n spawns simultaneously.

At the end of the par-for, there is an implicit sync point.

34
Q

Work and span of par-for

A

If each iteration is bounded by a constant cost, then work is O(1) for each of them and O(n) all together.

In theory, span would be constant. However, span actually depends on implementation. If we implement par-for with a for loop that spawns each iteration, then span is O(n)! That’s terrible!

A more realistic implementation would be to create a procedure call that spawns iterations of the loop, and call it recursively while using divide and conquer. This results in the standard binary tree recursion gives us a span of O(log n). Better!

Assume the O(log n) span implementation for this class.

35
Q

Race condition

A

When a data race leads to an error.

NOTE: Data race does not always lead to an error! So data race does not necessarily imply a race condition.

36
Q

What does it mean to “reduce” an array?

A

FROM WHAT I CAN TELL, it means to convert an array to a single value, most commonly by summing its elements. A lot of programming languages seem to have a reduce() method for arrays (Swift, Javascript, etc).

In Swift, for example, arr.reduce(0, +) will start with a value of 0, then combine that with the first element of arr using 0+arr[0], then combine that with arr[1], until it has summed all elements in arr. Presumably this could be done with any other binary operator, though I’m not confident about that.

37
Q

Vector notation tricks

A

In this class we will do at least two things to make pseudocode more compact:

1) Slicing: Instead of looping over some variable and using that to index an array, we can just use slicing a la Python, like, arr[1:n], or arr[:] for all elements of array. We could express the element-wise summation of two arrays by z[:] = x[:] + y[:].

2) Temporary arrays are implied: Instead of writing them out, we might do something like the following.

t[:] <- A[i,:]*x[:]
y(i) <- y(i) + reduce(t)

becomes…

y(i) <- y(i) + reduce(A[i,:]*x[:])

…where we recognize that there is an implicit temp array inside the reduce.

3) We should also recognize that something like t[:] <- A[i,:]*x[:] is easily parallelizable with par-for. Use the work and span for par-for here.

38
Q

Good algorithm vs bad algorithm in terms of DAG height, width, and number of operations

A

A good algorithm has low “mass”, or fewer operations, and has short and wide. i.e. lots of parallelism, little work, low span.

A bad algorithm has a lot of operations and is tall and skinny. So a lot of work, large span, and little parallelism.

39
Q

Divide and conquer: how important is it here?

A

It should usually be the first thing you try!