Week 6 (large scale data analysis using mapreduce Flashcards

1
Q

what are these examples of?

A

distance measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the goal behind link analysis

A

understanding large problem w/ unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how did early search engines work

A

crawled the web and listed the terms in an inverted index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is an inverted index in the the context of storage of terms

A

data structure which makes it easy, given a term, to find pointer to all places where that term appears

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

create an inverted index for the given texts

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

a term search for the terms Colorado State University, would give what set

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is term spam

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

goals behind a page rank algorithim

A

provide effective sumnmaries for search results

ordering/ranking results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

why does a pagerank algorithim simulate random surfers

A

pages that would have a large number of surfers were considered more “important” than pages that would be barely visited

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how is a page judged in a page rank algorithim (basic)

A

terms occuring on the page

and terms used in or near links to that page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the highest level definition of pagerank

A

function which assigns a real number to each page on the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what does a higher pagerank of a page mean

A

the more important it is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

given this graph, suppose a random surfer starts at page A, what are the probs that a surfer will be on each page in the next step

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

now suppose surfer at B, whats the prob of next step

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

for pagerank, what is the dimension of the transition matrix M

A

n columns and n rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what does the transition matrix of this graph look like

17
Q

what does the first and second column of this transition matrix mean

18
Q

what does this matrix mean/represent?

19
Q

for the transition matrix M, if we surf at any of the n page of web with equal prob. What will the inital vector look like and where is it located

A

1/n for each component

farthest left, first col vector (i think)

20
Q

what does the matrix mean and what does the vector mean?

and what does this equal

A

M = prob for next step from current location

Vec = probability for being in the current location

== distribution of the surfer after this step, probability for being in the next possible location

21
Q

what is v_j and m_ij ?

A

m_ij is prob that a surfer at node J will move to node i at the next step

v_j is ther prob that the surfer was at node j at prev step

22
Q

what is x_i

A

prob that a random surfer will be at node i at the next step

23
Q

what 2 conditions

A

graph is strongly connected

there are no dead ends (nodes that have no arcs out)

24
Q

how many iters to have v_j converge

A

50-75 iters

25
26
why do we need to do matrix vector multiplication in mapreduce with a sufficiently large dataset
vector v cannot fit in main memory more than 1.4 billion pages
27
in matrix-vector multiplication, what will be stored in the files
matrix M and vector V stored seperatly
28
in matrix-vector multiplication, what does the map function do
29
in matrix-vector multiplication, what does the reduce function look like
30
in matrix-vector multiplication, what do you do if the vector v cant fit in main memory
31
in matrix-vector multiplication, what does stripe-ing look like
32
what is a dead end and why is it a problem
a page with no links out, surfers reaching page will dissapear
33
what is a spider trap
groups of pages that all have outlinks but never link to any other pages
34
what would happen if dead ends were allowed
35
2 solutions for dealing with dead ends
36
in recursive deletion, on the pages that are deleted, whats their page rank
the parent, or if the parent is deleted then its the parent of that it just goes down the tree
37
with recursive deletion, the sums will eventually exceed 1. how does this change the definition of our matrix
38