week 4: big data Flashcards

(71 cards)

1
Q

what is big data

A

large amjount of user generated content (UGC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

challenges big data

A

storaging, processing, transfering from storage to computing nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

5 big data characteristics

A

volume (data size)

variety: different formats
velocity: speed of change
veracity: uncertainty of data
value: turn data into value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

let the data…

A

speak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

dont be fixed on.. but discover..

A

causality … patterns & correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

in big data, no need for..

A

sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

info extraction:

A

identify specific pieces of info (data) in an unstructured or semi structured textual document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

transform unstructured info in a corpus of different documents or web pages into…

A

a structured data base

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

many web pages are generated automatically from…

A

an underlying database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HTML structure of pages is..

A

failry specific and regular (semi structured)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

wrapper =

A

extractor for a semi structured website

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

screen scraping =

A

process of extracting from html pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

other process to extract info from html pages than screen scraping

A

API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

slots in template are typically filled by…

A

substring from the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

some slots may have…

A

a fixed set of prespecified possible fillers that may not occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

some slots may allow…

A

multiple fillers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

some domains may allow…

A

multiple extracted templates per document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

specifying an item to extract for a slot using a regular expression pattern may require

A

prefiller/preceding pattern to identify proper context & succeeding/postfiller pattern to identify the end of the filler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

regular expression operators:

A
or: |
grouping ()
repititon zero or more: *
repetition one or more: +
repetition zero or one(optional?
sequencing: order of elements
cardinality {m,n}: m is min number of reps and n is max number of reps (just 1 is exact number)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

regular expression operators for characters

A

any character: .
word boundary: \b
any digit: \d
escape sequence: e.g. \

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

extract slots in order:

A

starting search for the filler of the n+1 slot where the filler for the nth slot ended. Assume slots are in fixed order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Make patterns specific enough to…

A

identify each filler always starting from the beginning of the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

if a slot has a fixed set of pre specified possible fillers…

A

text categorisation can be used to fill the slot (job category, company type). Treat each of possible values of slot as a category and classify the entire document to determine the correct filler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

if extracting from automatically generated web pages … usually work

A

simple regex patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
if extracting from more natural, unstructured, human-written text, .... may help
some natural language processing
26
Sorts of NLP:
POS tagging, syntactic parsing, semantic word categories
27
extraction patterns can use ...
POS or phrase tags (in prefillers/fillers)
28
rapier system learns 3 regex style patterns for each slot:
prefiller pattern, filler pattern and postfiller pattern
29
always eveluate IE performance on...
independent maually-annotated test data not used during system development
30
measure each test document with:
N (total # correct extractions in solution template) E (total # of slot/value pairs extracted by the system) C (# of extracted slot/value pairs that are correct)
31
recall
C/N
32
Precision
C/E
33
if relevant docs were all available in standardized xml format...
IE would be unnecesarry
34
hard to get standardized xml format because of
difficult to format, difficult to mannualy annotate docs with good XML tags, commercial industry might be reluctant to provide it
35
IE provides a way of...
automatically transforming semi structured or unstructured data into xml compatible format
36
web extraction may be aided by...
first parsing web pages into dom trees
37
after parsing webpages into dom trees//
extraction patterns can be specified as paths from the root of the dom tree to the node containing the text to extract
38
even though using domtrees, may still need..
regex patterns to idenitfy proper portion of final character data node
39
rest =
representational state transfer
40
rest means
collection of netwrok architecture principles which outline how resources are defined and addressed. Its not a standard, but uses several standarrds.
41
HTTP is a
communications protocol that allows retrievin interlinked text documents (www)
42
motivation for rest was...
to caputre characteristic of web that made web succesfull (make request, http protocol, URI)
43
maiin concepts rest
nouns (resources) -> unconstrained (full website) verbs -> constraint (GET) representations -> constrained (XML
44
a resource is
conceptual mapping to a set of entities represented with global identifier
45
verbs:
represent actions to be performed on resources
46
http get
how clients asked for info theyseek
47
http puy
updates a resources
48
http post
creates resource
49
http delete
removes resource identified by uri
50
representations:
how data is represented/returned to the client for presentation (javascript or xml, can be multiple)
51
REST name because
client application changes/transfers state with each resource representation
52
javascript
html to define content of web pages | css to specify layout of webpagesjavascript to program behavior of webpages
53
commonuses java script
form validation, page embellishments and special effects, dynamic content manipulation, emerging web 2.0
54
ajax characs
increased responsivess and interactiveness of webpages. exchanging small amounts of data with server entire web page doesnt have to be reloaded each time user performs action
55
ajax is not
a technology itself but a term to refer to use of group of technologies
56
core and defining element of ajax is
xmlhttprequest object (page doesnt need to refresh)
57
elaborate characs ajax:
user driven, views defined by urls, simple user interaction model synchronous interacton
58
components ajax interaction
client event occurs an xmlhttprequest is created and configured asynchronous request made to server via xmlhttprequest object server processes request and returns data, client executes a callback in the xmlhttprequest object html dom updated based on response data
59
dom:
document object model, platform and language independent way to represent xml
60
ajax dangers
hyp application development/maintenance cost behavior not weblike security issuesp
61
parallel/distributed processing of data
sisd, simd, misd, mimd (between control unit and processor unit)
62
sisd
single instruction stream, single datastream ->serial procoessor
63
simd
single instruction stream, multiple data stream -> array processor
64
mimd
multiple instruction stream multiple data stream -> multiprocessor or multicomputer
65
misd
multiple instruction stream, single data stream -> no examples
66
a distrributed algo is an algo in which...
proccessors cant determine the state of the other processors (need message)
67
parallel algos are a subset of
distributed algos, but they can determine state of other processors
68
efficiency is a measure of..
fraction of time that a processor spends on perfoming useful work
69
what is a regular expression
a regular expression is a special string used to define string patterns
70
role of prefiller and post filler patterns in info extraction
they are responsible to defining the context in which the filler patterns operates
71
how to you fill slots that take values only from a fixed prespecified set of fillers (which dont necessarily appear in text)?
text categorisation is usually used. You need to treat each of the possible values of the slot as a category and classify the entire document to determine the correct filler