lecture 13 - big data + data analysis Flashcards

1
Q

The three Vs of Big Data

A
  1. Volume (scale)
  2. Velocity (speed of info. production)
  3. Variety (diversity of forms)

(Laney)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

big data - examples + research

A

e.g. social media, Smart Watches, Smart Homes, Video Doorbells, Classroom Scanners, etc.

Research = XXL large-N studies
- not possible without computer-assisted content analysis

  • Data Mining = throw data in computer, let computer process it (without any guidance), to see if it can detect patterns/relations/correlation
    = data-driven search for correlations/patterns (than logical theory development)
  • Machine Learning = computers ‘learn’ problem solving algorithms from data (AI approaches)
    instead of programming algorithms
    !they can also be wrong, but sometimes can come up with good strategies

data mining = less sophisticated than machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

big data example: fake news on twitter

A

tweet-based analysis:

  • sample of Twitter users with voter registration
  • focus on Tweets with links to political websites
  • using list of ‘fake news’ websites to classify/code tweets
  • calculate she of ‘fake news’ links by Tweet exposure & posts

findings = (extreme) left less likely to be exposed to fake news than (extreme) left
*superconsumers + supersharers = high exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

big data: ethical questions + recommendations

A

ownership

  • personal vs commercial (who gets the decision of informed consent)
  • public vs private information

ethical principles

  1. minimizing harm
    potential benefits (public interest) vs potential harm (rights, reputation, money)
  2. informed consent
    retroactively vs in advance
  3. protecting privacy and confidentiality
    intent and expectation of users/sources is crucial
    (rule of thumb: people don’t think about if their information will be used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

big data ethics recommendations
5

A
  • anonymization of personal data
    !quoting specific posts can still lead to identification
    pseudoanonymization: remove identifyers + store them somewhere for if you want to be more transparent
  • data minimization (only collect what you really need)
  • data encryption (not everyone can access)
  • secure storage
  • arrangements that enable data subjects to exercise their fundamental rights = ways for people to agree
    (e.g. direct access to their personal data and consent ot its use or transfer)
    (e.g. if you remove data from the site, than it would also be removed from data available for research)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

secondary data

A

= data collected by other researchers/institutions & made available often for no cost e.g. in data archives

trade-off

  • quick & convenient access to data
  • lack of control & constraints on measurement
    e.g. questions not exactly phrased the way you want

assessment of validity & reliability

  • during data analysis: researcher needs to look at reliability and validity whilst analyzing the data
    !is necessary if you do something with the data beyond what it was created for

Ethical principles still apply -> only informed consent is not required (already given)
- still ethics review necessary (esp. for privacy and confidentiality)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data management

A
  • own data entry/existing data (numerical data) into database/spreadsheet

typical structure =

  • columns = variables/categories/dimensions
  • row = data for individual cases (e.g each participant)

!!documentation and archiving = make a BACKUP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

levels of measurement
4

A
  • nominal = mutually exclusive, equivalent, and exhaustive categories
    e.g. gender, ethnicity, religion, countries
  • ordinal = rank ordering (with/without ties) on some dimension
    e.g. agreement-scale, evaluation, arbitrary intervals (e.g. time)
    !intervals are arbitrary, but can be ranked
  • interval = precise measurement units with arbitrary zero point
  • ratio = precise measurement units with absolute zero point

interesting: statistical analysis usually assumes ordinal and interval level data, but most data is ordinal (e.g. agreement scales in surveys)
-> **ordinal scales are sometimes treated as interval scales (requires heroic assumptions: assumes that the interval between the units is equal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

tables - how to do it

A
  • meaningful title
  • self-explanatory labels
  • consistent number format
  • (total) number of cases
  • data source(s)

Democratic Peace Theory
table 1: Wars by type of regime and type period
columns: wars 1800-1939, 1940-2010, totals
rows: dem-dem, dem-aut, aut-aut, total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

table types

A
  • bar charts = best choice for cross-sectional data
  • line charts = best choice for time series data
  • pie charts = almost always a bad choice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

scatterplots

A

= two interval scales

scatterplotting data is a good idea: it visualizes numerical trends

diff. forms of scatterplots can have the same linear regression line
same line can in scatterplot turn out to be

  • linear relationship
  • non-linear relationship
  • linear relationship with an outlier
  • no relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

inferential statistics basic idea + key assumptions

A

= making inferences from sample to population

  1. sampling from complete target population
  2. SRS (simple random sampling) with perfect response rate (or non-response completely random)
  3. no nonsampling error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

statistical significance

A

sample is never 100% representative, there is always some error, we don’t know how much -> we can make informed guesses

rule of thumb =

  • observed difference/relationship > random/sampling error
  • Statistical tests are “mechanical” tests, not a substitute for substantive interpretation

sampling error = difference between different samples
sampling distribution gives confidence in observed difference/mean in the single sample is random sampling error or if it represents a real meaningful difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

significance testing: issues and problems
4

A
  • statistical vs practical significance
    e.g. you could have same statistics/correlation, but small sample -> e.g. large random error -> not significant
    you could have a large sample -> small random error -> highly significant
    -> everything becomes significant in large-N
  • population (e.g. all countries) & significance testing
    (no sample -> no SE -> no inferential statistics = view 1)
    (population not static but changes over time thus (even) a single census is just one cross-sectional snapshot over time = view 2)
  • non-probability samples, e.g. internet panels
    increasingly hard to get probability samples (low response-rate) -> lot of shit goes via internet panels = not representative -> difficult statistics to fix this
  • publication bias & significance testing
    (no findings is boring -> does not get published -> what you read in journals is not representative (it ignores studies that ‘‘failed’’))
    solulu: pre-registration, so that ‘‘failed studies’’ show up

a lot of significance depends on the size of the sample

discussion of results should focus on : pattern of results + substantive meaning (shouldn’t be so small that it is only meaningful in statistics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

outlook: science in the ‘‘post-truth’’ era

A

contemporary threats to sciences

  • external: distrust of scientific research, accusations of ‘fake science’ (people make up data)
  • internal: wrong incentives, quantity over quality (publications), production of ‘fake science’

outlook =

  1. correct incentives
  2. counter ‘fake news’
  3. commitment to and practice of good science, based on ethical principles and scientific integrity (and through education)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a confidence interval (CI)

A

drawn on around an estimated sample statistic such as a mean, a difference in means, or a regression coefficient.