big DATA Flashcards

(54 cards)

1
Q

wHAT IS BIG DATA

A

big data is a large or complex dataset that often needs terabytes or petabytes of storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 terms used to define characteristics of big data

A

Volume
velocity
variety
veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the r additional terms regarding data relevance

A

variability
value
visualisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Volume

A

The computing capacity required to store and analyse data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Velocity

A

The speed at which data are created and analysed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variety

A

The types of data sources available (text, images, social media, administrative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Veracity

A

The accuracy and credibility of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Variability

A

The internal consistency of your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Value

A

The costs required to undertake big data analysis should pay dividends for your organisation and their patients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Visualisation

A

the use of novel techniques to communicate the patterns that would otherwise be lost in massive tables of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where do big data come from

A

1) Electronic or health records
2) the internet (IoT-internet of things)
3) research or data repositories
4) social media

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is data linkage

A

it is the process of matching records from different sources based on key information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is deterministic data linkage

A

Exact matches based on personal information appearing in all of the datasets that are to be linked-N.B IT HAS TO BE EXACT MATCHES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

probabilistic

A

statistical weights are used to calculate the probability that data from different sources refer to the same individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

NHI

A

it is basically a health number, and it is used to track your interactions with the health system

the purpose is basically so GPs, pharmacists, DHBs can be reimbursed for their data, services

Increasingly researchers are using encrypted versions of the NHI to investigate risk and protective factors associated with health outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the IDI

A

it is a large research database containing microdata about people and their households

The deidentified data come from a range of government and non governemnt agencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Benefits of IDI

A

De-identified, linkable data accessed in a data safe haven

The resource is only as good as the data it contains
-qualities about data quality
selection biases in data

Resident population definitions vary from study to study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Some data the IDI has

A

housing data,
people and communities data
education and training data
income and work data

benefits and social services data

population data

health data

justice data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

privacy

A

refers to ability of a person to control the availability of information about themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Security

A

refers to how the agency stores and controls access to data it holds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Confidentiality

A

refers to the protection of information from and about individuals and organisations and ensuring that the information is not made available or disclosed to unauthorised individuals and entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are 3 key areas in which big data presents challenges

A

1) Data governance
2) Data generation
3) Data output

23
Q

Data governance

A

collection of practices and processes which help to ensure the formal management of data assets withn an organisation

Including storage, transferring, sharing privacy

24
Q

Data generation

A

Data quality is even more important when looking at large volumes of data
The belief that larger numbers result in a more accurate picture is not necessarily true

including capturing, curating, updating and accuracy

25
Data output
Including analysis, querying large datasets, and generating meaningful and reliable outputs
26
Possible implications of using big data(7)
Data in=data out inadvertent discrimination of subpopulation possible to conduct what if scenarios to determine impacts of policy Control over "your data" Need to re think privacy policies Bias
27
IMD employment
the extent of exclusion of the working age population from employment . This is measured by the number of unemployment benefits being paid out to that neighborhood
28
imd Income
The extent of income deprivation in a data zone by measuring state-funded financial assistance to those with insufficient income Measured by the amount of state-funded financial support given to these people
29
Crime IMD
Measures the risk of victimisation and property damage/loss investigates the victim instead of the criminal because the effect on the victims more relevant to weelbeing. A shortcoming is that it does not measure the location of crime
30
Housing IMD
The proportion of houses in a neighbourhood which is overcrowded and the extent of renting
31
Education
Captures Youth disengagement, and the proportion of the working age population without a formal qualification. Also investigates the proportion of working age individuals without any qualifications
32
IMD access
measures the cost and inconvenience of travelling to access basic services
33
Definition of deprivation
The state of observable and demonstrable disadvantages relative to the local community, society or nation to which an individual or group belongs
34
How is neighborhood deprivation measured
It is measured by the census through a deficit model.
35
Ecological fallacy
errors that arise from using information about groups to make assumptions about the individuals in said groups
36
The integrated data infrastructure is a large research database containing microdata about people and households from different government agencies and other organisations across NZ. Briefly explain why the IDI has been referred to as a deficit data set
To apper in the IDI, you have to have interacted with these governments: such as the hospital or police which people may only interact with if there are problems with their lives
37
You are part of a research team interested in finding out whether people who have a wearable activity tracker are less likely visit a doctor than those who do not. V3.com are able to provide these data to your research team, with an encrypted NHI, age, sex, ethnicity and usual residence infromation in health datasets. However data from the wearble activity trackers is only available with age, sex and usual residence information. What method of data linkage discussed in class would be required to create your research data set
Probabalistic, as not all of the key personal information is available in the data to be linked, specifically ethnicity
38
What are 2 reasons why you would recommend using IMD instead of NZDEP
With IMD, there are 28 indicators, and using this, we can drill down on the drivers of deprivation. Also the domains in IMD can be used collectively or separately whereas this cannot be done with NZDEP, provides more flexibility in use
39
Advantages of IMD
Uses IDI, Which is more representative than the census Explores drivers of deprivation, cosnsistes of specific indicators Better small area information, as average IMD population size is 700 Forms specific solutions for small populations
40
What are advantages to NZDEP
Weights domains, | widespread and well known to policy makers and analysts
41
Disadvantages of IMD
Has not been used much | the quality of data of IDI is largely variable, and the disadvantages of the IDI
42
purpose of the IMD
https://www.fmhs.auckland.ac.nz/assets/fmhs/soph/epi/hgd/docs/Final_Brief%20report%20on%20the%20New%20Zealand%20IMD.pdf
43
Endemic disease
A disease that is constantly present in a given population
44
Endemic disease outbreak
Fluctuations in endemic diseases are to be expected. An outbreak is when occurence exceeds expected levels
45
Outbreak definition
The occurrence of cases of disease in a community or region where it would not normally be expected or at a much greater level than expected Outbreak is often used for smaller, localised increase in disease occurrence, epidemics usually cover larger geographic areas
46
Epidemic
quite similar to ourbreak, but its more like in a country rather than a localised event Defined as the occurrence of disease at a level greater than that would normall be expected, baseline levels are important and there is a rapid spread to many people
47
Definition of pandemic
An epidemic that has spread over several countries or continents, usually affecting a large number of people
48
Basic reproduction number
The basic reproduction number of an infection is the number of cases one case generates, on average, over the course of its infectious period, If R0 is less than 1, the infection will die out, if the R0 is greater than 1, the infection can spread
49
How do you work out basic reproductive rate
probability of infection being transmittedxthe rate of contacts in the host populationx duration of infectiousness
50
limitations of basic reproductive rate calculation
assumes everyone is susceptible, when thats not true
51
What is the SIR MODEL
The population is compartmentalised into different states 1) Susceptible 2) iNFECTED 3) rEMOVED/RECOVERED ``` Some of the more complex models factor in exposed/not exposed vaccination stochastic models, multistate multiagent based models ```
52
Herd immunity
A form of indirect protection from infectious disease that occurs when a large percentage of a population has become immune to an infection, thereby providing a measure of proetction for individuals who are not immune, referred to as indirect protection or a herd effect
53
Strengths and weaknesses of BIG DATA in infectious diseases epidemiology
Strength; opportunity for identifying associations, patterns and trends in data, hypothesis generating. When analysed can appropriately improve patient care, public health, reduce health care costs Difficult to manage with traditional hardware and software, data quality and inconsistency issues
54
What information is contained in big data
population, regional or local levels, or span different geographical areas combinding data from multiple sources to explore population health outcomes