chapter 3 Flashcards

1
Q

what is the goal of anonymization

A

Balancing Data Privacy and Data Utility to make data less specific while retaining its usefulness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

original database goes through _________ to become published database

A

anonymization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are some anonymisation techniques

A

Attribute Suppression

Character Masking

Generalisation

Swapping

Data Perturbation

Synthetic Data

Data aggregation

K-anonymity

Pseudonymization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is attribute suppression

A
  • Removal of an entire part of data (column in
    databases or spreadsheets) in a dataset
  • Used when an attribute is not required in the
    anonymised dataset
  • Strongest type of anonymization technique
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is an example of attribute suppression

A

Example: Data consists of test scores

  • Recipient only needs to analyse test scores with respect to trainers
  • The “student” attribute is removed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is character masking

A
  • Characters of a data value is masked by using a
    symbol, e.g. “*” or “x”
  • Used when hiding part of a string of characters, is
    sufficient to provide the anonymity required
  • Depending on attribute type, mask to replace a fixed
    number of characters, or a variable number of
    characters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is an example of character masking

A

Example: online grocery store conducting a study of its delivery demand from historical data

  • last 4 digits of the postal codes is masked
  • leaving the first 2 digits, which correspond to the “sector cod
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is generalisation

A
  • Reduction in the precision of data, e.g., converting a person’s age into a range of values
  • Used where values can be generalised into a range, and still
    be useful
  • Data ranges that are too large may mean too much
    modification, data ranges too small may be too easy to re-
    identify individuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

example of generalisation

A

Example: Dataset contains person name, age in years, and residential address
* Age ranges of 10 years, starting with a range <20 years, and ending with
range >60 years
* Remove the block/house number and retain only the road name in Addres

  • also lets say there is only 1 unique address record in the data, it is too unique already so we have to remove it from the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is swapping

A
  • Rearrangement of data in the dataset such that the individual attribute values are represented, but do not correspond to the original records
  • Used when subsequent analysis only needs to look at
    aggregated data, not relationships between attributes
  • Not all attributes (columns) need to be swapped, depending
    on the situation, only attributes containing values that are
    relatively identifiable need to be swapped
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is an example of swapping

A

Example: Dataset contains information about customer records for a business organisation

  • All values for all attributes have been swapped If the purpose of the anonymised dataset is to study the relationships between job profile and consumption patterns
  • other methods of anonymisation may be more suitable, e.g. generalisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is Data Perturbation

A
  • The values from the original dataset are modified to be slightly different
  • This is used for quasi-identifiers and typically for numbers and dates, and should not be used where data accuracy is crucial
  • The degree of perturbation should be proportionate, to the
    range of values, of the attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is an example of data perturbation

A

rounding off the values of the numeric columns to either base 3 or base 5 depending on the range of values of the attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is synthetic data

A
  • Data that is artificially or programmatically created often with the help of algorithms, rather than being generated by actual events
  • Captures the underlying structure and display the same
    statistical distributions as the original data
  • Used for a wide range of activities, including as test data for
    new products, and in AI model training, yet maintaining data
    privacy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

example of synthetic data

A

Example: Office facility, providing “hot-desking” facilities, keep records of the time that users start and end using their facilities.

  • They would like synthetic data for 1 day, to perform simulation testing on a new facility allocation
  • Synthetic data created, based on the statistics derived from the original data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is data aggregation

A
  • Converting a dataset from a list of records to summarised values
  • Used when individual records are not required and
    aggregated data is sufficient for the purpose
  • If the aggregated data includes a single record in any of the categories, it could be easy for someone with some additional knowledge to identify an individual, hence, aggregation may need to be applied in combination with suppression
17
Q

what is an example of data aggregation

A

Example: charity organisation has records of the donations made, as well as some information about the donors. Aggregated data is assessed to be sufficient to perform data analysis

18
Q

what is K-anonymity

A
  • A property of a dataset that is usually used in order to
    describe the dataset’s level of anonymity
  • Protects against re-identification, and often described as a
    ‘hiding in the crowd’ guarantee
  • k in k-anonymity refers to the number of times each
    combination of values appears in a dataset
  • If k = 3, the data is said to be 3-anonymous, the higher the
    value of ‘k’, the harder it is for individuals to be identified
19
Q

what is an example of k-anonymity

A

Example: Research needs to be done on the types of disease

  • Name, Postcode, Age, and Gender are attributes that could be used to identify an individual
  • Data anonymised to achieve k-anonymity of k = 3, or at least 1/3 chance to identify an individual
20
Q

what is Pseudonymization

A
  • Replacement of identifying data with made up values, which are unique, and should have no relationship to the original values
  • Used when the data values need to be uniquely distinguished
  • Persistent pseudonyms allow linkage across other different
    datasets
  • May need to follow the structure or data type of the original value, simply to look more similar to the original attribute
21
Q

what is an example of pseudonymization

A

Example: names of persons who obtained their driving licenses and other information

  • the names were replaced with pseudonyms

Useful for cross dataset linking and where original data structure is needed, but does not comply with personal data protection regulations, if applied specifically on explicit identifiers

22
Q

what are the 2 phases in the anonymisation methodology

A

Anonymisation Preparation Phase
Anonymisation Execution Phase

23
Q

what are the 4 steps in the anonymisation preparation phase

A

determine the release model

determine the reidentification risk threshold

classify the data attributes

remove unused data attributes

24
Q

what does determine the release model mean ?

A
  • Refers to how the anonymised dataset will be released
  • Public or Non-Public release
25
Q

what does Determine re- identification risk threshold mean ?

A
  • Data anonymity increases as Risk Threshold increases
  • Data Utility decreases as Risk Threshold increases
26
Q

what is risk threshold

A

The risk threshold is a parameter that determines the desired level of privacy protection in a dataset, balancing the trade-off between data anonymity and data utility.

27
Q

what does classify the data attributes mean ?

A
  • Classification affects how the attributes will subsequently be processed
  • Explicit/quasi identifiers, sensitive data
28
Q

why should attributes not required in the dataset be removed/suppressed ?

A

Attributes not required in the anonymized dataset should be suppressed to reduce the risk of re-identification, protect individuals’ privacy, and minimize the potential for unintended data leakage or misuse.

29
Q

what is step 4 in the Anonymization Preparation Phase

A

Remove unused data attributes: Attributes that are not required in the anonymized dataset should be suppressed

30
Q

define data anonymization

A

Data anonymization is the irreversible process of transforming a dataset to conceal individuals’ identities and sensitive information while preserving its structure and utility for research and analysis

31
Q

what are the 4 steps in the anonymization execution phase

A

Anonymise identifiers

Evaluate the solution

Determine controls required

Document anonymisation process

32
Q

what is anonymise identifiers mean ?

A
  • Apply relevant anonymization techniques
  • Different techniques are applicable for types of identifiers
33
Q

what does evaluate the solution mean

A
  • Examine the anonymised dataset to assess if there is sufficient data anonymity and utility
34
Q

what does it mean to determine the controls required

A
  • Technical controls, including access control, authentication, encryption
  • Non-technical controls, incl. legal, company processes
35
Q

what does it mean to document the anonymisation process

A
  • Details of the anonymisation process, parameters used and controls should be clearly recorded for future reference
  • Facilitates maintenance