Lecture 14, 15, 16 & 17 Flashcards

1
Q

14.1 What is the record (data) linkage problem?

A

Combining related/equivalent records across data sources

– Information relating to the same entity (e.g. a person or place) is connected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

14.2 Why is matching a database against itself regarded as a data linkage task?

A
  1. The customer database changes over time, people move address, change their names.
  2. Because you may have duplicate records. This could be records that refer to the same entity but have been filed under different names.
  3. Business wishes to carry out an advertising campaign.– Has a large database of customers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

14.3 Explain where record linkage is applied and why?

A

Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key), as may be the case due to differences in record shape, storage location, and/or curator style.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

14.4 Why can record linkage be difficult?

A
  1. No keys – have to make your own key to join rows on which is tricky
  2. Noisy values – the same data can be represented in different formats
  3. Missing attributes – Makes it harder to formulate good keys especially if the keys are a combination of features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

14.5 Describe the methodology of blocking, as applied to record linkage, explain why it is useful and why it is challenging.

A

Blocking: Represent complex records as simple values (blocks); only score records with simple value (block) in common.
This makes linking a lot faster as every record doesn’t have to be compared to every other record in the table, rather to only the records in its relative block.

However generating the blocks and deciding which attributes to block over can be tricky.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the record linkage pipeline?

A
preparation
blocking, 
scoring, 
matching and 
merging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

14.7 Explain in what circumstances privacy is an important issue for data linkage.

A

If the data is being used within the organisation that it refers to, it’s not an issue because you can assume that it will not be used with a malicious intent.
Is a concern when:
– Matched data is being passed to another organisation or being made public
– Data matching is being conducted across databases from different organisations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

14.8 What is the objective of privacy preserving data linkage?

A
  1. Some data is of a sensitive nature that if leaked outside the organisation could be used with negative consequences
  2. Organisations don’t want to share all their confidential data with one another, may be competitors
  3. Only linked records should be available to the unit that requires them (shouldn’t get data that they don’t need)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

14.9 What is the use of one way hashing for exact matching in a 2 party privacy preserving data linkage protocol?

A

Each organisation
– Applies a (one way) hash function to the attribute used to join the databases
– Shares its hashed values with the other organisation. Each checks which ones match. These are the linked records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

14.10 What are the vulnerabilities of 2 party privacy preserving data linkage protocol?

A

i) small differences in input
Single character difference results in a completely
different output.
ii) dictionary attack
An organisation could mount a dictionary attack to “invert” the hash function. An organisation could generate a dictionary of every word and it’s hash and can scan org. B’s data to see if anything matches. If so, privacy is lost for those items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

14.11 What is the operation of the 3 party protocol for privacy preserving linkage, using hash encoding with salt for exact matching?

A

– Involve a trusted 3rd party (Organisation C)
– Organisations A and B send their hashed values to Organisation C, who then checks for matches.
• Organisation C could mount a dictionary attack and guess the hashed values. Solution: A and B perform “dictionary attack resistant” hashing
Organisations A and B concatenate a secret word to every name field in their data before hashing (known as a salt). Organisation C does not know what this word is and thus can’t perform a dictionary attack to “reverse” the hashed values it receives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

14.12 What are the steps of the 3 party protocol for privacy preserving data linkage with approximate matching (using bloom filters)?

A
  1. Organisation A & B both generate bloom filters for each record.
  2. A & B send M & N number of bloom filters to C respectively. C then preforms M*N similarity comparisons.
  3. C then sends the ID’s of approximately similar record pairs to A and B.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

14.13 How do you compute similarity of two strings based on 2-grams?

A
Convert each string into 2-grams. Convert to lower case and list all substrings of length 2. E.g. James -> [‘ja’, ‘am’, ‘me’, ‘es’]
use the following formula 	
        sim(x,y) = 2h / (x+y)
h = no. of common 2 grams
x = no. of 2 grams in string x
y = no. of 2 grams in string y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14.14 How is a bloom filter created?

A

A bloom filter is an array of l bits, with all bits initially set to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

14.16 How can bloom filters be used to compare two strings for approximate similarity?

A

The 2-grams of the first string are stored in bloom filter B1, the2-grams of the second string are stored in bloom filter B2. Both bloom filters are the same length and use the same hash
functions.
• If the two strings have a lot of 2-grams in common, then their bloom filters will have a large number of identical bit positions set to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

14.11 What are the disadvantages of the 3 party protocol for privacy preserving linkage, using hash encoding with salt for exact matching?.

A

• This third party scheme prevents a dictionary attack, but may still be susceptible to a frequency attack.
– 3rd party compares the distribution of hashed values to some known distribution
• E.g. distribution of surname frequencies in a public database versus distribution of hash values - May be able to guess some of the hashed
values!
• Organisations A and B could prevent this by adding random records to manipulate the frequency distribution
• The hash based technique using the 3rd party, can only compute exact similarity between strings, not good for approximate similarity in a privacy preserving manner

17
Q

14.13 How is using 2-grams to compute the similarity of 2 strings a useful method?

A

Its simple and scalable. using 1 gram means top and pot would be considered equivalent and using 3 grams means cat and car would be considered to have no similarity. 2 grams meet both in the middle.

18
Q

14.16 What is the formula for comparing 2 strings for approximate similarity using bloom filters? (Dice’s coefficient)

A

Formula:
sim(B1, B2) = 2h / (b1+b2)
h = no. of bits set to 1 in both bloom filters
b1 = no. of bits set to 1 in bloom filter B1
b2 = no. of bits set to 1 in bloom filter B2

19
Q

What is the preparation step in record linkage?

A

Preparation: Clean records; General amortisation. Convert the data to have the same format

20
Q

What is the blocking step in record linkage?

A

Blocking: Represent complex records as simple values (blocks); only score records with simple value (block) in common.

21
Q

What is the scoring step in record linkage?

A

Scoring: Score record pairs for similarity. E.g. using cosine similarity or edit distance. Edit distance is bad if comparing words that are similar but not entirely e.g. “door knob” and “knob door” - will say they are vastly different, cosine similarity would be a better choice here. Cosine similarity on the other hand wouldn’t do as well with “Ghostbusters” and “ghost busters”. It would treat these as being vastly different and edit distance here would be better. No method is better than the other.

22
Q

What is the matching step in record linkage?

A

Matching: Match sufficiently similar records. Can be done via some threshold (e.g. match if similarity score > 0.5) or via graph matching techniques.

23
Q

What is the merging step in record linkage?

A

Merging: Merge matched records, resolving conflicting attributes, i.e. if a matched pair have different values for a feature, have to decide which value to pick.

24
Q

How are strings inserted into a bloom filter?

A

insertion: using hash functions H1…Hk to turn on certain bits.
– Each hash function Hi (1≤ i≤ k) maps an input string X to a value in the range [0, l-1].
– To store a string X in the bloom filter, set array index Hi(X) in the bloom filter to value ‘1’, for each Hi(X). If the bit is already on, do nothing.

25
Q

How are strings checked for membership in a bloom filter?

A

membership check:
• For any string X, hash X using the k hash functions.
– If at least one of the bits having value Hi(X) is set to 0, then X is definitely not a member of the bloom filter.
– If all of the the bits having value Hi(X) are set to 1, then X appears to be a member of the bloom filter. However, it might not really be a member (a false positive).