Information Extraction Flashcards
What is Information Extraction?
It is turning unstructured text into structured data, such as a relational database or set of extracted tuples
What is Relation Extraction (RE)?
Relation Extractions finds semantic relations among text entities, such as parent-child, part-whole or geospatial relations
What can be used to encode relational informtaion?
Knowledge-graphs
What is Event Extraction?
It finds event in which entities participate
What is Temporal Extraction?
It is finding times, dates and durations
What is Knowledge Base Population (KBP)?
It is the task of populating knowledge bases from unstructured text using extracted information
What datasets exist to perform relational extraction?
The ACE relation extraction dataset
Wikipedia info boxes (DBpedia and Wikidata)
WordNet
TACRED dataset
SemEval
What are some relations in RE?
What is pattern-based RE?
It is pattern-based relation extraction. It uses hand-crafted lexico-syntactic patterns that were to be followed. They were tailored to specific domain lexicons, so would only work in that domain. It had high precision, but low recall and were expensive to make
What is Supervised RE?
It is supervised relation extraction with an annotated corpus, which we pass to a model which we want to learn the patterns in the corpus to be able to perform extraction. It works by finding pairs for named entities and would classify the relation within the sentence for that pair
What is the input and output for a supervised RE model?
The input X is a feature set for the entity pair, and the output Y is a prediction of the relation for the provided pair.
What type of classifier can be used in a RE model?
It can be LogReg, RF, RNN but in this course we use a Transformer model
Why do we use a transformer model with RE?
This is because the self-attention works well for this type of problem and can learn what parts to focus on
What are some techniques to improve RE models?
Replace SUBJ and OBJ with NER tags to avoid overfitting to lexical terms
Use RoBERTa or SPANbert pre-trained word embeddings instead of vanilla BERT, as pre-training is done using single sentences rather than sentence pairs with a separator between them
Why do NEs tend to overfit in deep learning RE models?
RE labelled datasets are too small, so there will not be enough examples of every possible NE phrase, so the model is likely to overfit the dataset samples of the ones in the training set
What is a semi-supervised RE method?
We can use a semi-supervised RE approach with bootstrapping
What is bootstrapping?
It is where we have a high-quality hand crafted small set of seed tuples for the relations in the form (relation, e1, e2)
The algorithm will find sentences which contain instances of the seed tuples, extract patterns and find new seed tuples
What happens when we run many iterations of boot strapping?
We get semantic drift, so the seed tuples start looking for other things
What are some methods to reduce semantic drift?
Apply a confidence threshold to the extraction patterns to improve quality of tuples
Limit the dependency graph walk for new tuples
What is the distant supervision method for RE?
It is the use of a knowledge-base such as DBpedia as a source of seed tuples (r, e1, e2)
What does using a knowledge base avoid?
It avoids the semantic drift problems of the bootstrapping approach
What are the steps taken with Distant Supervision?
Start with a text corpus, use a NER tagger, and then match entities within the database, work out the relation to get the relation term for the seed tuple. We can then add matches to the training set as a feature set with the occurrence frequency
What is an issue with using distant supervision?
It generates a very large training set for supervised RE, meaning that it is very noisy and will have low precision
What can be done to reduce noise during distant supervision RE?
A GAN or incremental training approaches can be used