Information Systems Flashcards
(23 cards)
What is an Algorithm?
A finite set of rules that gives a sequence of operations for solving a specific type of problem.
What is a Computer Program?
An instance or concrete representation, fro an algorithm in some programming language.
Name 4 features of algorithms
they are finite, definite, have 0 or more inputs, 1 or more outputs and are effective
What is Data
Raw facts eg alphanumeric data, image data etc
What is information
Data with some meaning
Name 7 aspects of data and information
- Storing and Processing data
- Encrypting and security of data
- Information theory and communication theory.
- Value of Information
- Frequency
- Linguistic theories
- Human cognition
What is a web search engine?
An online web information retrieval system that, given a query, which represents a users information need, returns a list of web pages that match that query.
Name the 3 types of data
Structured, unstructured and semi-structured
Define Structured Data
Data that resides in a fixed field within a record or file eg often relational (or other) database approach.
Define Unstructured Data
Data that isn’t organised in any obviously meaningful way.
Define semi-structured Data
Data that doesn’t have a formal structure but does have tags or other information that convey meaning of data, eg XML or RDF documents with headings/sections, emails etc
What is the most used data type today
unstructured data
What is Organic Content
Unpaid marketing content that potential and existing customers can find naturally.
What is sponsored content
Ads with words matching the query words that are ranked above the web documents returned.
What is ranking
It involves ordering results returned in response to a user query.
Discuss the web link distribution
Web page links are not randomly distributed. Distribution is widely reported to be a power law, in which the total number of web pages with in-degree i is proportional to 1/i^c (c a constant).
ie only a small portion of web pages have a huge number of linksW
What does an index do?
Associates a web page with one or more terms
Explain pre-processing
- Case folding (words are changed to lowercase)
- Punctuation is removed
- “stop words” are removed
- “Stemming” is performed
What are stop words and why are they removed?
Words that do not provide any extra information about the meaning of the document. They are removed in order to save storage space and speed up searches.
What is “stemming”?
Tries to find the “stem” of each word. A stem represents variant forms of a word which share a common meaning. eg consist, consisted and consisting have the same stem “consist”.
Describe Lemmatisation.
A lemma is a base form of a word and it is what we look up in a dictionary. i.e. walking -> walk. Lemmatisation is the conversion of a word to its lemma. It is harder than finding its stem
What is tf and idf
tf is the term frequency i.e. how often a term occurs in a document
idf is the inverse document frequency which shows is the term occurs often across all document which are being searched
How do you calculate the tf-idf and what is it
It is a representation of a real numbe that represents the weights such that the higher the weight the more important the term is in describing the meaning of the document.
The tf is calculated as follows:
no. times term t occurs/ no. terms in a document.
the tf-idf is then calculated:
tfx Logˇ10(N/c + 1)
Where N is the no. documents and c is the no. documents the term occurs in