Lecture 10 - Content-Based Applications Flashcards

1
Q

What’s a B2B transaction?

A

Business to Business uses the WWW as a distributed document delivery service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the components of a search engine?

A

Database of references to webpages

A web crawler

An interface

Information retrieval system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the elements of the search engine database?

A

Where the users queries are matched

Contains only essential parts of the page

Only includes indexed pages

Search engines tend to be out of date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a web crawler do?

A

Records the data it finds such as words, metadata and alt attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the search engine interface do?

A

Gathers input from users

Presents results from the IR system

Often presents items in a ranked order

Requires user input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two main methods of search term matching?

A

Keyword Searching and Concept-based searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does keyword searching work?

A

Matches single terms, computing cosine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does concept-based searching work

A

Examining clusters of work

Attempts to determine the meaning of a query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the basic information retrieval features of a search engine?

A

Boolean Operators

Extended Operators

Stop word deletion

Stemming

Searching in fields (e.g. host)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the rules of ranked output for most search engines?

A

Early words more important

Title is important

frequency of occurrence matters for some

infrequent words matter more

modification date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Google handle searches differently than other SEs?

A

PageRanktm method is based on popularity, use of keywords and relevance.

Links as money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Google’s Anatomy: What does the URL server do?

A

Sends lists of URLS to be fetched

Fetched pages are sent to the store server

The store server compresses and stores pages into a repository
Each page has a docID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Google’s Indexer do?

A

Reads repository, uncompresses and parses documents

Converts pages into stats on word occurrences, hits

Includes intfo about the page, font size, capitalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does google’s sorter do?

A

Resorts barrels by wordID instead of docID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does dumpLexicon do?

A

Takes the list and lexicon to produce a new lexicon

To be used by the searcher to answer questions

Using the inverted indx, lexicn and PageRanks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Googlebombing?

A

Specifically targeting a web page to rank in 1st position for a particular search query

17
Q

What is the Deep Web?

A

Also called the invisible web, it contains documents not indexed by Search Engines.

18
Q

How is the invisible web changing over time?

A

More search engines are parsing non-html content than before

Companies are making more content available by keeping urls stable and including sitemaps

19
Q

What is dogpile?

A

A well known meta-crawler

20
Q

As a searcher, what are steps to success in search engines?

A

Use multiple search engines

Search within results

Use boolean expressions

21
Q

As a creator, what are the steps to success in search engines?

A

Always use ALT attributes

Avoid frames

Links between your pages

Use metadata, formal and informal

22
Q

How do you increase your pages popularity?

A

Don’t use systematic reciprocal linking

Use a context map at the top of each page

Don’t use frames

Think through dynamic content implications

23
Q

Why is metadata important?

A
  • It’s useful for describing and locating info
  • Judge relevance of information
  • Promote good information management
  • Search tools and information gateways can use metadata when location and describing resources
24
Q

How can we reduce inconsistencies in our metadata?

A

Clearly label attributes

Stick to formats and rules

Catalogue Rules

25
Q

What is Dublin Core (DC)?

A

It has 15 core elements

§ Title, Creator, Subject, Description, Publisher,
Contributors, Date, Type, Format, Identifier, Source,
Language, Relation, Coverage and Rights

26
Q

What does RDF stand for?

A

Resource Description Framework

27
Q

What does a resource description framework do?

A

It aims to provide the infrastructure to exchange metadata on the web.

Allows for mix of metadata schemes

Enables automated processing of web resources

Interoperability between applications that exchange machine-understandable information

28
Q

What are the applications of RDF?

A
  • Resource discovery - search engines
  • Cataloguing - describe content and content relationships
  • Describing intellectual property rights
  • Intelligent software agents - info sharing
  • Content rating
  • Privacy preferences/policies
  • Collections of pages as a single “document”
29
Q

what are the disadvantages of metadata?

A

Stored in separate files

Difficult to convince information providers of its importance

Need for standardised usage and procedures

Not trusted by some search engines (because of keyword spamming)

30
Q

What is the short term disadvantage of metadata?

A

Metadata imposes a load on the server

31
Q

Metadata is becoming important, how should we handle it when creating a site?

A

Start collecting it immediately

Automate as much as possible

Ensure information providers use metadata