C.2 Searching the Web Flashcards

1
Q

C.2.2

Distinguish between the surface web and the deep web

A

Surface web:
* Pages that are reachable (and indexed) by a search engine
* Pages that can be reached through links from other sites in the surface web
* Pages that do not require special access configurations

Deep web:
* Pages not reachable by search engines
* Substantially larger than the surface web
* (for example, parts of websites that need authentication access, private social media, emails. Or content which is blocked by paywalls, newspapers, netflix)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

C.2.3

Outline the principles of searching algorithms used by search engines

A
  • The time a page has existed
  • The time a page takes to load
  • Dwell time (how long does the user stays on the website)
  • The frequency of search keywords on the page
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

C.2.3

What is the Page Rank Algorithm?

A

PageRank works by counting the number and quality of backlinks to a page to determine a rough estimate of how important the website is. A page with more backlinks is considered more important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

C.2.3

What is the HITS Algorithm?

A

HITS algorithm splits sites into hubs and authorities.

Authorities have a lot of inlinks. It contains valuable information that the user wants. An authority is considered good if it is linked by a lot of high quality hubs.

Hubs contain outlinks to authorities. A hub is considered good if it links to a lot of high quality authorities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

C.2.4

Describe how a web crawler functions

A

A web crawler crawls through the web and downloads and indexes webpages from all over the internet. For each page it indexes, it extracts all the links in the webpage and adds it to the list of webpages to crawl.

The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

C.2.5

Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler

A

Meta tags are tags that are only meant for computers to read. They tell computers what the website is about.
The description meta-tag provides the indexer with a short description of the page.
The keywords meta-tag provides…well keywords about your page.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

C.2.6

Discuss the use of parallel web-crawling

A

The web is growing at an astonishing pace. As such, it is necessary to parallelise the crawling process to speed it up.

Advantages
* Faster
* Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load

Disadvantages
* Web crawlers may overlap and index the same page more than once
* Parallel web crawlers need to communicate with each other to effectively crawl the web. This takes up communication bandwidth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

C.2.7

Outline the purpose of web-indexing in search engines

A

Indexing websites allow search engines to quickly locate relevant information for users. Information is stored about the indexed websites, like its ranking, relevant keywords and metadata. This helps search engines rank websites and give helpful information based on search queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

C.2.8-9

Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines.

A
  • How many websites link to this website.
  • The clickthrough rate (how likely a user is to click on your website)
  • The bounce rate (how likely a user is to immediately leave your site after clicking)
  • Dwell time (how long a user stays on your webpage)
  • Using more semantic tags in your HTML which tell the bot what your website is about (article tags, section tags, h1 tag, h2 tag, footer tag)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

C.2.11

Discuss the use of white hat search engine optimisation

A
  • Guest blogging: Writing a blog post in someone else’s blog. At the end of the blog post you can insert a link to your site, thereby increasing the number of incoming links to your site.
  • Quality content: Writing quality content encourages users to stay longer, increasing dwell time.
  • Link Baiting: Getting users to click on their link, increasing click through rate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

C.2.11

Discuss the use of black hat search engine optimisation

A
  • Keyword stuffing
  • Link farming: Creating groups of websites with hyperlinks that all link to your own.
  • Blog comment spamming: Automated posting of hyperlinks for promotion on any kind of publicly accessible online discussion board
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

C.2.12

Outline future challenges to search engines as the web continues to grow

A

As the web grows, it becomes harder to filter out the most relevant information, and paid results (ads) play an important role.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly