1.3.4 Flashcards
Web technologies (11 cards)
Search Engines
- Search a database of web addresses to find resources based on criteria set by the user.
- Rely on an index of pages through which they search.
- Web Crawlers build the index by traversing The Internet exploring all links on the page.
- Crawlers collect keywords, phrases and meta data from pages.
Indexing
Computers fetch (or “crawl”) pages and other resources (eg PDF files) on the web. The program that does the fetching is known as a robot, bot, or spider (or in Google’s case Googlebot).
- The crawl process begins with a list of URLs, generated from previous crawls. When the crawler visits a page with links, it adds these pages to the list of pages to crawl. Dead links are removed from the index. Pages that are not accessible by an anonymous user cannot be crawled.
- During the crawl, the page is rendered in a browser and any scripts are run.
- The text of each page is extracted ignoring extraneous material such as HTML tags and punctuation and stores theme in the index.
- The order of words is recorded so that searching for particular phrases is possible.
- Robots.txt file can be used to instruct web crawlers for example to allow Googlebot but not crawlers of other search engines. The file would be stored in the root eg http://www.example.com/robots.txt
- When you create a new website, you can also contact the search engine to let them know you would like the page to be scanned and added to the search engine index.
Retrieving
A user submits a search query using Google’s search engine. Language models are used to decipher which words to look up in the index.
Google searches its index for resources with relevant content. The most basic signal that information is relevant is when a webpage contains the same keywords as the search query.
Ranking
- Frequency of search terms - how often the search terms appear in the page and where they appear for example in the title or in an image’s title. Term Frequency - Inverse Document Frequency (extension). This factor looks at how relevant a word is to a document in a collection of documents.
- Context and setting - your location, past search history and search settings. Country and location is used to show content relevant for your area. For instance, if you’re in Bristol and you search ‘football’, Google will most likely show you results about English football and Bristol City first.
- Language - if language detection is enabled/possible the search engine returns the results in the user’s preferred language.
- Usability - like does the page appear correctly in different browsers, is it designed for all device types and screen sizes, does the page load time work well for users with slow internet connections.
- Freshness of content - pages published more recently tend to have more accurate information.
PageRank Summary
- The PageRank algorithm involves statistical analysis of links to a web page. It requires several iterations through the collection to adjust approximate PageRank values
- More incoming links are better. Incoming links from web pages that have few outgoing links are better. Links from web pages that have high PageRank are better.
- PageRank does not rank web sites as a whole, but is determined for each page individually.
- PageRank doesn’t determine which webpages are included in the search results when a search term is entered into Google; that is determined by the relevance of titles, keywords and phrases contained within those pages.
Uses of Server Side on the web
Accessing a database
Security and authentication
Search engines
Cloud services
Hosting APIs
Uses of Server Side in other areas
Traditional file storage (e.g. a NAS)
Infrastructure (for LANs and WANs)
Virtual machines for thin clients
Uses of Client Side on the web
Interactivity without page reloads
Validation rules
Animations, games, etc.
Responsive design
Any ‘offline’ abilities
Uses of Client Side in other areas
Traditional desktop apps
Any processing in mobile apps
Video games (apart from Stadia!)
Advantages of Server Side
Can access secure data and control what the client sees.
Should be far more secure against code manipulation.
The server is probably more powerful than the client so can potentially tackle heavy workloads in less time
Advantages of Client Side
Don’t need to wait for the server to respond so delays are less.
The client can be customised so as to produce a different experience for each user.
The workload on the server is less, potentially reducing delays for all users.