System Design Flashcards by Dee Ena

Acronym for system design checklist

SAD Ninja FAR Read/ write S [from Cracking the coding interview]
-----------------------------------------------------
Scalability
Asynchronous operations or not?
Database 
Network metrics
Failures
Availability 
Reliability 
Read/ write heavy
Security

Also (SEARS)
-----------------------------------------------------
Scalability
Efficiency
Availability
Reliability
Serviceability or Manageability

How well did you know this?

Not at all

Perfectly

Design a web crawler

To crawler a single web page, all we need is to issue a HTTP GET request to the corresponding URL and parse the response data, which is kind of the core of a crawler. With that in mind, a basic web crawler can work like this:

Start with a URL pool that contains all the websites we want to crawl.
For each URL, issue a HTTP GET request to fetch the web page content.
Parse the content (usually HTML) and extract potential URLs that we want to crawl.
Add new URLs to the pool and keep crawling.

How well did you know this?

Not at all

Perfectly

How to ensure that the same url is not added in the pool in a distributed system?

One common approach is to use Bloom Filter. In a nutshell, a bloom filter is a space-efficient system that allows you to test if an element is in a set. However, it may have false positive. In other words, a bloom filter can tell you that , either a URL is definitely not in the pool or it is probably in the pool.

To briefly explain how bloom filter works, an empty bloom filter is a bit array of m bits (all 0). There are also k hash functions that map each element to one of the m bits. So when we add a new element (URL) into the bloom filter, we will get k bits from the hash functions and set all of them to 1. Thus, when we check the existence of an element, we first get the k bits for it and if any of them is not 1, we know immediately that the element doesn’t exist. However, if all of the k bits are 1, this can come from the combination of several other elements.

Bloom filter is a very commonly used technique and it’s the perfect solution for deduping URLs in a web crawler.

How well did you know this?

Not at all

Perfectly

Advantages and Disadvantages of SQL databases

Advantages:

Faster Query Processing
No Coding Skills
Standardized Language
Portable
Easy to learn

Disadvantages

Interface can get complex
can be costly
Need to have relationships between data

How well did you know this?

Not at all

Perfectly

Advantages and Disadvantages of No -SQL databases

Advantages

It stores as a key value system
Flexibiltiy to add/remove columns
In Sql ,you may need joins sometimes , to retrieve data from different tables , but here all your required data is contained in one block. So it’s little easier to insert and retrieve data from NoSql.
These DB’s are built for finding metrics and getting intelligent data

Disadvantages

Relations and Joins are hard to apply
Reads can be slow
In NoSQL two same id’s may have totally different data which makes updates difficult

How well did you know this?

Not at all

Perfectly

How can you do async processing?

RabbitMQ. RabbitMQ is one of many systems which help to implement async processing.

How well did you know this?

Not at all

Perfectly

No SQL DB types

Key-Value Stores: Data is stored in an array of key-value pairs. Well-known key-value stores include Redis, Voldemort, and Dynamo. Used for rapidly changing data(like in memory cache) as it gives better performance

Document Databases: In these databases, data is stored in documents and these documents are grouped together in collections. Its a key-value store with documents stored as values. Each document can have an entirely different structure. CouchDB and MongoDB. provide high flexibility and are often used for working with occasionally changing data.

Wide-Column Databases: Instead of ‘tables,’ we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.

Graph Databases: These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph. used to model complex relations

How well did you know this?

Not at all

Perfectly

80-20 rule in system design

it means 20% of the requests generates 80% traffic

How well did you know this?

Not at all

Perfectly

How many bits does each base64 encoding digit represent

Each non-final Base64 digit represents exactly 6 bits of data

How well did you know this?

Not at all

Perfectly

Data Partitioning and Replication

Range Based Partitioning(DB Sharding): We can store URLs in separate partitions based on the hash key’s first letter. Hence we save all the URLs starting with the letter ‘A’ (and ‘a’) in one partition, save those that start with the letter ‘B’ in another partition, and so on. It should be a static partitioning scheme to always store/find a URL in a predictable manner.

The main problem with this approach is that it can lead to unbalanced partitions. For example, we decide to put all URLs starting with the letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E.’

Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object.

Our hashing function will randomly distribute URLs into different partitions This approach can still lead to overloaded partitions

How well did you know this?

Not at all

Perfectly

How much memory do modern day servers have

modern-day server can have 256GB memory,

How well did you know this?

Not at all

Perfectly

XMPP

XMPP is the Extensible Messaging and Presence Protocol, a set of open technologies for instant messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data.

How well did you know this?

Not at all

Perfectly

Traffic can be represented as ?

URL’s/s. Like how many redirect url’s come in for TinyURL

What would be Queries Per Second (QPS) for our system? New URLs shortenings per second: assuming we get 500 Million new url’s generated/month

500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s

How well did you know this?

Not at all

Perfectly

Bandwidth can be represented as ?

this is KB/s.

For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second:

20K * 500 bytes = ~10 MB/s

How well did you know this?

Not at all

Perfectly

DNS Servers vs ISP Servers

Internet service providers (ISPs) bring the bandwidth , from where it’s produced, to wherever you happen to be sitting, wanting to use it. They’re the freight trucking companies of the Internet. If they do a poor job, the bandwidth is slow, packet delivery is unreliable, and performance is unpredictable. Eg like comcast, AT&T

The Domain Name System (DNS) is like the telephone directory of the Internet, which your computer or mobile device use to look up the addresses of the web sites you want to visit

How well did you know this?

Not at all

Perfectly

What does it mean if you say your web service is scalable?

Study These Flashcards

It means that if we increased resources(number of servers, memory etc) there will be better performance proportional to whatever resources were added

Things to consider to make a service scalable

Study These Flashcards

Caching(Redis)
Ability to clone servers (every server contains exactly the same codebase and does not store any user-related data, like sessions or profile pictures, on local disc or memory.)
Database (if using rdbms- sharding or partitioning else use nosql db)
Ability to perform asynchronous processing(using rabbitmq)

CAP Theorem

Study These Flashcards

In a distributed computer system, you can only support TWO of the following guarantees:

Consistency - Every read receives the most recent write or an error
Availability - Every request receives a response, without guarantee that it contains the most recent version of the information
Partition Tolerance - The system continues to operate despite arbitrary partitioning due to network failures
Networks aren’t reliable, so you’ll need to support partition tolerance. You’ll need to make a software tradeoff between consistency and availability.

CP - consistency and partition tolerance
Waiting for a response from the partitioned node might result in a timeout error. CP is a good choice if your business needs require atomic reads and writes.

AP - availability and partition tolerance
Responses return the most readily available version of the data available on any node, which might not be the latest. Writes might take some time to propagate when the partition is resolved. Best used in scenarios where you can wait for writes to happen but you want other functionalities to continue(like you might not be able to add to cart but can still browse)

Different patterns to achieve consistency?

Study These Flashcards

Weak :After a write, reads may or may not see it. used in voip or video calling. example: if you lose the connection during a phone call and then rejoin you don’t hear whatever was spoken when you lost your connection

Strong : after write, reads will see it. used in systems where transactions are needed like rdbms

Eventual : after write read will eventually see it. happens asynchronously. like in email

Different patterns to achieve availability?

Study These Flashcards

Failover : you have active-passive and active-active.
in active passive heartbeats are passed across servers. if active stops sending passive takes over
in active active both servers are managing traffic.
Disadvantage :
Fail-over adds more hardware and additional complexity.
There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.

Replication
master-slave and master-master servers

What are CDN’s

Study These Flashcards

A content delivery network (CDN) is a globally distributed network of proxy servers, serving content from locations closer to the user. Generally, static files such as HTML/CSS/JS, photos, and videos are served from CDN, although some CDNs such as Amazon’s CloudFront support dynamic content. The site’s DNS resolution will tell clients which server to contact.

Advantages:

Users receive content from data centers close to them
Your servers do not have to serve requests that the CDN fulfills

Disadvantages:
Costs can be high
data can become stale

What is Reverse proxy

Study These Flashcards

It is like having a layer before your app server that serves as the public IP. Provides a unified interfaces for your services

advantages
Increased security - Hide information about backend servers, blacklist IPs, limit number of connections per client
Increased scalability and flexibility - Clients only see the reverse proxy’s IP, allowing you to scale servers or change their configuration
Decrypt incoming requests and encrypt server responses so backend servers do not have to perform these potentially expensive operations
Compression - Compress server responses
Caching - Return the response for cached requests
Static content - can serve photos or html directly

How to scale RDBMS in a system

Study These Flashcards

master-slave replication : reads and writes go to master that replicates its data to the slaves. Consequently read requests go to slaves.

master-master replication : multiple masters take both read/writes and continuously sync with each other

federation : separate db’s for users, products.

Sharding : partitioning based on a certain range like A-G , H-K

denormalization : Redundant copies of the data are written in multiple tables to avoid expensive joins

SQL tuning : using indexes, less joins

Problems of data partitioning

Study These Flashcards

Joins are tough to do in partitions
One way around joins is denormalize the tables i.e move data that has to be joined to their own tables. It cause downtime and add the problem of inconsistency that comes with denormalization

You cant have foreign key constraints. What if the FK lies in a different partition?

Rebalancing - If one server is overloaded , you may need to rebalance. One way you can avoid this from happening is having a look up table(directory based partitioning ) which maps the key of the entity to which server it is present in. But this introduces a single point of failure and adds complexity

Cache invalidation techniques/ when to update a cache

Cache aside : the client/ server hits the cache, there is a cache miss, loads from db and then updates cache. Memcache does this. Write-through: data is written in cache & DB. completion is confirmed only when data is written in both places Write-back/write around: data is written in cache first; I/O completion is confirmed when data is written in cache; data is written to DB asynchronously (background job) and does not block the request from being processed Refresh ahead : the db refreshes the data in cache before it expires. This needs to be accurately predicated though

When do you say your system has low performance ? | When do you say your system is not scalable?

Performance: it is slow for a single user | Scalable :It is fast for a single user but slow for heavier load

Diff mechanisms for CDN

push: receive new content whenever changes occur on your server. SO server is responsible for updating. Useful because we save bandwidth and have better performance. Good for sites with low traffic and where content doesn't change often. pull: the CDN pulls data from the servers when the FIRST user requests for it. You leave the content on your server and rewrite URLs to point to the CDN. This results in a slower request until the content is cached on the CDN. There is TTL associated with it. Causes a lot of traffic. Good for systems with heavy traffic as it gets distributed evenly and only hot content is on the CDN

Load Balancing uses and methods

1 Preventing requests from going to unhealthy servers 2 Preventing overloading resources 3 Helping to eliminate a single point of failure 4 can provide encryption and decryption to avoid BE servers from doing these expensive operations 5 maintain sessions so we can route specific requests to specific servers ``` It can be done Random Round robin/ Weighted round robin Based on load of servers Layer 4 Layer 7 ```

if we want to generate unique hashes what can we use?

MD5 or SHA

Powers of 2 and their meaning

Power | Exact value | Approx Value | Bytes 7 | 128 | | 8 | 256 | | 10 | 1024 | 1000 | 1 KB 16 | 65,536 | | 64 KB 20 | 1,048,576 | 1 million | 1 MB 30 | 1073741824 | 1 billion | 1 GB 32 | | 4 billion | 4GB 40 | | 1 trillion | 1 TB if 2 ^ 30 = 1GB then 2^29 = 2^30 / 2 = 1/2 GB

System Design Flashcards

(32 cards)