System Architecture Flashcards

Question

Inverted index

Answer 1

An inverted index is a data structure that maps from words to the documents that contain them. This allows you to quickly find documents that contain a given word. - Used for full-text search - Used by lucene

Answer 2

Lemmatization is a process of changing a word to a dictionary form of the word

Answer 3

Stemming is a process of transforming a word into root form by cutting the ending of the word. This is similar to Lemmatization, but can not handle cases with irregular verbs, but can handle words which are not in the dictionary.

Answer 4

Stop words are the most popular words in the language or in your dataset, which don’t have any semantic weight.

Answer 5

- Track click via redirect. If we allow redirects to happen client-side we risk not knowing about it. - Analytics: separate DB + batch processing << stream processing - cassandra for raw click data - spark for batch processing, run every minute - streaming (flink), keep counts in memory and periodically flush to OLAP DB - scaling streams: shard by AdId, but can further bucket by appending random number to the end of the AdId to spread out the load - Durability: kafka is durable. Flink COULD use checkpointing, but not particularly relevant when processing 1-minute intervals. RECONCILIATION is a good way to double check as well. - Prevent abuse: generate unique impression ID, check impression ID in cache early in the stream (before aggregating). Sign the impression ID to avoid falsified IDs. - Aggregate data at coarse granularity (daily, weekly) via nightly cron job.

Answer 6

- Druid - Redshift - Snowflake - Bigquery

Answer 7

- use simple table for follower -> followed - lots of users results in a fan-out problems. Lots of reads, will be brittle and slow. (long tail problem) - use something like Snowflake IDs to make sure post IDs are chronologically sortable. - Precompute feeds table in DynamoDB. We'll use a partition key of the userId of the feed and its value will be a list of post IDs in order. - Async workers to process posts and write to feeds of their followers. - Don't fan-out writes for users with many followers (celebrities, > 100k). Fan out reads and merge with precomputed feeds at read time. - Hot keys: redundant caches (do not distribute data across nodes, they can operate independently), this spreads the load evenly across the cluster.

Answer 8

https://en.wikipedia.org/wiki/Triplestore - A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples[1] through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35".

Answer 9

- Users, jobs, tasks, schedule, executions (we need to separate the definition of a job from its execution instances) - By using a time bucket (Unix timestamp rounded down to the nearest hour) as our partition key, we achieve efficient querying while avoiding hot partition issues. - when a recurring job completes, we can easily schedule its next occurrence by calculating the next execution time and creating a new entry in the Executions table - GSI (Global secondary index) for UserID to get all jobs for a user - two-layered scheduler architecture: 1) query the DB every 5 minutes 2) publish messages to a queue in order of execution_time with delayed delivery (SQS Delivery Delay) 3) "just in time" jobs: publish directly to queue - AWS default limit is 3,000 messages per second for SQS - ECS or Kubernetes for autoscaling - At least once execution: SQS visibility timeout + heartbeats - we need to ensure our task code is idempotent: Design jobs to be naturally idempotent by using idempotency keys and conditional operations. For example, instead of "increment counter", the job would be "set counter to X". Instead of "send welcome email", we'd first check if the welcome email flag is already set in the user's profile. Each job execution includes a unique identifier that downstream services can use to deduplicate requests.

Answer 10

Sorted String Tables - Used in Cassandra

Answer 11

- hybrid to compute feed: fan out write + real-time - upload media: S3, multi-part upload, S3 notification to track status - CDN + dynamic media optimization

Answer 12

- Use optimistic concurrency control (OCC) to update max_bid on auction row. 1. read max bid from row 2. conditionally update max bid if value has not changed. - message queue to process bids. 1. durable storage 2. buffer against load spikes 3. guaranteed ordering - SSE for real-time client updates. More efficient than polling / long polling. - Pub/Sub (Redis) to broadcast max bid updates to clients that are watching the auction. - Dynamic auction end times: update end time with each bid + cron job. Or use an SQS queue with a delayed delivery. Check if the winner has changed, if not, end the auction.

Answer 13

- availability >> consistency (not intuitive here potentially) - presigned URLs to upload - CDN: We can use a cache control header to specify how long the file should be cached in the CDN. - Sharing with users: join table mapping user -> fileId that they have access to. Can use transaction to make sure it is consistent with the share list in the file metadata. - Monitor files: FileSystemWatcher or FSEvents - Conflicts are resolved using a "last write wins" strategy - Classify files: fresh vs stale. Can use SSE for real-time updates for fresh. Can do periodic polling for stale files. - Large files: 1. progress indicator 2. resumable uploads - Fingerprint: SHA-256 - Compression to speed up uploads and downloads: don't try to compress already compressed files - Encrypt in transit HTTPS, encrypt at rest in S3, access control - Generate a download signed URL with a TTL

Answer 14

Presigned URLs are a feature provided by cloud storage services, such as Amazon S3, that allow temporary access to private resources. These URLs are generated with a specific expiration time, after which they become invalid, offering a secure way to share files without altering permissions. When a presigned URL is created, it includes authentication information as part of the query string, enabling controlled access to otherwise private objects.

Answer 15

There are a number of compression algorithms that you can use to compress files. The most common are Gzip, Brotli, and Zstandard. Each of these algorithms has its own tradeoffs in terms of compression ratio and speed. Gzip is the most widely used and is supported by all modern web browsers. Brotli is newer and has a higher compression ratio than Gzip, but it's not supported by all web browsers. Zstandard is the newest and has the highest compression ratio and speed, but it's not supported by all web browsers. You'll need to decide which algorithm to use based on your specific use case. One important fact about compression is that you should always compress before you encrypt in cases where encryption is necessary. This is because encryption naturally introduces randomness into the file, which makes it difficult to compress. By compressing before encrypting, you will achieve a much higher compression ratio.

Answer 16

The saga pattern is a way to manage data consistency across multiple microservices without using distributed transactions, which can be complex and unreliable. Instead, it breaks down a single business transaction into a sequence of local transactions. Each local transaction updates the database within its own service and then publishes an event to trigger the next local transaction in the saga. There are two main ways to coordinate these sagas: Choreography: Each service listens for events from other services and reacts accordingly. There's no central orchestrator; services communicate directly through events. This can be simpler to implement initially but can become harder to manage as the saga grows more complex, leading to potential cyclic dependencies. Orchestration: A central orchestrator service manages the entire saga. It tells each service when to execute its local transaction and what to do based on the outcomes. This provides better control and visibility but introduces an additional service dependency. If a local transaction fails at any point in the saga, the pattern also requires implementing compensating transactions. These are transactions that undo the changes made by the preceding successful transactions. Imagine if one runner in the relay race drops the baton; you need a way to go back and correct the steps taken by the previous runners.

Answer 17

- background process to clean up TTL'ed items - LRU: hash table + doubly linked list to track access order - async repliaction / peer-to-peer replication (gossip protocols) - sharding: 32 GB machines, 24 GB of usable memory, consistent hashing - hot keys: dedicated hot key cache == read replicas << copies of hot keys - heavy writes to hot key: write batching or sharding hot key with suffixes - connection pooling

Answer 18

https://en.wikipedia.org/wiki/Gossip_protocol

Answer 19

Hash Ring & Assignment: Keys and servers are mapped to a circular space. Keys are assigned to the first server encountered clockwise on the ring. Minimal Redistribution: Adding or removing servers only requires remapping a small fraction of keys, unlike traditional hashing. Scalability & Stability: This approach improves scalability and stability in distributed systems by limiting the impact of node changes.

Answer 20

- last write wins - partition key, clustering key (optional, like sort key) - Cassandra Query Language (CQL) dialect - partitions of data are replicated to nodes on the ring, enabling it to skew extremely available (configurable) -

Answer 21

- Geohashing - Quadtree (in-memory) - A quadtree is a hierarchical data structure used for spatial partitioning of two-dimensional space. It recursively subdivides space into quadrants, allowing efficient storage and retrieval of spatial data. - R-tree - It organizes spatial objects into a tree hierarchy, allowing efficient queries for nearest neighbors, range searches, and spatial joins. - Redis geohashing - PostGIS (postgres extension) - ElasticSearch, native support

Answer 22

- Individual parts are coherent, cohesive and aligned both to the domain and to business value - individual parts are decoupled in the right way so that independent teams can work on the overall system together and in parallel - half an eye on future evolutions, creating overall architectures that are sufficiently adaptable to change

System Architecture Flashcards

(46 cards)