System Design Flashcards

Question

Core API - environment specific builds

Answer 1

“For each commit on the ‘master’ branch, CircleCI sets DOTNETCORE_ENVIRONMENT=production and deploys to our production cluster. For ‘stable’, it uses stage—we keep these pipelines consistent, ensuring minimal drift.”

Answer 2

“We cache frequently requested data in Redis—like top odds or event stats—for short TTLs. This offloads read traffic from Aurora and drastically reduces latency on hot endpoints.”

Answer 3

“Each pod has an in-memory cache for micro-optimizations, but it’s not critical if a pod restarts—it’s purely ephemeral. That’s a classic stateless approach, as all permanent state lives in external data stores.”

Answer 4

“We used Prometheus and Grafana for real-time visibility into common and custom metrics. That data helps us spot anomalies or performance regressions fast, and Grafana let us set events to trigger slack notifications for the proper team”

Answer 5

“Any exception in the Core API automatically logs to Rollbar and critical errors trigger slack notifications to the proper teams. During a major sporting event, if we see a surge of 500 errors, we can quickly pinpoint which endpoint or DB call is failing.”

Answer 6

“We keep a histogram of HTTP request durations. By tracking P95 and P99 latencies, we ensure that even our worst-case requests stay within acceptable bounds, especially during heavy game traffic.”

Answer 7

“We operate in a distributed AWS environment, so partition tolerance is mandatory. We typically favor high availability over strict consistency by reading from Aurora replicas—though the primary itself is strongly consistent. That means brief eventual consistency for read workloads, which is acceptable for this domain.”

Answer 8

“We do strongly consistent writes to Aurora’s primary. But for reads—especially from replicas or caches—we accept short-lived eventual consistency. The lag is usually small, and it’s worth it to maintain high throughput under load.”

Answer 9

“All local developers must assume an MFA-secured AWS role. Secrets are stored in Parameter Store or K8s secrets, meaning we never expose plain-text creds in code or logs.”

Answer 10

“We use an internal ALB for traffic, so it’s not publicly accessible. On top of that, Kubernetes role based access control restricts who can modify deployments or read secrets, ensuring a tight security posture.”

Answer 11

“We measure requests-per-second during major sporting events and compare it to CPU/memory usage. If we see pods hitting 80% CPU or if DB queries approach saturation, we scale out. Aurora read replicas handle the read spikes, and Redis further reduces direct DB hits.”

Answer 12

“Ultimately, Aurora can become the bottleneck for heavy writes. We mitigate that with indexing, short caches, and read replicas. If needed, we could further partition data, but so far Aurora’s performance has met our needs. Nevertheless, I recently built an archiving task that runs nightly to archive all market lines from over 18 months in the past, which included a few hundred million records from a terabytes size table”

Answer 13

Problem: Multiple newly acquired properties each ingested sports data differently, creating inconsistencies. Solution: We built a Core API on .NET, containerized on EKS, and standardized data ingestion via the Core Processor. Outcome: We reduced duplication, established a single source of truth, and scaled seamlessly for peak sports seasons.

Answer 14

Problem: Rolling updates were risky with older infrastructure, often causing partial outages. Solution: By using Helm with rolling updates and readiness probes, we can gradually shift traffic to new pods while old pods are drained. Outcome: Near-zero downtime deploys and the ability to roll back quickly if metrics or logs show a spike in errors.

Answer 15

Problem: We lacked insight into production performance; debugging took hours. Solution: We integrated Telegraf for metrics and Rollbar for error logs. Outcome: The moment error rates spike, we get Slack alerts and can see exactly which endpoints or queries are failing, cutting response times in half.

Answer 16

Classes should have a single responsibility, and only one reason to change. Everything it does should be very closely related so class isn’t bloated.

Answer 17

Code should be open to extension, but closed to modification. Instead of modifying we can make a subclass that inherits from the base, extension methods

Answer 18

A child class should be able to do everything a parent class can.

Answer 19

Client should never be forced to implement an interface it doesn’t use or forced to depend on methods they don’t use

Answer 20

High level modules shouldn’t depend on low level modules. Both should depend on abstraction.

Answer 21

Most common and simple. Backed up by db and cache. Fronted by api and lb. Client -> API -> Load Balancer -> Service -> Cache -> Database

Answer 22

For systems that have lots of processing and can tolerate some delay. Good for processing images and videos. Queue -> Workers -> Database

Answer 23

Good for recommendations, search, route planning. Fast but inaccurate stage then slow but precise to finish off. Ranking service (slow but precise) -> Vector DB (fast but inaccurate) <- Blob storage

Answer 24

Centered around events, good for systems that need to react to changes in real-time, ex E-Commerce when an order is placed Event producers -> event routers/brokers (Kafka or eventbridge) -> event consumers to process the events and take necessary actions

Answer 25

For long running jobs. Store jobs in something like Kafka, then pool of workers process the jobs. They periodically checkpoint progress to a durable log and if a worker crashes another can pick up where it left off. Phase 1 workers -> distributed, durable log -> Phase 2 workers -> durable log -> Phase 3 workers

Answer 26

Ex Uber Divide geographical area into regions and index entities within the regions. Allows system to exclude vast areas that don’t contain relevant entities

System Design Flashcards

(50 cards)