SRE concepts Flashcards

Error budgets, SLXs, monitoring, observability, incident management

1
Q

Can you explain the concept of “Error Budget”?

A
  • Error budget represents the acceptable level of unreliability or downtime for a service while still meeting the service level objectives (SLOs).
  • balance innovation and reliability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How would you handle a critical incident?

A
  • Immediate Response: Identify the issue’s scope and impact. Assemble an incident response team. Focus on restoring service quickly.
  • Mitigation: Isolate the problem, roll back changes if needed, and implement temporary fixes to restore service stability.
  • Investigation: Analyse logs, metrics, and system behaviour to determine the root cause. Share information among teams.
  • Resolution and Post-Incident Analysis: Implement a permanent fix. Conduct a post-incident analysis to identify contributing factors and preventive measures.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How would you approach designing and implementing a monitoring and alerting system for a complex distributed application?

A
  • Design a monitoring system with relevant metrics such as response time, error rates, resource utilisation, and custom application-specific metrics.
  • Use tools like Prometheus, Grafana, or ELK stack.
  • Set up meaningful alerts with proper thresholds and aggregation.
  • Establish different alerting channels for different severity levels.
  • Continuously refine alerts based on false positives/negatives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the differences between horizontal and vertical scaling.

A
  • Horizontal scaling involves adding more instances of the same component, distributing the load.
  • Vertical scaling involves upgrading the resources of an existing instance.
  • Horizontal scaling is suitable for distributing high traffic and ensuring availability.
  • Vertical scaling is useful for improving performance of individual components.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What challenges might arise with horizontal and vertical scaling?

A
  • horizontal scaling: synchronization issues and data consistency
  • vertical scaling: resource limitations on a single machine.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How would you describe a robust CI/CD pipeline?

A
  • A robust CI/CD pipeline includes stages for building, testing, deploying, and monitoring.
  • Automated tests cover unit, integration, and end-to-end scenarios.
  • pipeline helps catch bugs early and ensures consistent deployments.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss the importance of automated testing and continuous integration/continuous deployment (CI/CD).

A
  • Automated testing ensures code quality and reliability.
  • CI/CD automates deployment pipelines for faster and safer releases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you troubleshoot the issue and identify the root cause of performance degradation in production?

A
  • Analyse performance metrics, logs, and resource utilisation.
  • Identify the specific code changes causing the issue.
  • If necessary, roll back the update using version control systems.
  • To prevent future occurrences, improve testing practices, and consider canary deployments or feature flags.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you ensure that incidents are thoroughly investigated and learnings are applied to prevent future occurrences?

A
  • Incident response involves immediate action to restore service.
  • Post-incident analysis includes a thorough review of what happened, why it happened, and how to prevent it.
  • Ensure all contributing factors are addressed.
  • Learnings are shared across teams, leading to process improvements and preventing similar incidents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What disaster recovery strategies and techniques would you employ to ensure high availability and data integrity for a critical application?

A
  • Redundancy: Deploy across multiple availability zones/regions.
  • Backup and Restore: Regularly back up data and test restoration procedures.
  • Failover: Automatically switch to standby systems in case of failure.
  • Chaos Engineering: Intentionally introduce failures to test the system’s resilience.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the concept of “circuit breakers” and their role in preventing cascading failures in a microservices architecture.

A

detect service degradation
block requests to a failing service
isolate failing components
give them time to recover

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you implement and manage circuit breakers effectively?

A
  • Set thresholds for errors or latency
  • Manage thresholds dynamically based on real-time performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you navigate collaborating with development teams to achieve both reliability and innovation?

A
  • Emphasise the common goal of reliability and innovation.
  • Engage in open communication, share data and insights, and involve both teams in decision-making.
  • Use incident learnings to advocate for reliability improvements without hindering innovation.
  • Align on shared metrics and incentives to foster a culture of collaboration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is an error budget calculated and how would you prioritise between new features and reliability?

A
  • calculated based on the difference between 100% and the desired SLO percentage.
  • prioritise between new features and reliability by considering the remaining error budget.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly