interview Flashcards
(176 cards)
- How do you differentiate between incidents, problems, and changes?
Incident: An unplanned interruption or degradation of service.
Problem: The underlying cause of one or more incidents.
Change: A modification to an IT service or system aimed at resolving problems or improving functionality.
- How do you handle multiple simultaneous incidents?
Assess and prioritize based on impact and urgency.
Assign dedicated resources to each incident.
Use playbooks to ensure a structured response for high-priority incidents.
Communicate status updates effectively to all stakeholders.
- What is your experience with post-incident reviews (PIRs)?
Conducted PIRs within 48 hours of incident resolution.
Structured the review to include a timeline, root cause analysis, corrective actions, and lessons learned.
Facilitated open discussions to identify process gaps and ensure accountability without assigning blame.
- How do you ensure compliance with SLAs during incident management?
Establish a clear escalation and communication path for all teams involved.
Use global incident management tools like ServiceNow or PagerDuty.
Ensure team members are aware of time zones and organizational dependencies.
- How do you diagnose performance issues on a Linux server?
Use tools like top, htop, or vmstat for resource usage.
Analyze logs under /var/log.
Check disk I/O using iostat or iotop.
Inspect network performance using netstat, tcpdump, or iftop.
- What is your understanding of TCP/UDP and when would you use each?
TCP: Reliable, connection-oriented protocol for use cases like file transfers and web browsing.
UDP: Lightweight, connectionless protocol for real-time applications like DNS queries and video streaming.
- How do you ensure containerized services are running optimally?
Use monitoring tools like Prometheus and Grafana for resource tracking.
Analyze container logs using docker logs.
Ensure health checks and resource limits (CPU, memory) are defined in Docker or Kubernetes configurations.
Investigate inter-container network latency or misconfigurations.
- How would you troubleshoot DNS resolution failures?
Check the DNS server’s availability using nslookup or dig.
Verify DNS configurations in /etc/resolv.conf.
Investigate firewall or network settings blocking DNS traffic.
Ensure TTL values are appropriate and DNS caches are updated.
- What are some common bottlenecks in CI/CD pipelines, and how do you address them?
Slow Builds: Optimize builds by caching dependencies or using parallel tasks.
Failed Tests: Ensure tests are modular and focus on critical areas.
Deployment Issues: Use automated rollback mechanisms or staged deployments.
- How do you configure and use Prometheus or graffana for monitoring?
Install Prometheus and configure scrape targets in the prometheus.yml file.
Use exporters (e.g., node_exporter for Linux systems) to gather metrics.
Query metrics using PromQL and visualize them with Grafana.
- What is the difference between active and passive monitoring?
Active Monitoring: Simulates user transactions to test system performance proactively (e.g., synthetic monitoring).
Passive Monitoring: Observes live user activity to detect issues in real-time (e.g., packet sniffing).
- What do you look for in log files during incident resolution?
Errors or exceptions with timestamps matching the incident.
Patterns indicating system or user activity leading to failure.
Logs of dependent services to identify cascading issues.
- How do you design an incident escalation matrix?
Define escalation tiers based on severity and impact.
Assign escalation paths to specific roles or teams.
Establish time thresholds for each tier.
Regularly review and update the matrix.
What are some key metrics you track to measure the success of incident management processes?
Mean Time to Detect (MTTD).
Mean Time to Acknowledge (MTTA).
Mean Time to Resolve (MTTR).
SLA compliance rates.
Post-incident review completion rates.
- How do you handle emergency changes during an incident?
Assess the impact and risk of the change with key stakeholders.
Gain approval through an expedited emergency change management process.
Test the change in a controlled environment if time permits.
Monitor the results and document the change thoroughly.
- What are some strategies for mitigating risks in high-availability systems?
Implement redundancy at all levels (e.g., servers, storage, networks).
Use automated failover mechanisms.
Regularly test disaster recovery and failover scenarios.
Ensure proper monitoring and alerting to catch early signs of degradation.
- A critical database server is down during peak hours. How would you handle the situation?
Notify stakeholders immediately and assemble the incident response team.
Investigate logs for errors or performance degradation.
Check for hardware issues or resource exhaustion.
Apply a temporary fix, such as restoring from a backup or scaling resources.
Document the incident thoroughly and schedule a follow-up for root cause analysis.
A monitoring tool has flagged intermittent latency in a microservices-based application. What’s your approach?
Examine logs for specific services with high response times.
Use distributed tracing tools to identify bottlenecks.
Investigate resource usage on affected nodes.
Test inter-service communication and network latency.
- Explain the significance of ICMP in network troubleshooting.
ICMP is used for diagnostic and error-reporting purposes.
Common tools like ping and traceroute rely on ICMP to measure connectivity and path latency.
- How do you ensure I/O performance optimization in a high-load application?
Use RAID for disk performance and redundancy.
Optimize database queries and indexes.
Implement caching layers (e.g., Redis).
Monitor and adjust kernel I/O schedulers.
- Define the Bot’s Purpose
Identify the problem the bot will solve (e.g., automate repetitive tasks, assist employees, or manage workflows).
Example use cases: answering FAQs, scheduling, or retrieving on-call staff information.
- how to choose right bot platform?
Decide where the bot will operate (e.g., Slack, Microsoft Teams, email, or a custom app).
Ensure it integrates well with corporate tools (e.g., Jira, ServiceNow, or internal APIs).
- Select Development Tools for bot
Bot Frameworks: Use frameworks like Microsoft Bot Framework, Dialogflow, or Botpress to streamline development.
Programming Language: Python, JavaScript, or Node.js are commonly used due to their simplicity and libraries.
APIs: Utilize corporate APIs (like HR systems or databases) to fetch required data.
- Build Core Functionality for a bot
Write code for the bot’s tasks. For example:
Use APIs for fetching schedules, automating queries, or retrieving documents.
Build logic for processing commands like “Show today’s on-call staff.”
Implement natural language understanding (NLU) using tools like Rasa or Dialogflow for conversational bots.