interview Flashcards

Question

5. 5. Ensure Security and Compliance for bot

Answer 1

Use encryption to secure sensitive data. Follow corporate policies on data storage and processing. Authenticate users (e.g., via Single Sign-On (SSO) or OAuth).

Answer 2

Test the bot in a staging environment for bugs or user flow issues. Deploy it to production on your chosen platform (e.g., Slack workspace or Teams environment).

Answer 3

Monitor usage logs and performance metrics. Gather feedback from users to improve functionality. Update the bot regularly to handle new scenarios or integrate additional features.

Answer 4

Detection: Incident identified via monitoring tools, alerts, or user reports. Classification: Assign severity level based on impact and urgency. Response: Notify stakeholders, assemble the incident response team, and assign roles. Diagnosis: Use logs, monitoring data, and tools to identify the root cause. Resolution: Apply a temporary fix (if needed) and implement a permanent solution. Communication: Regular updates to stakeholders. Post-Incident Review: Document the incident, analyze for lessons learned, and refine processes to prevent recurrence.

Answer 5

Assess the impact (e.g., number of users affected, financial loss). Analyze the urgency (e.g., SLA breaches or cascading system failures). Focus on critical systems supporting customer-facing services. Ensure effective delegation, while providing continuous communication to stakeholders.

Answer 6

Tools: Prometheus, Grafana, Splunk, Datadog, and ELK Stack. Used for tracking system performance, detecting anomalies, creating alerts, and diagnosing root causes. Example: Used Grafana dashboards to monitor application health during peak traffic and proactively mitigated risks by analyzing metrics like CPU and memory usage.

Answer 7

Start with the impact (e.g., "Service X is currently unavailable for 20% of users"). Avoid technical jargon; use analogies if helpful (e.g., "It’s like a traffic jam blocking access"). Highlight the steps being taken and the estimated time for resolution. Keep updates concise and provide frequent updates to maintain trust.

Answer 8

Be transparent but solution-focused. Clearly explain the situation, impact, and mitigation steps in progress. Reassure stakeholders by emphasizing the team’s expertise and a structured resolution plan.

Answer 9

Context: A critical payment processing service outage. Action: Coordinated with SREs, reviewed logs, and pinpointed a database deadlock issue. Solution: Rolled back the faulty code deployment and implemented additional monitoring to catch similar issues proactively. Outcome: Restored services within the SLA and conducted a root cause analysis to prevent recurrence.

Answer 10

Start with logs: Check network device logs and application-level errors. Run diagnostics: Use tools like traceroute, ping, and packet capture. Correlate data: Identify patterns like time of occurrence or specific affected regions. Isolate components: Test individual elements of the network to narrow down the root cause.

Answer 11

Jira: For tracking and documenting incidents. ServiceNow: For managing incident lifecycles. Confluence: For post-incident reporting and knowledge sharing.

Answer 12

Include essential details: incident timeline, root cause, impact, resolution steps, and lessons learned. Ensure reports are concise, structured, and accessible to both technical and non-technical audiences. Use templates to maintain consistency across reports.

Answer 13

Use a centralized communication channel (e.g., Slack or Microsoft Teams). Clearly assign roles and responsibilities to team members. Encourage open communication and quick escalation of issues. Regularly update stakeholders and keep the team focused on resolution goals.

Answer 14

Facilitate a discussion to focus on facts rather than opinions. Use data (e.g., logs, metrics) to guide decisions. If unresolved, escalate to a neutral decision-maker or a higher-level incident commander.

Answer 15

Familiar with Jenkins, GitLab CI/CD, and GitHub Actions. Implemented pipelines for automated testing, deployment, and monitoring. Example: Reduced deployment time by 30% using a well-defined CI/CD process integrated with Kubernetes.

Answer 16

Implement a change freeze during critical periods. Conduct thorough pre-deployment testing. Use canary deployments or blue-green deployments to minimize impact. Rollback immediately if adverse effects are detected.

Answer 17

Stay calm and focused by breaking the problem into smaller tasks. Use checklists and incident response playbooks to stay organized. Maintain clear communication and lean on the team for support.

Answer 18

Implemented a post-mortem framework to identify recurring incident trends. Developed an automated incident notification system using Slack and ServiceNow integration. Resulted in a 20% improvement in incident response times and better documentation quality.

Answer 19

depending on severity of the issue, outage, number of users effected

Answer 20

ERD, PRD, production. oncall bot schedules misalligned

Answer 21

Through an lark channel with all the correct POCs from the affected departments. asking for update and providing when I can.

Answer 22

rollbacks for bad deployments traffic failover to ttp2 after ttp1 outage downstream service caused errors spikes to a critical psm for my team 3C which caused an investigation and discussion with the team that caused it.

Answer 23

In such scenarios, prioritization would depend on the urgency and impact. Suicide content would take the highest priority as it involves potential loss of life and requires immediate action to protect individuals. Next, I would address government pressure, ensuring compliance with regulations to maintain operational stability. Lastly, I would manage the LGBT-related escalation, ensuring that it is handled sensitively and in alignment with TikTok’s values of inclusivity and community support. Throughout, I would ensure clear communication and resource allocation to address all issues effectively.

Answer 24

Through out my career ive been aspired to be more of leader where I can contribute to my team by driving project ideas, resolutions as well as being in the trench's getting the work done.

Answer 25

during my time at TikTok I designed a bot that updates the oncall schedules based on a master schedule regardless of the region. and outputs the current oncaller. During a demo for the discovery team there was a conflict between two of the SRE's regarding the impact of the tool for their team that escalated to shouting. Which I took the reigns of the conversation so we could break down the issues that were present since one of the discovery members didn't understand the issue the first member was bringing up.

Answer 26

im interested in working at titkok because its an ever evolving industry with numerous talented engineers and projects to work on to grow my career as well as supporting a platform that can connect millions of people worldwide through the app.

Answer 27

one of my friends was desperate for a career change and I brought up IT and got him involved with some foundational certs such as A+, Linux +. networking +, etc. We then went over interview prep and found some help desk jobs and now hes currently working for Microsoft as a pen test engineer after a 3 year journey.

Answer 28

Linear form of agile that utilizes 1.requirements 2. design 3. implementation 4. Testing 5. Deployment 6. Maintenance

Answer 29

top or htop.

Answer 30

Use df -h for disk usage and du -sh for directory size.

Answer 31

A: It maps hostnames to IP addresses locally, bypassing DNS.

Answer 32

A: Tests connectivity between two devices by sending ICMP echo requests.

Answer 33

A: VMware, Hyper-V, and KVM. oraclevm

Answer 34

A: NAT shares the host's IP, while Bridged gives VMs direct access to the network.

Answer 35

A: Manages firewall rules for packet filtering.

Answer 36

A: To segment a network into isolated virtual networks for security and efficiency.

Answer 37

A: Orchestrates containerized applications for scaling, deployment, and management.

Answer 38

A: Use the command docker run .

Answer 39

A: The smallest deployable unit in Kubernetes, containing one or more containers.

Answer 40

A: By automatically replicating and rescheduling pods on healthy nodes.

Answer 41

A: Defines the steps to build a custom Docker image.

Answer 42

A: Use kubectl get nodes.

Answer 43

A: Manages external HTTP/S access to services within the cluster.

Answer 44

A: Prometheus with Grafana or Kubernetes Dashboard.

Answer 45

A: Identification, logging, categorization, prioritization, investigation, resolution, closure.

Answer 46

A: Service Level Agreement – a commitment to resolve issues within a specified timeframe.

Answer 47

A: To restore normal service operation as quickly as possible with minimal business impact.

Answer 48

A: A Priority 1 incident with the highest severity, often causing significant business disruption.

Answer 49

A: To analyze root causes, evaluate the response, and identify process improvements.

Answer 50

A: Information Technology Infrastructure Library.

Answer 51

A: To coordinate efforts, ensure clear communication, and oversee resolution activities.

Answer 52

A: Identifying goals and the problem the process aims to solve.

Answer 53

A: ITIL (Information Technology Infrastructure Library).

Answer 54

A: Provide regular updates with clear and concise information to stakeholders.

Answer 55

A: Use post-incident reviews, gather feedback, and implement iterative changes.

Answer 56

A: Responsible, Accountable, Consulted, Informed.

Answer 57

A: It ensures consistency, provides clarity, and supports training and compliance.

Answer 58

A: A predefined set of steps and procedures to handle specific incident types.

Answer 59

A: Ansible, Rundeck, or Zapier.

Answer 60

A: It changes file permissions (e.g., chmod 755 file).

Answer 61

A: find /path/to/dir -name "*.log" -mtime -7.

Answer 62

A: Piping (|) passes the output of one command as input to another. Example: ls -l | grep "filename".

Answer 63

A: > overwrites a file, while >> appends to a file.

Answer 64

A: Processes and extracts data from text. Example: awk '{print $1}' file.txt prints the first column of a file.

Answer 65

#!/bin/bash ps aux

Answer 66

A: Exits the script immediately if a command returns a non-zero status.

Answer 67

0 2 * * * /path/to/script.sh

Answer 68

A: Excludes lines matching a pattern. Example: grep -v "error" file.txt.

Answer 69

A: sed -i 's/foo/bar/g' file.txt.

Answer 70

A: Master node (API server, scheduler, etcd, controller manager) and worker nodes (kubelet, kube-proxy, container runtime).

Answer 71

A: Run kubeadm init, then configure kubectl and join worker nodes using the provided token.

Answer 72

A: Use a Service of type NodePort or LoadBalancer, or configure an Ingress.

Answer 73

docker build -t /: . docker push /:

Answer 74

A: docker ps.

Answer 75

A: Use kubectl scale deployment --replicas=.

Answer 76

A: Stores non-sensitive configuration data as key-value pairs, which can be used by applications.

Answer 77

kubectl rollout restart deployment

Answer 78

docker-compose is for local container orchestration, while Kubernetes manages containers across distributed systems.

Answer 79

Define a YAML file specifying the storage class, access modes, and size, then apply it using kubectl apply -f.

Answer 80

A: An incident is an immediate disruption of service, while a problem is the underlying cause of one or more incidents.

Answer 81

A: To review and approve proposed changes to minimize risks.

Answer 82

A: By assessing impact (business effect) and urgency (time sensitivity).

Answer 83

A: Faster resolution, improved communication, reduced downtime, and better documentation for future prevention.

Answer 84

A: A repository of solutions, troubleshooting guides, and documentation to help resolve incidents efficiently.

Answer 85

A: Acts as the single point of contact for users to report incidents and request services.

Answer 86

A: Reactive addresses incidents after they occur, while proactive identifies and prevents potential issues.

Answer 87

A: Key metrics include Mean Time to Resolution (MTTR), First Call Resolution (FCR) rate, and SLA compliance.

Answer 88

A: To involve higher-level support or management when the current team cannot resolve the issue within SLA timelines.

Answer 89

A: ServiceNow, PagerDuty, or Jira Service Management.

Answer 90

#! /bin/bash for i in {1..10}; do touch "file$i.txt" done

Answer 91

A: netstat -tuln or ss -tuln.

Answer 92

A: Network Address Translation translates private IP addresses to a public IP address for internet communication, conserving IPv4 addresses.

Answer 93

A: Use ip link add type bridge or configure with ifconfig or ip addr.

Answer 94

A: Soft links (symbolic links) point to the original file's path, while hard links are direct references to the inode, unaffected by file relocation.

Answer 95

A: ping tests connectivity to a host, while traceroute shows the route packets take to reach the host.

Answer 96

Generate a key pair: ssh-keygen. Copy the public key to the remote server: ssh-copy-id user@remote_host. Ensure proper permissions: chmod 700 ~/.ssh and chmod 600 ~/.ssh/authorized_keys.

Answer 97

It converts input into arguments for a command. Example: ls | xargs rm removes files listed by ls.

Answer 98

A: command > file 2>&1.

Answer 99

Functional Escalation: Involves higher-level technical expertise. Hierarchical Escalation: Involves senior management for visibility or decision-making.

Answer 100

SLA (Service Level Agreement): Defines service delivery expectations between a provider and a customer. OLA (Operational Level Agreement): Defines responsibilities between internal teams. UC (Underpinning Contract): Defines obligations between a provider and third-party vendors.

Answer 101

Assess Impact: Identify affected systems and services. Activate Major Incident Process: Notify stakeholders and assemble an incident response team. Communicate Updates: Provide regular updates to users and management. Implement Fix: Work on resolution or mitigation. Document: Log details for post-incident review.

Answer 102

A process to determine the underlying reason for an incident or problem and identify corrective measures to prevent recurrence.

Answer 103

MTTR (Mean Time to Resolve) MTTD (Mean Time to Detect) First Call Resolution Rate Incident Escalation Rate

Answer 104

Coordinate response teams. Ensure SLA compliance. Communicate with stakeholders. Drive root cause analysis and post-incident reviews. Identify areas for process improvement.

Answer 105

Identify: Detect and verify the breach. Contain: Isolate affected systems. Eradicate: Remove the threat. Recover: Restore systems and data. Learn: Conduct a post-incident review.

Answer 106

DNS (Domain Name System) resolves human-readable domain names (e.g., google.com) into IP addresses that computers use to identify resources.

Answer 107

A (Address): Maps a domain to an IPv4 address. AAAA: Maps a domain to an IPv6 address. CNAME: Maps an alias to another domain name. MX (Mail Exchange): Specifies mail servers for a domain. NS (Name Server): Specifies authoritative DNS servers for a domain. PTR (Pointer): Provides reverse DNS, mapping an IP address to a hostname. TXT: Stores arbitrary text, often used for verification and policies (e.g., SPF, DKIM).

Answer 108

A: Queries DNS servers for information about domains and their records.

Answer 109

nslookup google.com

Answer 110

-`kubectl version`: Get Kubernetes client and server versions. - `kubectl get pods`: List all running pods. - `kubectl describe pod `: Get detailed info about a pod. - `kubectl apply -f .yaml`: Apply configuration from a YAML file. - `kubectl delete pod `: Delete a pod.

Answer 111

physical : it is responsible for the actual physical connection between the devices. The physical layer contains information in the form of bits

Answer 112

Data Link Layer (DLL) The data link layer is responsible for the node-to-node delivery of the message.

Answer 113

Network Layer The network layer works for the transmission of data from one host to the other located in different networks. It also takes care of packet routing i.e. selection of the shortest path to transmit the packet, from the number of routes available.

Answer 114

Transport Layer The transport layer provides services to the application layer and takes services from the network layer. The data in the transport layer is referred to as Segments. It is responsible for the end-to-end delivery of the complete message. The transport layer also provides the acknowledgment of the successful data transmission and re-transmits the data if an error is found. Protocols used in Transport Layer are TCP, UDP NetBIOS, PPTP.

Answer 115

Layer 5 – Session Layer Session Layer in the OSI Model is responsible for the establishment of connections, management of connections, terminations of sessions between two devices. It also provides authentication and security. Protocols used in the Session Layer are NetBIOS, PPTP.

Answer 116

Presentation Layer The presentation layer is also called the Translation layer. The data from the application layer is extracted here and manipulated as per the required format to transmit over the network. Protocols used in the Presentation Layer are JPEG, MPEG, GIF, TLS/SSL, etc.

Answer 117

Application Layer At the very top of the OSI Reference Model stack of layers, we find the Application layer which is implemented by the network applications. These applications produce the data to be transferred over the network. This layer also serves as a window for the application services to access the network and for displaying the received information to the user. Protocols used in the Application layer are SMTP, FTP, DNS, etc.

Answer 118

TCP Creates a secure connection to ensure data is transmitted reliably. TCP verifies that data is received and checks for errors. UDP Does not establish a connection, so it doesn't check for errors or confirm receipt. This means some data may be lost during transmission.

Answer 119

**An issue that needs to be addressed immediately and with as many resources as is required**. Such an issue causes a full outage or makes a critical function of the product to be unavailable for everyone, without any known workaround.

Answer 120

severe end user tiktok user impact app functions are broken and severe experience issues are being encounter. - 3 or more teams impacted such as TCE, RDS, HDFS - quantifiable revenue or advertiser impact - security impact risk to customer data, security breach, data loss, vulnerabilities, hack/attack, etc

Answer 121

high system affect vs single user affect. vs urgency.

Answer 122

Responsibilities: 1. imt is added to a p0 incident and begins tracking the incident timeline 2. ensure escalation to correct technical teams based on systems impacted 3. insures that the incident is being address in a timely manner and will drive escalations to team leads and managers 4. opens fatal record 5. starts incident analysis template to start incident report 6. tracks the incident details and drives the incident group until the impact is mitigated 7. add all relevant data to incident report and begins the post incident review process Security incidents escalate to the appropriate security channel

Answer 123

IT Service Management

Answer 124

1. intial triage ( join/create oncall, review chat logs, request issue summary, request poc updates, ensure all necessary escalation contact that are needed to investigate the issue are engaged. 2. manage incident (update ttp incident thread every 15 minutes for critcal issues. request regular updates from technical teams, use data to populate the iat with as much data as possible) 3. Post incident (when mitigate lower to p1, send final message to appropriate groups. create jira epic and create post mortem doc.)

Answer 125

1. identify - detect/log incident 2. analyze - categorize and prioritize incidents 3. respond - investigate, diagnose and resolve incidents 4. review - post mortem and improvements

Answer 126

1. identify incidents 2. document incidents 3. categorize incidents 4. assign ownership

Answer 127

1. incident coordinator 2. technical specialists 3. communication manager 4. process owner

Answer 128

1. service outage 2. security breach 3. human error 4. natural disasters

Answer 129

Responsible: Person(s) doing the task. Accountable: Person with final decision-making authority. Supporting: Person(s) providing support or resources. Consulted: Person(s) providing input or feedback. Informed: Person(s) kept updated on progress or decisions.

Answer 130

Network Traffic Drop in TTP1 OCI

Answer 131

CI/CD stands for Continuous Integration and Continuous Deployment/Delivery, automating code integration, testing, and deployment processes.

Answer 132

A: CI involves merging developer code changes into a shared repository multiple times a day, with automated builds and testing to detect integration issues early.

Answer 133

A: CD automates the release process, ensuring the application is always in a deployable state but requires manual approval to deploy to production.

Answer 134

A: Continuous Deployment automates the release process entirely, deploying every change that passes automated tests to production without manual intervention.

Answer 135

Code Commit: Developers push code to a repository. Build: The application is compiled, and dependencies are installed. Test: Automated tests validate the code. Deploy: The tested application is deployed to staging or production.

Answer 136

A: To ensure that changes do not break existing functionality or introduce bugs.

Answer 137

A: Jenkins, GitHub Actions, GitLab CI/CD, CircleCI, Travis CI, Azure DevOps, etc.

Answer 138

A: Version control (e.g., Git) tracks changes, enables collaboration, and integrates with CI/CD pipelines for automated builds and testing.

Answer 139

A: A strategy where a new version is rolled out to a small subset of users before full deployment to minimize risk.

Answer 140

A: Reverting to a previous stable version in case of deployment failures.

Answer 141

A: To define and clarify roles and responsibilities within a project or process to avoid confusion.

Answer 142

Assign clear responsibilities for incident detection, escalation, resolution, and review. Ensure accountability is established for critical decision-making. Identify key stakeholders to keep informed during incidents.

Answer 143

Responsible: Incident responder or SRE. Accountable: Incident manager or lead. Supporting: System administrator or SME. Consulted: Security team or product owner. Informed: Leadership or affected customers.

Answer 144

Overlapping roles leading to confusion. Lack of agreement on responsibilities. Not keeping the matrix up to date with organizational changes.

Answer 145

A: Incident Management focuses on restoring service as quickly as possible. Problem Management identifies and resolves the underlying cause of incidents.

Answer 146

A: A set of best practices for IT service management that aligns IT services with business needs.

Answer 147

A: To act as a single point of contact (SPOC) for users to report incidents and request services.

Answer 148

A: A group responsible for evaluating and approving proposed changes to IT systems.

Answer 149

People: Stakeholders and roles. Processes: Workflows and activities. Products: Technology and tools. Partners: Vendors and suppliers.

Answer 150

Mean Time to Resolve (MTTR). Mean Time to Detect (MTTD). Number of recurring incidents. SLA compliance rate.

Answer 151

Reactive: Focuses on resolving incidents after they occur. Proactive: Prevents incidents through monitoring, analysis, and improvement.

interview Flashcards

(176 cards)