A Critical Site Event: Flashcards
(85 cards)
What is a Critical Site Event (CSE)?
A CSE is an event that has the potential for large and direct impact on customers, are highly emphasized and visible, urgent situations with a high risk of load loss, and are triggered by a loss of redundancy or resiliency to systems.
Example: A loss of supply power to the cooling systems that risks overheating without direct customer impact.
What defines a Large Scale Event (LSE)?
An LSE is an event that impacts customers’ ability to connect, is reported by a customer contact, and indicates that customers are affected by a loss of service.
Example: Similar loss of power to cooling systems where customers make contact.
What is the primary difference between a CSE and an LSE?
CSEs are triggered by events where service remains available but may cause customer impact, while LSEs are triggered by customer contact or events that result in loss of service.
What is the role of the DCEO team?
The DCEO team is responsible for maintaining all critical infrastructure within data centers globally.
What should be done when a critical alarm is triggered?
DCEO needs to respond immediately by acknowledging the alarm and assessing the situation at the alarm’s location.
List the phases in the event management process.
- React: Acknowledge the alarm
- Investigate: Engage with the equipment
- Communicate: Update and contact stakeholders
- Fault Find: Identify the root cause
- Update: Keep relevant parties informed
- Stabilize and Restore: Implement mitigation steps.
Who is the First Responder?
The First Responder is any person onsite responsible for responding to alarms and notifying surrounding EOTs of the alarm.
What is the responsibility of the Incident Commander (IC)?
The IC is responsible for communications during the event, ensuring proper handling, providing updates, and escalating the event as needed.
Fill in the blank: The _______ monitors and prioritizes alarms globally.
[Facility Operations Center (FOC)]
What are the InfraOps Tenets?
- Safe Work Environment
- Security
- Prepare for the Improbable
- Automation
- Speed Matters
- Serviceability
- Continuous Improvement.
What should be done if there is a loss of redundancy?
The First Responder should investigate potential customer impact, assess conditions, and then escalate as needed.
True or False: The Primary On-Call is typically an On-Site Facility Manager.
True.
What steps should be taken when escalations are necessary?
Communicate early and often, assess options, redirect network traffic, and request additional resources.
What is the primary responsibility of the Call Leader?
Leads the FOC conference call during customer or redundancy/resiliency impacting events and keeps the Response Team focused on recovery efforts.
What happens if a critical alarm is received?
The First Responder EOT needs to react immediately.
What is the significance of the FOC in the event management process?
The FOC provides first-level support, monitors alarms, and helps resolve events on a 24-hour basis.
What actions should the First Responder take during an event?
- Contact another member to act as IC
- Establish communication with IC
- Validate alarms and escalate if required.
What is the escalation process for customer-impacting LSEs?
All customer-impacting LSEs must be escalated to the Cluster Manager for mitigation planning and recovery approval.
Fill in the blank: If issues require escalation to the Regional/Cluster Manager, prepare an email to Amazon Senior _______.
[Leadership]
What should be assessed before attempting any resets after a power event?
Any damage
Important to ensure safety and proper functioning before resetting systems.
What is the first action taken by FOC when they see an alarm?
Cuts a TT to affected site
TT stands for Trouble Ticket.
If the FOC does not pick up on an alarm, what should the on-call EOT do?
Get in contact with the FOC ASAP
Ensures timely response to alarms.
What should the EOT on site do if they cannot get through to the FOC?
Create your own TT and work from that
This allows for independent action in critical situations.
Which button should the EOT taking up the IC position use for events?
CSE Power or Thermal event button
Available through a TamperMonkey Script.