A Critical Site Event: Flashcards

(85 cards)

1
Q

What is a Critical Site Event (CSE)?

A

A CSE is an event that has the potential for large and direct impact on customers, are highly emphasized and visible, urgent situations with a high risk of load loss, and are triggered by a loss of redundancy or resiliency to systems.

Example: A loss of supply power to the cooling systems that risks overheating without direct customer impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What defines a Large Scale Event (LSE)?

A

An LSE is an event that impacts customers’ ability to connect, is reported by a customer contact, and indicates that customers are affected by a loss of service.

Example: Similar loss of power to cooling systems where customers make contact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary difference between a CSE and an LSE?

A

CSEs are triggered by events where service remains available but may cause customer impact, while LSEs are triggered by customer contact or events that result in loss of service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of the DCEO team?

A

The DCEO team is responsible for maintaining all critical infrastructure within data centers globally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What should be done when a critical alarm is triggered?

A

DCEO needs to respond immediately by acknowledging the alarm and assessing the situation at the alarm’s location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the phases in the event management process.

A
  • React: Acknowledge the alarm
  • Investigate: Engage with the equipment
  • Communicate: Update and contact stakeholders
  • Fault Find: Identify the root cause
  • Update: Keep relevant parties informed
  • Stabilize and Restore: Implement mitigation steps.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Who is the First Responder?

A

The First Responder is any person onsite responsible for responding to alarms and notifying surrounding EOTs of the alarm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the responsibility of the Incident Commander (IC)?

A

The IC is responsible for communications during the event, ensuring proper handling, providing updates, and escalating the event as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fill in the blank: The _______ monitors and prioritizes alarms globally.

A

[Facility Operations Center (FOC)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the InfraOps Tenets?

A
  • Safe Work Environment
  • Security
  • Prepare for the Improbable
  • Automation
  • Speed Matters
  • Serviceability
  • Continuous Improvement.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should be done if there is a loss of redundancy?

A

The First Responder should investigate potential customer impact, assess conditions, and then escalate as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: The Primary On-Call is typically an On-Site Facility Manager.

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What steps should be taken when escalations are necessary?

A

Communicate early and often, assess options, redirect network traffic, and request additional resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the primary responsibility of the Call Leader?

A

Leads the FOC conference call during customer or redundancy/resiliency impacting events and keeps the Response Team focused on recovery efforts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens if a critical alarm is received?

A

The First Responder EOT needs to react immediately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the significance of the FOC in the event management process?

A

The FOC provides first-level support, monitors alarms, and helps resolve events on a 24-hour basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What actions should the First Responder take during an event?

A
  • Contact another member to act as IC
  • Establish communication with IC
  • Validate alarms and escalate if required.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the escalation process for customer-impacting LSEs?

A

All customer-impacting LSEs must be escalated to the Cluster Manager for mitigation planning and recovery approval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fill in the blank: If issues require escalation to the Regional/Cluster Manager, prepare an email to Amazon Senior _______.

A

[Leadership]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What should be assessed before attempting any resets after a power event?

A

Any damage

Important to ensure safety and proper functioning before resetting systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the first action taken by FOC when they see an alarm?

A

Cuts a TT to affected site

TT stands for Trouble Ticket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

If the FOC does not pick up on an alarm, what should the on-call EOT do?

A

Get in contact with the FOC ASAP

Ensures timely response to alarms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What should the EOT on site do if they cannot get through to the FOC?

A

Create your own TT and work from that

This allows for independent action in critical situations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which button should the EOT taking up the IC position use for events?

A

CSE Power or Thermal event button

Available through a TamperMonkey Script.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What tools are needed to address script-inserted questions?
* Doors of Durin * AHA and its CSE Response Tool * GRC * BMS/EPMS * Holocron * The First Responder * The FOC * The DCO Team ## Footnote These tools aid in managing and responding to events effectively.
26
What is the main focus of the First Responder during an alarm investigation?
Fault finding and troubleshooting ## Footnote They communicate only with the IC.
27
What does the IC do when the First Responder provides updates?
* Contacts the FOC to spin up a conference call if required * Escalates using the suitable escalation path * Supplies regular updates to the ticket and conference call ## Footnote Keeping communication clear and consistent is essential.
28
What information should the IC update in the TT regarding critical load support?
* Impact YES/NO * Potential impact * Staff onsite and roles * How critical load is supported * UPS Autonomy Times * Vendor engagement status * Access arrangements * Status of affected POD/Electrical room ## Footnote This information is crucial for assessing the situation accurately.
29
What is a CSE?
An infrastructure event impacting two or more racks ## Footnote Stands for Critical Site Event.
30
What is the purpose of the CSE Response Tool?
To diagnose a CSE by providing critical data ## Footnote It offers insights into host and rack impairment, server temperature data, and critical electronic metrics.
31
What should the IC do if they need to escalate an issue?
Ensure to speak to someone directly ## Footnote Emails, texts, or chimes do not count as escalation.
32
What is the role of the IC during a conference call?
Clearly identify themselves and remain on the call until resolved ## Footnote Important for maintaining clarity and continuity.
33
What steps should the IC take if there is a loss of redundancy?
Take immediate action if critical load/mechanical load has not transferred ## Footnote This may involve escalating the situation and conducting tests.
34
Fill in the blank: The IC is expected to use the _______ button on the TT for thermal events.
CSE Thermal Event ## Footnote This differentiates between power and thermal incidents.
35
What must be updated in the TT regarding the status of AHU/CRAH modules?
* Temperature in Pod * Difference in temps over last 5-10 mins * Status of remaining healthy modules * Number of fault-free modules * Status of associated HSDB/MSDB ## Footnote Critical for managing thermal events effectively.
36
What is the primary function of the CSE Response Tool?
To provide high-level and detailed views into the status of racks, hosts, and electrical infrastructure in a data center during Critical Site Events
37
What type of data does the CSE Response Tool incorporate?
Hotspot data for thermal critical racks and average rack temperatures
38
List the benefits of the CSE Response Tool
* Shows important visuals on one page * Improves HotSpot alarms by monitoring aggregated thermal data * Initiates alarms based on rate-of-change thresholds
39
Define 'Up' status for racks in the CSE Response Tool.
Fewer than 25% of the operational hosts in the rack are not reachable from the network
40
What does 'Down' status indicate for a rack?
At least 25% of the operational hosts in the rack are not reachable from the network
41
What temperature indicates a rack is 'Thermal Critical'?
The average temperature of the hosts in the rack is equal to or greater than 35°C
42
What does 'Unknown' status mean for a rack?
There is not enough information to determine the rack status
43
How does the InfraMap floor map enhance rack status monitoring?
It includes rack status and thermal status data for spatial identification of issues
44
What information does the InfraMap SOS dashboard provide?
Electrical infrastructure status data including utility, generator, or UPS power readings
45
Who receives alerts when thermal CSE conditions are met?
* A member of the FOC monitoring a potential CSE * A Field Engineer remotely monitoring the site * A local DCEO EOT
46
What criteria create a potential thermal CSE alert?
* Average host temperatures in a room equal to or over 35°C * More than 100 thermal critical racks with over 10 racks reporting inlet temperature * Average host temperatures diverging more than 5°C from the 30-minute trailing average
47
What is the maximum time span for event de-duplication in thermal CSE alerts?
30 minutes
48
What categories are racks separated into within the CSE Response Tool?
* Thermal Critical * Thermal Impaired * Impaired * Normal * Unknown
49
What key statuses are displayed on the CSE Dashboard?
* Event-Related Rack Downs * At-Risk Racks * Event Impact * Electrical Infrastructure Status * Down Rack Count * IT Load and ATS Monitoring Status
50
What does the Thermal Status tab show?
Rack thermal status, thermal impact count, and rack average temperatures
51
What does the Rack Detail page provide?
Details about a specific rack and the hosts it contains
52
What type of data does the CSE Response Tool display during a CSE?
Critical electrical infrastructure data alongside down racks data
53
What is the purpose of the SOS Dashboard?
To view high-level electrical utility, generator, and UPS states for every lineup in a data center
54
What does a reading of zero in the UPS column indicate?
The UPS is in use
55
What scenario indicates the data center is operating normally?
All meters are green
56
What does the term 'On Generator' indicate?
The generator is active and producing output load
57
What indicates a lineup is on UPS?
Both USB and generator readings are at zero
58
Describe an 'Edge Case' scenario.
The generator is tested with load output, showing greater than zero while the USB indicates power from the utility
59
What happens in a 'Live Load Transfer - Failed to Transfer' scenario?
Load is shown on the generator meter and both USB inputs, with a UPS reading lower than 100%
60
What are 'Meter Defect States'?
States where lineups are not actively monitored or have faulty meters
61
What does AHA stand for?
Amazon Hardware Atlas
62
What is the main function of AHA?
Data center health monitoring during event recovery and regular operations
63
How often does AHA ping hosts in PROD and EC2 fabrics?
Every minute
64
What indicates a rack is thermal impaired?
At least 25% of hosts are impaired and the rack was thermal critical before becoming impaired
65
What does AHA use to monitor host impairment status?
Ping data and thermal data
66
What regions is AHA available in?
* Classic Regions * BJS/ZHY * PDT/OSU * DCA * LCK
67
What does the AHA Blast Radius feature allow operators to do?
Identify racks downstream of power topology equipment
68
What information does the downstream rack page display?
Fabric and rack type breakdown, along with topology nodes
69
What does Rack Splits provide information about?
Breakdown of racks in a Pod, including fabric type and supply
70
What must you do to create an Impact Analysis in AHA?
Be part of the security group 'aha-impact-analysis-admin' and enter the Datacenter and Date
71
What is required to create an Impact Analysis in AHA?
You must be part of the security group aha-impact-analysis-admin and in the InfraOps Central Ops org ## Footnote This ensures only authorized users can create event analyses.
72
What information must be entered to create an Impact Analysis?
Datacenter, Date, and Time (in UTC) ## Footnote This data is essential for accurate event creation.
73
How can users view different potential impacts during event creation?
By selecting a different date and time ## Footnote This feature allows analysis of various scenarios.
74
What is displayed on the next screen after selecting a date and time for Impact Analysis?
A timeline of impacted racks ## Footnote This helps identify the starting point of the analysis.
75
What must be entered in the confirmation modal when creating an event?
An associated ticket (or SIM ID) ## Footnote This links the event to existing support tickets.
76
What information is provided on the event analysis page?
Analysis of the event, currently impacted racks, and number of recovered hosts ## Footnote This gives insights into the event's impact and recovery status.
77
What action can be taken after viewing the event analysis?
Click Post to Ticket to append a list of impacted racks ## Footnote This facilitates communication with the support team.
78
What happens to events currently after 10 hours?
They close automatically ## Footnote This ensures timely event management.
79
What is Seismo in the context of AHA?
The DC Availability Anomaly Detector that detects multi-rack impairments ## Footnote It triggers alarms for quick response to significant issues.
80
What is the temperature threshold for Seismo to cut a ticket to the FOC?
Above 35 degrees Celsius ## Footnote This helps manage thermal anomalies in data centers.
81
What is the purpose of HotSpot Redux?
It aggregates and processes server-level environmental sensor readings ## Footnote This data is crucial for monitoring and predicting thermal events.
82
What teams utilize HotSpot data?
AHA/InfraMap, Seismo, and DCS Science Team ## Footnote They use this data to analyze and respond to potential thermal events.
83
What is the function of the HWMon team in relation to HotSpot?
They publish source data from servers with software-based hypervisors ## Footnote This is part of the data aggregation process.
84
What is the legacy system that HotSpot Redux redesigned?
The legacy Hotspot service ## Footnote The redesign improves data handling and analysis capabilities.
85
What action should be taken for special feature requests?
Submit a ticket to get your ideas heard ## Footnote This allows users to contribute to system improvements.