recover Flashcards
(36 cards)
what is an event
any change of state that has significance for the management of a service. Typically, they are notifications from monitoring tools
what is an incident
an unplanned interruption to a service or reduction in the quality of a service.
what is a problem
- what is a known error
a cause, or potential cause, of one or more incidents
- known errors are problems that have been analysed but not resolved
what is incident management
- purpose?
- to minimise the negative impacts of incidents by restoring normal service operation as quickly as possible
- diagnose and escalate
- reactive process
- not a proactive measure
what is problem management
- purpose?
- reduce likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors
- reactive and proactive
- same incident occurring many times; affects many users;
incident management process (4)
- identify
- log
- categorise
- prioritise
incident identification
come from:
1. users: walk-ups, self-service, emails, etc
2. alerts: application monitoring software
decide if issue is an incident OR request
incident logging
include:
- user’s name and contact information
- incident description
- date and time of incident
- date and time of incident report
incident categorisation
- purpose?
- assigning a category + at least 1 subcategory
- purpose: allows sorting and model incidents, automatic prioritisation; accurate incident tracking and see patterns emerge
incident prioritisation
determined by:
1. impact on users and the business: measure extent of potential damage
2. urgency: how quickly a resolution is required to reduce business impact
incident tracking status (6)
- New
- Assigned
- In progress
- On hold
- Resolved
- Closed
post incident review
- check users’ perception
- check business process and infrastructure metrics
- decide if an underlying problem exists and raise a ticket if necessary (problem management)
incident communication
- find out what happen
- escalation
- updates
- reporting incident impact and resolution
- confirming the resolution with the users
user satisfaction surveys
- why?
- success?
- good method of monitoring user perception and expectations
- key points for success: scope, define, conduct, understand, publish, translate, follow through
incident report
- basic summary: ticker number, description, impact, resolution time
- causes found: technical analysis
- actions taken: short-term workarounds, improvements to avoid similar occurrences
- post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys
summary report
- purpose?
- includes?
- ensure incident management effective
includes:
- number of incidents
- average resolution time
- type of incident reported
- % of incidents handled within the agreed response time
- % closed by service desk without escalation
- summarise in non-technical language and show where improvements could be made
security concerns
- incident may occur due to security event (unauthorised access, virus, cyber attack)
- elevated system access may need to be granted to resolve incident
- data may be lost/leaked
support team’s role in IM
- Receive and communicate all incidents
- Filter out service/change requests
- Resolve or escalate incidents as appropriate
- Confirm and close tickets
- Analyse incident logs
- Report on incident trends and suggest improvements
types of root causes
- (special cause) random root cause:
- hard to track down and fix
- log but no action unless occurs again - (random cause) root cause will produce more incidents if not fixed:
- problems
- find and fix
risks
potential incidents that have no manifested yet
risk management
- purpose?
- how?
- any potential incident is a risk and should be considered as early as possible
- ensure reliable enterprise solutions
- avoid/mitigate/transfer/accept
risk classification
- severity (business impact)
- likelihood (probability of the event to happen)
RTO
recovery time objectives
- maximum agreed acceptable period of time following a service disruption that can elapse before business functions are severely impacted
- how long to recover?
RPO
recovery point objectives
- the point to which information used by a business activity must be restored to enable the activity to operate on resumption of the service
- how far back last point where data is in usable format?