- check users' perception - check business process and infrastructure metrics - decide if an underlying problem exists and raise a ticket if necessary (problem management)

- basic summary: ticker number, description, impact, resolution time - causes found: technical analysis - actions taken: short-term workarounds, improvements to avoid similar occurrences - post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys

recover Flashcards by I S

what is an event

any change of state that has significance for the management of a service. Typically, they are notifications from monitoring tools

How well did you know this?

Not at all

Perfectly

what is an incident

an unplanned interruption to a service or reduction in the quality of a service.

How well did you know this?

Not at all

Perfectly

what is a problem
- what is a known error

a cause, or potential cause, of one or more incidents

known errors are problems that have been analysed but not resolved

How well did you know this?

Not at all

Perfectly

what is incident management
- purpose?

to minimise the negative impacts of incidents by restoring normal service operation as quickly as possible
diagnose and escalate
reactive process
not a proactive measure

How well did you know this?

Not at all

Perfectly

what is problem management
- purpose?

reduce likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors
reactive and proactive
same incident occurring many times; affects many users;

How well did you know this?

Not at all

Perfectly

incident management process (4)

identify
log
categorise
prioritise

How well did you know this?

Not at all

Perfectly

incident identification

come from:
1. users: walk-ups, self-service, emails, etc
2. alerts: application monitoring software

decide if issue is an incident OR request

How well did you know this?

Not at all

Perfectly

incident logging

include:
- user’s name and contact information
- incident description
- date and time of incident
- date and time of incident report

How well did you know this?

Not at all

Perfectly

incident categorisation
- purpose?

assigning a category + at least 1 subcategory
purpose: allows sorting and model incidents, automatic prioritisation; accurate incident tracking and see patterns emerge

How well did you know this?

Not at all

Perfectly

incident prioritisation

determined by:
1. impact on users and the business: measure extent of potential damage
2. urgency: how quickly a resolution is required to reduce business impact

How well did you know this?

Not at all

Perfectly

incident tracking status (6)

New
Assigned
In progress
On hold
Resolved
Closed

How well did you know this?

Not at all

Perfectly

post incident review

check users’ perception
check business process and infrastructure metrics
decide if an underlying problem exists and raise a ticket if necessary (problem management)

How well did you know this?

Not at all

Perfectly

incident communication

find out what happen
escalation
updates
reporting incident impact and resolution
confirming the resolution with the users

How well did you know this?

Not at all

Perfectly

user satisfaction surveys
- why?
- success?

good method of monitoring user perception and expectations
key points for success: scope, define, conduct, understand, publish, translate, follow through

How well did you know this?

Not at all

Perfectly

incident report

basic summary: ticker number, description, impact, resolution time
causes found: technical analysis
actions taken: short-term workarounds, improvements to avoid similar occurrences
post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys

How well did you know this?

Not at all

Perfectly

summary report
- purpose?
- includes?

Study These Flashcards

ensure incident management effective

includes:
- number of incidents
- average resolution time
- type of incident reported
- % of incidents handled within the agreed response time
- % closed by service desk without escalation
- summarise in non-technical language and show where improvements could be made

security concerns

Study These Flashcards

incident may occur due to security event (unauthorised access, virus, cyber attack)
elevated system access may need to be granted to resolve incident
data may be lost/leaked

support team’s role in IM

Study These Flashcards

Receive and communicate all incidents
Filter out service/change requests
Resolve or escalate incidents as appropriate
Confirm and close tickets
Analyse incident logs
Report on incident trends and suggest improvements

types of root causes

Study These Flashcards

(special cause) random root cause:
- hard to track down and fix
- log but no action unless occurs again
(random cause) root cause will produce more incidents if not fixed:
- problems
- find and fix

risks

Study These Flashcards

potential incidents that have no manifested yet

risk management
- purpose?
- how?

Study These Flashcards

any potential incident is a risk and should be considered as early as possible
ensure reliable enterprise solutions
avoid/mitigate/transfer/accept

risk classification

Study These Flashcards

severity (business impact)
likelihood (probability of the event to happen)

RTO

Study These Flashcards

recovery time objectives
- maximum agreed acceptable period of time following a service disruption that can elapse before business functions are severely impacted
- how long to recover?

RPO

Study These Flashcards

recovery point objectives
- the point to which information used by a business activity must be restored to enable the activity to operate on resumption of the service
- how far back last point where data is in usable format?

phases of problem management

1. problem identification 2. problem control 3. error control

problem identification

- detect duplicate and recurring issues - during major incident, identify risk that an incident could recur - analyse information received that may cause problems like security risks, vendor reports, quality assurance teams

problem control

- problem analysis (RCA) / troubleshooting - documenting workarounds - documenting known errors

troubleshooting process

1. define problem statement 2. gather information, data, etc 3. determine - root cause analysis 4. recommend solutions for eliminating or mitigating the problem

RCA

root cause analysis - systematic process for identifying 'root causes' of problems/incidents and an approach for responding to them - prevent problems - pinpoint contributing factors to a problem - creates RCI & RCR

RCA - time analysis

- understand what happened and ensure all information is available - get data, sort by date and time, list in time order == look for patterns

RCA - fishbone diagram

- helps to understand and visualise relationships between causes - helps with troubleshooting documentation - progressively break down potential causes of a problem 1. causes are grouped into categories 2. create possible causes under each category

problem response: troubleshoot recommendation

1. design solution based on analysis 2. decide & plan implementation 3. follow change process

what is a workaround

- solution that reduces/eliminated the impact of an incident/problem for which a full resolution is not yet available

error control

- manage known errors - identify potential permanent solution - regularly reassess the status of known errors not yet resolved

disaster recovery

- aims to protect an org from effects of significantly negative events - allows org to maintain or quickly resume mission-critical functions following a disaster

ESM role in disaster recovery

- Escalating if a situation looks like a potential disaster - Help test DR plans - Check critical business processes - Triage incidents - Check if back to normal

recover Flashcards

(36 cards)