recover Flashcards

(36 cards)

1
Q

what is an event

A

any change of state that has significance for the management of a service. Typically, they are notifications from monitoring tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is an incident

A

an unplanned interruption to a service or reduction in the quality of a service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a problem
- what is a known error

A

a cause, or potential cause, of one or more incidents

  • known errors are problems that have been analysed but not resolved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is incident management
- purpose?

A
  • to minimise the negative impacts of incidents by restoring normal service operation as quickly as possible
  • diagnose and escalate
  • reactive process
  • not a proactive measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is problem management
- purpose?

A
  • reduce likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors
  • reactive and proactive
  • same incident occurring many times; affects many users;
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

incident management process (4)

A
  1. identify
  2. log
  3. categorise
  4. prioritise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

incident identification

A

come from:
1. users: walk-ups, self-service, emails, etc
2. alerts: application monitoring software

decide if issue is an incident OR request

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

incident logging

A

include:
- user’s name and contact information
- incident description
- date and time of incident
- date and time of incident report

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

incident categorisation
- purpose?

A
  • assigning a category + at least 1 subcategory
  • purpose: allows sorting and model incidents, automatic prioritisation; accurate incident tracking and see patterns emerge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

incident prioritisation

A

determined by:
1. impact on users and the business: measure extent of potential damage
2. urgency: how quickly a resolution is required to reduce business impact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

incident tracking status (6)

A
  1. New
  2. Assigned
  3. In progress
  4. On hold
  5. Resolved
  6. Closed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

post incident review

A
  • check users’ perception
  • check business process and infrastructure metrics
  • decide if an underlying problem exists and raise a ticket if necessary (problem management)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

incident communication

A
  • find out what happen
  • escalation
  • updates
  • reporting incident impact and resolution
  • confirming the resolution with the users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

user satisfaction surveys
- why?
- success?

A
  • good method of monitoring user perception and expectations
  • key points for success: scope, define, conduct, understand, publish, translate, follow through
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

incident report

A
  • basic summary: ticker number, description, impact, resolution time
  • causes found: technical analysis
  • actions taken: short-term workarounds, improvements to avoid similar occurrences
  • post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

summary report
- purpose?
- includes?

A
  • ensure incident management effective

includes:
- number of incidents
- average resolution time
- type of incident reported
- % of incidents handled within the agreed response time
- % closed by service desk without escalation
- summarise in non-technical language and show where improvements could be made

17
Q

security concerns

A
  • incident may occur due to security event (unauthorised access, virus, cyber attack)
  • elevated system access may need to be granted to resolve incident
  • data may be lost/leaked
18
Q

support team’s role in IM

A
  • Receive and communicate all incidents
  • Filter out service/change requests
  • Resolve or escalate incidents as appropriate
  • Confirm and close tickets
  • Analyse incident logs
  • Report on incident trends and suggest improvements
19
Q

types of root causes

A
  1. (special cause) random root cause:
    - hard to track down and fix
    - log but no action unless occurs again
  2. (random cause) root cause will produce more incidents if not fixed:
    - problems
    - find and fix
20
Q

risks

A

potential incidents that have no manifested yet

21
Q

risk management
- purpose?
- how?

A
  • any potential incident is a risk and should be considered as early as possible
  • ensure reliable enterprise solutions
  • avoid/mitigate/transfer/accept
22
Q

risk classification

A
  1. severity (business impact)
  2. likelihood (probability of the event to happen)
23
Q

RTO

A

recovery time objectives
- maximum agreed acceptable period of time following a service disruption that can elapse before business functions are severely impacted
- how long to recover?

24
Q

RPO

A

recovery point objectives
- the point to which information used by a business activity must be restored to enable the activity to operate on resumption of the service
- how far back last point where data is in usable format?

25
phases of problem management
1. problem identification 2. problem control 3. error control
26
problem identification
- detect duplicate and recurring issues - during major incident, identify risk that an incident could recur - analyse information received that may cause problems like security risks, vendor reports, quality assurance teams
27
problem control
- problem analysis (RCA) / troubleshooting - documenting workarounds - documenting known errors
28
troubleshooting process
1. define problem statement 2. gather information, data, etc 3. determine - root cause analysis 4. recommend solutions for eliminating or mitigating the problem
29
RCA
root cause analysis - systematic process for identifying 'root causes' of problems/incidents and an approach for responding to them - prevent problems - pinpoint contributing factors to a problem - creates RCI & RCR
30
RCA - time analysis
- understand what happened and ensure all information is available - get data, sort by date and time, list in time order == look for patterns
31
RCA - fishbone diagram
- helps to understand and visualise relationships between causes - helps with troubleshooting documentation - progressively break down potential causes of a problem 1. causes are grouped into categories 2. create possible causes under each category
32
problem response: troubleshoot recommendation
1. design solution based on analysis 2. decide & plan implementation 3. follow change process
33
what is a workaround
- solution that reduces/eliminated the impact of an incident/problem for which a full resolution is not yet available
34
error control
- manage known errors - identify potential permanent solution - regularly reassess the status of known errors not yet resolved
35
disaster recovery
- aims to protect an org from effects of significantly negative events - allows org to maintain or quickly resume mission-critical functions following a disaster
36
ESM role in disaster recovery
- Escalating if a situation looks like a potential disaster - Help test DR plans - Check critical business processes - Triage incidents - Check if back to normal