Transcription

University of CaliforniaSan FranciscoInformation Technology and ServiceITS MAJOR INCIDENTPROCESSVERSION 1.1, REV. September 5, 2012

UCSFInformation Technology and ServiceITS Major Incident ProcessDocument Version ControlDocument NameITS Major Incident ProcessVersionNumberIssueDatePrepared By1.08/31/12Terrie ColemanInitial draft1.19/5/12Terrie ColemanUpdates based on meeting with John Chin andRebecca NguyenReason for ChangeReviewers and ApproversNameApproval DateKevin BarneyJohn ChinRebecca NguyenThis document contains confidential, proprietary information intended for internal use only and is not to be distributedoutside the University of California, San Francisco (UCSF) without an appropriate non-disclosure agreement in force. Itscontents may be changed at any time and create neither obligations on UCSF’s part nor rights in any third personUCSF – Internal Use Only2 of 10

UCSFInformation Technology and ServiceITS Major Incident ProcessTable of Contents1.INTRODUCTION42.DEFINITIONS43.PROCESS DEFINITION53.1.3.2.4.RACI CHARTACTIVITY DIAGRAMS56APPENDIX4.1.8MAJOR INCIDENT CHECKLISTUCSF – Internal Use Only83 of 10

UCSFInformation Technology and ServiceITS Major Incident Process1. INTRODUCTIONThe purpose of this document is to define the actions, communications and escalationsteps that will used to manage a major incident.The major incident process has 4 key phases; Detection of the major incident, Escalationto Priority 2, Escalation to Priority 1 and Closure. The major incident process can beabandoned at any point once resolution of the incident has been reached.2. DEFINITIONSTermMajor IncidentTechnicianIncident Response TeamService Desk AgentService Desk ManagerIncident CommanderITS Administrator On Call(AOC)UCSF – Internal Use OnlyDefinitionAny full or partial system outage.Resource tasked with identifying and resolving incident. Alsoresponsible for providing regular updates to the Service DeskStaff.Technical team tasked with identifying and resolving incident.Point of coordination for all incoming incident information andoutgoing communications.Primary point of contact within the Service Desk accountable forescalations and end user notification.Individual who is responsible for driving the major incident toclosure. This role is typically held by the manager or designee ofthe affected system or infrastructure component or by thesecurity manager in the event of a major incident involving abreach.The ITS director on-call responsible for providing enterpriseperspective into the issue and making sure key leadership staffare notified of the issue, if necessary.4 of 10

UCSFInformation Technology and ServiceITS Major Incident Process3. PROCESS viceServiceShMajor Incident RACI ChartTec#niciankAnalyst3.1. RACI CHARTOutputDetection of Major Incident121a2a345678Identify Potential Major Incident(Pattern of issues reported to ServiceDesk)ygticketIdentify Potential Major Incident(Monitoring tools)Notifyg g IT Service Deskp Potential MI,necessaryBegin InvestigationProvidey SDp with updates on IssueCommanderContact Service Desk ManagerConfirm system-wide issue9 Invoke escalation to Major Incident P2Open TicketAssign TicketA/RIA/ROpen & Assign IIIA/RICIIA/RCEscalation to P2 High10111213141516Update IncidentTicket: Priority High(P2) Symptom to Major OutageNotify Incident CommanderOpeng Technical BridgegpRecordpgcustomersrequiredEscalate to Priority 1A/RAutomated Notification toITS Mgers and DirectorsContinue n to P1 Critical17 Open P1 ITS Bridge1819202122Update IncidentTicket: Priority Critical(P1) Symptom to Major OutageNotify ITS AOC and relevant IT Teamsas requested by Incident CommanderDecision: Formal Notification to CIO?Decision: Cut over to DiasterRecovery, if availableConfirm Stabilization or Resolutions, ifresolvedRAAutomated Notification toITS Mgers and DirectorsIA/RIIIIA/RICCDR or Continue RemediationRRIA/RRIIIssue ResolvedA/RIIIIIIAutomated Notification toITS Mgers and DirectorsClosure2324252627Update and Resolve Incident TicketRemove Front-end ACD messagerequiredpnecessarySchedule Closure CallRARA/RRCIIA/RCIIIA/RIIIIIResponsible – People who do the work, facilitate it and/or organize itAccountable – The one who ensures that desired outcomes are reached and has yes/no decision making authorityConsulted – People who have critical expertise to contribute before a decision is madeInformed – People who are significantly affected by the activity/decision and must be informed to ensure successful implementationUCSF – Internal Use Only5 of 10

UCSFInformation Technology and ServiceITS Major Incident Process3.2. ACTIVITY DIAGRAMSIT Service DeskAgentITS Major Incident Process - v1.0StartDetect1IdentifyPotentialMajor m WideOutage3NotifyService DeskMgr1aIdentifyPotentialMajor Incident2aNotifyService DeskAgent4Engage IRTIncidentResponse TeamUCSF – Internal Use ConfirmSystem WideOutage9InvokeP2IT Service DeskManagerTechnician6Provide Updates to Service Desk6 of 10Go toMI P2

UCSFInformation Technology and ServiceITS Major Incident ProcessITS Major Incident Process - v1.0MI P2 HighInvokedIT Service ge18UpdateIncident ticketto P119Notify AOCas requestedby f necessary12OpenTechnical Bridge13BeginDocumentingWork Plan16Escalateto P1UCSF – Internal Use OnlyNo17OpenP1 Bridge21DisasterRecovery?Yes20Notify CIO, ifnecessaryAOCIncidentCommanderIT Service DeskManagerTechnicianEscalated to P110UpdateIncident ticketto P27 of 1022ConfirmStabilization, ifResolvedInitiateDisasterRecoveryProcessGo toClose

UCSFInformation Technology and ServiceITS Major Incident Process4. APPENDIX4.1. MAJOR INCIDENT CHECKLISTITS Major IncidentAction Check ListIDDetection of Major Incident (MI)1Identify a Potential Major IncidentService Desk notes pattern of issues beingreported that may warrant a Major Incidentconsideration.Notify the On-Call Technician2Action by:Service DeskAgent Service DeskAgent 1aIdentify a Potential Major IncidentIT monitoring tools signal an outage that maywarrant a Major Incident consideration.Technician2aNotify the Service DeskTechnician3Notify the Service Desk Manager ordesignee4Confirm that there is a system-wide issue4aDeclare P2Service DeskAgentInvoke escalation to Major Incident (MI)(P2)Service DeskManager5UCSF – Internal Use OnlyNotes Service DeskAgentTechnician8 of 10Poll other SD AgentsRun ticket reportCheck Change Control CalendarAssign the ticket to the technicianService Desk Agent and TechnicianAgree on a Service Desk update planEngage the Incident Response Team,if necessaryBegin Investigation of potential MIIdentify the potential IncidentCommanderTechnician calls IT Service Deskback line at 415-353-4444 andindicates incident is in under“watch”. Service Desk Agent and TechnicianAgree on a Service Desk update planAfter hours refer to:http://oncall.ucsfmedicalcenter.org/ ITSD - Manager Consult with the Service DeskManager and Agent to confirm thatthere is a system-wide issue Decision: to Invoke escalation toMajor Incident P2 using the followingcriteria:o More than a single unit orfloor is affectedo Received 5 more calls forthe same issue within 30minutes Received 5 more calls for the sameissue within 30 minutesAt the direction of the Service DeskManager: Service Desk Agent CategorizesSymptom as Major Outage(automated notification to ITSManagers and Directors)

UCSFInformation Technology and ServiceITS Major Incident ProcessITS Major IncidentAction Check ListIDEscalation to Major Incident (P2) HIGHAction by:6Notify the Incident Commander there is aMajor Incident P2TechnicianOpen Technical Communication Bridge, ifnecessaryCampus ITS MI Technical CommunicationBridge: 353-8000, code: 602914IncidentCommanderIncident Commander begins the EventRecord and documents work plan forremediationIncidentCommander9Prepare front-end ACD message forinbound customer callsService DeskAgent10Initiate Service Desk CommunicationProcess:DECISION : Notify owners or end-usersService DeskManager78Email Customer-facing Notificationsand/or11Notify System Owners and/or ApplicationFunctional OwnersEscalate to Priority 1 (P1)UCSF – Internal Use OnlyIncidentCommander9 of 10NotesAdCom On-Call Outlook Calendar BA/Infrastructure Systems & DBAPager Duty Calendar Infrastructure Network Oncall Multiple technicians involved.Bridge facilitates faster coordinationof troubleshooting.This Record is used to note actionstaken and actions planned. It is alsoused to debrief ITS AOC in the eventthat this is required Update the incident worklogDecision Criteria/Consideration: If warranted by call volume Front end should not be used if callsare still needed for additionalexamples.Decision Criteria/Consideration: Specific instruction located in the KBSupport InformationEvaluation Criteria/Consideration: Is this major incident that is affectinga large group of user or criticalbusiness processes? Is there extreme impact to patientcare and business operations? Will the resolution of this eventrequire additional technicalresources? Involves a Medical Center Tier 1application? Note: Some Campusapplications are considered MedicalCenter Tier 1 e.g . Exchange Is greater awareness warranted? Is ITS AOC awareness warranted?

UCSFInformation Technology and ServiceITS Major Incident ProcessITS Major IncidentAction Check ListIDEscalation to Priority 1 (P1) CRITICAL12Initiate campus ITS P1 Conference Bridge:353-8000, code: 271433Update Incident Ticket P1IncidentCommanderService DeskManager14Notify ITS AOC and all relevant ITteams/personnel as requested by theIncident CommanderService DeskAgent13151617IDIf this Incident involves a Medical Center Tier1 application then the Medical Center IT AOCmust be notified and the IT911 ConferenceBridge activated.DECISION: Notify Campus CIOBriefed by phone every 30 minutes - and/orprovide option of joining ITS P1 conferencebridgeDECISION: Cutover to Disaster RecoveryProcess (if available) orContinue with remediation efforts untilIncident Commander announces thatissue has been stabilizedConfirm Stabilization or Resolution withaffected end-usersAction by:Update and resolve incident ticketITSAdministratorOn-Call Campus(AOC)IncidentCommanderPager Duty Calendar Infrastructure Network OncallEvaluation Criteria/Consideration: Is greater awareness to hospitaloperations warranted? Would executive-level leadershipbenefit from greater awareness?Evaluation Criteria/Consideration: Unresolved after 24 hours? No recovery plan in sight?IncidentCommanderAction by:Technician19Remove Front-end ACD MessageService DeskAgent20Service DeskManager21Close out Service Desk CommunicationProcess, if necessaryComplete Event Record22Schedule Closure CallUCSF – Internal Use OnlyAt the direction of the Service DeskManager the Service Desk Agent: Updates the Incident to P1Major Outage Automated Notification to ITSManagers and DirectorsRefer to:http://oncall.ucsfmedicalcenter.org/ Campus ITS AOCAdCom On-Call Outlook Calendar BA/Infrastructure Systems & cidentCommander10 of 10Notes

Major Incident RACI Chart n t er r C IO r r Output Detection of Major Incident 1 Identify Potential Major Incident (Pattern of issues reported to Service Desk) A/R Open Ticket 2 y g ticket I A/R Assign Ticket 1a Identify Potential Major Incident (Monitoring tools) A/R Open & Assign Ticket 2a Notify IT Service Desk Potential MI A/R I 3 gg p .File Size: 472KB