Understand how Cloudflare deals with incidents impacting its production environment and the ways in which Cloudflare communicates the nature and impact of these incidents to Enterprise customers.
Cloudflare believes that openness and transparency are intrinsic to the delivery of our service, and is dedicated to establishing the trust of our customers and of the Internet community at large. Cloudflare operates a global network which impacts the lives and prosperity of hundreds of millions of people, and we are therefore extremely mindful of that responsibility.
This Standard Operating Procedure (SOP) defines how Cloudflare deals with all incidents and problems impacting its production environment and the ways in which Cloudflare communicates the nature and impact of these incidents to Enterprise customers, both planned and unplanned, regardless of severity. This procedure specifies how these efforts are uniformly followed in order to
- maximize environment uptime,
- minimize client impact,
- reduce the time to repair, and
- share information with our customers and the Internet community.
This SOP applies to Cloudflare customers and customer services as consumed by customers. The SOP is applicable to all customer production environments at Cloudflare including:
- Cloudflare’s public website (www.cloudflare.com)
- Cloudflare’s APIs (Application Programming Interfaces)
- Outbound third-party interfaces (e.g. credit card authorizations, etc.)
- Network infrastructure owned or managed by Cloudflare for production services
- Vendor software, hardware and services that affect any part of Cloudflare production
Cloudflare wants to build a better Internet. In order to deliver an improved experience to millions of Internet users, Cloudflare’s internal operations must follow excellent service delivery processes and procedures. Cloudflare’s procedures therefore follow many industry-standard best practices, some of which specifically follow patterns of the Information Library Infrastructure Technology (ITIL). This SOP follows the best practices of the ITIL Problem Management methodology.
Categories of key incident terms: All events are conditions which can instigate alerts; some alerts are incidents of note (and some are not); all incidents must be triaged (sometimes through automation, sometimes through human interaction); some incidents are problems; some subset of problems are "major" and instigate Status Page updates; some major incidents have a high priority (P1) which requires the creation of an Incident Report.
Any identifiable and discrete thing which can be logged by one of Cloudflare’s production applications or systems
An event of potential interest which is identified and communicated via one of Cloudflare’s monitoring systems
A report or alert which has a high probability of affecting Cloudflare’s production systems, or an alert condition which only exists for a short period of time because the affected service is restored to health before a Problem condition is identified
An identified and categorized incident which has a negative impact on the optimal health and/or performance of Cloudflare’s production systems or applications
A public report which describes the nature of a service Problem, Cloudflare’s overall response to the Problem, and efforts to reduce or eliminate future impact
Post Mortem Review
A review meeting initiated in response to severe and/or critical Problem. All Post Mortem meetings focus on the details of an Incident Report generated by a Cloudflare engineer with skills or experience appropriately suited to address to the nature of the Problem.
The Systems Reliability Engineers are the group responsible for the first-level support of all incidents
The Customer Support group is the team responsible for responding to all customer-generated requests, and for all customer communications during any identified Problem.
Cloudflare ticketing system used for the tracking of incidents, work orders and problems
Severity / Priority Level
Value of “P0, P1, P2 or P3” based on the severity of Problem impacting the Cloudflare network and customers
Service Level Agreement – internal or contractual obligation for a specific level of service (usually measured in actions per unit of time)
Service Level Objective – internal or contractual objective for a specific level of service (usually measured in actions per unit of time)
Cloudflare resource responsible for ensuring the Problem is being addressed properly, time is being kept, escalations are being made, clients are being updated, and that resources are being engaged as needed
The Internet Community
Cloudflare’s primary stakeholder group. Cloudflare secures and optimizes over 4,600,000 websites, and the average Internet user interacts with Cloudflare websites over 500 times per week.
Non-Cloudflare vendor or service provider who partners with CloudFlare in the delivery of systems or services to the client
Person, group or company that is affected by an incident either as the provider (e.g. Cloudflare person, third party) or consumer (client)
Root Cause Analysis – Thorough review of the underlying cause of a problem
All necessary steps to resolve the root cause of a problem and ensure that it will not happen again
The primary tool which Cloudflare uses to publicly share information about its service delivery and any Incidents or Problems impacting Cloudflare services: https://www.cloudflarestatus.com
The Status Page is hosted by a Third Party (Statuspage.io) which is not dependent on Cloudflare’s services for operation.
Roles and responsibilities
The following roles and responsibilities are associated with the management of incidents within Cloudflare:
Review and approve procedures. Ensure that all staff members are trained on procedures. Notify customers and third parties, as necessary, of their role in procedures. Initiate and oversee Post Mortem Reviews for critical Incident Reports.
One or more SREs who are assigned for on-call shifts to respond to all critical alerts. Identifies and responds to an Incident, assesses and classifies the severity of the Incident, and potentially escalates an impacting Incident as a Problem. Act as escalations and administration of the issue from start to finish.
On-call Network Engineers
One or more Network Engineers who are assigned for on-call shifts to respond to critical alerts. Coordinates with the SRE team, who provide the primary Incident Manager during any identified problem.
One or more CSUP engineers who are assigned for shifts to respond to all customer requests. Responsible for all customer communications during all identified Problems. Responsible for communicating all planned maintenance.
The overall Systems Reliability Engineering team who support the efforts of the on-call SREs. Assume the role of Incident Manager during an identified problem. Implement appropriate Cloudflare-supported production changes to resolve issues.
Cloudflare Engineering Teams (DBA, Network, nginx, Security, etc.)
Support the Incident Manager during problem resolution. Join bridge calls, if requested. Ensure documentation is captured while diagnosing and correcting issues and proper escalation to other responsible groups is executed. Participate in Post Mortem reviews of some Incident Reports, as requested by Cloudflare Management.
Standard Operating Procedure
This section details the procedures for incident and problem management. At a high-level, these processes relate as follows:
- Incident Management: The overall process for observing and responding to alerts, including: assessing the potential impact and severity of an Incident, classifying the Incident as a Problem, assigning a priority to the Problem, or dismissing the Incident as a non-impacting event if a problem condition cannot be identified.
- Problem Management: The process of identifying the scope and extent of a Problem, assigning an appropriate severity level (P0, P1, P2, P3), the actions to resolve the Problem and restore the optimal state for production services, and the communication of the Problem to appropriate parties.
- Resolution Management: The process of investigating the causes and conditions which lead to a problem condition, reporting on the overall manner by which a problem was managed and resolved, and any subsequent analysis of how the conditions and causes of a problem may be prevented in the future.
The primary goal of Incident Management is to identify and react to potential problems as quickly as possible, and thereby minimize impact to production services and provide the best possible levels of service quality and availability. The best possible levels of service quality and availability would be that all services operated exactly as designed 100% of the time, and were available and accessible 100% of the time.
Because we accept that a combination of forces within our control, and forces beyond our control, will eventually impact service health, we define Service Level Objectives (SLOs), and Service Level Agreements (SLAs), to describe what degradations in service health are acceptable for various services within Cloudflare’s network. SLAs and SLOs are expressed as percentages of periods of time (monthly and annually.)
The level of information given about an incident may vary, but the following information must be collected before an incident is classified and prioritized:
- Submitter Source (monitoring alert or alternate source)
- Customer(s) (if applicable)
- System or application (and hostname, if applicable)
- Time of alert
- Scope of impact: estimated number of systems, users, or regions impacted
- Type of impact: general scope of service impairment (e.g., loss of all access, degraded performance, dependent applications impacted, observed customer impact)
All Incidents which are classified as Problems, regardless of source, which have a priority of P0 or P1, will be logged within the Cloudflare ticketing system, JIRA. Some alerts will indicate conditions which may not be immediately impacting to service levels, and as necessary, will be categorized as Problems with a P2 or P3 priority.
The JIRA system is the system of record for all incident information, and all other sources of documentation regarding a Problem (e.g. alert history, screen-shots, work logs, chat conversations) are attached to the original JIRA ticket created in response to an Incident.
After acknowledging an alert, SRE immediately triages the alert by correlating it to a category and priority level. When creating new JIRA tickets for high priority (P0 and P1) Problems, SRE will ensure that each ticket is classified correctly by including its Category and Priority.
All tickets will be categorized according to the following 4 levels of priority. The criteria listed are general guidelines. Conditions described below should explicitly define a priority level; however at the discretion of SRE or Cloudflare management, problems may be assigned a higher level of priority, as needed:
- Complete loss of access to the Cloudflare application or API.
- Degraded access to the Cloudflare application or API (⪯ 98% as measured worldwide or from any major region).
- Complete loss of access to, or major performance degradation to, a Tier-1 Data Center.
- Degraded performance of any Tier-1 global transit provider (⪰ 20% packet loss worldwide or 30% packet loss from any major region).
- Degraded access to or performance of any critical system.
- Intermittent or degraded Site-wide performance degradation.
- Loss of an important function such as reporting.
- Loss of access to the Cloudflare application from one of the social media or external CloudFlare websites (e.g. spaCloudflare.com, salonCloudflare.com, etc.).
- Outage to important outbound third-party interface.
- Inoperability of the site for one of the enterprise clients or distribution partners.
- Corruption or loss of customer data.
- Sporadic or localized performance issue.
- System issues with no noticeable client impact yet (e.g. high CPU).
- Single client outage/degradation.
- Operational issues, procedural problems or service requests that have little or no effect on end-users and can be handled on an as-available basis.
- The default severity assigned to all tickets that have not yet been reviewed or assigned a severity level.
For proper tracking and communications, high priority (P0 and P1) problems will be assigned to categories. These categories (ticket labels) correspond to the publicly communicated categories which are listed on Cloudflare’s public Status Page.
Lower priority (P2 and P3) tickets may be categorized using labels and nomenclature which are specific to various Engineering and Non-engineering teams within Cloudflare. These various labels and categories are not listed on this document.
It is critical to understand that incidents that are classified under the category of Security require special handling and procedures. These incidents should be logged here and then follow the Security Incident procedures as defined by the Cloudflare Information Security team.
High Severity / Priority Incidents
P0 and P1 incidents obviously have more impact to the business and therefore, have some special upfront requirements to ensure that they are handled in the most expeditious fashion possible.
For all P0 and P1 issues, the on-duty Incident Manager should be contacted immediately. A schedule of incident managers will be posted to ensure that SRE knows who to contact at any given time. The incident manager is a critical resource responsible for the following:
- Validation of the severity of an issue
- Tracking of the issue from submission to resolution
- Representation of clients’ best interest
- Logging of all actions and times
- Direction of personnel toward the fastest possible resolution
- Ensuring that clients and internal management are notified of status according to pre-determined time periods (or upon change in status)
- Performing client, internal or third-party escalations when time limits are being exceeded or appropriate progress is not being made
- Ensuring that a meaningful explanation is applied to the ticket upon resolution
- Making certain that the initial submitter agrees that the issue is resolved before the ticket is closed
External communications during an incident are critical for:
- Notifying the stakeholders that Cloudflare is aware of the issue and is pursuing resolution
- Reassuring clients that the matter is under review and that Cloudflare is looking out for their best interests
- Issues do not drag on unnecessarily and appropriate escalations are being made
- Informing key internal stakeholders of important incidents
Major types of communications during an incident include:
Status Page will be created using templates by CSUP team member on-call as soon as an incident is identified.
Cloudflare believes that all critical problems should never recur. To that end, all P0 problems will instigate the publication of an Incident Report (IR), which includes a Root Cause Analysis (RCA) of the problem and the overall factors which lead up to the Incident. All IR publications will be followed by a Post Mortem meeting, a meeting in which engineers and managers review and agree upon the details of the IR, the conclusions of the RCA, and any follow-up remediation steps which will be taken to ensure that the problem condition(s) do not recur.
Problem Management and Post-Mortem
Problem Management differs from Incident Management in that its main goal is the detection of the underlying causes of an Incident and their subsequent resolution and prevention.
Root Cause Analysis and Remediation
An RCA is a Root Cause Analysis report. A Jira Problem ticket is the logging and tracking of events that may warrant an RCA. This is a process by which the subject matter experts (SME) for an area will review a P0 or P1 issue searching for the underlying cause of the issue. Once this is determined, the SMEs need to create a remediation plan to address the cause(s). The ultimate deliverable is a well-documented ticket to track the remediation to completion and, if required, a well-written incident report to be sent to an internal team and/or client.
The above points are still applicable even if it is a third party provider or vendor supplying the RCA. When the RCA information is received from a third party, we must ensure that the Problem ticket is updated with all relevant information including outstanding remediations to be tracked.
The Incident Report (“IR”) is the primary method of communication to the client on an issue and may contain some or all parts of what is written within the ticket.
The person writing the report will vary depending on the severity of the issue and the responsible area. Upon completion of the draft report, it is critical to ensure that the report is reviewed by Cloudflare management for content, commitments and professional presentation. Once the report is approved it may be published to the client.
The above sections have detailed the handling of the incident and the root cause process for ensuring permanent remediation. The last part of the incident and problem management process is to ensure that key metrics, trends and reporting are done to ensure that the process is being followed correctly, SLAs are being met and below-the-surface issues are not being missed.
The ticket criteria that need to be reported for both open and closed tickets include the following:
- Responsible Group
- Age/Days Open
Wherever possible, this data should be reported graphically to show visible trends. These reports should be published to internal Cloudflare managers and area owners.
Analysis and Accountability
Each area owner for tickets will be responsible for not only ensuring that their tickets are closed within prescribed or reasonable time frames, but also reviewing the reports and looking for trends, concerns and repeat issues. Based on this analysis, further Problem tickets should be opened to remediate any issues that may not have surfaced via a P0 or P1. This will allow continuous improvement and should ultimately reduce new ticket counts by further dealing with root causes.
Incident Management Review Meetings (Post-mortem)
As part of all departmental staff meetings, group managers should be reviewing the ticket open and trending reports with the following objectives:
- Discussion of areas of success or concern
- Review of opportunities for improvement by the area owners
- Agreement on areas that warrant a new Problem ticket to be opened for remediation tracking