ranton.org - Production Readiness Reviews

Abstract

Software development organizations often have processes to ensure software services are ready for production, including operational readiness reviews (ORRs) and production readiness reviews (PRRs). These reviews help mitigate risks and ensure efficient service to users. They are useful for web applications, web services providing APIs, data pipelines, machine learning systems, and mobile applications relying on external services. They create written artifacts valuable for understanding service architecture and functionality among authors, reviewers, and other relevant parties. They provide insight when determining the best trade-offs for risks, features, dates, and staffing. They enable gathering timely feedback from subject-matter experts that may cause production issues, and can save organizations from major outages by taking a structured, proactive approach which also promotes accountability for operational excellence. This article proposes topics these reviews should cover with related information on best practices.

DevOps pager fire

Image courtesy of Dall-E.

Overview and Motivation

Software development organizations often have processes to ensure systems are ready to operate reliably in production. These may also include a security review or use a separate process for that.

Terminology varies, but the processes share many concepts. At Amazon, we called these operational readiness reviews (ORRs), and at Google SRE, we had a related process for when we took over production support from developer teams that we called production readiness reviews (PRR).

Conducting such a review is pivotal for mitigating risks and confirming the system can service its users effectively and efficiently.

These are also applicable for data pipelines, machine learning systems, and mobile or desktop applications with dependencies on external, online services though specifics appropriate vary between types of system.

Some groups prepare these as documents, while others share them as presentations, or tracked set of work items in something like JIRA. If you use slides, be sure to link to more detailed references for others to dig into if desired, and share the slides with the audience.

Operational readiness reviews serve several purposes.

Increasing the odds of a successful product by conducting a rigorous risk analysis to allow an open, objective review of the proper prioritization of resources, risk mitigations, feature scope, and available time before launch.
Increasing the preparedness of the team that will support the product by familiarizing them with possible issues, mitigations, playbooks, incident management processes, and troubleshooting information in advance.
Creating an accurate checkpoint of system documentation, including both architecture, interdependencies, and monitoring.

Readiness reviews are also used to ensure uniformity for production related standards such as logging formats, metric naming, and IaaC coding standards. Uniformity of control surfaces makes reusing automation and monitoring much easier across teams.

These reviews also produce several other desirable outcomes beyond their immediate goals.

They result in a better understanding of the architecture and functionality of the service among authors, reviewers, and other relevant parties. As issues develop in the future, creating this shared context benefits everyone.
A lot of time pressure is often placed on teams to deliver new features or products, sometimes for a hard deadline such as a major marketing event. When determining whether the best trade-offs are being made for risks, features, dates, and staffing, having a checklist evaluated by someone outside the team can introduce much-needed impartiality.
These reviews are an organized method of obtaining assessments from subject matter experts on a variety of production challenges. As an illustration, not everyone developing awesome new software is Kubernetes autoscaling expert.
Checklists prevent mistakes by humans under stress. Checklists can save us from catastrophe by preventing us from missing obvious things, not just unique or interesting things. And depending on what kind of work you have been doing, you will have a different idea of what is obvious.

While these processes and the artifacts and learning produced are valuable, they are not free.

There is a significant amount of work required to prepare for these reviews. There are also follow ups you commit to after the review itself has been conducted. The work required is reduced if a prior review has content that can be updated and reused.

These reviews should be viewed as an iterative process that is stared during development. As you refine your understanding of where your gaps are, you should improve your understanding of risks and priorities, refine your project roadmap to balance operational readiness with development velocity, and tackle the highest-priority opportunities first. You should prioritize reliability-related work with the biggest return on investment for reliability improvements and risk reduction versus cost and time to implement.

This also provides valuable inputs into system design for observability, deployment safety, scalability, and performance sooner so changes can be made earlier when they are less costly to implement.

Areas and Topics Covered

I would recommend that an ORR or PRR cover a number of specific areas of concern.

System Purpose, Use Cases, and Requirements

Business context of the product, feature, or launch.
System architecture overview, with one or more high-level diagrams. This makes it easier for those outside the team, or team members who did not work on every aspect, to understand the rest of the document.
Description of use cases, so it is clear what the user impact is when specific functionality or dependencies are having issues.
Discussion of any service level agreements (SLAs) promised to users and any service level objectives (SLOs) set for the system.

System and Product Roadmap

Major changes to the system, its usage, or the product(s) it forms part of are an important context for teams developing and operating software systems. For example, if you know that there is a major marketing event or expected usage spike coming, that is important for capacity planning and operational readiness work. Also, if your organization is in the process of migrating to a new version of one of your system’s dependencies or changing how some infrastructure you depend on works, then that should be taken into consideration for risk management and planning other work.

System Monitoring and Observability

Monitoring a system for availability and performance is critically important to providing a reliable product experience.

This part of the review should include a breakdown of the dashboards, instrumentation, metrics, monitors, alarms, and alerts that are in place for the system or are planned to be put in place before launch. This should also cover any instrumentation, such as logs, emitted metrics, traces, or crash collection, for troubleshooting and providing observability into the system.

For a concise introduction to monitoring concepts and considerations, I recommend“Chapter 6 - Monitoring Distributed Systems” from the SRE Book(Beyer, et al., 2016). I also recommend reading through the OpenTelemetry site’s Observability Primer, and exploring the AWS Observability Best Practices site.

Oncall Response and Rotation

Include a discussion of what oncall, pager, business hours, or other rotation or staffing for dealing with production issues is in place or being put in place. This should include process, rotation, mechanics of delivering pages, bugs, or notifications.

Reliability and Availability

The short version: reliable means it is available, returns the correct result, and delivers it in a timely manner.

Availability is usually used to mean whether the system is reachable, but I prefer to use the term reliability, which I define as the property of a system that:

Can be reached (receive the inputs or trigger an operation).
Returns the correct outputs.
Performs the required operations in a timely manner.

The review should evaluate the system’s reliability, including its ability to recover from failures and maintain data integrity.

The system and its operators must also safeguard any data the user entrusts to the system in terms of integrity, durability, security, privacy, and compliance.

Capacity Management

Provide an evaluation of the current capacity of the system in terms of how much traffic, throughput, or other appropriate measure of scale. This should be coupled with what performance or scalability benchmarking or load testing has been done or is planned before launch to measure the system’s scalability. If possible, this should be compared against any available historical usage or traffic data and future projections to ensure accurate and reliable results.

This also includes considerations around the cost of the system in terms of efficiency and trade-offs between availability goals and cost efficiency.

System Dependencies

Provide a complete list of dependencies, whether first or third party, that are used by the system. In software services, these are most often other services called over the network. In data processing pipelines or workflows, these are often jobs that provide their output as input to a later job in the workflow. Important dependencies are not necessarily limited to things you call over the network. Anything with a realistic risk of breaking due to the hardware, network, environment, bug, scale, or operator error should be included.

A Review of Operational History

For systems with an operational history, a review of the prior issues, all recent postmortems, and an overall analysis of alerts generated and handled for the system should be included. This should ideally cover.

Quantity of alerts by severity over time. If the system has separate components with different alert queues, break down the statistics as appropriate.
Analysis in detail of prior postmortems and other significant issues or outages that have occurred.
Root cause analysis for patterns around the leading causes of alerts, including how many alerts are actionable, false positives, or repeated, and whether the proportion of issues is due to specific categories of root causes, such as deployed defects, system performance or capacity, dependency outages, networking issues, bad configuration deployment, etc., so that it can inform the right investments into testing, code review, deployment, and monitoring improvements to reduce pager noise and improve reliability.

Known Issues

Often, we do not have the luxury of addressing all the issues with the current system while planning for new functionality or even ongoing support. Everything has an opportunity cost, but it is good to take on technical debt that leads to known risks, operational toil, reduced observability, or other issues with our eyes open while prioritizing this work against other work to address different issues or add new functionality or features.

This is something that should be addressed in a review, but I think it works better to keep it as a running tally updated periodically in your backlog as a specific category of work and summarize it at the time of a relevant review.

Risk Analysis

This should consist of a detailed risk analysis discussing all significant risks to the product, its users, and the company developing, maintaining, and supporting it. This should cover the likelihood of each risk, the impact if it occurs, possible mitigations to reduce the likelihood or impact of occurrence, and what mitigations are done, planned, or considered but not planned.

The system’s uptime history and incident response plans should also be scrutinized to ensure that any issues can be quickly addressed and rectified.

In many ways, this is the primary point of the exercise. Decide what are the highest priority risks by auditing the system and its supporting artifacts and processes to determine actions to take now, in the near future, and later in the roadmap to reduce these risks to an acceptable level for the business.

By proactively identifying and addressing risks, organizations can reduce the likelihood and impact of potential disruptions, helping to ensure the smooth operation and success of their business.

Risk Management Matrix

The risk management matrix is a tool used to identify, assess, and prioritize risks. It is a valuable tool for organizations of all sizes, as it helps them to proactively manage risks and reduce the potential for losses.

The risk management matrix typically includes these columns:

Risk Title: This column identifies the specific risk that is being assessed.
Severity: This column assesses the potential impact of the risk on the organization. The severity of a risk is typically rated on a scale of 1 to 5, with 1 being the lowest severity and 5 being the highest severity.
Likelihood: This column assesses the likelihood of the risk occurring. The likelihood of a risk is typically rated on a scale of 1 to 5, with 1 being the lowest likelihood and 5 being the highest likelihood.
Mitigations: This describes possible or planned mitigations.This column lists the actions that can be taken to reduce the severity or likelihood of the risk. These actions are designed to prevent or minimize the potential negative impact of the identified risk.

Here is a short example of a risk management matrix:

Risk	Impact	Severity	Likelihood	Mitigations
Auth system dependency outage	Inability for users to sign-in to the product.	High	Possible	Legal counsel review, clear contracting, insurance
Service overload	Increased latency, potential reduced availability	Medium	Likely	Load testing, capacity planning, autoscaling for services, monitoring dependency performance and error rate.
Data Breach	Loss of user trust, legal liability	High	Possible	Data encryption, security audits, staff training, penetration testing

Disaster Recovery and Business Continuity

The review should include backup procedures, disaster recovery plans, and failover mechanisms.
If this is not a new system or component, then the history of prior issues, alerts, investigations, and postmortems should be analyzed, summarized, and discussed, especially any patterns of issues or known problems that may reoccur.
A list of runbooks or playbooks to cover diagnosing, mitigating, and resolving each type of issue that is considered likely or serious enough to prepare for in advance. One approach is to require a specific playbook for each alarm, which could page you.

Testing, CI/CD, and change management

I talked about why CI/CD is important in one of my earlier articles,Strategy for Effective and Efficient Developer Teams, so here we will focus more on how to review where a team and their systems are at for addressing gaps.

Test-driven development (TDD) emphasizes the creation of automated tests before writing the actual code. This approach helps ensure that the code is correct, designed for testability, and meets the requirements. TDD has several important benefits, including:

Improved code quality: By writing tests first, developers are forced to think about the requirements and how to test them before implementing the code. This results in more concise and testable code.
Reduced defects: Tests act as a safety net that catches defects early in the development process, before they can cause problems in production.
Facilitates Refactoring: Since tests are written first, developers can refactor code with confidence, knowing that tests will catch any regressions or errors introduced.
Faster development: TDD can help developers work more efficiently by providing a clear path to follow. By writing tests first, developers can avoid getting bogged down in implementation details and focus on the important task of writing correct code.

Reviewing Test Coverage

By reviewing or auditing a software system’s automated testing, you can help ensure that the tests are effective and efficient. This will help to improve the quality of the code and reduce the risk of defects in production. Several aspects should be covered to ensure the effectiveness and completeness of the tests:

Test Coverage: Evaluate test coverage through the codebase. High coverage is not the only goal, but important functionality should not be left untested.
Test Quality: Assess the quality of the tests themselves. Tests should be clear, concise, and focused on a single aspect of the code. They should be readable and maintainable.
Test Reliability: Ensure tests are reliable and do not produce false positives or negatives. Flaky tests can undermine confidence in the testing suite.
Test execution time: Ensure that the automated tests execute in a reasonable amount of time. This is important to ensure that the tests can be run regularly without impacting the development process.
Integration and End-to-End Tests: Besides unit tests, review the presence and quality of integration and end-to-end tests to ensure systems work together as expected.
Mocking and Stubbing Practices: Evaluate how external services or systems are mocked or stubbed in tests. Over-reliance on mocks may hide integration issues.
Performance Testing: Consider whether the suite includes performance tests where relevant to ensure the system meets performance criteria under load.
Security Testing: Check for the inclusion of security-focused tests, especially for applications dealing with sensitive data or operating in high-risk environments.
Synthetic Testing: Check for synthetic test covering specific user journeys in production and pre-production. See “What is Synthetic Testing?” on DataDog’s website for a good introduction.

Continuous integration and continuous deployment (CI/CD) are critical aspects of modern software development. The integration of CI/CD practices into software development has a significant impact on operational readiness and production reliability by reducing human error, enabling automated testing, and promoting small, frequent changes that are easier to test and troubleshoot.

High-Level Checklist for CI/CD Practices

Automated Build Process: Verify that the build process is fully automated and triggered upon any new code commitment to the repository.
Comprehensive Test Suite: Ensure that there is a comprehensive suite of automated tests, including unit, integration, and end-to-end tests, that are run against every build.
Quality Gates: Implement quality gates that prevent the promotion of code changes to the next stage in the pipeline if they fail to meet predefined criteria such as test coverage or code quality metrics.
Deployment Automation: Confirm that deployments to all environments, especially production, are automated and that manual deployments are the exception, not the rule.
Rollback Procedures: Check that there are automated rollback procedures in place to quickly revert to a previous state in case of a failed deployment.
Monitoring and Alerting: Validate that monitoring and alerting systems are in place to detect and notify the team of issues in real-time.
Environment Parity: Ensure that the development, staging, and production environments are as similar as possible to reduce the chances of environment-specific issues.
Infrastructure as Code (IaC): Review the use of Infrastructure as Code (IaC) to manage and provision infrastructure, ensuring consistency and reliability across environments.
Security Checks: Incorporate automated security scans and checks into the CI/CD pipeline to identify vulnerabilities early in the development process.
Documentation and Knowledge Sharing: Make certain that there is up-to-date documentation for the CI/CD processes and that knowledge is shared across the team to avoid silos. Strategies like auto-generating API documentation, literate programming, and techniques from Living Documentation (Martraire, 2019) can help here.
Performance Testing: Include performance testing as part of the delivery pipeline to ensure that the system meets performance benchmarks before being released to production.
Feature Flags: A feature flag mechanism for separating deployment from the launch of new functionality should be in place.

Deployment Strategies

It is important to avoid a “big-bang” or all-at-once approach to updating software. Phased or incremental deployment approaches have several benefits that can improve availability.

Risk Mitigation: By introducing changes incrementally, the potential impact of any single change is reduced, allowing for quicker isolation and resolution of issues.
Feedback Loops: Early feedback from a subset of users can inform further development and deployment decisions, ensuring that the service remains available and performs as expected.
Load Testing: Gradual rollouts help in assessing the performance under real-world conditions without affecting the entire user base, safeguarding against downtime due to performance issues.
Reversibility: If a deployment causes issues, it is much easier to rollback changes when only a small portion of the environment is affected, maintaining overall service availability.

Phased deployment, also known as incremental or staged rollout, is a deployment strategy where new features, updates, or entire applications are gradually released to a subset of users or production environments. This allows teams to test the impact of changes in a controlled manner, monitor performance, and quickly address any issues before full-scale deployment, which reduces the impact on users.

Canary deployment is a technique where a new version is rolled out to a small subset of users or servers to validate the reliability and performance of the change before being deployed to the rest of production.

Blue-green deployments are another strategy where you use load balancing configuration to shift traffic between two deployments progressively after you have first deployed to one deployment while it is drained from live traffic. This approach allows for a very rapid rollback by just redirecting traffic. The downside is having to maintain two environments that can support production traffic, but if you set this up in conjunction with auto-scaling for the service(s) involved, you can scale the inactive environment down when not being used to deploy.

Backwards Compatibility: There are special considerations around user sessions with web apps, compatibility between different layers in your service stack, and also schema changes to data stores. For example, if you roll out a new version of your frontend that depends on a new feature in your API layer that requires a new column in your data store’s schema to be present, then you will likely have problems. You also have to be able to roll back changes safely, which is why having rollback testing in some pre-production phase of deployment is useful.

Feature Flags and Dark Launches: You should have at least a simple mechanism for deploying changes to your system behind feature flags, which you can change more quickly than a full deployment. This lets you rollback more easily in case of problems. It decouples launch from deployment, and it gives you a mechanism to work around otherwise problematic backward compatibility issues (expected or otherwise). Feature flags can also be coupled with the measurement of user analytics data and performance metrics to perform A/B experiments to evaluate the effect on user experience and behavior of new features and other changes. Ideally, your feature flag mechanism should itself support gradual rollout so you can test a change on a subset of users and allow for internal users to force a flag setting to test behavior in production before enabling it for actual end users.

Team training and incident management

The importance of the human element of operations should not be underestimated. How communications are managed during an incident, how team members are trained, and how follow-ups such as post-mortems are conducted play a significant role in the reliability of a system.

Training for the staff who will operate and maintain the system should be reviewed to ensure they are adequately prepared. Operational procedures, including incident response, deployment processes, change management, and support protocols, should be well-defined and tested. The training should include several areas, including:

IMAG: How to manage an incident in terms of communication, escalation, and other processes.
Documentation: Proper documentation is essential for the effective operation and maintenance of complex systems. The ORR should confirm that documentation is complete, up-to-date, and accessible to relevant personnel. This includes user manuals, system architecture diagrams, and operational procedures.
Do not trade accuracy for unnecessary detail. Less documentation requires less maintenance, which makes it easier to keep up-to-date and accurate.
Where to find runbook or playbook information and contact information if help or escalation is needed.
How to handle a security incident in particular.
Where to find dashboards, alerts, status, and how to query metrics, logs, traces, crash reports, and any other telemetry related to operating or troubleshooting the service.
SLAs for pages, incidents, and alerts in terms of how they are tracked and what types of updates are expected by what frequency.

For an introduction to runbooks, see“What is a Runbook” by PagerDuty.

For more information on incident management in practice, I recommend starting with the“Incident Response” chapter of the SRE Book(Beyer, et al., 2016).

Security

Security is paramount in today’s digital landscape. If there is not a separate process being followed for security, including threat modeling and security risk management, then you should perform one as part of the ORR process.

The security assessment should cover the system’s security posture, ensuring that all data is protected against unauthorized access and breaches. This involves reviewing authentication mechanisms, access controls, encryption standards, and security protocols. Compliance with relevant regulations and standards, such as GDPR or HIPAA, must also be verified to avoid legal and financial repercussions.

Human review should be combined with automated scanning and, preferably, outside auditing and penetration testing when feasible.

We are not going to cover security or threat modeling in any detail here, not because it is not important, but because it would significantly increase the scope and length of this article to do it justice. If you need a starting point for threat modeling and mitigation I recommend starting with the Thread Modelling Process provided by OWASP. Another good resource is "Threat Modeling: 12 Available Methods" (SEI, 2018) from the Software Engineering Institute blog.

Customer Support

The ORR should assess the readiness of the support team to handle customer inquiries and technical issues. Service level agreements (SLAs) and support response times should be evaluated to ensure they meet business requirements.

Legal and financial considerations

The ORR should not overlook legal and financial aspects, such as licensing agreements, intellectual property rights, and budget allocations. It is crucial to ensure that the system’s launch does not expose the organization to legal vulnerabilities or unexpected costs.

Sample Questions

If you are going to structure your PRR/ORR as a document, then one way is to create a survey style list of questions for the service team to answer to address the important topics.

Here is a non-exhaustive list of sample questions to cover during a readiness review. You should tailor the list to your situation, organization, and needs. A template of questions like this can serve as a checklist while being much more concise than a document (such as this article) covering the entire process.

Does the service have defined SLOs and/or SLAs for availability? And for performance (i.e., latency)?
Does the service have monitoring for key customer experiences that can generate automated alerts?
Do you have a way to trace requests through the system?
Do you have automated collection and rotation of system logs?
Are your logs searchable easily?
Do you have monitors for the expiration of TLS certificates?
Do web endpoints (sites, apps, and APIs) have black-box (prober) monitoring?
Do all your services have monitors and alerts for elevated error rates? For degraded performance or elevated latency?
Do you have clear dashboards of key system health metrics?
Do you have metrics and monitoring for downstream dependencies?
Do you monitor the resource utilization of hosts, nodes, and pods in the system for disk usage, memory usage, and CPU utilization?
Do you have synthetic tests for your system, if appropriate? These are tests that try to more realistically mimic real user traffic. This is also sometimes called “canary testing.”
If the system includes a web interface, what client-side instrumentation do you have?
Do you have Real User Monitoring (RUM) or other types of monitoring on client-side instrumentation?

Conclusion

An operational readiness review allows organizations to identify and address any potential risks before launch, and is an essential step in the software development lifecycle. Proactive review ensures the system’s performs as expected and provides a positive customer experience.

References

Books

Beck, K., Test-Driven Development by Example. Boston Addison-Wesley, 2014.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016).Site Reliability Engineering: How Google Runs Production Systems. Oreilly.

Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., Thorne, S. (2018).The Site Reliability Workbook: Practical Ways to Implement SRE. United States: O’Reilly Media.

Forsgren, N., Humble, J., & Kim, G. (n.d.). Accelerate: the science behind DevOps: building and scaling high performing technology organizations.

Humble, J., Farley, D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation Addison Wesley; 1 edition, 27 July 2010

Kim, G., Humble, J., Debois, P., Willis, J., Forsgren, N. (2021).The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations. United States: IT Revolution Press.

Martraire, C. (2019). Living documentation: Continuous knowledge sharing by design. Addison-Wesley

Articles

The 6 Pillars of the AWS Well-Architected Framework | Amazon Web Services. (2022, March 1). Amazon Web Services. https://aws.amazon.com/blogs/apn/the-6-pillars-of-the-aws-well-architected-framework/.

Amazon Web Services, (n.d.). “Using synthetic monitoring”, Amazon CloudWatch User Guide, https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatchSyntheticsCanaries.html. Accessed 25 Dec 2022.

Amazon Web Services, (n.d.). AWS Observability Best Practices. Retrieved February 4, 2024, from https://aws-observability.github.io/observability-best-practices/

Cocca, G. (2023, April 7). What is CI/CD? Learn Continuous Integration/Continuous Deployment by Building a Project. freeCodeCamp.org. https://www.freecodecamp.org/news/what-is-ci-cd/

Davidovič, Š., & Beyer, B. (2018). Canary analysis service. Communications of the ACM, 61(5), 54-62. https://dl.acm.org/doi/10.1145/3190566

DORA | DevOps Research and Assessment. (n.d.). https://dora.dev/

Dodd, R. (2023, January 30). Four common deployment strategies. LaunchDarkly. https://launchdarkly.com/blog/four-common-deployment-strategies/

Liguori, C., “My CI/CD pipeline is my release captain”, Amazon Builder’s Library, https://aws.amazon.com/builders-library/cicd-pipeline/. Accessed 23 Dec 2022.

Observability Primer. (2024, January 30). OpenTelemetry. https://opentelemetry.io/docs/concepts/observability-primer

Production Readiness Review. (2023, December 6). The GitLab Handbook. https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/

(2021, June 7). Synthetic Testing: What It Is & How It Works. Datadog. Retrieved February 13, 2024, from https://www.datadoghq.com/knowledge-center/synthetic-testing

Threat Modeling: 12 Available Methods. (2018, December 3). SEI Blog. https://insights.sei.cmu.edu/blog/threat-modeling-12-available-methods/

Threat Modeling Process | OWASP Foundation. (n.d.). https://owasp.org/www-community/Threat_Modeling_Process

What is a Runbook? | PagerDuty. (2023, March 21). PagerDuty. Retrieved February 4, 2024, from https://www.pagerduty.com/resources/learn/what-is-a-runbook/