Skip to main content

Engineering a culture of technology resilience

How enterprises can proactively thrive amid uncertainty

Gain insight into our Enterprise Resilience Engineering Framework and how it can help your business adapt and thrive in today's ever evolving technology landscape.

A call to technological resilience

Technology resilience is the ability of a system to continue operating and recover swiftly under adverse conditions. Resiliency has become more crucial as we navigate an era marked by increasing complexity—due to distributed architectures, multihybrid cloud environments, interconnected systems, and thousands of microservices.

Challenges such as perpetual technical debt, rigid systems, shifting regulations, organizational complexity, technology disruptions, and delivery friction create significant gaps. The shift-left culture, new regulatory requirements (for example, DORA,¹ SEC,² OCC,³ and FFIEC⁴) and a changing regulatory ethos add complexity to resilience, necessitating a balance between regulatory and market-driven solutions. Additionally, the shared responsibility model with third parties such as vendors and hyperscalers—along with heightened customer expectations for “always-on” services—further complicates matters.
Resilience failures now carry severe consequences, potentially leading to competitive disadvantages and loss of customer trust. A recent faulty software update caused a major information technology (IT) outage⁵ worldwide, disrupting operations across sectors such as banking, airlines, and hospitals. This incident highlights the vulnerability of our interconnected world and the fragility of our technological environment.

Staying ahead of the curve: Deloitte’s resiliency engineering framework

Our resiliency engineering reference framework provides valuable guidance for building robust mitigation strategies, ensuring operational continuity during challenging times. The framework fosters a proactive resilience culture—extending beyond mere availability and data protection—to emphasize preparedness and adaptability across architectural and operational pillars (figure 1).

Leaders must proactively mitigate resiliency problems to build reliable and robust systems and reduce the cost of fixing issues. By enabling organizations to assess their current state and embrace resilience improvement opportunities such as implementing architectural patterns to remediate failures, executing game days, and ingraining a “resiliency-first” culture over time, they can start to build resilience today.

Our approach to technology resilience

Achieving proactive resilience may seem daunting, but a reference framework can help organizations navigate challenges with confidence and adaptability. Ultimately, proactive resilience flips the script, driving businesses to thrive in an unpredictable world.

Architecture fitness function

At Deloitte, we thoughtfully assess design resilient architecture for critical systems. We recognize that resilience is the backbone of robust systems, enabling them to withstand and recover from unexpected failures.

Implementing proven resiliency patterns such as circuit breakers, time-outs, geographical redundancy, active/active failover, and traffic distribution helps prevent cascading failures in distributed systems, maintaining their stability.

Modern release management techniques, such as blue/green deployments and canary releases, ensure component resiliency and minimize downtime risks in production environments.

  • High-availability and fault tolerance: We advocate for redundancy across the technology stack, adopting availability zones, multiregions, and modern data protection techniques and orchestration runbooks to facilitate quick recovery from significant failure events.
  • Policy-as-code: By defining, enforcing, and codifying policies throughout the development life cycle, we help ensure application teams adhere to reliability standards.
  • Codify standards and hygiene: Adhering to defined standards governing code quality and consistency reduces the likelihood of system failures due to code-related issues.

Modern-day operations

Adopting site reliability engineering (SRE) practices requires patience and collaboration. Operational shift-left integrates life cycle operations activities such as planning, configuring, and maintenance tasks earlier in the development life cycle—fostering enhanced efficiency, collaboration, and reliability.

  • Resiliency practices: Developing standardized and repeatable practices is crucial for driving SRE adoption.
  • Incident and problem management: Implementing clear communication channels and standardized processes ensures that incidents are handled swiftly and effectively.
  • Release and change management: Defining checks, tollgates, and standardized procedures throughout the software delivery life cycle aims to achieve stable releases with minimal risk to production environments.
  • Root cause analysis and post-incident reviews: Establishing blameless procedures allows us to constructively identify incidents’ underlying causes.
  • Automation: Identifying opportunities for automation can significantly reduce routine tasks (for example, infrastructure provisioning).

Building robust modeling, experimenting, and testing capabilities can help address challenges preemptively.

  • Failure mode and effects analysis (FMEA): Establishing FMEA processes helps mitigate risks before they become issues.
  • Chaos testing: Designing a library of chaos recipes for common fault injection patterns helps uncover vulnerabilities in a controlled environment.
  • Functional testing: Comprehensive guidelines for functional testing are essential 
for maintaining a solid resiliency posture.
  • Nonfunctional testing: Rigorous processes for nonfunctional tests allow us to accurately evaluate systems’ speed, reliability, and performance levels.
  • Game day and disaster recovery (DR) testing: Regular game days and disaster recovery testing builds muscle memory for real-world incidents.

 

Observability empowers organizations to gain real-time insights into application health and performance.

  • Telemetry: Establishing a robust telemetry system provides comprehensive real-time system insights utilizing log management, time-series databases, and event correlation technologies.
  • Notification and alerting: Configuring appropriate thresholds and alerts ensures critical incidents are promptly communicated.
  • Service-level indicators (SLIs), service-level objectives (SLOs), and error budgets: Collaborating with business and technology teams helps identify applicable SLIs and determine SLOs alongside error budgets.
  • Reporting and dashboards: Integrating data from various sources into consolidated dashboards provides tailored real-time insights.
Furthermore, intelligent operations can be integrated into software development life cycle (SDLC) phases. We believe this can provide observability that transcends traditional logging, monitoring, and metrics—offering a holistic view of an IT system’s health, behavior, and performance.

In today’s rapidly evolving digital landscape, the importance of security and compliance cannot be overstated.

  • Data privacy and protection: Establishing and implementing data privacy and protection measures to securely handle personal and sensitive information.
  • Third-party risk management: Developing a strategy to identify, assess, and mitigate risks associated with third-party relationships.
  • Regulatory and industry standards: Adapting to regulatory and industry standards involves defining a structured approach for continuous assessment and monitoring.
  • Continuous monitoring and validation: Integrating real-time insights with regular security controls, policies, and compliance requirement assessments.
  • Auditability and traceability: Maintaining auditability through robust audit trails and documentation practices enables us to demonstrate compliance effectively.

Fostering a culture of resilience requires a comprehensive cultural shift. SREs rely on metrics and data to make informed decisions and encourage continuous improvement through regular reviews and updates to processes and tools. This data-driven approach helps in identifying areas for improvement and implementing new practices that enhance overall resilience.

  • Operating model: Embedding site reliability engineer roles throughout the SDLC.
  • Tooling: Enabling and scaling the adoption of resiliency standards through practical tools and processes.
  • Communication management: Clear and consistent communication is vital for any cultural shift.
  • Key performance indicators (KPIs) and key results areas: Establishing ambitious yet measurable goals aligned with organizational objectives.
  • Training and incentivization: Providing technical training on resiliency for all levels and developing mechanisms to reward desired behaviors.

By embracing these practices, we assist organizations in transitioning to modern operations, enabling them to harness real-time insights and effectively manage error budgets. This proactive mindset is crucial in safeguarding customer trust and maintaining organizational integrity in an unpredictable world.

Resiliency engineering framework in action: A recent use case

One of our clients faced significant reliability challenges with regular outages and needed to prepare for an influx of millions of new transactions on its platform. Despite establishing incident management processes, resiliency knowledge and skills gaps posed substantial challenges. The client sought our partnership to embark on its resiliency-bolstering journey.

By leveraging our resiliency engineering reference framework, we empowered our client to adopt engineering practices, enhance IT resiliency, and seamlessly prepare for increased traffic significantly. This transformation included:

  • Establishing an SRE operating model: Setting clearly defined roles and responsibilities to ensure a structured approach to engineering site reliability.
  • Rolling out a robust resiliency framework: Implementing a comprehensive framework to guide resiliency practices across the organization.
  • Enabling precise SLIs/SLOs: Defining and monitoring SLIs/SLOs to maintain performance standards.

Through proactive measures such as FMEA and chaos testing, we identified potential faults before they became issues. We also published resiliency software standards and upskilled teams to foster a culture of resilience.

The results were significant:

Incidents were reduced by 25%, duration of major incidents was reduced by nearly 30%, more than 400 failure modes were identified, and 60 of them were remediated across more than 100 critical path applications. These efforts collectively contributed to enhancing system reliability and fostering a resilient organizational culture. This proactive approach ensured that our client was well-prepared to handle the increased customer base and maintain operational continuity.

By collaborating with us and leveraging our Resiliency Engineering Reference Framework, the company mitigated most of its stability risks and built a robust foundation for future growth and stability.

In today’s dynamic business landscape, resilience is essential. Embrace resilience engineering to ensure seamless operations and unwavering performance. Contact us to fortify your technology and lead with confidence.

Endnotes

¹ European Union, Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital operational resilience for the financial sector, Official Journal of the European Union, December 27, 2022.

² US Securities and Exchange Commission (SEC), “Cybersecurity and resiliency observations,” Office of Compliance Inspections and Examinations (OCIE), February 27, 2025.

³ Jennie Clarke, “OCC to join regulatory rollout as it eyes operational risk requirements for banks,” Global Relay, March 14, 2024.

Federal Financial Institutions Examination Council, “Financial regulators revise Business Continuity Management booklet to stress to examiners the value of resilience to avoid disruptions to operations,” press release, November 14, 2019.[DK2]

Sean Michael Kerner, “CrowdStrike outage explained: What caused it and what’s next,” TechTarget, October 29, 2024.

Did you find this useful?

Thanks for your feedback