Skip to main content

Building Resilient Applications in the Cloud

What is cloud resilience?

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components.”

In more generic terms, Resilience can be defined as “the capacity to recover from attacks (either inadvertent through a bug, or purposeful by intention), load (more service requests), and component failures.”

Why does cloud resilience matter?

Cloud resiliency helps mitigate the impact of service disruptions on the end users of cloud-based applications and services while also minimizing the cost. This results in lower overall downtime and higher system uptime, which can in turn enhance user satisfaction and drive business performance in a cost-efficient way. It allows businesses to recover from unexpected disasters, scale cloud infrastructure in changing business needs. Cloud resiliency ensures that services and operations are not interrupted, critical data and applications are always available & protected.

Why do organizations fail at building resilient applications and how to succeed?

  • Mission Critical Applications- Organizations do not have visibility into mission-critical applications and do not prioritize application recovery. They should try to identify mission critical applications and “RPO (Recovery Point Objective), RTO (Recovery Time Objective), MTD (Maximum Tolerable Downtime)” must be defined. In performing this exercise, identifying interdependencies between applications and components is critical.
  • Disaster Recovery - Organizations move applications to the cloud with a focus on infrastructure disaster recovery, but often fail to see the cyber-attacks or software vulnerabilities, not the site loss, are the primary source of incidents. Infrastructure security should receive more attention from organizations.
  • Networking – Logical segregation is not prioritized by organizations when they migrate to the cloud because they believe the cloud would automatically reduce risk. This raises the risk of unsafe deployments, such as putting production and test systems on the same network or account.  This allows the attacker to travel laterally into the network if one system is compromised. The organization must prioritize segregation and deployment strategies.
  • Automation and Orchestration – Organizations spend time and money on manual deployments, manual policy enforcement, and so on. This results in misconfiguration and inconsistent policy enforcement, as well as security holes and monetary loss. The culture of DevSecOps should be adopted by the organization. Wherever possible, policy as code and infrastructure as code should be used, and disaster recovery should be automated.
  • Immutable Infrastructure - Organizations use non-standard public images or images are shared across different teams for spinning compute or container resources. This results in non-standard configurations, inconsistencies in security and an increase in the attack surface. The organization should establish a process of creating a Golden image for every team along with continual image assessment and patching.
  • Crisis Management and planning - The organization does not promote a training and development culture. They either lack explicit incident response plan or have one but fail to implement which leads to ambiguous actions and outcomes. There are no clearly defined roles and responsibilities, which leads to friction amongst the teams. The organization must concentrate on developing a clear incident response plan. They must define a RACI (Responsible, Accountable, Consult, and Inform, it is a matrix in which actions are matched to roles). A process for conducting periodic reviews and simulation exercises should be established.
  • People - The organizations are understaffed and lacking in experienced professionals. Critical tasks are entrusted with one resource who is unable to respond effectively during an incident due to overwhelming responsibility which leads to decision fatigue and leads to poor judgement. Organizations should avoid relying on a single resource and should have all processes clearly documented and evaluated regularly so that they may be easily followed.
  • Resilience engineering - Resilience is built around non-critical applications, and the old idea of disaster recovery (i.e., several back up zones) is still used which makes it difficult to restore to the original site after disaster. Organizations should focus on expanding compute services across zones and geographies, instead of several backup services. Architecture design should be resilient and tested periodically. (Via e.g., Chaos Monkey engineered by Netflix).

Conclusion

In conclusion, cloud migration provides several benefits to organizations, but it also brings several challenges that must be addressed to ensure secure and efficient operations. By addressing these challenges, organizations can enhance their security posture, reduce operational costs, and improve their overall cloud migration experience.