Unpacking Our Findings From Assessing Numerous Infrastructures - Part 2

When superior performance comes at a higher price tag, innovation makes it accessible. Quite evident from the way AWS has been evolving its services –

gp3, the successor of gp2 volumes – Offers the same durability, supported volume size, max IOPS per volume, and max IOPS per instance. The main difference between gp2 and gp3 is gp3’s decoupling of IOPS, throughput, and volume size. This flexibility to configure each piece independently – is where the savings come in.
AWS Graviton3 processors – Offers 25% better computing, double the floating-point, and improved cryptographic performance compared to its predecessors. It’s 3x faster than Graviton 2 and supports DDR5 memory, providing 50% more bandwidth than DDR4 (Graviton 2).

To be better at assessing your core infrastructure needs, knowing the AWS services is just half the battle. In my previous blog, I’ve discussed numerous areas where engineering teams often falter. Do give it a read! >>> Unpacking Our Findings From Assessing Numerous Infrastructures – Part 1

What we’ll be discussing here are –

Are your systems truly reliable?
How do you respond to a security incident?
How do you reduce defects, ease remediation, and improve flow into production? (Operational Excellence)

Are Your Systems Truly Reliable?

Nearly 67% of teams showed high risk in questions around resilience testing. Starting with the lack of basic pre-thinking of how things might fail, and building plans for what you would do in that event. Of course, teams did perform root cause analysis after things actually went wrong — that we can consider as learning from mistakes. For the majority of them — there’s no playbook/procedure to investigate failures and post-incident analysis.

How do you plan for disaster recovery?

Eighty percent of the workloads we reviewed score a high risk in this area. Despite disaster recovery being a vital necessity, many organizations avoid it due to its perceived complexity and cost. Some other common reasons were — insufficient time, inadequate resources, inability to prioritize due to lack of skilled personnel, etc.

An easy way to begin is by noting down the —

Recovery point objective – How much data are you prepared to lose?
Recovery time objective – How long can you handle downtime to serve your customers?

The next important step is planning and working on the recovery strategies. Let’s consider the Lambda function. How can you go about thinking of various error scenarios—

Manual deployment errors: Risk of deploying incorrect code or configuration changes.
Cold start delay: It so happens with Lambda that it takes time to initiate the underlying hardware, resulting in the first request taking longer to serve, often attributed to instance expiration from inactivity. Thus, resulting in a poor user experience.
Lambda concurrency limit: Risk of throttling the default concurrency limit, where if it is exceeded, the lambda no longer invokes, resulting in the loss of all requests.

Or maybe answering questions like — what will happen to your application if your database goes away? — Does it reconnect? Does it reconnect properly? Is it re-resolving the DNS name?

While the cloud does take away most of your “heavy lifting” with infrastructure management, this doesn’t include managing your application and business requirements.

Some Best Practices to follow:

Being aware of unchangeable service quotas, service constraints, and physical resource limits to prevent service interruptions or financial overruns.
Validate your backup integrity and processes by performing recovery tests.
Ensure a sufficient gap exists between the current quotas and the maximum usage to accommodate failover.

How Do You Respond To A Security Incident?

75% of technology teams are not doing a good job at responding to security incidents. They’re not planning ahead for things that are going on in the security landscape. Only 30% of teams knew what tooling they would use to either mitigate or investigate a security incident.

Now when we’re talking about security incidents caused by exploited frameworks. Some of the common tell-tale signs observed were –

Allowing untrusted code execution on your machines.
Failure to set up adequate access controls on storage services such as leading to Data leakage from an S3 bucket, potentially making data public.
Accidental exposure of API keys, such as when checked into a public Git repository.

Another aspect of security is understanding the health of your workload, implying monitoring and telemetry. In this framework, we differentiate user behavior monitoring and real user monitoring versus workload behavior. Here this is notable because teams are undoubtedly collecting all sorts of data but are not doing much with it.

More than half of them have clearly defined their KPIs but fewer have actually established the baselines of what normal looks like.
The number drops further when it comes to setting up alerts for those monitored items.

Then comes access and granting the least privileges. Although teams understood what work they do and what access they should have, not many were following it. There was an absolute absence of:

Role Based Access Mechanism
Multi-factor authentication
Rotation of passwords and,
Use of secret vaults like Secrets Managers or HashiCorp Vault (and instead simply baking them into config for their applications), etc.

In short, automation of credential management is pretty much nonexistent.

How Do You Reduce Defects, Ease Remediation, And Enhance the Production Deployment Process?

Yes, finally talking about the pillar – operational excellence. People are pretty much familiar with the version control system and are using Git (mostly). They run a lot of automated testing in their CI, basically a lot of smoke tests and integration tests.

Operational excellence focuses on defining, executing, measuring, and improving the standard operating procedures in response to incidents and client requests. Following the DevOps philosophy is not enough if the tools and workflows don’t support it. Absence of proper documentation and sole dependence on DevOps engineers to use automation has led to burnout. DevOps engineers manually stitching solutions for every situation has resulted in slow workflow development and brittle operations.

As per Gartner, platform engineering is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.” Beyond commercial hype, an Internal Developer Platform is a curated set of tools, capabilities, and processes; packaged together for easy consumption by development teams. Reduced human dependency and standardized workflows empower engineering teams to scale efficiently.

I guess the primary takeaway for us through the reviews was that today people are better at building platforms than they are at securing or running them. This is the real lesson, and there’s a high chance that this applies to you as well.

What’s Next?

If you’re planning to get your platform assessed, then the Well-Architected Review tool is available right in your AWS console. You can begin by working through those questions and following the linked information to better understand your own practice. But it gets much better if you engage a third party to review it for you, why?

Third-party perspective is valuable as they can both identify errors and explain their significance in your context.
Furthermore, if they’re AWS Well-Architected Partners, they can get you $5000 AWS credit to fix your errors.

Here we’re not just looking for knowledge and financial gains but to realize how to actually establish scalable, consistent, and reliable outcomes. Thanks for staying up to the very end of this blog. If you have any questions or suggestions, you can comment below.

Blog Pundits: Sanjeev Pandey and Sandeep Rawat

OpsTree is an End-to-End DevOps Solution Provider.

Connect with Us