Unpacking Our Findings From Assessing Numerous Infrastructures - Part 1

You can’t improve what you don’t measure.

When things are not analyzed, changes become responses to hefty cloud bills or security breaches. AWS Well-Architected framework can help you identify issues early and streamline your approach to ensure resilient, consistent, and scalable outcomes.

But what difference does AWS WAR Framework bring?

WAR offers a distinctive and valuable perspective into the common practices within cloud computing. After reviewing numerous, distinct workloads, it’s fascinating to learn how various cloud engineering teams operate in the cloud — how they see things, what’s important to them, and why they prefer certain things in a certain way. For us, the best part is being able to make assertions about what tech teams genuinely love about cloud, beyond the hype.

We’ve deeply embedded the AWS WAR framework into our consulting practice and the most commonly asked question is-

Does AWS WAR promote adding more AWS services?

Not necessarily, it makes you aware of the areas that your system lacks in terms of best practices and recommended guidelines and how it could potentially pose challenges in the long run.

It’s a different story that AWS offers a plethora of services to serve every purpose. That being said, it doesn’t prevent you from choosing open-source alternatives or using managed service of your choice.

What Well-Architected Is And What Does It Look Like?

As an Amazon framework, Well-Architected speaks directly to AWS users. Whether already in AWS, on-prem, or even in other cloud providers, WAR is a systematic approach to reviewing your architecture against best practices.

The Well-Architected framework addresses these six pillars — operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. These pillars cater to various organizational needs to varying extents.

For instance,

If you operate in the banking or pharmaceutical sector, security and compliance might be a higher priority.
Conversely, startups often prioritize cost optimization.
Or if it’s an e-commerce marketplace, it prioritizes performance efficiency.

It’s important to note that all pillars are essential for businesses across all industries. However, the order of preference varies, and in many cases, customers focus on a specific pillar and ignore the rest. This can cause a lot of problems which is what we’ll be discussing.

Let’s Begin With Our Findings

We observed that most teams are doing a great job in considering compliance, customer needs, and business priorities in their decision-making processes. Undoubtedly, surviving in a competitive market means bringing customer-driven changes.

I’ll be summing up the observations in 5 questions –

How are you optimizing for performance and efficiency?
Can you cut costs without compromising on quality?
Are your systems truly reliable?
How do you respond to a security incident?
How do you reduce defects, ease remediation, and improve flow into production? (Operational Excellence)

How Are You Optimizing For Performance And Efficiency?

Performance pillar questions probe architectural choices for infrastructure components like storage, databases, computing, and networking. Decisions made should be on understanding how to get services, applications, and data fit together. What we noticed here is that by and large, teams are doing “okay” in picking the right services. If we break down our analysis below –

Databases –

Most of them understood the type of data they dealt with. For example,

Companies dealing with relational data have opted for relational databases like Amazon RDS (most preferred) and Amazon Aurora.
If data is document-based, we have seen teams using DocumentDB, etc.

Now, what teams miss out on is they don’t collect and store metrics related to database performance. This results in unanticipated downtime, hindered capacity planning, and extended issue resolution times. To prevent it, they need to capture and analyze transactions per second, average query rates, response times, index usage, and number of connections open.

Compute –

When it was about choosing compute options, the teams did perform a good analysis of their workload types whether it was a bit more bursty or done on a much more SAS basis. They’ve been actively learning AWS services through AWS webinars and workshops.

It is evident with their choices of experimenting with different AWS compute options like

The use of t2.small/t3.micro instance by developers for experimentation due to its cost-effectiveness.
Some wanted to run their production workloads on graviton instances, to save costs without compromising on compute. To this, they had many questions and also some sought our assistance.

Now where teams go wrong is

Overprovisioning Compute and
Choosing the Wrong Instance Type

Suppose you have chosen the T3 instance family which is known to work great for bursty workloads, but in reality, you’re doing a lot of consistent computation. So, you may face a lot of lags, affecting user experience, and there you might think of increasing the instance size. Where all it required was a different instance choice, maybe the M5 instance would have handled better.

Storage –

In our reviews, nearly seventy percent of the tech teams have opted for Amazon S3. Mostly, it was because of their familiarity with this service. At times, teams even recognized that EFS can establish a shared storage layer within an application, as opposed to looking for an alternative.

While teams excelled in understanding requirements, not many could grasp the configurability of each ecosystem component. For example, when using EBS for block storage in AWS, the number of achievable IOPS can be directly tied to its size in certain cases. Similarly, various configuration options for EFS can impact its performance based on the file types it hosts.

Can You Cut Costs Without Compromising On Quality?

One of our favorite pillars, where teams couldn’t score well, is the Cost Pillar.

Many teams adopt new services based on their curiosity i.e., discovering them and migrating from existing EC2 instances without a structured review. Or worse, teams are not staying updated on the AWS new releases.

To better understand this, we need to see costs beyond money, including the operational time spent. Teams can save long-term engineering operational time by adopting a service version of something. Suppose, you’re running Elasticsearch clusters, you’d be burning a lot of time and energy on handling inadequate configuration, scaling, backups, or troubleshooting. You can now use an Elasticsearch service from the provider, to better utilize your time on something more business-critical.

Also, if we’re moving from the old generation to the new generation of ec2 instances, it can give you around 20% to 25% savings, and in addition to that they offer better performance in comparison to their predecessors.

Similarly, what works at a small scale may not work as things scale up. For instance, AWS Lambda is amazing at scaling, and is priced for what you “use”. But this makes costs unpredictable. There was this case where a Lambda function intended to clean up empty log streams in AWS was getting overwhelmed by increasing log groups, resulting in frequent, slow invocations due to rate-limiting when making CloudWatch API calls. This led to an average of 176 hours of execution every hour, which in turn led to a spiraling rise in the cloud bill.

In the next part of the blog we’ll be discussing —

Are your systems truly reliable?
Is your security strategy robust enough?
How can you enhance operational excellence?

Wrapping Up!

In the ever-evolving tech landscape, achieving an optimized ecosystem can seem like a myth. It’s an ongoing process of improvement.

Now that we’re wrapping up — What do you think about the problems we discussed? Was there anything that sounded familiar to you? Do let us know in the comments. Also if you have any suggestions, you’re most welcome.

Blog Pundits: Sanjeev Pandey and Sandeep Rawat

OpsTree is an End-to-End DevOps Solution Provider.

Connect with Us