*Price calculations using AWS Price List API Show
“How should I configure my EC2 infrastructure? What components should I use? What instance types should I choose? t3.large? m5.large?”That’s one of the most common questions I hear from EC2 users. And it’s one question for which I have the most annoying answer of all: “It depends." It’s an annoying answer because it doesn’t offer an immediate solution. Finding the right AWS infrastructure is not an easy task; there are a lot of variables that come from your application requirements and there are a lot of knobs in AWS. Put all variables together and we have an overwhelming number of options. Without proper testing, we can only guess. The only way to make an informed decision is to execute performance tests, monitor metrics, identify patterns, fine-tune, rinse and repeat. So that’s what I decided to do for this post. I will show you a sample application and some real steps and tests I followed for finding optimal EC2 infrastructure. Let’s get started… Step 1 - What business function is this all about?Systems are useless unless they fulfill a business function. The first thing we need to be clear about is how to articulate a business function in terms that can be quantified and translated into system requirements. In this example I will use a hypothetical WordPress site that is critical to our customer. Our customer knows today’s number of visitors to her site and is worried her current infrastructure won’t keep up with future growth. Our customer will be happy only when she has certainty her site will handle the expected volume one year from now. And when her site grows further, she wants to have certainty that adapting her AWS infrastructure for even more visitors becomes a trivial task moving forward. The requirements
What type of EC2 infrastructure should I set up for my system to handle these requirements? Step 2 - What are my system requirements?WordPress is a Content Management System that powers millions of websites and blogs around the world. In its simplest form it consists of an Apache web server running on Linux that connects to a MySQL database, like this: My system should do the following:
The system under test Regarding my specific WordPress setup:
Architecture
Additional test and monitoring components:
The AWS architecture I configured looks like this: Steps 3, 4, 5… test, measure, repeatTest 1 - t3.large - 100 concurrent usersI started with a quick, small load test just to have an idea of how the system behaves. For a high traffic, scalable and resilient application like the one we want, I typically start with larger instance types. My first option for this test was to configure my Auto Scaling group with a Launch Configuration based on a t3.large instance type. Here are the basic specs for the t3.large instance type:
When working with t3 instance types, it is critical to understand CPU Credits. The t3.large gives us a nice baseline CPU utilization of 30%, which means my CPU credits will not decrease as long as I keep my utilization below 30%. I configured my Auto Scaling group such that instances are launched when average CPU utilization is greater than 30% and terminated if the average CPU Utilization is less than 10%. I see this as a safe range that allows my instances to react to spikes in traffic and prevents me from paying unnecessarily due to underutilization. Test results: 100 concurrent users with a random think time between 15 and 30 seconds results in 4 transactions per second. My AutoScaling group stabilized at 2 instances, which peaked at 17% CPU utilization. Having two t3.large instances seems appropriate for 100 concurrent users. Test 2 - t3.large - 1,000 concurrent usersRight after I completed my test with 100 users, I ramped up the number of users to 500, waited a few minutes and then continued to 1,000 users. It is important to gradually increase the load, otherwise we could get false negatives due to suddenly increasing the load on the system by a factor of 10. I kept the Auto Scaling policy to the same parameters as my previous test: increase when CPU utilization is >= 30% and decrease when it is <10%. Test results: 1,000 concurrent users resulted in 45 transactions per second. The Auto Scaling group stabilized with seven t3.large instances at a 25% average CPU Utilization. This is a sustainable CPU Utilization that will accrue CPU credits that I could use in case there is a spike in traffic. Response times: median=229ms, P90=322ms, P99=764ms, max=1131ms. Keep in mind that my Locust load generator lives in the same region as my web servers, therefore the response times are lower than what real users would experience when they access from different locations worldwide. As long as we compare response times from the same load generator when testing different instance types, we’ll be comparing apples to apples. Calculating data transfer (which is a bit tricky)
Estimated monthly cost in the US East region: Test 3 - m5.large - 1,000 concurrent usersThe next alternative for my experiment was an m5.large EC2 instance. M5.large instances feature a 2.6GHz processor optimized for EC2. The specs are very similar to those of the t3.large, but they don’t have a baseline CPU% utilization, which means we don’t have to worry about CPU credits. They have 6.5 ECUs (EC2 Compute Units), which are the standard units of compute power as measured by AWS. However, I can’t really compare ECUs against a t3.large, since ECUs are variable in t3 instance types. The m5.large instance is slightly more expensive than the t3.large:
Test results: At 1,000 concurrent users, the Auto Scaling group stabilized with eight m5.large instances at an average CPU utilization of 28%. Not only will we pay more due to having 8 instances, but also the average resource utilization is a bit higher (28% vs. 25% for the t3.large test). Response times: median=277ms, P90=873ms, P99=5168ms, max=7067ms … and then I noticed something interesting, thanks to New Relic’s Physical Memory metrics: What is it with this low memory consumption? This graph is basically telling me I don’t have to pay a premium for the higher memory specs that come with the t3.large and m5.large instances! I also looked at the disk read and write operations metrics and they are close to zero. Since my servers are stateless, they barely perform any read or write operations, so I don’t have to pay a premium for EBS optimization either. So far my test results and metrics have been telling me that my application is CPU intensive. Wouldn’t it be nice to get a similar CPU performance and avoid paying extra for resources that I don’t need, such as memory and high EBS throughput?That’s why I decided to execute a test with t3.medium instances… Test 4 - t3.medium - 1,000 concurrent usersFor this test, I used an Auto Scaling launch configuration with t3.medium instances. T3.medium instances have the same computing power as t3.large ones, but half the memory (and exactly half the cost). They have, however, a lower baseline CPU utilization of 20%. This means I have to maintain the CPU Utilization below 20% to avoid CPU Credit depletion. Even with this caveat, I still wanted to take a closer look. I configured the Auto Scaling policy to launch 1 new instance when the average CPU utilization is >= 20% and to terminate 1 instance when utilization is less than 10%. Similar to my previous tests, I gradually increased the number of users until I reached 1,000 concurrent users with a random think time between 15 and 30 seconds.
Test results: At 1,000 concurrent users, the Auto Scaling group stabilized at 11 t3.medium instances and a 18% average CPU Utilization. Memory consumption was below 10% in all servers. Response times: median=232ms, P90=296ms, P99=564ms, max=852ms So far, t3.mediums are winning. They are giving me better response times and lower cost. And just in case you were wondering, I generated 1,000 concurrent users with a single t2.nano instance running Locust, which peaked at around 10% CPU Utilization. I really like t2.nanos for testing purposes. Last, but not least, some relevant RDS metrics at 1,000 concurrent users. They confirm to me that the main bottleneck in this test was CPU utilization in the WordPress web servers. Test 5 - t3.medium - traffic spike to 2,000 concurrent usersNow that I have settled for t3.medium instances, I still want to make sure they can handle a 100% spike in traffic. I increased the number of concurrent users to 2000 over a 15 minute period, resulting in 90 transactions per second. This is intentionally steeper than the original requirement of 100% increase over 1 hour. The overall CPU usage increased to around 30% and it stabilized with 18 instances at an average utilization of 19% for the whole Auto Scaling group. I am confident the system will handle a 100% spike in traffic over 15 minutes and still have wiggle room for an even steeper spike. Response times: median=350ms, P90=835ms, P99=1836ms, max=2711ms. Response times are higher during the spike, mainly due to the fast ramp up. Once additional instances were launched and CPU utilization stabilized at 19%, response times returned to the values seen in the 1,000 concurrent user test. One thing to note is that your account EC2 instance limits should be set to more than 20 instances per region, otherwise you could end up being unable to launch new instances when you most need them. To summarize:
Are you looking for ways to save money while running fast applications on EC2, or other AWS services?I can certainly help you with planning and managing your AWS cost. In many cases, I can save my clients at least 25%. Just click on the Schedule Consultation button below and let’s have a chat. |