Minecraft unterschied zwischen runtask und schedule

Inhaltsverzeichnis Show

Engineering use-case (console)
Operations use-case (CLI)
How Spot placement score works
Using Spot placement score with AWS Management Console
Availability and pricing
Launch templates vs. launch configurations
How to determine where you are using launch configurations
How to transition to launch templates today
AWS Management Console
CloudFormation and Terraform
Game overview
Solution overview
Why Agones?
Other ways than Agones on EKS?
Overview of the Database
Spot Blueprints overview
What are Spot best practices?
Example tutorial
EC2 Auto Scaling and Spot Instances – a classic love story
The next level – moving from reactive to proactive Spot Capacity Rebalancing in EC2 Auto Scaling
Example tutorial – Web application workload
Spot Instances recap
Identifying suitable instance types for your web application
Configure your Auto Scaling group
Load balancing with Application Load Balancer
Leveraging Instance Termination Notices for graceful termination
Tying it all together
Introduction
Solution Overview
Application of Amazon SQS with Spot Instances
Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application
Handling Spot Instance interruptions
Testing the solution
Prerequisites
Testing On-Demand Instances
Testing On-Demand and Spot Instances
Walkthrough
Lowest-price (diversified over N pools) allocation strategy
Capacity-optimized allocation strategy
Characteristics of game-server workloads
Why use Amazon EC2 Spot Instances?
Interruption handling
Node drainage
Optimization strategies for Kubernetes specifications
Reference architecture
What if I want my game server to be redundant?
Scheduling mechanism
April & early May — 2018 Schedule
How does the new pricing model work?
Fewer interruptions
Architecture
Walkthrough
Launch the stack
Review the details
Create a Spot Fleet request
Prerequisites
How does immersive media work?
Processing and origination stack
How it works
Solution components
Spot Instance considerations
Configuration
Origin instances
Handler scripts
Camera source
Encoding and publishing
Live to On-Demand
Delivery and playback
What’s next?
Get ready for takeoff!
Session types
Session levels
Session locations
MONDAY 11/27
Breakout sessions
Chalk talks
TUESDAY 11/28
Breakout Sessions
Chalk Talks
WEDNESDAY 11/29
Birds of a Feather (BoF)
Breakout Sessions
Chalk Talks
THURSDAY 11/30
Breakout Sessions
Chalk Talks
FRIDAY 12/1
Breakout Sessions
Chalk Talks
Know before you go
Architecture
Walkthrough
Before you begin
Implementation Steps
Creating an ECS cluster running on Spot Instances
Choosing an allocation strategy
Using AWS CloudFormation to deploy your ECS cluster on Spot Instances
Handling termination
ECS on Spot Instances in action

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/

This post is written by Steve Cole, WW SA Leader for EC2 Spot, and David Bermeo, Senior Product Manager for EC2.

On October 20, 2021, AWS released new functionality to the Amazon Fault Injection Simulator that supports triggering the interruption of Amazon EC2 Spot Instances. This functionality lets you test the fault tolerance of your software by interrupting instances on command. The triggered interruption will be preceded with a Rebalance Recommendation (RBR) and Instance Termination Notification (ITN) so that you can fully test your applications as if an actual Spot interruption had occurred.

In this post, we’ll provide two examples of how easy it has now become to simulate Spot interruptions and validate the fault-tolerance of an application or service. We will demonstrate testing an application through the console and a service via CLI.

Engineering use-case (console)

Whether you are building a Spot-capable product or service from scratch or evaluating the Spot compatibility of existing software, the first step in testing is identifying whether or not the software is tolerant of being interrupted.

In the past, one way this was accomplished was with an AWS open-source tool called the Amazon EC2 Metadata Mock. This tool let customers simulate a Spot interruption as if it had been delivered through the Instance Metadata Service (IMDS), which then let customers test how their code responded to an RBR or an ITN. However, this model wasn’t a direct plug-and-play solution with how an actual Spot interruption would occur, since the signal wasn’t coming from AWS. In particular, the method didn’t provide the centralized notifications available through Amazon EventBridge or Amazon CloudWatch Events that enabled off-instance activities like launching AWS Lambda functions or other orchestration work when an RBR or ITN was received.

Now, Fault Injection Simulator has removed the need for custom logic, since it lets RBR and ITN signals be delivered via the standard IMDS and event services simultaneously.

Let’s walk through the process in the AWS Management Console. We’ll identify an instance that’s hosting a perpetually-running queue worker that checks the IMDS before pulling messages from Amazon Simple Queue Service (SQS). It will be part of a service stack that is scaled in and out based on the queue depth. Our goal is to make sure that the IMDS is being polled properly so that no new messages are pulled once an ITN is received. The typical processing time of a message with this example is 30 seconds, so we can wait for an ITN (which provides a two minute warning) and need not act on an RBR.

First, we go to the Fault Injection Simulator in the AWS Management Console to create an experiment.

At the experiment creation screen, we start by creating an optional name (recommended for console use) and a description, and then selecting an IAM Role. If this is the first time that you’ve used Fault Injection Simulator, then you’ll need to create an IAM Role per the directions in the FIS IAM permissions documentation. I’ve named the role that we created ‘FIS.’ After that, I’ll select an action (interrupt) and identify a target (the instance).

First, I name the action. The Action type I want is to interrupt the Spot Instance: aws:ec2:send-spot-instance-interruptions. In the Action parameters, we are given the option to set the duration. The minimum value here is two minutes, below which you will receive an error since Spot Instances will always receive a two minute warning. The advantage here is that, by setting the durationBeforeInterruption to a value above two minutes, you will get the RBR (an optional point for you to respond) and ITN (the actual two minute warning) at different points in time, and this lets you respond to one or both.

The target instance that we launched is depicted in the following screenshot. It is a Spot Instance that was launched as a persistent request with its interruption action set to ‘stop’ instead of ‘terminate.’ The option to stop a Spot Instance, introduced in 2020, will let us restart the instance, log in and retrieve logs, update code, and perform other work necessary to implement Spot interruption handling.

Now that an action has been defined, we configure the target. We have the option of naming the target, which we’ve done here to match the Name tagged on the EC2 instance ‘qWorker’. The target method we want to use here is Resource ID, and then we can either type or select the desired instance from a drop-down list. Selection mode will be ‘all’, as there is only one instance. If we were using tags, which we will in the next example, then we’d be able to select a count of instances, up to five, instead of just one.

Once you’ve saved the Action, the Target, and the Experiment, then you’ll be able to begin the experiment by selecting the ‘Start from the Action’ menu at the top right of the screen.

After the experiment starts, you’ll be able to observe its state by refreshing the screen. Generally, the process will take just seconds, and you should be greeted by the Completed state, as seen in the following screenshot.

In the following screenshot, having opened an interruption log group created in CloudWatch Event Logs, we can see the JSON of the RBR.

Two minutes later, we see the ITN in the same log group.

Another two minutes after the ITN, we can see the EC2 instance is in the process of stopping (or terminating, if you elect).

Shortly after the stop is issued by EC2, we can see the instance stopped. It would now be possible to restart the instance and view logs, make code changes, or do whatever you find necessary before testing again.

Now that our experiment succeeded in interrupting our Spot Instance, we can evaluate the performance of the code running on the instance. It should have completed the processing of any messages already retrieved at the ITN, and it should have not pulled any new messages afterward.

This experiment can be saved for later use, but it will require selecting the specific instance each time that it’s run. We can also re-use the experiment template by using tags instead of an instance ID, as we’ll show in the next example. This shouldn’t prove troublesome for infrequent experiments, and especially those run through the console. Or, as we did in our example, you can set the instance interruption behavior to stop (versus terminate) and re-use the experiment as long as that particular instance continues to exist. When the experiments get more frequent, it might be advantageous to automate the process, possibly as part of the test phase of a CI/CD pipeline. Doing this is programmatically possible through the AWS CLI or SDK.

Operations use-case (CLI)

Once the developers of our product validate the single-instance fault tolerance, indicating that the target workload is capable of running on Spot Instances, then the next logical step is to deploy the product as a service on multiple instances. This will allow for more comprehensive testing of the service as a whole, and it is a key process in collecting valuable information, such as performance data, response times, error rates, and other metrics to be used in the monitoring of the service. Once data has been collected on a non-interrupted deployment, it is then possible to use the Spot interruption action of the Fault Injection Simulator to observe how well the service can handle RBR and ITN while running, and to see how those events influence the metrics collected previously.

When testing a service, whether it is launched as instances in an Amazon EC2 Auto Scaling group, or it is part of one of the AWS container services, such as Amazon Elastic Container Service (Amazon ECS) or the Amazon Elastic Kubernetes Service (EKS), EC2 Fleet, Amazon EMR, or across any instances with descriptive tagging, you now have the ability to trigger Spot interruptions to as many as five instances in a single Fault Injection Simulator experiment.

We’ll use tags, as opposed to instance IDs, to identify candidates for interruption to interrupt multiple Spot Instances simultaneously. We can further refine the candidate targets with one or more filters in our experiment, for example targeting only running instances if you perform an action repeatedly.

In the following example, we will be interrupting three instances in an Auto Scaling group that is backing a self-managed EKS node group. We already know the software will behave as desired from our previous engineering tests. Our goal here is to see how quickly EKS can launch replacement tasks and identify how the service as a whole responds during the event. In our experiment, we will identify instances that contain the tag aws:autoscaling:groupName with a value of “spotEKS”.

The key benefit here is that we don’t need a list of instance IDs in our experiment. Therefore, this is a re-usable experiment that can be incorporated into test automation without needing to make specific selections from previous steps like collecting instance IDs from the target Auto Scaling group.

We start by creating a file that describes our experiment in JSON rather than through the console:

{ "description": "interrupt multiple random instances in ASG", "targets": { "spotEKS": { "resourceType": "aws:ec2:spot-instance", "resourceTags": { "aws:autoscaling:groupName": "spotEKS" }, "selectionMode": "COUNT(3)" } }, "actions": { "interrupt": { "actionId": "aws:ec2:send-spot-instance-interruptions", "description": "interrupt multiple instances", "parameters": { "durationBeforeInterruption": "PT4M" }, "targets": { "SpotInstances": "spotEKS" } } }, "stopConditions": [ { "source": "none" } ], "roleArn": "arn:aws:iam::xxxxxxxxxxxx:role/FIS", "tags": { "Name": "multi-instance" } }

Then we upload the experiment template to Fault Injection Simulator from the command-line.

aws fis create-experiment-template --cli-input-json file://experiment.json

The response we receive returns our template along with an ID, which we’ll need to execute the experiment.

{ "experimentTemplate": { "id": "EXT3SHtpk1N4qmsn", ... } }

We then execute the experiment from the command-line using the ID that we were given at template creation.

aws fis start-experiment --experiment-template-id EXT3SHtpk1N4qmsn

We then receive confirmation that the experiment has started.

{ "experiment": { "id": "EXPaFhEaX8GusfztyY", "experimentTemplateId": "EXT3SHtpk1N4qmsn", "state": { "status": "initiating", "reason": "Experiment is initiating." }, ... } }

To check the status of the experiment as it runs, which for interrupting Spot Instances is quite fast, we can query the experiment ID for success or failure messages as follows:

aws fis get-experiment --id EXPaFhEaX8GusfztyY

And finally, we can confirm the results of our experiment by listing our instances through EC2. Here we use the following command-line before and after execution to generate pre- and post-experiment output:

aws ec2 describe-instances --filters\ Name='tag:aws:autoscaling:groupName',Values='spotEKS'\ Name='instance-state-name',Values='running'\ | jq .Reservations[].Instances[].InstanceId | sort

We can then compare this to identify which instances were interrupted and which were launched as replacements.

< "i-003c8d95c7b6e3c63" < "i-03aa172262c16840a" < "i-02572fa37a61dc319" --- > "i-04a13406d11a38ca6" > "i-02723d957dc243981" > "i-05ced3f71736b5c95"

Summary

In the previous examples, we have demonstrated through the console and command-line how you can use the Spot interruption action in the Fault Injection Simulator to ascertain how your software and service will behave when encountering a Spot interruption. Simulating Spot interruptions will help assess the fault-tolerance of your software and can assess the impact of interruptions in a running service. The addition of events can enable more tooling, and being able to simulate both ITNs and RBRs, along with the Capacity Rebalance feature of Auto scaling groups, now matches the end-to-end experience of an actual AWS interruption. Get started on simulating Spot interruptions in the console.

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/identifying-optimal-locations-for-flexible-workloads-with-spot-placement-score/

This post is written by Jessie Xie, Solutions Architect for EC2 Spot, and Peter Manastyrny, Senior Product Manager for EC2 Auto Scaling and EC2 Fleet.

Amazon EC2 Spot Instances let you run flexible, fault-tolerant, or stateless applications in the AWS Cloud at up to a 90% discount from On-Demand prices. Since we introduced Spot Instances back in 2009, we have been building new features and integrations with a single goal – to make Spot easy and efficient to use for your flexible compute needs.

Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts. In exchange for the discount, Spot Instances are interruptible and must be returned when EC2 needs the capacity back. The location and amount of spare capacity available at any given moment is dynamic and changes in real time. This is why Spot workloads should be flexible, meaning they can utilize a variety of different EC2 instance types and can be shifted in real time to where the spare capacity currently is. You can use Spot Instances with tools such as EC2 Fleet and Amazon EC2 Auto Scaling which make it easy to run workloads on multiple instance types.

The AWS Cloud spans 81 Availability Zones across 25 Regions with plans to launch 21 more Availability Zones and 7 more Regions. However, until now there was no way to find an optimal location (either a Region or Availability Zone) to fulfill your Spot capacity needs without trying to launch Spot Instances there first. Today, we are excited to announce Spot placement score, a new feature that helps you identify an optimal location to run your workloads on Spot Instances. Spot placement score recommends an optimal Region or Availability Zone based on the amount of Spot capacity you need and your instance type requirements.

Spot placement score is useful for workloads that could potentially run in a different Region. Additionally, because the score takes into account your instance type selection, it can help you determine if your request is sufficiently instance type flexible for your chosen Region or Availability Zone.

How Spot placement score works

To use Spot placement score you need to specify the amount of Spot capacity you need, what your instance type requirements are, and whether you would like a recommendation for a Region or a single Availability Zone. For instance type requirements, you can either provide a list of instance types, or the instance attributes, like the number of vCPUs and amount of memory. If you choose to use the instance attributes option, you can then use the same attribute configuration to request your Spot Instances in the recommended Region or Availability Zone with the new attribute-based instance type selection feature in EC2 Fleet or EC2 Auto Scaling.

Spot placement score provides a list of Regions or Availability Zones, each scored from 1 to 10, based on factors such as the requested instance types, target capacity, historical and current Spot usage trends, and time of the request. The score reflects the likelihood of success when provisioning Spot capacity, with a 10 meaning that the request is highly likely to succeed. Provided scores change based on the current Spot capacity situation, and the same request can yield different scores when ran at different times. It is important to note that the score serves as a guideline, and no score guarantees that your Spot request will be fully or partially fulfilled.

You can also filter your score by Regions or Availability Zones, which is useful for cases where you can use only a subset of AWS Regions, for example any Region in the United States.

Let’s see how Spot placement score works in practice through an example.

Using Spot placement score with AWS Management Console

To try Spot placement score, log into your AWS account, select EC2, Spot Requests, and click on Spot placement score to open the Spot placement score window.

Here, you need provide your target capacity and instance type requirements by clicking on Enter requirements. You can enter target capacity as a number of instances, vCPUs, or memory. vCPUs and memory options are useful for vertically scalable workloads that are sized for a total amount of compute resources and can utilize a wide range of instance sizes. Target capacity is limited and based on your recent Spot usage with accounting for potential usage growth. For accounts that do not have recent Spot usage, there is a default limit aligned with the Spot Instances limit.

For instance type requirements, there are two options. First option is to select Specify instance attributes that match your compute requirements tab and enter your compute requirements as a number of vCPUs, amount of memory, CPU architecture, and other optional attributes. Second option is to select Manually select instance types tab and select instance types from the list.

Please note that you need to select at least three different instance types (that is, different families, generations, or sizes). If you specify a smaller number of instance types, Spot placement score will always yield a low score. Spot placement score is designed to help you find an optimal location to request Spot capacity tailored to your specific workload needs, but it is not intended to be used for getting high-level Spot capacity information across all Regions and instance types.

Let’s try to find an optimal location to run a workload that can utilize r5.8xlarge, c5.9xlarge, and m5.8xlarge instance types and is sized at 2000 instances.

Once you select 2000 instances under Target capacity, select r5.8xlarge, c5.9xlarge, and m5.8xlarge instances under Select instance types, and click Load placement score button, you will get a list of Regions sorted by score in a descending order. There is also an option to filter by specific Regions if needed.

The highest rated Region for your requirements turns out to be US East (N. Virginia) with a score of 8. The second closest contender is Europe (Ireland) with a score of 5. That tells you that right now the optimal Region for your Spot requirements is US East (N. Virginia).

Let’s now see if it is possible to get a higher score. Remember, the key best practice for Spot is to be flexible and utilize as many instance types as possible. To do that, press the Edit button on the Target capacity and instance type requirements tab. For the new request, keep the same target capacity at 2000, but expand the selection of instance types by adding similarly sized instance types from a variety of instance families and generations, i.e., r5.4xlarge, r5.12xlarge, m5zn.12xlarge, m5zn.6xlarge, m5n.8xlarge, m5dn.8xlarge, m5d.8xlarge, r5n.8xlarge, r5dn.8xlarge, r5d.8xlarge, c5.12xlarge, c5.4xlarge, c5d.12xlarge, c5n.9xlarge. c5d.9xlarge, m4.4xlarge, m4.16xlarge, m4.10xlarge, r4.8xlarge, c4.8xlarge.

After requesting the scores with updated requirements, you can see that even though the score in US East (N. Virginia) stays unchanged at 8, the scores for Europe (Ireland) and US West (Oregon) improved dramatically, both raising to 9. Now, you have a choice of three high-scored Regions to request your Spot Instances, each with a high likelihood to succeed.

To request Spot Instances based on the score, you can use EC2 Fleet or EC2 Auto Scaling. Please note, that the score implies that you use capacity-optimized Spot allocation strategy when requesting the capacity. If you use other allocation strategies, such as lowest-price, the result in the recommended Region or Availability Zone will not align with the score provided.

You can also request the scores at the Availability Zone level. This is useful for running workloads that need to have all instances in the same Availability Zone, potentially to minimize inter-Availability Zone data transfer costs. Workloads such as Apache Spark, which involve transferring a high volume of data between instances, would be a good use case for this. To get scores per Availability Zone you can check the box Provide placement scores per Availability Zone.

When requesting instances based on Availability Zone recommendation, you need to make sure to configure EC2 Fleet or EC2 Auto Scaling request to only use that specific Availability Zone.

With Spot placement score, you can test different instance type combinations at different points in time, and find the most optimal Region or Availability Zone to run your workloads on Spot Instances.

Availability and pricing

You can use Spot placement score today in all public and AWS GovCloud Regions with the exception of those based in China, where we plan to release later. You can access Spot placement score using the AWS Command Line Interface (CLI), AWS SDKs, and Management Console. There is no additional charge for using Spot placement score, you will only pay EC2 standard rates if provisioning instances based on recommendation.

To learn more about using Spot placement score, visit the Spot placement score documentation page. To learn more about best practices for using Spot Instances, see Spot documentation.

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/

This post is written by Scott Horsfield, Principal Solutions Architect, EC2 Scalability and Surabhi Agarwal, Sr. Product Manager, EC2.

In 2010, AWS released launch configurations as a way to define the parameters of instances launched by EC2 Auto Scaling groups. In 2017, AWS released launch templates, the successor of launch configurations, as a way to streamline and simplify the launch process for Auto Scaling, Spot Fleet, Amazon EC2 Spot Instances, and On-Demand Instances. Launch templates define the steps required to create an instance, by capturing instance parameters in a resource that can be used across multiple services. Launch configurations have continued to live alongside launch templates but haven’t benefitted from all of the features we’ve added to launch templates.

Today, AWS is recommending that customers using launch configurations migrate to launch templates. We will continue to support and maintain launch configurations, but we will not be adding any new features to them. We will focus on adding new EC2 features to launch templates only. You can continue using launch configurations, and AWS is committed to supporting applications you have already built using them, but in order for you to take advantage of our most recent and upcoming releases, a migration to launch templates is recommended. Additionally, we plan to no longer support new instance types with launch configurations by the end of 2022. Our goal is to have all customers moved over to launch templates by then.

Moving to launch templates is simple to accomplish and can be done easily today. In this blog, we provide more details on how you can transition from launch configurations to launch templates. If you are unable to transition to launch templates due to lack of tooling or specific functions, or have any concerns, please contact AWS Support.

Launch templates vs. launch configurations

Launch configurations have been a part of Amazon EC2 Auto Scaling Groups since 2010. Customers use launch configurations to define Auto Scaling group configurations that include AMI and instance type definition. In 2017, AWS released launch templates, which reduce the number of steps required to create an instance by capturing all launch parameters within one resource that can be used across multiple services. Since then, AWS has released many new features such as Mixed Instance Policies with Auto Scaling groups, Targeted Capacity Reservations, and unlimited mode for burstable performance instances that only work with launch templates.

Launch templates provide several key benefits to customers, when compared to launch configurations, that can improve the availability and optimization of the workloads you host in Auto Scaling groups and allow you to access the full set of EC2 features when launching instances in an Auto Scaling group.

Some of the key benefits of launch templates when used with Auto Scaling groups include:

How to determine where you are using launch configurations

Use the Launch Configuration Inventory Script to find all of the launch configurations in your account. You can use this script to generate an inventory of launch configurations across all regions in a single account or all accounts in your AWS Organization.

The script can be run with a variety of options for different levels of account access. You can learn more about these options in this GitHub post. In its simplest form it will use the default credentials profile to inventory launch configurations across all regions in a single account.

Once the script has completed, you can view the generated inventory.csv file to get a sense of how many launch configurations may need to be converted to launch templates or deleted.

How to transition to launch templates today

If you’re ready to move to launch templates now, making the transition is simple and mostly automated through the AWS Management Console. For customers who do not use the AWS Management Console, most popular Infrastructure as Code (IaC), such as CloudFormation and Terraform, already support launch templates, as do the AWS CLI and SDKs.

To perform this transition, you will need to ensure that your user has the required permissions.

Here are some examples to get you started.

AWS Management Console

Open the EC2 Launch Configuration console. You must sign in if you are not already authenticated.
From the Launch Configuration console, click on the Copy to launch template button and select Copy all.
1. Alternatively, you can select individual launch configurations, and use the Copy selected option to selectively copy certain launch configurations.

Review the list of templates and click on the Copy button when you’re ready to proceed.

Once the copy process has completed, you can close the wizard.

Your launch templates are now ready to replace launch configurations in your Auto Scaling group configuration. Navigate to the Auto Scaling group console, select your Auto Scaling group, and click on the Edit.

Next, scroll down to the Launch configuration section, and click Switch to launch template.

Select your newly created Launch template, review and confirm your configuration, and when ready scroll down to the bottom of the page and click the Update button.
Now that you’ve migrated your launch configurations to launch templates you can prevent users from creating new launch configurations by updating their IAM permissions to deny the autoscaling:CreateLaunchConfiguration action.

Instances launched by this Auto Scaling group continue to run and are not automatically be replaced by making this change. Any instance launched after making this change uses the launch template for its configuration. As your Auto Scaling group scales up and down, the older instances are replaced. If you’d like to force an update, you can use Instance Refresh to ensure that all instances are running the same launch template and version.

CloudFormation and Terraform

If you use CloudFormation to create and manage your infrastructure, you should use the AWS::EC2::LaunchTemplate resource to create launch templates. After adding a launch template resource to your CloudFormation stack template file update your Auto Scaling group resource definition by adding a LaunchTemplate property and removing the existing LaunchConfigurationName property. We have several examples available to help you get started.

Using launch templates with Terraform is a similar process. Update your template file to include a aws_launch_template resource and then update your aws_autoscaling_group resources to reference the launch template.

In addition to making these changes, you may also want to consider adding a MixedInstancesPolicy to your Auto Scaling group. A MixedInstancesPolicy allows you to configure your Auto Scaling group with multiple instance types and purchase options. This helps improve the availability and optimization of your applications. Some examples of these benefits include using Spot Instances and On-Demand Instances within the same Auto Scaling group, combining CPU architectures such as Intel, AMD, and ARM (Graviton2), and having multiple instance types configured in case of a temporary capacity issue.

You can generate and configure example templates for CloudFormation and Terraform in the AWS Management Console.

AWS CLI

If you’re using the AWS CLI to create and manage your Auto Scaling groups, these examples will show you how to accomplish common tasks when using launch templates.

SDKs

AWS SDKs already include APIs for creating launch templates. If you’re using one of our SDKs to create and configure your Auto Scaling groups, you can find more information in the SDK documentation for your language of choice.

Next steps

We’re excited to help you take advantage of the latest EC2 features by making the transition to launch templates as seamless as possible. As we make this transition together, we’re here to help and will continue to communicate our plans and timelines for this transition. If you are unable to transition to launch templates due to lack of tooling or functionalities or have any concerns, please contact AWS Support. Also, stay tuned for more information on tools to help make this transition easier for you.

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-run-massively-multiplayer-games-with-ec2-spot-using-aurora-serverless/

This post is written by Yahav Biran, Principal Solutions Architect, and Pritam Pal, Sr. EC2 Spot Specialist SA

Massively multiplayer online (MMO) game servers must dynamically scale their compute and storage to create a world-scale persistence simulation with millions of dynamic objects, such as complex AR/VR synthetic environments that match real-world fidelity. The Elastic Kubernetes Service (EKS) powered by Amazon EC2 Spot and Aurora Serverless allow customers to create world-scale persistence simulations processed by numerous cost-effective compute chipsets, such as ARM, x86, or Nvidia GPU. It also persists them On-Demand — an automatic scaling configuration of open-source database engines like MySQL or PostgreSQL without managing any database capacity. This post proposes a fully open-sourced option to build, deploy, and manage MMOs on AWS. We use a Python-base game server to demonstrate the MMOs.

Challenge

Increasing competition in the gaming industry has driven game developers to architect cost-effective game servers that can scale up to meet player demands and scale down to meet cost goals. AWS enables MMO developers to scale their game beyond the limits of a single server. The game state (world) can spatially partition across many Regions based on the requested number of sessions or simulations using Amazon EC2 compute power. As the game progresses over many sessions, you must track the simulation’s global state. Amazon Aurora maintains the global game state in memory to manage complex interactions, such as hand-offs across instances and Regions. Amazon Aurora Serverless powers all of these for PostgreSQL or MySQL.

This blog shows how to use a commodity server using an ephemeral disk and decouple the game state from the game server. We store the game state in Aurora for PostgreSQL, but you can also use DynamoDB or KeySpaces for the NoSQL case.

Game overview

We use a Minecraft clone to demonstrate a distributed persistence simulation. The game server is python-based deployed on Agones, an open-source multiplayer dedicated game-server platform on Kubernetes. The Kubernetes cluster is powered by EC2 Spot Instances and configured with EC2 instances to auto-scale that expands and shrinks the compute seamlessly upon game-server allocation. We add a git-ops-based continuous delivery system that stores the game-server binaries and config in a git repository and deploys the game in a cluster deploy in one or more Regions to allow global compute scale. The following image is a diagram of the architecture.

The game server persists every object in an Aurora Serverless PostgreSQL-compatible edition. The serverless database configuration aids automatic start-up and scales capacity up or down as per player demand. The world is divided into 32×32 block chunks in the XYZ plane (Y is up). This allows it to be “infinite” (PostgreSQL Bigint type) and eases data management. Only visible chunks must be queried from the database.

The central database table is named “block” and has the columns p, q, x, y, z, w. (p, q) identifies the chunk, (x, y, z) identifies the block position, and (w) identifies the block type. 0 represents an empty block (air).

In the game, the chunks store their blocks in a hash map. An (x, y, z) key maps to a (w) value.

The y positions of blocks are limited to 0 <= y < 256. The upper limit is essentially an artificial limitation that prevents users from building tall structures. Users cannot destroy blocks at y = 0 to avoid falling underneath the world.

Solution overview

Kubernetes allows dedicated game server scaling and orchestration without limiting the compute platform spanning across many Regions and staying closer to the player. For simplicity, we use EKS to reduce operational overhead.

Amazon EC2 runs the compute simulation, which might require different EC2 instance types. These include compute-optimized instances for compute-bound applications benefiting from high-performance processors or accelerated compute (GPU), using hardware accelerators to perform functions like graphics processing or data pattern matching. In addition, the EC2 Auto Scaling runs the game-server configured to use Amazon EC2 Spot Instances in order to allow up to 90% discount as compared to On-Demand Instance prices. However, Amazon EC2 can interrupt your Spot Instance when the demand for Spot Instances rises, when the supply of Spot Instances decreases, or when the Spot price exceeds your maximum price.

The following two mechanisms minimize the EC2 reclaim compute capacity impact:

Pull interruption notifications and notify the game-server to replicate the session to another game server.
Prioritize compute capacity based on availability.

Two Auto Scaling groups are deployed for method two. The first Auto Scaling group uses latest generation Spot Instances (C5, C5a, C5n, M5, M5a, and M5n) instances, and the second uses all generations x86-based instances (C4 and M4). We configure the cluster-autoscaler that controls the Auto Scaling group size with the Expander priority option in order to favor the latest generation Spot Auto Scaling group. The priority should be a positive value, and the highest value wins. For each priority value, a list of regular expressions should be given. The following example assumes an EKS cluster craft-us-west-2 and two ASGs. The craft-us-west-2-nodegroup-spot5 Auto Scaling group wins the priority. Therefore, new instances will be spawned from the EC2 Spot Auto Scaling group.

.… - --expander=priority - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/craft-us-west-2 …. --- apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-expander namespace: kube-system data: priorities: |- 10: - .*craft-us-west-2-nodegroup-spot4* 50: - .*craft-us-west-2-nodegroup-spot5* ---

The complete working spec is available in https://github.com/aws-samples/spotable-game-server/blob/master/specs/cluster_autoscaler.yml.

We propose the following two options to minimize player impact during interruption. The first is based on two-minute interruption notification, and the second on rebalance recommendations.

Notify that that the instance will be shut down within two minutes.
Notify when a Spot Instance is at elevated risk of interruption, this signal can arrive sooner than the two-minute Spot Instance interruption notice.

Choosing between the two depends on the transition you want for the player. Both cases deploy DaemonSet that listens to either notification and notifies every game server running on the EC2 instance. It also prevents new game-servers from running on this instance.

The daemon set pulls from the instance metadata every five seconds denoted by POLL_INTERVAL as follows:

while http_status=$(curl -o /dev/null -w '%{http_code}' -sL ${NOTICE_URL}); [ ${http_status} -ne 200 ]; do echo $(date): ${http_status} sleep ${POLL_INTERVAL} done

where NOTICE_URL can be either

NOTICE_URL=”http://169.254.169.254/latest/meta-data/spot/termination-time”

Or, for the second option:

NOTICE_URL=”http://169.254.169.254/latest/meta-data/events/recommendations/rebalance”

The command that notifies all the game-servers about the interruption is:

kubectl drain ${NODE_NAME} --force --ignore-daemonsets --delete-local-data

From that point, every game server that runs on the instance gets notified by the Unix signal SIGTERM.

In our example server.py, we tell the OS to signal the Python process and complete the sig_handler function. The example prompts a message to every connected player regarding the incoming interruption.

def sig_handler(signum, frameframe): log('Signal handler called with signal',signum) model.send_talk("WARN game server maintenance is pending - your universe is saved") def main(): .. signal.signal(signal.SIGTERM,sig_handler)

Why Agones?

Agones orchestrates game servers via declarative configuration in order to manage groups of ready game-servers to play. It also offers integrated SDK for managing game server lifecycle, health, and configuration. Finally, it runs on Kubernetes, so it is an all-up open-source platform that runs anywhere. The Agones SDK is easily implemented. Furthermore, combining the compute platform AWS and Agones offers the most secure, resilient, scalable, and cost-effective method for running an MMO.

In our example, we implemented the /health in agones_health and /allocate in agones_allocate calls. Then, the agones_health() called upon the server init to indicate that it is ready to assign new players.

def agones_allocate(model): url="http://localhost:"+agones_port+"/allocate" req = urllib2.Request(url) req.add_header('Content-Type','application/json') req.add_data('') r = urllib2.urlopen(req) resp=r.getcode() log('agones- Response code from agones allocate was:',resp) model.send_talk("new player joined - reporting to agones the server is allocated")

The agones_health() using the native health and relay its health to keep the game-server in a viable state.

def agones_health(model): url="http://localhost:"+agones_port+"/health" req = urllib2.Request(url) req.add_header('Content-Type','application/json') req.add_data('') while True: model.ishealthy() r = urllib2.urlopen(req) resp=r.getcode() log('agones- Response code from agones health was:',resp) time.sleep(10)

The main() function forks a new process that reports health. Agones manages the port allocation and maintains the game state, e.g., Allocated, Scheduled, Shutdown, Creating, and Unhealthy.

def main(): … server = Server((host, port), Handler) server.model = model newpid=os.fork() if newpid ==0: log('agones-in child process about to call agones_health()') agones_health() log('agones-in child process called agones_health()') else: pids = (os.getpid(), newpid) log('agones server pid and health pid',pids) log('SERV', host, port) server.serve_forever()

Other ways than Agones on EKS?

Configuring an Agones group of ready game-servers behind a load balancer is difficult. Agones game-servers endpoint must be published so that players’ clients can connect and play. Agones creates game-servers endpoints that are an IP and port pair. The IP is the public IP of the EC2 Instance. The port results from PortPolicy, which generates a non-predictable port number. Hence, make it impossible to use with load balancer such as Amazon Network Load Balancer (NLB).

Suppose you want to use a load balancer to route players via a predictable endpoint. You could use the Kubernetes Deployment construct and configure a Kubernetes Service construct with NLB and no need to implement additional SDK or install any additional components on your EKS cluster.

The following example defines a service craft-svc that creates an NLB that will route TCP connections to Pod targets carrying the selector craft and listening on port 4080.

apiVersion: v1 kind: Service metadata: name: craft-svc annotations: service.beta.kubernetes.io/aws-load-balancer-type: nlb spec: selector: app: craft ports: - protocol: TCP port: 4080 targetPort: 4080 type: LoadBalancer

The game server Deployment set the metadata label for the Service load balancer and the port.

Furthermore, the Kubernetes readinessProbe and livenessProbe offer similar features as the Agones SDK /allocate and /health implemented prior, making the Deployment option parity with the Agones option.

apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: craft name: craft spec: … metadata: labels: app: craft spec: … readinessProbe: tcpSocket: port: 4080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: tcpSocket: port: 4080 initialDelaySeconds: 5 periodSeconds: 10

Overview of the Database

The compute layer running the game could be reclaimed at any time within minutes, so it is imperative to continuously store the game state as players progress without impacting the experience. Furthermore, it is essential to read the game state quickly from scratch upon session recovery from an unexpected interruption. Therefore, the game requires fast reads and lazy writes. More considerations are consistency and isolation. Developers could handle inconsistency via the game in order to relax hard consistency requirements. As for isolation, in our Craft example players can build a structure without other players seeing it and publish it globally only when they choose.

Choosing the proper database for the MMO depends on the game state structure, e.g., a set of related objects such as our Craft example or a single denormalized table. The former fits the relational model used with RDS or Aurora open-source databases such as MySQL or PostgreSQL. While the latter can be used with Keyspaces, an AWS managed Cassandra, or a key-value store such as DynamoDB. Our example includes two options to store the game state with Aurora Serverless for PostgreSQL or DynamoDB. We chose those because of the ACID support. PostgreSQL offers four isolation levels: dirty read, nonrepeatable read, phantom read, and serialization anomaly. DynamoDB offers two isolation levels: serializable and read-committed. Both databases’ options allow the game developer to implement the best player experience and avoid additional implementation efforts.

Moreover, both engines offer Restful connection methods to the database. Aurora uses Data API. The Data API doesn’t require a persistent DB cluster connection. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. Game developers can run SQL statements via the endpoint without managing connections. DynamoDB only supports a Restful connection. Finally, Aurora Serverless and DynamoDB scale the compute and storage to reduce operational overhead and pay only for what was played.

Conclusion

MMOs are unique because they require infrastructure features similar to other game types like FPS or casual games, and reliable persistence storage for the game-state. Unfortunately, this leads to expensive choices that make monetizing the game difficult. Therefore, we proposed an option with a fun game to help you, the developer, analyze what is best for your MMO. Our option is built upon open-source projects that allow you to build it anywhere, but we also show that AWS offers the most cost-effective and scalable option. We encourage you to read recent announcements about these topics, including several at AWS re:Invent 2021.

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-spot-blueprints-a-template-generator-for-frameworks-like-kubernetes-and-apache-spark/

This post is authored by Deepthi Chelupati, Senior Product Manager for Amazon EC2 Spot Instances, and Chad Schmutzer, Principal Developer Advocate for Amazon EC2

Customers have been using EC2 Spot Instances to save money and scale workloads to new levels for over a decade. Launched in late 2009, Spot Instances are spare Amazon EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand Instance prices. One thing customers love about Spot Instances is their integration across many services on AWS, the AWS Partner Network, and open source software. These integrations unlock the ability to take advantage of the deep savings and scale Spot Instances provide for interruptible workloads. Some of the most popular services used with Spot Instances include Amazon EC2 Auto Scaling, Amazon EMR, AWS Batch, Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (EKS). When we talk with customers who are early in their cost optimization journey about the advantages of using Spot Instances with their favorite workloads, they typically can’t wait to get started. They often tell us that while they’d love to start using Spot Instances right away, flipping between documentation, blog posts, and the AWS Management Console is time consuming and eats up precious development cycles. They want to know the fastest way to start saving with Spot Instances, while feeling confident they’ve applied best practices. Customers have overwhelmingly told us that the best way for them to get started quickly is to see complete workload configuration examples in infrastructure as code templates for AWS CloudFormation and Hashicorp Terraform. To address this feedback, we launched Spot Blueprints.

Spot Blueprints overview

Today we are excited to tell you about Spot Blueprints, an infrastructure code template generator that lives right in the EC2 Spot Console. Built based on customer feedback, Spot Blueprints guides you through a few short steps. These steps are designed to gather your workload requirements while explaining and configuring Spot best practices along the way. Unlike traditional wizards, Spot Blueprints generates custom infrastructure as code in real time within each step so you can easily understand the details of the configuration. Spot Blueprints makes configuring EC2 instance type agnostic workloads easy by letting you express compute capacity requirements as vCPU and memory. Then the wizard automatically expands those requirements to a flexible list of EC2 instance types available in the AWS Region you are operating. Spot Blueprints also takes high-level Spot best practices like Availability Zone flexibility, and applies them to your workload compute requirements. For example, automatically including all of the required resources and dependencies for creating a Virtual Private Cloud (VPC) with all Availability Zones configured for use. Additional workload-specific Spot best practices are also configured, such as graceful interruption handling for load balancer connection draining, automatic job retries, or container rescheduling.

Spot Blueprints makes keeping up with the latest and greatest Spot features (like the capacity-optimized allocation strategy and Capacity Rebalancing for EC2 Auto Scaling) easy. Spot Blueprints is continually updated to support new features as they become available. Today, Spot Blueprints supports generating infrastructure as code workload templates for some of the most popular services used with Spot Instances: Amazon EC2 Auto Scaling, Amazon EMR, AWS Batch, and Amazon EKS. You can tell us what blueprint you’d like to see next right in Spot Blueprints. We are excited to hear from you!

What are Spot best practices?

We’ve mentioned Spot best practices a few times so far in this blog. Let’s quickly review the best practices and how they relate to Spot Blueprints. When using Spot Instances, it is important to understand a couple of points:

Spot Instances are interruptible and must be returned when EC2 needs the capacity back
The location and amount of spare capacity available at any given moment is dynamic and continually changes in real time

For these reasons, it is important to follow best practices when using Spot Instances in your workloads. We like to call these “Spot best practices.” These best practices can be summarized as follows:

Only run workloads that are truly interruption tolerant, meaning interruptible at both the individual instance level and overall application level
Spot workloads should be flexible, meaning they can be shifted in real time to where the spare capacity currently is, or otherwise be paused until spare capacity is available again
In practice, being flexible means qualifying a workload to run on multiple EC2 instance types (think big: multiple families, sizes, and generations), and in multiple Availability Zones, at any given time

Over the last few years, we’ve focused on making it easier to follow these best practices by adding features such as the following:

As mentioned prior, in addition to the Spot best practices we reviewed, there is an additional set of Spot best practices native to each workload that has integrated support for Spot Instances. For example, the ability to implement graceful interruption handling for:

Load balancer connection draining
Automatic job retries
Container draining and rescheduling

Spot Blueprints are designed to quickly explain and generate templates with Spot best practices for each specific workload. The custom-generated workload templates can be downloaded in either AWS CloudFormation or HashiCorp Terraform format, allowing for further customization and learning before being deployed in your environment.

Next, let’s walk through configuring and deploying an example Spot blueprint.

Example tutorial

In this example tutorial, we use Spot Blueprints to configure an Apache Spark environment running on Amazon EMR, deploy the template as a CloudFormation stack, run a sample job, and then delete the CloudFormation stack.

First, we navigate to the EC2 Spot console and click on “Spot Blueprints”:

In the next screen, we select the EMR blueprint, and then click “Configure blueprint”:

A couple of notes here:

If you are in a hurry, you can grab a preconfigured template to get started quickly. The preconfigured template has default best practices in place and can be further customized as needed.
If you have a suggestion for a new blueprint you’d like to see, you can click on “Don’t see a blueprint you need?” to give us feedback. We’d love to hear from you!

In Step 1, we give the blueprint a name and configure permissions. We have the blueprint create new IAM roles to allow Amazon EMR and Amazon EC2 compute resources to make calls to the AWS APIs on your behalf:

We see a summary of resources that will be created, along with a code snippet preview. Unlike traditional wizards, you can see the code generated in every step!

In Step 2, the network is configured. We create a new VPC with public subnets in all Availability Zones in the Region. This is a Spot best practice because it increases the number of Spot capacity pools, which increases the possibility for EC2 to find and provision the required capacity:

We see a summary of resources that will be created, along with a code snippet preview. Here is the CloudFormation code for our template, where you can see the VPC creation, including all dependencies such as the internet gateway, route table, and subnets:

attachGateway: DependsOn: - vpc - internetGateway Properties: InternetGatewayId: Ref: internetGateway VpcId: Ref: vpc Type: AWS::EC2::VPCGatewayAttachment internetGateway: DependsOn: - vpc Type: AWS::EC2::InternetGateway publicRoute: DependsOn: - publicRouteTable - internetGateway - attachGateway Properties: DestinationCidrBlock: 0.0.0.0/0 GatewayId: Ref: internetGateway RouteTableId: Ref: publicRouteTable Type: AWS::EC2::Route publicRouteTable: DependsOn: - vpc - attachGateway Properties: Tags: - Key: Name Value: Public Route Table VpcId: Ref: vpc Type: AWS::EC2::RouteTable vpc: Properties: CidrBlock: Fn::FindInMap: - CidrMappings - vpc - CIDR EnableDnsHostnames: true EnableDnsSupport: true Tags: - Key: Name Value: Ref: AWS::StackName Type: AWS::EC2::VPC publicSubnet1RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet1 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet1 publicSubnet1: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1a CidrBlock: 10.0.0.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ1) publicSubnet2RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet2 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet2 publicSubnet2: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1b CidrBlock: 10.0.1.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ2) publicSubnet3RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet3 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet3 publicSubnet3: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1c CidrBlock: 10.0.2.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ3) publicSubnet4RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet4 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet4 publicSubnet4: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1d CidrBlock: 10.0.3.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ4) publicSubnet5RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet5 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet5 publicSubnet5: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1e CidrBlock: 10.0.4.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ5) publicSubnet6RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet6 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet6 publicSubnet6: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1f CidrBlock: 10.0.5.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ6)

In Step 3, we configure the compute environment. Spot Blueprints makes this easy by allowing us to simply define our capacity units and required compute capacity for each type of node in our EMR cluster. In this walkthrough we will use vCPUs. We select 4 vCPUs for the core node capacity, and 12 vCPUs for the task node capacity. Based on these configurations, Spot Blueprints will apply best practices to the EMR cluster settings. These include using On-Demand Instances for core nodes since interruptions to core nodes can cause instability in the EMR cluster, and using Spot Instances for task nodes because the EMR cluster is typically more tolerant to task node interruptions. Finally, task nodes are provisioned using the capacity-optimized allocation strategy because it helps find the most optimal Spot capacity:

You’ll notice there is no need to spend time thinking about which instance types to select – Spot Blueprints takes our application’s minimum vCPUs per instance and minimum vCPU to memory ratio requirements and automatically selects the optimal instance types. This instance type selection process applies Spot best practices by a) using instance types across different families, generations, and sizes, and b) using the maximum number of instance types possible for core (5) and task (15) nodes:

Here is the CloudFormation code for our template. You can see the EMR cluster creation, the applications being installed (Spark, Hadoop, and Ganglia), the flexible list of instance types and Availability Zones, and the capacity-optimized allocation strategy enabled (along with all dependencies):

emrSparkInstanceFleetCluster: DependsOn: - vpc - publicRoute - publicSubnet1RouteTableAssociation - publicSubnet2RouteTableAssociation - publicSubnet3RouteTableAssociation - publicSubnet4RouteTableAssociation - publicSubnet5RouteTableAssociation - publicSubnet6RouteTableAssociation - spotBlueprintsEmrRole Properties: Applications: - Name: Spark - Name: Hadoop - Name: Ganglia Instances: CoreInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge WeightedCapacity: 4 - InstanceType: c4.xlarge WeightedCapacity: 4 - InstanceType: c3.xlarge WeightedCapacity: 4 - InstanceType: m5.xlarge WeightedCapacity: 4 - InstanceType: m3.xlarge WeightedCapacity: 4 Name: Ref: AWS::StackName TargetOnDemandCapacity: "4" LaunchSpecifications: OnDemandSpecification: AllocationStrategy: lowest-price Ec2SubnetIds: - Ref: publicSubnet1 - Ref: publicSubnet2 - Ref: publicSubnet3 - Ref: publicSubnet4 - Ref: publicSubnet5 - Ref: publicSubnet6 MasterInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge - InstanceType: c4.xlarge - InstanceType: c3.xlarge - InstanceType: m5.xlarge - InstanceType: m3.xlarge Name: Ref: AWS::StackName TargetOnDemandCapacity: "1" JobFlowRole: Ref: spotBlueprintsEmrEc2InstanceProfile Name: Ref: AWS::StackName ReleaseLabel: emr-5.30.1 ServiceRole: Ref: spotBlueprintsEmrRole Tags: - Key: Name Value: Ref: AWS::StackName VisibleToAllUsers: true Type: AWS::EMR::Cluster emrSparkInstanceTaskFleet: DependsOn: - emrSparkInstanceFleetCluster Properties: ClusterId: Ref: emrSparkInstanceFleetCluster InstanceFleetType: TASK InstanceTypeConfigs: - InstanceType: c5.xlarge WeightedCapacity: 4 - InstanceType: c4.xlarge WeightedCapacity: 4 - InstanceType: c3.xlarge WeightedCapacity: 4 - InstanceType: m5.xlarge WeightedCapacity: 4 - InstanceType: m3.xlarge WeightedCapacity: 4 - InstanceType: m4.xlarge WeightedCapacity: 4 - InstanceType: r4.xlarge WeightedCapacity: 4 - InstanceType: r5.xlarge WeightedCapacity: 4 - InstanceType: r3.xlarge WeightedCapacity: 4 - InstanceType: c4.2xlarge WeightedCapacity: 8 - InstanceType: c3.2xlarge WeightedCapacity: 8 - InstanceType: c5.2xlarge WeightedCapacity: 8 - InstanceType: m5.2xlarge WeightedCapacity: 8 - InstanceType: m4.2xlarge WeightedCapacity: 8 - InstanceType: m3.2xlarge WeightedCapacity: 8 LaunchSpecifications: SpotSpecification: TimeoutAction: TERMINATE_CLUSTER TimeoutDurationMinutes: 60 AllocationStrategy: capacity-optimized Name: TaskFleet TargetSpotCapacity: "12" Type: AWS::EMR::InstanceFleetConfig

In Step 4, we have the option of enabling EMR managed scaling on the cluster. Enabling EMR managed scaling is a Spot best practice because this allows EMR to automatically increase or decrease the compute capacity in the cluster based on the workload, further optimizing performance and cost. EMR continuously evaluates cluster metrics to make scaling decisions that optimize the cluster for cost and speed. Spot Blueprints automatically configures minimum and maximum scaling values based on the compute requirements we defined in the previous step:

Here is the updated CloudFormation code for our template with managed scaling (ManagedScalingPolicy) enabled:

In Step 5, we can review and download the template code for further customization in either CloudFormation or Terraform format, review the instance configuration summary, and review a summary of the resources that would be created from the template. Spot Blueprints will also upload the CloudFormation template to an Amazon S3 bucket so we can deploy the template directly from the CloudFormation console or CLI. Let’s go ahead and click on “Deploy in CloudFormation,” copy the URL, and then click on “Deploy in CloudFormation” again:

Having copied the S3 URL, we go to the CloudFormation console to launch the CloudFormation stack:

We walk through the CloudFormation stack creation steps and the stack launches, creating all of the resources as configured in the blueprint. It takes roughly 15-20 minutes for our stack creation to complete:

Once the stack creation is complete, we navigate to the Amazon EMR console to view the EMR cluster configured with Spot best practices:

Next, let’s run a sample Spark application written in Python to calculate the value of pi on the cluster. We’ll do this by uploading the sample application code to an S3 bucket in our account and then adding a step to the cluster referencing the application location code with arguments:

The step runs and completes successfully:

The results of the calculation are sent to our S3 bucket as configured in the arguments:

{"tries":1600000,"hits":1256253,"pi":3.1406325}

Cleanup

Now that our job is done, we delete the CloudFormation stack in order to remove the AWS resources created. Please note that as a part of the EMR cluster creation, EMR creates some EC2 security groups that cannot be removed by CloudFormation since they were created by the EMR cluster and not by CloudFormation. As a result, the deletion of the CloudFormation stack will fail to delete the VPC on the first try. To solve this, we have the option of deleting the VPC manually by hand, or we can let the CloudFormation stack ignore the VPC and leave it (along with the security groups) in place for future use. We can then delete the CloudFormation stack a final time:

Conclusion

No matter if you are a first-time Spot user learning how to take advantage of the savings and scale offered by Spot Instances, or a veteran Spot user expanding your Spot usage into a new architecture, Spot Blueprints has you covered. We hope you enjoy using Spot Blueprints to quickly get you started generating example template frameworks for workloads like Kubernetes and Apache Spark with Spot best practices in place. Please tell us which blueprint you’d like to see next right in Spot Blueprints. We can’t wait to hear from you!

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/proactively-manage-spot-instance-lifecycle-using-the-new-capacity-rebalancing-feature-for-ec2-auto-scaling/

By Deepthi Chelupati and Chad Schmutzer

AWS now offers Capacity Rebalancing for Amazon EC2 Auto Scaling, a new feature for proactively managing the Amazon EC2 Spot Instance lifecycle in an Auto Scaling group. Capacity Rebalancing complements the capacity optimized allocation strategy (designed to help find the most optimal spare capacity) and the mixed instances policy (designed to enhance availability by deploying across multiple instance types running in multiple Availability Zones). Capacity Rebalancing increases the emphasis on availability by automatically attempting to replace Spot Instances in an Auto Scaling group before they are interrupted by Amazon EC2.

In order to proactively replace Spot Instances, Capacity Rebalancing leverages the new EC2 Instance rebalance recommendation, a signal that is sent when a Spot Instance is at elevated risk of interruption. The rebalance recommendation signal can arrive sooner than the existing two-minute Spot Instance interruption notice, providing an opportunity to proactively rebalance a workload to new or existing Spot Instances that are not at elevated risk of interruption.

Capacity Rebalancing for EC2 Auto Scaling provides a seamless and automated experience for maintaining desired capacity through the Spot Instance lifecycle. This includes monitoring for rebalance recommendations, attempting to proactively launch replacement capacity for existing Spot Instances when they are at elevated risk of interruption, detaching from Elastic Load Balancing if necessary, and running lifecycle hooks as configured. This post provides an overview of using Capacity Rebalancing in EC2 Auto Scaling to manage your Spot Instance backed workloads, and dives into an example use case for taking advantage of Capacity Rebalancing in your environment.

EC2 Auto Scaling and Spot Instances – a classic love story

First, let’s review what Spot Instances are and why EC2 Auto scaling provides an optimal platform to manage your Spot Instance backed workloads. This will help illustrate how Capacity Rebalancing can benefit these workloads.

Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand prices. In exchange for the discount, Spot Instances come with a simple rule – they are interruptible and must be returned when EC2 needs the capacity back. Where does this spare capacity come from? Since AWS builds capacity for unpredictable demand at any given time (think all 350+ instance types across 77 Availability Zones and 24 Regions), there is often excess capacity. Rather than let that spare capacity sit idle and unused, it is made available to be purchased as Spot Instances.

As you can imagine, the location and amount of spare capacity available at any given moment is dynamic and continually changes in real time. This is why it is extremely important for Spot customers to only run workloads that are truly interruption tolerant. Additionally, Spot workloads should be flexible, meaning they can be shifted in real time to where the spare capacity currently is (or otherwise be paused until spare capacity is available again). In practice, being flexible means qualifying a workload to run on multiple EC2 instance types (think big: multiple families, sizes, and generations), and in multiple Availability Zones, at any given time.

This is where EC2 Auto Scaling comes in. EC2 Auto Scaling is designed to help you maintain application availability. It also allows you to automatically add or remove EC2 instances according to conditions you define. We’ve continued to innovate on behalf of our customers by adding new features to EC2 Auto Scaling to natively support flexible configurations for EC2 workloads. One of these innovations is the mixed instances policy (launched in 2018), which supports multiple instance types and purchase options in a single Auto Scaling group. Another innovation is the capacity optimized allocation strategy (launched in 2019), an allocation strategy designed to locate optimal spare capacity for Spot Instances backed workloads. These features are aimed at supporting flexible workload best practices, and reacting to the dynamic shifts in capacity automatically.

The next level – moving from reactive to proactive Spot Capacity Rebalancing in EC2 Auto Scaling

The default behavior for EC2 Auto Scaling is to take a reactive approach to Spot Instance interruptions. This means that EC2 Auto Scaling attempts to replace an interrupted Spot Instance with another Spot Instance only after the instance has been shut down by EC2 and the health check fails. The reactive approach to interruptions works fine for many workloads. However, we have received feedback from customers requesting that EC2 Auto Scaling take a more proactive approach to handling Spot Instance interruptions.

Capacity Rebalancing in EC2 Auto Scaling is the answer to this request. Capacity Rebalancing is designed to take a proactive approach in handling the dynamic nature of EC2 capacity. It does this by monitoring for the EC2 Instance rebalance recommendation signal in addition to the “final” two-minute Spot Instance interruption notice. When a rebalance recommendation signal is detected, it automatically attempts to get a head start in replacing Spot Instances with new Spot Instances before they are shut down. In addition to attempting to maintain desired capacity through interruptions by launching replacement Spot Instances, Capacity Rebalancing gives customers the opportunity to gracefully remove Spot Instances from an Auto Scaling group by taking Spot Instances through the normal shut down process, such as deregistering from a load balancer and running terminating lifecycle hooks.

Capacity Rebalancing in EC2 Auto Scaling works best when combined with a few best practices. Let’s quickly review them:

Be flexible. Capacity Rebalancing thrives on flexibility, and works best when using the EC2 Auto Scaling mixed instances policy and as many instance types and Availability Zones as possible. Remember to think big and qualify multiple families, sizes, and generations for your workload, and use all Availability Zones if possible.
Use the capacity optimized allocation strategy. Capacity rebalance works optimally when combined with the capacity optimized allocation strategy and a flexible list of instance types and Availability Zones, because the goal is to find the optimal spare capacity to rebalance your workload on.
Take advantage of termination lifecycle hooks (optional). Termination lifecycle hooks are powerful in case you need to perform any final tasks before shutdown.

Example tutorial – Web application workload

Now that you understand the best practices for taking advantage of Capacity Rebalancing in EC2 Auto Scaling, let’s dive into the example workload. In this scenario, we have a web application powered by 75% Spot Instances and 25% On-Demand Instances in an Auto Scaling group, running behind an Application Load Balancer. We’d like to maintain availability, and have the Auto Scaling group automatically handle Spot Instance interruptions and rebalancing of capacity.

The Auto Scaling group configuration looks like this (note the best practices of instance type and Availability Zone flexibility combined with the capacity optimized allocation strategy in the mixed instances policy):

{ "AutoScalingGroupName": "myAutoScalingGroup", "CapacityRebalance": true, "DesiredCapacity": 12, "MaxSize": 15, "MinSize": 12, "MixedInstancesPolicy": { "InstancesDistribution": { "OnDemandBaseCapacity": 0, "OnDemandPercentageAboveBaseCapacity": 25, "SpotAllocationStrategy": "capacity-optimized" }, "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "myLaunchTemplate", "Version": "$Default" }, "Overrides": [ { "InstanceType": "c5.large" }, { "InstanceType": "c5a.large" }, { "InstanceType": "m5.large" }, { "InstanceType": "m5a.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "m4.large" }, { "InstanceType": "c3.large" }, { "InstanceType": "m3.large" } ] } }, "TargetGroupARNs": [ "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/a1b2c3d4e5f6g7h8" ], "VPCZoneIdentifier": "mySubnet1,mySubnet2,mySubnet3" }

Next, create the Auto Scaling group as follows:

aws autoscaling create-auto-scaling-group \ --cli-input-json file://myAutoScalingGroup.json

We also use a lifecycle hook to download logs before an instance is shut down:

aws autoscaling put-lifecycle-hook \ --lifecycle-hook-name myTerminatingHook \ --auto-scaling-group-name myAutoScalingGroup \ --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \ --heartbeat-timeout 300

In this example scenario, let’s say that the above config results in nine Spot Instances and three On-Demand instances being deployed in the Auto Scaling group, three Spot Instances, and one On-Demand instance in each Availability Zone. With Capacity Rebalancing enabled, if any of the nine Spot Instances receive the EC2 Instance rebalance recommendation signal, EC2 Auto Scaling will automatically request a replacement Spot Instance according to the allocation strategy (capacity optimized), resulting in 10 running Spot Instances. When the new Spot Instance passes EC2 health checks, it is joined to the load balancer and placed into service. Upon placing the new Spot Instance in service, EC2 Auto Scaling then proceeds with the shutdown process for the Spot Instance that has received the rebalance recommendation signal. It detaches the instance from the load balancer, drains connections, and then carries out the terminating lifecycle hook. Once the terminating lifecycle hook is complete, EC2 Auto Scaling shuts down the instance, bringing capacity back to nine Spot Instances.

Conclusion

Consider using the new Capacity Rebalancing feature for EC2 Auto Scaling in your environment to proactively manage Spot Instance lifecycle. Capacity Rebalancing attempts to maintain workload availability by automatically rebalancing capacity as necessary, providing a seamless and hands-off experience for managing Spot Instance interruptions. Capacity Rebalancing works best when combined with instance type flexibility and the capacity optimized allocation strategy, and may be especially useful for workloads that can easily rebalance across shifting capacity, including:

Containerized workloads
Big data and analytics
Image and media rendering
Batch processing
Web applications

To learn more about Capacity Rebalancing for EC2 Auto Scaling, please visit the documentation.

To learn more about the new EC2 Instance rebalance recommendation, please visit the documentation.

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/running-web-applications-on-amazon-ec2-spot-instances/

Author: Isaac Vallhonrat, Sr. EC2 Spot Specialist SA

Amazon EC2 Spot Instances allow customers to save up to 90% compared to On-Demand pricing by leveraging spare EC2 capacity. Spot Instances are a perfect fit for fault tolerant workloads that are flexible to run on multiple instance types such as batch jobs, code builds, load tests, containerized workloads, web applications, big data clusters, and High Performance Computing clusters. In this blog post, I walk you through how-to and best practices to run web applications on Spot Instances, so you can benefit from the scale and savings that they provide.

As Spot Instances are interruptible, your web application should be stateless, fault tolerant, loosely coupled, and rely on external data stores for persistent data, like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, etc.

Spot Instances recap

While Spot Instances have been around since 2009, recent updates and service integrations have made them easier than ever to integrate with your workloads. Before I get into the details of how to use them to host a web application, here is a brief recap of how Spot Instances work:

First, Spot Instances are a purchasing option within EC2. There is no difference in hardware when compared to On-Demand or Reserved Instances. The only difference is that these instances can be reclaimed with a two-minute notice if EC2 needs the capacity back.

A set of unused EC2 instances with the same instance type, operating system, and Availability Zone is a Spot capacity pool. Spot Instance pricing is set by Amazon EC2 and adjusts gradually based on long-term trends in supply and demand of EC2 instances in each pool. You can expect Spot pricing to be stable over time, meaning no sudden spikes or swings. You can view historical pricing data for the last three months in both the EC2 console and via the API. The following image is an example of the pricing history for m5.xlarge instances in the N. Virginia (us-east-1).

Figure 1. Spot Instance pricing history for an m5.xlarge Linux/UNIX instance in us-east-1. Price changes independently for each Availability Zone and in small increments over time based on long term supply and demand.

As each capacity pool is independent from each other, I recommend you spread the fleet of instances that power your workload across multiple Availability Zones (AZs) and be flexible to use multiple instance types. This way, you effectively increase the amount of spare capacity you can use to launch Spot Instances and are able to replace interrupted instances with instances from other pools.

The best way to adhere to Spot Instance flexibility best practices and managing your fleet of instances is using an Amazon EC2 Auto Scaling group. This allows you to combine multiple instance types and purchase options. Once you identify a list of suitable instance types for your workload, you can use Auto Scaling features to launch a combination of On-Demand and Spot Instances from optimal Spot Instance pools in each Availability Zone.

Stateless web applications are a natural fit for Spot Instances as they are fault tolerant, used to handle interruptions through scale-in events in an Auto Scaling group, and commonly run across multiple Availability Zones. It’s also normally easy to implement instance type flexibility on web applications, so they tap on Spot Instance capacity from multiple pools.

Identifying suitable instance types for your web application

Being instance type flexible is key to being successful with Spot Instances. At the time of this writing, AWS provides more than 270 instance types, so it’s likely that your web application can run on multiple of them. For example, if your application runs on the m5.xlarge instance type, it can also run on the m5d.xlarge instance type as it is the same instance type, but with local SSDs. Also, running your application on the r5.xlarge or r5d.xlarge performs similarly since they use the same processor family. Instead, your application runs on an instance with more memory and you benefit from the savings Spot Instances provide.

You may be able to qualify additional similarly-sized instance types from other families or generations, like the AMD-based variants m5a.xlarge and r5a.xlarge, or the higher bandwidth variants m5n.xlarge and r5n.xlarge. These may present some performance differences compared to other types due to their hardware and/or the virtualization system. However, Application Load Balancer has mechanisms in place to spread the load across your instances according to the number of outstanding requests they are processing. This means, instances handling longer requests or that have lower processing power are not overloaded. This allows you to acquire capacity from even more Spot capacity pools regardless of hardware differences. Later in the blog post, I dive into the details on the load balancing mechanisms.

Configure your Auto Scaling group

Now that you have qualified multiple instance types to run your web application, you need to create an EC2 Auto Scaling group.

The first step is to create a Launch Template where you specify configuration for your fleet of instances including the AMI, Security Groups, Tags, user data scripts, etc. You configure a single instance type on the template, here you configure the instance type you commonly use to run your web application. Also, do not mark the Request Spot Instances checkbox on the template. You configure multiple instance types and the Spot Instance settings when configuring the Auto Scaling group. Here’s my Launch Template:

Figure 2. Launch templates console

Now, go to the EC2 Auto Scaling console to create an Auto Scaling group. In the wizard, select the launch template you created and then click next. Then, on the second step of the wizard, select Combine purchase options and instance types. You can see the configuration options in the following image.

Figure 3. Combine purchase options and instance types configuration.

Here is a detailed look at the configuration details:

Optional On-Demand Base: This parameter specifies a baseline of On-Demand Instances within your Auto Scaling group. This is also handy if you have Reserved Instances or Savings Plans to cover your compute baseline, since On-Demand Instances apply towards your Savings Plans or the instances matching your reservation are billed as Reserved Instances.
On-Demand percentage above base: Once the size of the Auto Scaling group grows over the (optional) On-Demand base capacity, this parameter defines the percentage of instances that are On-Demand and Spot Instances. The default settings are a good starting point, set at 70% On-Demand and 30% Spot. You can adjust these percentages as suitable for your workload.
Spot Allocation Strategy per Availability Zone: This defines the logic that Auto Scaling uses to select which instance type to launch in each Availability Zone when launching Spot Instances. The possible Spot allocation strategies are:
- Capacity Optimized: This strategy allocates your instances from the Spot Instance pool with optimal capacity for the number of instances that are launching. This is the default selection and the recommended one, as launching from capacity pools with optimal capacity can lower the chance of interruption. Every time EC2 Auto Scaling scales out, the strategy is evaluated, so as your Auto Scaling group scales out and scales in, your instances are recycled and you make use of Spot Instances in the optimal pools.
- Lowest price: This strategy allocates your instances from the number (N) of Spot Instance pools that you specify and from the pools with the lowest price per unit at the time of fulfillment.

Next, you choose your instance types from the Instance Types section of the wizard. Here, there is a primary instance type inherited from your Launch Template. The console then automatically populates a list of same size instance types across other families and generations that could also fit your web application requirements. You can add or remove instance types as you see fit.

Figure 4. Selection of instance types

Note: There is an option to combine instance types of different sizes which is very helpful for queue worker nodes and similar workloads. We won’t cover this in this blog post as it is not relevant for web applications. You can read more about this feature here.

Once you select your instance types, your Auto Scaling group uses the configured Spot Instance allocation strategy to decide which Spot Instance types to launch on each Availability Zone. By specifying multiple instance types, EC2 Auto Scaling replenishes capacity from other Spot Instance pools applying your configured allocation strategy if some of your instances get interrupted. This maintains the size of your fleet and keeps your service available.

For the On-Demand part of your Auto Scaling group, the allocation strategy is prioritized. This means that EC2 Auto Scaling launches the primary instance type, and moves to the next types in the order you selected, in the unlikely event it cannot launch the capacity. If you have Reserved Instances to cover your compute baseline, set the instance type you reserved as the primary instance type in your Auto Scaling group, so the On-Demand Instances get the reservation discounts. You can learn more about how Reserved Instances are applied here.

After completing this step, complete the Auto Scaling group creation wizard, and you can move on to set up your Application Load Balancer.

Load balancing with Application Load Balancer

To balance the load across the fleet of instances running your web application you use an Application Load Balancer (ALB). EC2 Auto Scaling groups are fully integrated with ALB and Target Groups that manage the fleet of instances fronted by ALB.

Target groups are set up by default with a deregistration delay of 300 seconds. This means that when EC2 Auto Scaling needs to remove an instance out of the fleet, it first puts the instance in draining state, informs ALB to stop sending new requests to it, and allows the configured time for in-flight requests to complete before the instance is finally deregistered. This way, removing an instance from the fleet is not perceptible to your end users.

As Spot Instances are interrupted with a 2 minute warning, you need to adjust this setting to a slightly lower time. Recommended values are 90 seconds or less. This allows time for in-flight requests to complete and gracefully close existing connections before the instance is interrupted.

At the time of this writing, EC2 Auto Scaling is not aware of Spot Instance interruptions and does not trigger draining automatically when a Spot Instance is going to be interrupted. However, you can put the instance in draining state by catching the interruption notice and calling the EC2 Auto Scaling API. I cover this in more detail in the next section.

Target Groups allow you to configure a routing algorithm, which by default is Round Robin. As with Spot Instances you combine multiple instance types in your fleet, your web application may see some performance differences if they use different hardware and in some cases a different virtualization platform (Xen for the older generation instances and AWS Nitro for the newer ones). While the impact to response time may be negligible to the end user, you need to ensure your instances handle load fairly accounting for the processing power differences of each instance type.

With the Least Outstanding Requests load balancing algorithm, instead of distributing the requests across your fleet in a round-robin fashion, as a new request comes in, the load balancer sends it to the target with the least number of outstanding requests. This way backend instances with lower processing capabilities or processing long-standing requests are not burdened with more requests and the load is spread evenly across targets. This also helps the new targets effectively take load from overloaded targets.

These settings can be configured on the Target Groups section of the EC2 Console and editing the attributes of your target group.

Figure 5. ALB Target group attribute configuration.

Your architecture will look like the following image.

Figure 6. Stateless web application hosted on a combination of On-Demand and Spot Instances

Leveraging Instance Termination Notices for graceful termination

When a Spot Instance is going to be interrupted, EC2 triggers a Spot Instance interruption notice that is presented via the EC2 Metadata endpoint on the instance. It can also be caught by an Amazon EventBridge rule, for example, trigger an AWS Lambda function. You can find the specific documentation here.

By catching the interruption notice, you can take actions to gracefully take the instance out of the fleet without impacting your end users. For example, stop receiving new requests from the Load Balancer, allowing in-flight requests to finish and launch replacement capacity.

Below is a sample Spot Instance Interruption Notice caught by Amazon EventBridge:

{ "version": "0", "id": "72d6566c-e6bd-117d-5bbc-2779c679abd5", "detail-type": "EC2 Spot Instance Interruption Warning", "source": "aws.ec2", "account": "12345678910", "time": "2020-03-06T08:25:40Z", "region": "eu-west-1", "resources": [ "arn:aws:ec2:eu-west-1a:instance/i-0cefe48f36d7c281a" ], "detail": { "instance-id": "i-0cefe48f36d7c281a", "instance-action": "terminate" } }

Upon intercepting the interruption event, you can leverage the DetachInstances API call of EC2 Auto Scaling. Invoking this API puts the instance in Draining state, which makes the configured Load Balancer stop sending new requests to the instance that is going to be interrupted. The instance remains in this state during the configured deregistration delay on the Target Group to allow time for in-flight requests to complete.

Figure 7. Target Group console showing instance draining.

Also, if you set the ShouldDecrementDesiredCapacity parameter of the API call to False, EC2 Auto Scaling attempts to launch replacement Spot Instance capacity according to your instance type selection and allocation strategy. Below is an example code snippet calling the Auto Scaling API using Python Boto3.

def detach_instance_from_asg(instance_id,as_group_name): try: # detach instance from ASG and launch replacement instance response = asgclient.detach_instances( InstanceIds=[instance_id], AutoScalingGroupName=as_group_name, ShouldDecrementDesiredCapacity=False) logger.info(response['Activities'][0]['Cause']) except ClientError as e: error_message = "Unable to detach instance {id} from AutoScaling Group {asg_name}. ".format( id=instance_id,asg_name=as_group_name) logger.error( error_message + e.response['Error']['Message']) raise e

In some cases, you may also want to take actions at the operating system level when a Spot Instance is going to be interrupted. For example, you may want to gracefully stop your application so it closes open connections to a database, deregister agents running on the instance or other clean-up activities. You can leverage AWS Systems Manager Run Command to invoke commands on the instance that is going to be interrupted to perform these actions.

You may also want to delay the execution of commands to capitalize on the two minute warning. As a first step, you can check the instance termination time on the instance metadata and wait, for example, until 30 seconds before the interruption. Then, gracefully stop your application and issue other termination commands. Note that there are no charges for using Run Command (limits apply) and the API call is asynchronous so it doesn’t hold your Lambda function running.

Note: To use Run Command you must have the AWS Systems Manager agent running on your instance and meet a set of pre-requisites, like configuring an IAM instance profile. You can find more information here.

We provide you with a sample solution that handles all the steps outlined above here that you can easily deploy on your AWS account.

The sample solution also allows you to optionally execute commands upon Spot Instance interruption by creating an AWS Systems Manager parameter in Parameter Store specific to each Auto Scaling group with the commands you want to run, so you can easily customize termination commands to the needs of each application. You can find detailed configuration instructions in the README.md file. The high-level architecture of this solution is detailed below:

Figure 8. Serverless architecture to handle Spot Instance interruptions

Tying it all together

Now, that I covered all the relevant features to run web applications on Spot Instances, let’s put them all together and summarize:

Using mixed instance types and purchase options in EC2 Auto Scaling groups you can configure a combination of On-Demand and Spot Instances. Leverage this feature to configure multiple instance types to run your web application on Spot Instances, so you can leverage unused capacity from multiple Spot capacity pools and replace interrupted instances.
Using the capacity-optimized Spot Instance allocation strategy you can launch Spot Instances from your instance type selection based on the most available Spot Instance capacity pools. This reduces the chances of interruptions. Auto Scaling then replenishes Spot Instance capacity from the optimal pools if some of your instances are interrupted as Spot Instance capacity fluctuates.
With Application Load Balancer you can adjust the deregistration delay of your Target Group to a slightly lower time than the Spot Instance termination notice time (e.g. 90 seconds), so when an instance is going to be interrupted, you stop sending new requests to it and leverage the notice time to let in-flight requests complete and avoid impact to your end users. You can also configure the Least Outstanding Requests load balancing algorithm in your ALB Target Group so the load is distributed evenly across your fleet accounting for differences in processing power for different instance families.

You can leverage Amazon EventBridge, AWS Lambda and AWS Systems Manager Run Command to take interruption handling actions like put the instance in draining state, request EC2 Auto Scaling to launch a replacement instance and run commands to gracefully shutdown your application or trigger clean-up activities. You can also subscribe an SNS topic or other Lambda functions to the interruption event for alarming, logging or other purposes.

I hope you found this blog post useful, and if you are not using Spot Instances to run your stateless web applications today, I encourage you to try them to get the cost savings and scale they provide.

Isaac Vallhonrat is a Sr. Specialist Solutions Architect for EC2 Spot at AWS. He helps customers build resilient, fault tolerant and cost-effective architectures with EC2 Spot Instances across multiple workload types like web applications, containers, big data, CI/CD, etc.

Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/

This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze

Introduction

Amazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story.

Solution Overview

Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up.

Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post.

Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications.

By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone.

You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy.

We focus on three main points in this blog:

Best practices for using Spot Instances with Amazon SQS
A customer example that uses these components
Example solution that can help you get you started quickly

Application of Amazon SQS with Spot Instances

Amazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes.

In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy.

AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS

Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application

Trax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions.

Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier.

You can create a Dynamic Scaling Policy by doing the following:

Observe a Amazon CloudWatch metric. Watch the metric for the current number of messages in the Amazon SQS queue (ApproximateNumberOfMessagesVisible).
Create a CloudWatch alarm. This alarm should be based on that metric you created in the prior step.
Use your CloudWatch alarm in a Dynamic Scaling Policy. Use this policy increase and decrease the number of EC2 Instances in the Auto Scaling group.

In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch.

Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated.

With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node).

To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection.

Handling Spot Instance interruptions

This instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone).

Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires.

Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message.

Testing the solution

If you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post.

For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job.

To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot.

You can also see the application logs using CloudWatch Logs:

/usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB

/usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in

/usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up

/usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx

/usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle

Conclusion

This post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs.

In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:

Diversifying your Spot Instances using Auto Scaling groups, and selecting the right Spot allocation strategy
Protecting instances from scale-in activities while they process jobs
Using the Spot interruption notification so that the application stop polling the queue for new jobs before the instance is terminated

We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution.

We’d love your feedback—please comment and let me know what you think.

About the authors

Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others.

As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
An Amazon SQS queue to manage the tasks.
A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
Amazon EC2 Auto Scaling groups to model scenarios.
Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance. The batch total is $38.40.

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

Instance Type	Provisioning	Host Count	Hourly Instance Cost	Total 30-minute batch cost
c5.18xlarge	Spot	15	$ 1.2367	$ 9.2753
c5.2xlarge	Spot	22	$ 0.1547	$ 1.7017
c5.4xlarge	Spot	12	$ 0.2772	$ 1.6632
c5.9xlarge	Spot	11	$ 0.6239	$ 3.4315
c5.2xlarge	On-Demand	13	$ 0.3840	$ 2.4960
c5.4xlarge	On-Demand	3	$ 0.7680	$ 1.1520
c5.9xlarge	On-Demand	4	$ 1.7280	$ 3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

AWS announces the new capacity-optimized allocation strategy for Amazon EC2 Auto Scaling and EC2 Fleet. This new strategy automatically makes the most efficient use of spare capacity while still taking advantage of the steep discounts offered by Spot Instances. It’s a new way for you to gain easy access to extra EC2 compute capacity in the AWS Cloud.

This post compares how the capacity-optimized allocation strategy deploys capacity compared to the current lowest-price allocation strategy.

Overview

Spot Instances are spare EC2 compute capacity in the AWS Cloud available to you at savings of up to 90% off compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by EC2 with two minutes of notification when EC2 needs the capacity back.

When making requests for Spot Instances, customers can take advantage of allocation strategies within services such as EC2 Auto Scaling and EC2 Fleet. The allocation strategy determines how the Spot portion of your request is fulfilled from the possible Spot Instance pools you provide in the configuration.

The existing allocation strategy available in EC2 Auto Scaling and EC2 Fleet is called “lowest-price” (with an option to diversify across N pools). This strategy allocates capacity strictly based on the lowest-priced Spot Instance pool or pools. The “diversified” allocation strategy (available in EC2 Fleet but not in EC2 Auto Scaling) spreads your Spot Instances across all the Spot Instance pools you’ve specified as evenly as possible.

As the AWS global infrastructure has grown over time in terms of geographic Regions and Availability Zones as well as the raw number of EC2 Instance families and types, so has the amount of spare EC2 capacity. Therefore it is important that customers have access to tools to help them utilize spare EC2 capacity optimally. The new capacity-optimized strategy for both EC2 Auto Scaling and EC2 Fleet provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics.

Walkthrough

To illustrate how the capacity-optimized allocation strategy deploys capacity compared to the existing lowest-price allocation strategy, here are examples of Auto Scaling group configurations and use cases for each strategy.

Lowest-price (diversified over N pools) allocation strategy

The lowest-price allocation strategy deploys Spot Instances from the pools with the lowest price in each Availability Zone. This strategy has an optional modifier SpotInstancePools that provides the ability to diversify over the N lowest-priced pools in each Availability Zone.

Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The lowest-price strategy does not account for pool capacity depth as it deploys Spot Instances.

As a result, the lowest-price allocation strategy is a good choice for workloads with a low cost of interruption that want the lowest possible prices, such as:

Time-insensitive workloads
Extremely transient workloads
Workloads that are easily check-pointed and restarted

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the lowest-price allocation strategy diversified over two pools:

{ "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale", "MixedInstancesPolicy": { "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "my-launch-template", "Version": "$Latest" }, "Overrides": [ { "InstanceType": "c3.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "c5.large" } ] }, "InstancesDistribution": { "OnDemandPercentageAboveBaseCapacity": 0, "SpotAllocationStrategy": "lowest-price", "SpotInstancePools": 2 } }, "MinSize": 10, "MaxSize": 100, "DesiredCapacity": 60, "HealthCheckType": "EC2", "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456" }

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to lowest-price over two SpotInstancePools.

First, EC2 Auto Scaling tries to make sure that it balances the requested capacity across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the lowest-price allocation strategy allocates the Spot Instance launches to the lowest-priced pool per zone.

Using the example Spot prices shown in the following table, the resulting allocation is:

20 Spot Instances from us-east-1a (10 c3.large, 10 c4.large)
20 Spot Instances from us-east-1b (10 c3.large, 10 c4.large)
20 Spot Instances from us-east-1c (10 c3.large, 10 c4.large)

Availability Zone	Instance type	Spot Instances allocated	Spot price
us-east-1a	c3.large	10	$0.0294
us-east-1a	c4.large	10	$0.0308
us-east-1a	c5.large	0	$0.0408
us-east-1b	c3.large	10	$0.0294
us-east-1b	c4.large	10	$0.0308
us-east-1b	c5.large	0	$0.0387
us-east-1c	c3.large	10	$0.0294
us-east-1c	c4.large	10	$0.0331
us-east-1c	c5.large	0	$0.0353

The cost for this Auto Scaling group is $1.83/hour. Of course, the Spot Instances are allocated according to the lowest price and are not optimized for capacity. The Auto Scaling group could experience higher interruptions if the lowest-priced Spot Instance pools are not as deep as others, since upon interruption the Auto Scaling group will attempt to re-provision instances into the lowest-priced Spot Instance pools.

Capacity-optimized allocation strategy

There is a price associated with interruptions, restarting work, and checkpointing. While the overall hourly cost of capacity-optimized allocation strategy might be slightly higher, the possibility of having fewer interruptions can lower the overall cost of your workload.

The effectiveness of the capacity-optimized allocation strategy depends on following Spot best practices by being flexible and providing as many instance types and Availability Zones (Spot Instance pools) as possible in the configuration. It is also important to understand that as capacity demands change, the allocations provided by this strategy also change over time.

Remember that Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The capacity-optimized strategy does account for pool capacity depth as it deploys Spot Instances, but it does not account for Spot prices.

As a result, the capacity-optimized allocation strategy is a good choice for workloads with a high cost of interruption, such as:

Big data and analytics
Image and media rendering
Machine learning
High performance computing

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the capacity-optimized allocation strategy:

{ "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale", "MixedInstancesPolicy": { "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "my-launch-template", "Version": "$Latest" }, "Overrides": [ { "InstanceType": "c3.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "c5.large" } ] }, "InstancesDistribution": { "OnDemandPercentageAboveBaseCapacity": 0, "SpotAllocationStrategy": "capacity-optimized" } }, "MinSize": 10, "MaxSize": 100, "DesiredCapacity": 60, "HealthCheckType": "EC2", "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456" }

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices (especially critical when using the capacity-optimized allocation strategy) and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to capacity-optimized.

First, EC2 Auto Scaling tries to make sure that the requested capacity is evenly balanced across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the capacity-optimized allocation strategy optimizes the Spot Instance launches by analyzing capacity metrics per instance type per zone. This is because this strategy effectively optimizes by capacity instead of by the lowest price (hence its name).

Using the example Spot prices shown in the following table, the resulting allocation is:

20 Spot Instances from us-east-1a (20 c4.large)
20 Spot Instances from us-east-1b (20 c3.large)
20 Spot Instances from us-east-1c (20 c5.large)

Availability Zone	Instance type	Spot Instances allocated	Spot price
us-east-1a	c3.large	0	$0.0294
us-east-1a	c4.large	20	$0.0308
us-east-1a	c5.large	0	$0.0408
us-east-1b	c3.large	20	$0.0294
us-east-1b	c4.large	0	$0.0308
us-east-1b	c5.large	0	$0.0387
us-east-1c	c3.large	0	$0.0294
us-east-1c	c4.large	0	$0.0308
us-east-1c	c5.large	20	$0.0353

The cost for this Auto Scaling group is $1.91/hour, only 5% more than the lowest-priced example above. However, notice the distribution of the Spot Instances is different. This is because the capacity-optimized allocation strategy determined this was the most efficient distribution from an available capacity perspective.

Conclusion

Consider using the new capacity-optimized allocation strategy to make the most efficient use of spare capacity. Automatically deploy into the most available Spot Instance pools—while still taking advantage of the steep discounts provided by Spot Instances.

This allocation strategy may be especially useful for workloads with a high cost of interruption, including:

Big data and analytics
Image and media rendering
Machine learning
High performance computing

No matter which allocation strategy you choose, you still enjoy the steep discounts provided by Spot Instances. These discounts are possible thanks to the stable Spot pricing made available with the new Spot pricing model.

Chad Schmutzer is a Principal Developer Advocate for the EC2 Spot team. Follow him on twitter to get the latest updates on saving at scale with Spot Instances, to provide feedback, or just say HI.

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/running-your-game-servers-at-scale-for-up-to-90-lower-compute-cost/

This post is contributed by Yahav Biran, Chad Schmutzer, and Jeremy Cowan, Solutions Architects at AWS

Many successful video games such Fortnite: Battle Royale, Warframe, and Apex Legends use a free-to-play model, which offers players access to a portion of the game without paying. Such games are no longer low quality and require premium-like quality. The business model is constrained on cost, and Amazon EC2 Spot Instances offer a viable low-cost compute option. The casual multiplayer games naturally fit the Spot offering. With the orchestration of Amazon EKS containers and the mechanism available to minimize the player impact and optimize the cost when running multiplayer game-servers workloads, both casual and hardcore multiplayer games fit the Spot Instance offering.

Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand Instances. Spot Instances enable you to optimize your costs and scale your application’s throughput up to 10 times for the same budget. Spot Instances are best suited for fault-tolerant workloads. Multiplayer game-servers are no exception: a game-server state is updated using real-time player inputs, which makes the server state transient. Game-server workloads can be disposable and take advantage of Spot Instances to save up to 90% on compute cost. In this blog, we share how to architect your game-server workloads to handle interruptions and effectively use Spot Instances.

Characteristics of game-server workloads

Simply put, multiplayer game servers spend most of their life updating current character position and state (mostly animation). The rest of the time is spent on image updates that result from combat actions, moves, and other game-related events. More specifically, game servers’ CPUs are busy doing network I/O operations by accepting client positions, calculating the new game state, and multi-casting the game state back to the clients. That makes a game server workload a good fit for general-purpose instance types for casual multiplayer games and, preferably, compute-optimized instance types for the hardcore multiplayer games.

AWS provides a wide variety for both compute-optimized (C5 and C4) and general-purpose (M5) instance types with Amazon EC2 Spot Instances. Because capacities fluctuate independently for each instance type in an Availability Zone, you can often get more compute capacity for the same price when using a wide range of instance types. For more information on Spot Instance best practices, see Getting Started with Amazon EC2 Spot Instances

One solution that customers use for running dedicated game-servers is Amazon GameLift. This solution deploys a fleet of Amazon GameLift FleetIQ and Spot Instances in an AWS Region. FleetIQ places new sessions on game servers based on player latencies, instance prices, and Spot Instance interruption rates so that you don’t need to worry about Spot Instance interruptions. For more information, see Reduce Cost by up to 90% with Amazon GameLift FleetIQ and Spot Instances on the AWS Game Tech Blog.

In other cases, you can use game-server deployment patterns like containers-based orchestration (such as Kubernetes, Swarm, and Amazon ECS) for deploying multiplayer game servers. Those systems manage a large number of game-servers deployed as Docker containers across several Regions. The rest of this blog focuses on this containerized game-server solution. Containers fit the game-server workload because they’re lightweight, start quickly, and allow for greater utilization of the underlying instance.

Why use Amazon EC2 Spot Instances?

A Spot Instance is the choice to run a disposable game server workload because of its built-in two-minute interruption notice that provides graceful handling. The two-minute termination notification enables the game server to act upon interruption. We demonstrate two examples for notification handling through Instance Metadata and Amazon CloudWatch. For more information, see “Interruption handling” and “What if I want my game-server to be redundant?” segments later in this blog.

Spot Instances also offer a variety of EC2 instances types that fit game-server workloads, such as general-purpose and compute-optimized (C4 and C5). Finally, Spot Instances provide low termination rates. The Spot Instance Advisor can help you choose a good starting point for determining which instance types have lower historical interruption rates.

Interruption handling

Avoiding player impact is key when using Spot Instances. Here is a strategy to avoid player impact that we apply in the proposed reference architecture and code examples available at Spotable Game Server on GitHub. Specifically, for Amazon EKS, node drainage requires draining the node via the kubectl drain command. This makes the node unschedulable and evicts the pods currently running on the node with a graceful termination period (terminationGracePeriodSeconds) that might impact the player experience. As a result, pods continue to run while a signal is sent to the game to end it gracefully.

Node drainage

Node drainage requires an agent pod that runs as a DaemonSet on every Spot Instance host to pull potential spot interruption from Amazon CloudWatch or instance metadata. We’re going to use the Instance Metadata notification. The following describes how termination events are handled with node drainage:

Launch the game-server pod with a default of 120 seconds (terminationGracePeriodSeconds). As an example, see this deploy YAML file on GitHub.
Provision a worker node pool with a mixed instances policy of On-Demand and Spot Instances. It uses the Spot Instance allocation strategy with the lowest price. For example, see this AWS CloudFormation template on GitHub.
Use the Amazon EKS bootstrap tool (/etc/eks/bootstrap.sh in the recommended AMI) to label each node with its instances lifecycle, either nDemand or Spot. For example:
- OnDemand: “–kubelet-extra-args –node labels=lifecycle=ondemand,title=minecraft,region=uswest2”
- Spot: “–kubelet-extra-args –node-labels=lifecycle=spot,title=minecraft,region=uswest2”
A daemon set deployed on every node pulls the termination status from the instance metadata endpoint. When a termination notification arrives, the `kubectl drain node` command is executed, and a SIGTERM signal is sent to the game-server pod. You can see these commands in the batch file on GitHub.
The game server keeps running for the next 120 seconds to allow the game to notify the players about the incoming termination.
No new game-server is scheduled on the node to be terminated because it’s marked as unschedulable.
A notification to an external system such as a matchmaking system is sent to update the current inventory of available game servers.

Optimization strategies for Kubernetes specifications

This section describes a few recommended strategies for Kubernetes specifications that enable optimal game server placements on the provisioned worker nodes.

Use single Spot Instance Auto Scaling groups as worker nodes. To accommodate the use of multiple Auto Scaling groups, we use Kubernetes nodeSelector to control the game-server scheduling on any of the nodes in any of the Spot Instance–based Auto Scaling groups.
nodeSelector: lifecycle: spot title: your game title
The lifecycle label is populated upon node creation through the AWS CloudFormation template in the following section:
BootstrapArgumentsForSpotFleet: Description: Sets Node Labels to set lifecycle as Ec2Spot Default: "--kubelet-extra-args --node-labels=lifecycle=spot,title=minecraft,region=uswest2" Type: String
You might have a case where the incoming player actions are served by UDP and masking the interruption from the player is required. Here, the game-server allocator (a Kubernetes scheduler for us) schedules more than one game server as target upstream servers behind a UDP load balancer that multicasts any packet received to the set of game servers. After the scheduler terminates the game server upon node termination, the failover occurs seamlessly. For more information, see “What if I want my game-server to be redundant?” later in this blog.

Reference architecture

The following architecture describes an instance mix of On-Demand and Spot Instances in an Amazon EKS cluster of multiplayer game servers. Within a single VPC, control plane node pools (Master Host and Storage Host) are highly available and thus run On-Demand Instances. The game-server hosts/nodes uses a mix of Spot and On-Demand Instances. The control plane, the API server is accessible via an Amazon Elastic Load Balancing Application Load Balancer with a preconfigured allowed list.

What if I want my game server to be redundant?

A game server is a sessionful workload, but it traditionally runs as a single dedicated game server instance with no redundancy. For game servers that use TCP as the transport network layer, AWS offers Network Load Balancers as an option for distributing player traffic across multiple game servers’ targets. Currently, game servers that use UDP don’t have similar load balancer solutions that add redundancy to maintain a highly available game server.

This section proposes a solution for the case where game servers deployed as containerized Amazon EKS pods use UDP as the network transport layer and are required to be highly available. We’re using the UDP load balancer because of the Spot Instances, but the choice isn’t limited to when you’re using Spot Instances.

The following diagram shows a reference architecture for the implementation of a UDP load balancer based on Amazon EKS. It requires a setup of an Amazon EKS cluster as suggested above and a set of components that simulate architecture that supports multiplayer game services. For example, this includes game-server inventory that captures the running game-servers, their status, and allocation placement.

The Amazon EKS cluster is on the left, and the proposed UDP load-balancer system is on the right. A new game server is reported to an Amazon SQS queue that persists in an Amazon DynamoDB table. When a player’s assignment is required, the match-making service queries an API endpoint for an optimal available game server through the game-server inventory that uses the DynamoDB tables.

The solution includes the following main components:

The game server (see mockup-udp-server at GitHub). This is a simple UDP socket server that accepts a delta of a game state from connected players and multicasts the updated state based on pseudo computation back to the players. It’s a single threaded server whose goal is to prove the viability of UDP-based load balancing in dedicated game servers. The model presented here isn’t limited to this implementation. It’s deployed as a single-container Kubernetes pod that uses hostNetwork: true for network optimization.
The load balancer (udp-lb). This is a containerized NGINX server loaded with the stream module. The load balance upstream set is configured upon initialization based on the dedicated game-server state that is stored in the DynamoDB table game-server-status-by-endpoint. Available load balancer instances are also stored in a DynamoDB table, lb-status-by-endpoint, to be used by core game services such as a matchmaking service.
An Amazon SQS queue that captures the initialization and termination of game servers and load balancers instances deployed in the Kubernetes cluster.
DynamoDB tables that persist the state of the cluster with regards to the game servers and load balancer inventory.
An API operation based on AWS Lambda (game-server-inventory-api-lambda) that serves the game servers and load balancers for an updated list of resources available. The operation supports /get-available-gs needed for the load balancer to set its upstream target game servers. It also supports /set-gs-busy/{endpoint} for labeling already claimed game servers from the available game servers inventory.
A Lambda function (game-server-status-poller-lambda) that the Amazon SQS queue triggers and that populates the DynamoDB tables.

Scheduling mechanism

Our goal in this example is to reduce the chance that two game servers that serve the same load-balancer game endpoint are interrupted at the same time. Therefore, we need to prevent the scheduling of the same game servers (mockup-UDP-server) on the same host. This example uses advanced scheduling in Kubernetes where the pod affinity/anti-affinity policy is being applied.

We define two soft labels, mockup-grp1 and mockup-grp2, in the podAffinity section as follows:

affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app" operator: In values: - mockup-grp1 topologyKey: "kubernetes.io/hostname"

The requiredDuringSchedulingIgnoredDuringExecution tells the scheduler that the subsequent rule must be met upon pod scheduling. The rule says that pods that carry the value of key: “app” mockup-grp1 will not be scheduled on the same node as pods with key: “app” mockup-grp2 due to topologyKey: “kubernetes.io/hostname”.

When a load balancer pod (udp-lb) is scheduled, it queries the game-server-inventory-api endpoint for two game-server pods that run on different nodes. If this request isn’t fulfilled, the load balancer pod enters a crash loop until two available game servers are ready.

Try it out

We published two examples that guide you on how to build an Amazon EKS cluster that uses Spot Instances. The first example, Spotable Game Server, creates the cluster, deploys Spot Instances, Dockerizes the game server, and deploys it. The second example, Game Server Glutamate, enhances the game-server workload and enables redundancy as a mechanism for handling Spot Instance interruptions.

Conclusion

Multiplayer game servers have relatively short-lived processes that last between a few minutes to a few hours. The current average observed life span of Spot Instances in US and EU Regions ranges between a few hours to a few days, which makes Spot Instances a good fit for game servers. Amazon GameLift FleetIQ offers native and seamless support for Spot Instances, and Amazon EKS offers mechanisms to significantly minimize the probability of interrupting the player experience. This makes the Spot Instances an attractive option for not only casual multiplayer game server but also hardcore game servers. Game studios that use Spot Instances for multiplayer game server can save up to 90% of the compute cost, thus benefiting them as well as delighting their players.

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PT – Get Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PT – Episode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PT – Accelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PT – Ensuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PT – Running Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PT – Oracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PT – Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PT – De-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PT – Launch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PT – Leading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PT – Enabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PT – Fireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PT – AWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PT – Integrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PT – Building Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PT – Optimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PT – Drive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PT – Understanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PT – Using Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PT – Productionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PT – Changing the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PT – Big Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/

We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

April & early May — 2018 Schedule

Compute

April 30, 2018 | 01:00 PM – 01:45 PM PT – Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.

May 1, 2018 | 01:00 PM – 01:45 PM PT – How to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.

May 2, 2018 | 01:00 PM – 01:45 PM PT – Deep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.

Containers

April 23, 2018 | 11:00 AM – 11:45 AM PT – New Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS.

Databases

April 23, 2018 | 01:00 PM – 01:45 PM PT – ElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache.

April 25, 2018 | 01:00 PM – 01:45 PM PT – Intro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.

DevOps

April 25, 2018 | 09:00 AM – 09:45 AM PT – Debug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun.

Enterprise & Hybrid

April 23, 2018 | 09:00 AM – 09:45 AM PT – An Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale.

April 24, 2018 | 11:00 AM – 11:45 AM PT – Deploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0

IoT

May 2, 2018 | 11:00 AM – 11:45 AM PT – How to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.

Machine Learning

April 24, 2018 | 09:00 AM – 09:45 AM PT – Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe.

April 26, 2018 | 09:00 AM – 09:45 AM PT – Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge.

Mobile

April 30, 2018 | 11:00 AM – 11:45 AM PT – Offline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.

Networking

May 2, 2018 | 09:00 AM – 09:45 AM PT – Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.

Security, Identity & Compliance

April 30, 2018 | 09:00 AM – 09:45 AM PT – Amazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings.

May 3, 2018 | 09:00 AM – 09:45 AM PT – Protect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.

Serverless

April 24, 2018 | 01:00 PM – 01:45 PM PT – Tips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes.

Storage

May 1, 2018 | 11:00 AM – 11:45 AM PT – Building Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.

May 3, 2018 | 11:00 AM – 11:45 AM PT – Integrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Contributed by Deepthi Chelupati and Roshni Pary

Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.

At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.

How does the new pricing model work?

You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.

Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.

As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.

In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:

You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:

$ aws ec2 run-instances --instance-market-options '{"MarketType":"Spot", "SpotOptions": {"SpotPrice": "0.12"}}' \ --image-id ami-1a2b3c4d --count 1 --instance-type c4.2xlarge

The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.

Fewer interruptions

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.

To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.

Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/taking-advantage-of-amazon-ec2-spot-instance-interruption-notices/

Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.

Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances.

In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice.

In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer.

Architecture

In this reference architecture, you use an AWS CloudFormation template to deploy the following:

After the AWS CloudFormation stack deployment is complete, you then create an Amazon EC2 Spot Fleet request diversified across both Availability Zones and use a couple of recent Spot Fleet features: Elastic Load Balancing integration and Tagging Spot Fleet Instances.

When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise.

EC2 Spot Instance Interruption Notices Reference Architecture Diagram

Walkthrough

To complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks.

Launch the stack

Go ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions.

$ git clone https://github.com/awslabs/ec2-spot-labs.git $ aws cloudformation create-stack --stack-name spot-spin-cwe \ --template-body file://ec2-spot-labs/ec2-spot-interruption-notice-cloudwatch-events/ec2-spot-interruption-notice-cloudwatch-events.yaml \ --capabilities CAPABILITY_IAM

You should receive a StackId value in return, confirming the stack is launching.

{ "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/spot-spin-cwe/083e7ad0-0ade-11e8-9e36-500c219ab02a" }

Review the details

Here are the details of the architecture being launched by the stack.

IAM permissions

Give permissions to a few components in the architecture:

The Lambda function
The CloudWatch Events rule
The Spot Fleet

The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these.

lambdaFunctionRole: Properties: AssumeRolePolicyDocument: Statement: - Action: - sts:AssumeRole Effect: Allow Principal: Service: - lambda.amazonaws.com Version: 2012-10-17 ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Path: / Policies: - PolicyDocument: Statement: - Action: elasticloadbalancing:DeregisterTargets Effect: Allow Resource: '*' - Action: ec2:DescribeTags Effect: Allow Resource: '*' Version: '2012-10-17' PolicyName: Fn::Join: - '-' - - Ref: AWS::StackName - lambdaFunctionRole Type: AWS::IAM::Role

Allow CloudWatch Events to call the Lambda function and publish to the SNS topic.

lambdaFunctionPermission: Properties: Action: lambda:InvokeFunction FunctionName: Fn::GetAtt: - lambdaFunction - Arn Principal: events.amazonaws.com SourceArn: Fn::GetAtt: - eventRule - Arn Type: AWS::Lambda::Permission snsTopicPolicy: DependsOn: - snsTopic Properties: PolicyDocument: Id: Fn::GetAtt: - snsTopic - TopicName Statement: - Action: sns:Publish Effect: Allow Principal: Service: - events.amazonaws.com Resource: Ref: snsTopic Version: '2012-10-17' Topics: - Ref: snsTopic Type: AWS::SNS::TopicPolicy

Finally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this.

spotFleetRole: Properties: AssumeRolePolicyDocument: Statement: - Action: - sts:AssumeRole Effect: Allow Principal: Service: - spotfleet.amazonaws.com Version: 2012-10-17 ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole Path: / Type: AWS::IAM::Role

Elastic Load Balancing timeout delay

Because you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice.

loadBalancerTargetGroup: DependsOn: - vpc Properties: HealthCheckIntervalSeconds: 5 HealthCheckPath: / HealthCheckTimeoutSeconds: 2 Port: 80 Protocol: HTTP TargetGroupAttributes: - Key: deregistration_delay.timeout_seconds Value: 120 UnhealthyThresholdCount: 2 VpcId: Ref: vpc Type: AWS::ElasticLoadBalancingV2::TargetGroup

CloudWatch Events rule

To capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic.

eventRule: DependsOn: - snsTopic Properties: Description: Events rule for Spot Instance Interruption Notices EventPattern: detail-type: - EC2 Spot Instance Interruption Warning source: - aws.ec2 State: ENABLED Targets: - Arn: Ref: snsTopic Id: Fn::GetAtt: - snsTopic - TopicName - Arn: Fn::GetAtt: - lambdaFunction - Arn Id: Ref: lambdaFunction Type: AWS::Events::Rule

Lambda function

The Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag.

import boto3 def handler(event, context): instanceId = event['detail']['instance-id'] instanceAction = event['detail']['instance-action'] try: ec2client = boto3.client('ec2') describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}]) except: print("No action being taken. Unable to describe tags for instance id:", instanceId) return try: elbv2client = boto3.client('elbv2') deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}]) except: print("No action being taken. Unable to deregister targets for instance id:", instanceId) return print("Detaching instance from target:") print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",") return

SNS topic

Finally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received.

snsTopic: Properties: DisplayName: SNS Topic for EC2 Spot Instance Interruption Notices Type: AWS::SNS::Topic

Create a Spot Fleet request

To proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack:

$ aws cloudformation describe-stacks --stack-name spot-spin-cwe

Using the output values of the CloudFormation stack, update the following values in the Spot Fleet request configuration:

%spotFleetRole%
%publicSubnet1%
%publicSubnet2%
%loadBalancerTargetGroup% (in two places)

Be sure to also replace %amiId% with the latest Amazon Linux AMI for your region and %keyName% with your environment.

{ "AllocationStrategy": "diversified", "IamFleetRole": "%spotFleetRole%", "LaunchSpecifications": [ { "ImageId": "%amiId%", "InstanceType": "c4.large", "Monitoring": { "Enabled": true }, "KeyName": "%keyName%", "SubnetId": "%publicSubnet1%,%publicSubnet2%", "UserData": "IyEvYmluL2Jhc2gKeXVtIC15IHVwZGF0ZQp5dW0gLXkgaW5zdGFsbCBodHRwZApjaGtjb25maWcgaHR0cGQgb24KaW5zdGFuY2VpZD0kKGN1cmwgaHR0cDovLzE2OS4yNTQuMTY5LjI1NC9sYXRlc3QvbWV0YS1kYXRhL2luc3RhbmNlLWlkKQplY2hvICJoZWxsbyBmcm9tICRpbnN0YW5jZWlkIiA+IC92YXIvd3d3L2h0bWwvaW5kZXguaHRtbApzZXJ2aWNlIGh0dHBkIHN0YXJ0Cg==", "TagSpecifications": [ { "ResourceType": "instance", "Tags": [ { "Key": "loadBalancerTargetGroup", "Value": "%loadBalancerTargetGroup%" } ] } ] } ], "TargetCapacity": 2, "TerminateInstancesWithExpiration": true, "Type": "maintain", "ReplaceUnhealthyInstances": true, "InstanceInterruptionBehavior": "terminate", "LoadBalancersConfig": { "TargetGroupsConfig": { "TargetGroups": [ { "Arn": "%loadBalancerTargetGroup%" } ] } } }

Save the configuration and place the Spot Fleet request:

$ aws ec2 request-spot-fleet --spot-fleet-request-config file://sfr.json

You should receive a SpotFleetRequestId in return, confirming the request:

{ "SpotFleetRequestId": "sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7" }

You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request:

$ aws ec2 describe-spot-fleet-requests --spot-fleet-request-id sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7 { "SpotFleetRequestConfigs": [ { "ActivityStatus": "fulfilled", "CreateTime": "2018-02-08T01:23:16.029Z", "SpotFleetRequestConfig": { "AllocationStrategy": "diversified", "ExcessCapacityTerminationPolicy": "Default", "FulfilledCapacity": 2.0, … "TargetCapacity": 2, … } ] }

Next, you can confirm that the Spot Instances have been registered with the Elastic Load Balancing target group and are in a healthy state:

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/spot-loadB-1DZUVWL720VS6/26456d12cddbf23a { "TargetHealthDescriptions": [ { "Target": { "Id": "i-056c95d9dd6fde892", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } }, { "Target": { "Id": "i-06c4c47228fd999b8", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } } ] }

Test

In order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice:

$ aws ec2 modify-spot-fleet-request --spot-fleet-request-id sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7 --target-capacity 1 { "Return": true }

As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state.

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/spot-loadB-1DZUVWL720VS6/26456d12cddbf23a { "TargetHealthDescriptions": [ { "Target": { "Id": "i-0c3dcd78efb9b7e53", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "draining", "Reason": "Target.DeregistrationInProgress", "Description": "Target deregistration is in progress" } }, { "Target": { "Id": "i-088c91a66078b4299", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } } ] }

Conclusion

In conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!

Chad Schmutzer
Solutions Architect

Chad Schmutzer is a Solutions Architect at Amazon Web Services based in Pasadena, CA. As an extension of the Amazon EC2 Spot Instances team, Chad helps customers significantly reduce the cost of running their applications, growing their compute capacity and throughput without increasing budget, and enabling new types of cloud computing applications.

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/building-an-immersive-vr-streaming-solution-on-aws/

This post was contributed by:

Konstantin Wilms
Solutions Architect

Shawn Przybilla
Solutions Architect

Chad Schmutzer
Solutions Architect

With the explosion in virtual reality (VR) technologies over the past few years, we’ve had an increasing number of customers ask us for advice and best practices around deploying their VR-based products and service offerings on the AWS Cloud. It soon became apparent that while the VR ecosystem is large in both scope and depth of types of workloads (gaming, e-medicine, security analytics, live streaming events, etc.), many of the workloads followed repeatable patterns, with storage and delivery of live and on-demand immersive video at the top of the list.

Looking at consumer trends, the desire for live and on-demand immersive video is fairly self-explanatory. VR has ushered in convenient and low-cost access for consumers and businesses to a wide variety of options for consuming content, ranging from browser playback of live and on-demand 360º video, all the way up to positional tracking systems with a high degree of immersion. All of these scenarios contain one lowest common denominator: video.

Which brings us to the topic of this post. We set out to build a solution that could support both live and on-demand events, bring with it a high degree of scalability, be flexible enough to support transformation of video if required, run at a low cost, and use open-source software to every extent possible.

In this post, we describe the reference architecture we created to solve this challenge, using Amazon EC2 Spot Instances, Amazon S3, Elastic Load Balancing, Amazon CloudFront, AWS CloudFormation, and Amazon CloudWatch, with open-source software such as NGINX, FFMPEG, and JavaScript-based client-side playback technologies. We step you through deployment of the solution and how the components work, as well as the capture, processing, and playback of the underlying live and on-demand immersive media streams.

This GitHub repository includes the source code necessary to follow along. We’ve also provided a self-paced workshop, from AWS re:Invent 2017 that breaks down this architecture even further. If you experience any issues or would like to suggest an enhancement, please use the GitHub issue tracker.

Prerequisites

As a side note, you’ll also need a few additional components to take best advantage of the infrastructure:

A camera/capture device capable of encoding and streaming RTMP video
A browser to consume the content.

You’re going to generate HTML5-compatible video (Apple HLS to be exact), but there are many other native iOS and Android options for consuming the media that you create. It’s also worth noting that your playback device should support projection of your input stream. We’ll talk more about that in the next section.

How does immersive media work?

At its core, any flavor of media, be that audio or video, can be viewed with some level of immersion. The ability to interact passively or actively with the content brings with it a further level of immersion. When you look at VR devices with rotational and positional tracking, you naturally need more than an ability to interact with a flat plane of video. The challenge for any creative thus becomes a tradeoff between immersion features (degrees of freedom, monoscopic 2D or stereoscopic 3D, resolution, framerate) and overall complexity.

Where can you start from a simple and effective point of view, that enables you to build out a fairly modular solution and test it? There are a few areas we chose to be prescriptive with our solution.

Source capture from the Ricoh Theta S

First, monoscopic 360-degree video is currently one of the most commonly consumed formats on consumer devices. We explicitly chose to focus on this format, although the infrastructure is not limited to it. More on this later.

Second, if you look at most consumer-level cameras that provide live streaming ability, and even many professional rigs, there are at least two lenses or cameras at a minimum. The figure above illustrates a single capture from a Ricoh Theta S in monoscopic 2D. The left image captures 180 degrees of the field of view, and the right image captures the other 180 degrees.

For this post, we chose a typical midlevel camera (the Ricoh Theta S), and used a laptop with open-source software (Open Broadcaster Software) to encode and stream the content. Again, the solution infrastructure is not limited to this particular brand of camera. Any camera or encoder that outputs 360º video and encodes to H264+AAC with an RTMP transport will work.

Third, capturing and streaming multiple camera feeds brings additional requirements around stream synchronization and cost of infrastructure. There is also a requirement to stitch media in real time, which can be CPU and GPU-intensive. Many devices and platforms do this either on the device, or via outboard processing that sits close to the camera location. If you stitch and deliver a single stream, you can save the costs of infrastructure and bitrate/connectivity requirements. We chose to keep these aspects on the encoder side to save on cost and reduce infrastructure complexity.

Last, the most common delivery format that requires little to no processing on the infrastructure side is equirectangular projection, as per the above figure. By stitching and unwrapping the spherical coordinates into a flat plane, you can easily deliver the video exactly as you would with any other live or on-demand stream. The only caveat is that resolution and bit rate are of utmost importance. The higher you can push these (high bit rate @ 4K resolution), the more immersive the experience is for viewers. This is due to the increase in sharpness and reduction of compression artifacts.

Knowing that we would be transcoding potentially at 4K on the source camera, but in a format that could be transmuxed without an encoding penalty on the origin servers, we implemented a pass-through for the highest bit rate, and elected to only transcode lower bitrates. This requires some level of configuration on the source encoder, but saves on cost and infrastructure. Because you can conform the source stream, you may as well take advantage of that!

For this post, we chose not to focus on ways to optimize projection. However, the reference architecture does support this with additional open source components compiled into the FFMPEG toolchain. A number of options are available to this end, such as open source equirectangular to cubic transformation filters. There is a tradeoff, however, in that reprojection implies that all streams must be transcoded.

Processing and origination stack

To get started, we’ve provided a CloudFormation template that you can launch directly into your own AWS account. We quickly review how it works, the solution’s components, key features, processing steps, and examine the main configuration files. Following this, you launch the stack, and then proceed with camera and encoder setup.

Immersive streaming reference architecture

The event encoder publishes the RTMP source to multiple origin elastic IP addresses for packaging into the HLS adaptive bitrate.
The client requests the live stream through the CloudFront CDN.
The origin responds with the appropriate HLS stream.
The edge fleet caches media requests from clients and elastically scales across both Availability Zones to meet peak demand.
CloudFront caches media at local edge PoPs to improve performance for users and reduce the origin load.
When the live event is finished, the VOD asset is published to S3. An S3 event is then published to SQS.
The encoding fleet processes the read messages from the SQS queue, processes the VOD clips, and stores them in the S3 bucket.

How it works

A camera captures content, and with the help of a contribution encoder, publishes a live stream in equirectangular format. The stream is encoded at a high bit rate (at least 2.5 Mbps, but typically 16+ Mbps for 4K) using H264 video and AAC audio compression codecs, and delivered to a primary origin via the RTMP protocol. Streams may transit over the internet or dedicated links to the origins. Typically, for live events in the field, internet or bonded cellular are the most widely used.

The encoder is typically configured to push the live stream to a primary URI, with the ability (depending on the source encoding software/hardware) to roll over to a backup publishing point origin if the primary fails. Because you run across multiple Availability Zones, this architecture could handle an entire zone outage with minor disruption to live events. The primary and backup origins handle the ingestion of the live stream as well as transcoding to H264+AAC-based adaptive bit rate sets. After transcode, they package the streams into HLS for delivery and create a master-level manifest that references all adaptive bit rates.

The edge cache fleet pulls segments and manifests from the active origin on demand, and supports failover from primary to backup if the primary origin fails. By adding this caching tier, you effectively separate the encoding backend tier from the cache tier that responds to client or CDN requests. In addition to origin protection, this separation allows you to independently monitor, configure, and scale these components.

Viewers can use the sample HTML5 player (or compatible desktop, iOS or Android application) to view the streams. Navigation in the 360-degree view is handled either natively via device-based gyroscope, positionally via more advanced devices such as a head mount display, or via mouse drag on the desktop. Adaptive bit rate is key here, as this allows you to target multiple device types, giving the player on each device the option of selecting an optimum stream based on network conditions or device profile.

Solution components

When you deploy the CloudFormation template, all the architecture services referenced above are created and launched. This includes:

The compute tier running on Spot Instances for the corresponding components:
- the primary and backup ingest origins
- the edge cache fleet
- the transcoding fleet
- the test source
The CloudFront distribution
S3 buckets for storage of on-demand VOD assets
An Application Load Balancer for load balancing the service
An Amazon ECS cluster and container for the test source
An Amazon SQS queue

The template also provisions the underlying dependencies:

A VPC
Security groups
IAM policies and roles
Elastic network interfaces
Elastic IP addresses

The edge cache fleet instances need some way to discover the primary and backup origin locations. You use elastic network interfaces and elastic IP addresses for this purpose.

As each component of the infrastructure is provisioned, software required to transcode and process the streams across the Spot Instances is automatically deployed. This includes NGiNX-RTMP for ingest of live streams, FFMPEG for transcoding, NGINX for serving, and helper scripts to handle various tasks (potential Spot Instance interruptions, queueing, moving content to S3). Metrics and logs are available through CloudWatch and you can manage the deployment using the CloudFormation console or AWS CLI.

Key features include:

Live and video-on-demand recording

You’re supporting both live and on-demand. On-demand content is created automatically when the encoder stops publishing to the origin.

Cost-optimization and operating at scale using Spot Instances

Spot Instances are used exclusively for infrastructure to optimize cost and scale throughput.

To protect the origin servers, the midtier cache fleet pulls, caches, and delivers to downstream CDNs.

Distribution via CloudFront or multi-CDN

The Application Load Balancer endpoint allows CloudFront or any third-party CDN to source content from the edge fleet and, indirectly, the origin.

FFMPEG + NGINX + NGiNX-RTMP

These three components form the core of the stream ingest, transcode, packaging, and delivery infrastructure, as well as the VOD-processing component for creating transcoded VOD content on-demand.

Simple deployment using a CloudFormation template

All infrastructure can be easily created and modified using CloudFormation.

To provide an end-to-end experience right away, we’ve included a test player page hosted as a static site on S3. This page uses A-Frame, a cross-platform, open-source framework for building VR experiences in the browser. Though A-Frame provides many features, it’s used here to render a sphere that acts as a 3D canvas for your live stream.

Spot Instance considerations

At this stage, and before we discuss processing, it is important to understand how the architecture operates with Spot Instances.

Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. Spot Instances enables you to optimize your costs on the AWS Cloud and scale your application’s throughput up to 10X for the same budget. By selecting Spot Instances, you can save up-to 90% on On-Demand prices. This allows you to greatly reduce the cost of running the solution because, outside of S3 for storage and CloudFront for delivery, this solution is almost entirely dependent on Spot Instances for infrastructure requirements.

We also know that customers running events look to deploy streaming infrastructure at the lowest price point, so it makes sense to take advantage of it wherever possible. A potential challenge when using Spot Instances for live streaming and on-demand processing is that you need to proactively deal with potential Spot Instance interruptions. How can you best deal with this?

First, the origin is deployed in a primary/backup deployment. If a Spot Instance interruption happens on the primary origin, you can fail over to the backup with a brief interruption. Should a potential interruption not be acceptable, then either Reserved Instances or On-Demand options (or a combination) can be used at this tier.

Second, the edge cache fleet runs a job (started automatically at system boot) that periodically queries the local instance metadata to detect if an interruption is scheduled to occur. Spot Instance Interruption Notices provide a two-minute warning of a pending interruption. If you poll every 5 seconds, you have almost 2 full minutes to detach from the Load Balancer and drain or stop any traffic directed to your instance.

Lastly, use an SQS queue when transcoding. If a transcode for a Spot Instance is interrupted, the stale item falls back into the SQS queue and is eventually re-surfaced into the processing pipeline. Only remove items from the queue after the transcoded files have been successfully moved to the destination S3 bucket.

Processing

As discussed in the previous sections, you pass through the video for the highest bit rate to save on having to increase the instance size to transcode the 4K or similar high bit rate or resolution content.

We’ve selected a handful of bitrates for the adaptive bit rate stack. You can customize any of these to suit the requirements for your event. The default ABR stack includes:

2160p (4K)
1080p
540p
480p

These can be modified by editing the /etc/nginx/rtmp.d/rtmp.conf NGINX configuration file on the origin or the CloudFormation template.

It’s important to understand where and how streams are transcoded. When the source high bit rate stream enters the primary or backup origin at the /live RTMP application entry point, it is recorded on stop and start of publishing. On completion, it is moved to S3 by a cleanup script, and a message is placed in your SQS queue for workers to use. These workers transcode the media and push it to a playout location bucket.

This solution uses Spot Fleet with automatic scaling to drive the fleet size. You can customize it based on CloudWatch metrics, such as simple utilization metrics to drive the size of the fleet. Why use Spot Instances for the transcode option instead of Amazon Elastic Transcoder? This allows you to implement reprojection of the input stream via FFMPEG filters in the future.

The origins handle all the heavy live streaming work. Edges only store and forward the segments and manifests, and provide scaling plus reduction of burden on the origin. This lets you customize the origin to the right compute capacity without having to rely on a ‘high watermark’ for compute sizing, thus saving additional costs.

Loopback is an important concept for the live origins. The incoming stream entering /live is transcoded by FFMPEG to multiple bit rates, which are streamed back to the same host via RTMP, on a secondary publishing point /show. The secondary publishing point is transparent to the user and encoder, but handles HLS segment generation and cleanup, and keeps a sliding window of live segments and constantly updating manifests.

Configuration

Our solution provides two key points of configuration that can be used to customize the solution to accommodate ingest, recording, transcoding, and delivery, all controlled via origin and edge configuration files, which are described later. In addition, a number of job scripts run on the instances to provide hooks into Spot Instance interruption events and the VOD SQS-based processing queue.

Origin instances

The rtmp.conf excerpt below also shows additional parameters that can be customized, such as maximum recording file size in Kbytes, HLS Fragment length, and Playlist sizes. We’ve created these in accordance with general industry best practices to ensure the reliable streaming and delivery of your content.

rtmp { server { listen 1935; chunk_size 4000; application live { live on; record all; record_path /var/lib/nginx/rec; record_max_size 128000K; exec_record_done /usr/local/bin/record-postprocess.sh $path $basename; exec /usr/local/bin/ffmpeg <…parameters…>; } application show { live on; hls on; ... hls_type live; hls_fragment 10s; hls_playlist_length 60s; ... } } }

This exposes a few URL endpoints for debugging and general status. In production, you would most likely turn these off:

/stat provides a statistics endpoint accessible via any standard web browser.
/control enables control of RTMP streams and publishing points.

You also control the TTLs, as previously discussed. It’s important to note here that you are setting TTLs explicitly at the origin, instead of in CloudFront’s distribution configuration. While both are valid, this approach allows you to reconfigure and restart the service on the fly without having to push changes through CloudFront. This is useful for debugging any caching or playback issues.

location /stat { ... } location /stat.xsl { ... } location /control { ... } # origin controlled ttls location ~* /hls/.*\.ts$ { ... add_header Cache-Control "max-age=900"; expires 900s; } location ~* /hls/.*\.m3u8$ { ... add_header Cache-Control "max-age=5"; expires 5s; }

Handler scripts

Here is a brief overview of the scripts we use to handle the events and process of the solution. We encourage you to take some time to review them.

spot-termination-handler.sh – Provides a graceful shutdown for the transcoding and edge fleet instances.
record-postprocess.sh – Ensures that recorded files on the origin are well-formed, and transfers them to S3 for processing.
ffmpeg.sh – Transcodes content on the encoding fleet, pulling source media from your S3 ingress bucket, based on SQS queue entries, and pushing transcoded adaptive bit rate segments and manifests to your VOD playout egress bucket.

For more details, see the Delivery and Playback section later in this post.

Camera source

With the processing and origination infrastructure running, you need to configure your camera and encoder.

As discussed, we chose to use a Ricoh Theta S camera and Open Broadcaster Software (OBS) to stitch and deliver a stream into the infrastructure. Ricoh provides a free ‘blender’ driver, which allows you to transform, stitch, encode, and deliver both transformed equirectangular (used for this post) video as well as spherical (two camera) video. The Theta provides an easy way to get capturing for under $300, and OBS is a free and open-source software application for capturing and live streaming on a budget. It is quick, cheap, and enjoys wide use by the gaming community. OBS lowers the barrier to getting started with immersive streaming.

While the resolution and bit rate of the Theta may not be 4K, it still provides us with a way to test the functionality of the entire pipeline end to end, without having to invest in a more expensive camera rig. One could also use this type of model to target smaller events, which may involve mobile devices with smaller display profiles, such as phones and potentially smaller sized tablets.

Looking for a more professional solution? Nokia, GoPro, Samsung, and many others have options ranging from $500 to $50,000. This solution is based around the Theta S capabilities, but we’d encourage you to extend it to meet your specific needs.

If your device can support equirectangular RTMP, then it can deliver media through the reference architecture (dependent on instance sizing for higher bit rate sources, of course). If additional features are required such as camera stitching, mixing, or device bonding, we’d recommend exploring a commercial solution such as Teradek Sphere.

Teradek Rig (Teradek)

Ricoh Theta (CNET)

All cameras have varied PC connectivity support. We chose the Ricoh Theta S due to the real-time video connectivity that it provides through software drivers on macOS and PC. If you plan to purchase a camera to use with a PC, confirm that it supports real-time capabilities as a peripheral device.

Encoding and publishing

Now that you have a camera, encoder, and AWS stack running, you can finally publish a live stream.

To start streaming with OBS, configure the source camera and set a publishing point. Use the RTMP application name /live on port 1935 to ingest into the primary origin’s Elastic IP address provided as the CloudFormation output: primaryOriginElasticIp.

You also need to choose a stream name or stream key in OBS. You can use any stream name, but keep the naming short and lowercase, and use only alphanumeric characters. This avoids any parsing issues on client-side player frameworks. There’s no publish point protection in your deployment, so any stream key works with the default NGiNX-RTMP configuration. For more information about stream keys, publishing point security, and extending the NGiNX-RTMP module, see the NGiNX-RTMP Wiki.

You should end up with a configuration similar to the following:

OBS Stream Settings

The Output settings dialog allows us to rescale the Video canvas and encode it for delivery to our AWS infrastructure. In the dialog below, we’ve set the Theta to encode at 5 Mbps in CBR mode using a preset optimized for low CPU utilization. We chose these settings in accordance with best practices for the stream pass-through at the origin for the initial incoming bit rate. You may notice that they largely match the FFMPEG encoding settings we use on the origin – namely constant bit rate, a single audio track, and x264 encoding with the ‘veryfast’ encoding profile.

OBS Output Settings

Live to On-Demand

As you may have noticed, an on-demand component is included in the solution architecture. When talking to customers, one frequent request that we see is that they would like to record the incoming stream with as little effort as possible.

NGINX-RTMP’s recording directives provide an easy way to accomplish this. We record any newly published stream on stream start at the primary or backup origins, using the incoming source stream, which also happens to be the highest bit rate. When the encoder stops broadcasting, NGINX-RTMP executes an exec_record_done script – record-postprocess.sh (described in the Configuration section earlier), which ensures that the content is well-formed, and then moves it to an S3 ingest bucket for processing.

Transcoding of content to make it ready for VOD as adaptive bit rate is a multi-step pipeline. First, Spot Instances in the transcoding cluster periodically poll the SQS queue for new jobs. Items on the queue are pulled off on demand by processing instances, and transcoded via FFMPEG into adaptive bit rate HLS. This allows you to also extend FFMPEG using filters for cubic and other bitrate-optimizing 360-specific transforms. Finally, transcoded content is moved from the ingest bucket to an egress bucket, making them ready for playback via your CloudFront distribution.

Separate ingest and egress by bucket to provide hard security boundaries between source recordings (which are highest quality and unencrypted), and destination derivatives (which may be lower quality and potentially require encryption). Bucket separation also allows you to order and archive input and output content using different taxonomies, which is common when moving content from an asset management and archival pipeline (the ingest bucket) to a consumer-facing playback pipeline (the egress bucket, and any other attached infrastructure or services, such as CMS, Mobile applications, and so forth).

Because streams are pushed over the internet, there is always the chance that an interruption could occur in the network path, or even at the origin side of the equation (primary to backup roll-over). Both of these scenarios could result in malformed or partial recordings being created. For the best level of reliability, encoding should always be recorded locally on-site as a precaution to deal with potential stream interruptions.

Delivery and playback

With the camera turned on and OBS streaming to AWS, the final step is to play the live stream. We’ve primarily tested the prototype player on the latest Chrome and Firefox browsers on macOS, so your mileage may vary on different browsers or operating systems. For those looking to try the livestream on Google Cardboard, or similar headsets, native apps for iOS (VRPlayer) and Android exist that can play back HLS streams.

The prototype player is hosted in an S3 bucket and can be found from the CloudFormation output clientWebsiteUrl. It requires a stream URL provided as a query parameter ?url=<stream_url> to begin playback. This stream URL is determined by the RTMP stream configuration in OBS. For example, if OBS is publishing to rtmp://x.x.x.x:1935/live/foo, the resulting playback URL would be:

https://<cloudFrontDistribution>/hls/foo.m3u8

The combined player URL and playback URL results in a path like this one:

https://<clientWebsiteUrl>/?url=https://<cloudFrontDistribution>/hls/foo.m3u8

To assist in setup/debugging, we’ve provided a test source as part of the CloudFormation template. A color bar pattern with timecode and audio is being generated by FFmpeg running as an ECS task. Much like OBS, FFmpeg is streaming the test pattern to the primary origin over the RTMP protocol. The prototype player and test HLS stream can be accessed by opening the clientTestPatternUrl CloudFormation output link.

Test Stream Playback

What’s next?

In this post, we walked you through the design and implementation of a full end-to-end immersive streaming solution architecture. As you may have noticed, there are a number of areas this could expand into, and we intend to do this in follow-up posts around the topic of virtual reality media workloads in the cloud. We’ve identified a number of topics such as load testing, content protection, client-side metrics and analytics, and CI/CD infrastructure for 24/7 live streams. If you have any requests, please drop us a line.

We would like to extend extra-special thanks to Scott Malkie and Chad Neal for their help and contributions to this post and reference architecture.

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/aws-reinvent-2017-guide-to-all-things-containers/

Contributed by Tiffany Jernigan, Developer Advocate for Amazon ECS

Get ready for takeoff!

We made sure that this year’s re:Invent is chock-full of containers: there are over 40 sessions! New to containers? No problem, we have several introductory sessions for you to dip your toes. Been using containers for years and know the ins and outs? Don’t miss our technical deep-dives and interactive chalk talks led by container experts.

If you can’t make it to Las Vegas, you can catch the keynotes and session recaps from our livestream and on Twitch.

Session types

Not everyone learns the same way, so we have multiple types of breakout content:

Birds of a Feather
An interactive discussion with industry leaders about containers on AWS.
Breakout sessions
60-minute presentations about building on AWS. Sessions are delivered by both AWS experts and customers and span all content levels.
Workshops
2.5-hour, hands-on sessions that teach how to build on AWS. AWS credits are provided. Bring a laptop, and have an active AWS account.
Chalk Talks
1-hour, highly interactive sessions with a smaller audience. They begin with a short lecture delivered by an AWS expert, followed by a discussion with the audience.

Session levels

Whether you’re new to containers or you’ve been using them for years, you’ll find useful information at every level.

Introductory
Sessions are focused on providing an overview of AWS services and features, with the assumption that attendees are new to the topic.
Advanced
Sessions dive deeper into the selected topic. Presenters assume that the audience has some familiarity with the topic, but may or may not have direct experience implementing a similar solution.
Expert
Sessions are for attendees who are deeply familiar with the topic, have implemented a solution on their own already, and are comfortable with how the technology works across multiple services, architectures, and implementations.

Session locations

All container sessions are located in the Aria Resort.

MONDAY 11/27

Breakout sessions

Level 200 (Introductory)

CON202 – Getting Started with Docker and Amazon ECS
By packaging software into standardized units, Docker gives code everything it needs to run, ensuring consistency from your laptop all the way into production. But once you have your code ready to ship, how do you run and scale it in the cloud? In this session, you become comfortable running containerized services in production using Amazon ECS. We cover container deployment, cluster management, service auto-scaling, service discovery, secrets management, logging, monitoring, security, and other core concepts. We also cover integrated AWS services and supplementary services that you can take advantage of to run and scale container-based services in the cloud.

Chalk talks

Level 200 (Introductory)

CON211 – Reducing your Compute Footprint with Containers and Amazon ECS
Tomas Riha, platform architect for Volvo, shows how Volvo transitioned its WirelessCar platform from using Amazon EC2 virtual machines to containers running on Amazon ECS, significantly reducing cost. Tomas dives deep into the architecture that Volvo used to achieve the migration in under four months, including Amazon ECS, Amazon ECR, Elastic Load Balancing, and AWS CloudFormation.

CON212 – Anomaly Detection Using Amazon ECS, AWS Lambda, and Amazon EMR
Learn about the architecture that Cisco CloudLock uses to enable automated security and compliance checks throughout the entire development lifecycle, from the first line of code through runtime. It includes integration with IAM roles, Amazon VPC, and AWS KMS.

Level 400 (Expert)

CON410 – Advanced CICD with Amazon ECS Control Plane
Mohit Gupta, product and engineering lead for Clever, demonstrates how to extend the Amazon ECS control plane to optimize management of container deployments and how the control plane can be broadly applied to take advantage of new AWS services. This includes ark—an AWS CLI-based deployment to Amazon ECS, Dapple—a slack-based automation system for deployments and notifications, and Kayvee—log and event routing libraries based on Amazon Kinesis.

Workshops

Level 200 (Introductory)

CON209 – Interstella 8888: Learn How to Use Docker on AWS
Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience with Docker as you containerize Interstella 8888’s aging monolithic application and deploy it using Amazon ECS.

CON213 – Hands-on Deployment of Kubernetes on AWS
In this workshop, attendees get hands-on experience using Kubernetes and Kops (Kubernetes Operations), as described in our recent blog post. Attendees learn how to provision a cluster, assign role-based permissions and security, and launch a container. If you’re interested in learning best practices for running Kubernetes on AWS, don’t miss this workshop.

TUESDAY 11/28

Breakout Sessions

Level 200 (Introductory)

CON206 – Docker on AWS
In this session, Docker Technical Staff Member Patrick Chanezon discusses how Finnish Rail, the national train system for Finland, is using Docker on Amazon Web Services to modernize their customer facing applications, from ticket sales to reservations. Patrick also shares the state of Docker development and adoption on AWS, including explaining the opportunities and implications of efforts such as Project Moby, Docker EE, and how developers can use and contribute to Docker projects.

CON208 – Building Microservices on AWS
Increasingly, organizations are turning to microservices to help them empower autonomous teams, letting them innovate and ship software faster than ever before. But implementing a microservices architecture comes with a number of new challenges that need to be dealt with. Chief among these finding an appropriate platform to help manage a growing number of independently deployable services. In this session, Sam Newman, author of Building Microservices and a renowned expert in microservices strategy, discusses strategies for building scalable and robust microservices architectures. He also tells you how to choose the right platform for building microservices, and about common challenges and mistakes organizations make when they move to microservices architectures.

Level 300 (Advanced)

CON302 – Building a CICD Pipeline for Containers on AWS
Containers can make it easier to scale applications in the cloud, but how do you set up your CICD workflow to automatically test and deploy code to containerized apps? In this session, we explore how developers can build effective CICD workflows to manage their containerized code deployments on AWS.

Ajit Zadgaonkar, Director of Engineering and Operations at Edmunds walks through best practices for CICD architectures used by his team to deploy containers. We also deep dive into topics such as how to create an accessible CICD platform and architect for safe blue/green deployments.

CON307 – Building Effective Container Images
Sick of getting paged at 2am and wondering “where did all my disk space go?” New Docker users often start with a stock image in order to get up and running quickly, but this can cause problems as your application matures and scales. Creating efficient container images is important to maximize resources, and deliver critical security benefits.

In this session, AWS Sr. Technical Evangelist Abby Fuller covers how to create effective images to run containers in production. This includes an in-depth discussion of how Docker image layers work, things you should think about when creating your images, working with Amazon ECR, and mise-en-place for install dependencies. Prakash Janakiraman, Co-Founder and Chief Architect at Nextdoor discuss high-level and language-specific best practices for with building images and how Nextdoor uses these practices to successfully scale their containerized services with a small team.

CON309 – Containerized Machine Learning on AWS
Image recognition is a field of deep learning that uses neural networks to recognize the subject and traits for a given image. In Japan, Cookpad uses Amazon ECS to run an image recognition platform on clusters of GPU-enabled EC2 instances. In this session, hear from Cookpad about the challenges they faced building and scaling this advanced, user-friendly service to ensure high-availability and low-latency for tens of millions of users.

CON320 – Monitoring, Logging, and Debugging for Containerized Services
As containers become more embedded in the platform tools, debug tools, traces, and logs become increasingly important. Nare Hayrapetyan, Senior Software Engineer and Calvin French-Owen, Senior Technical Officer for Segment discuss the principals of monitoring and debugging containers and the tools Segment has implemented and built for logging, alerting, metric collection, and debugging of containerized services running on Amazon ECS.

Chalk Talks

Level 300 (Advanced)

CON314 – Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS
Containers make it easy to deploy new code into production to update the functionality of a service, but what happens when you need to update the Amazon EC2 compute instances that your containers are running on? In this talk, we’ll deep dive into how to upgrade the Amazon EC2 infrastructure underlying a live production Amazon ECS cluster without affecting service availability. Matt Callanan, Engineering Manager at Expedia walk through Expedia’s “PRISM” project that safely relocates hundreds of tasks onto new Amazon EC2 instances with zero-downtime to applications.

CON322 – Maximizing Amazon ECS for Large-Scale Workloads
Head of Mobfox DevOps, David Spitzer, shows how Mobfox used Docker and Amazon ECS to scale the Mobfox services and development teams to achieve low-latency networking and automatic scaling. This session covers Mobfox’s ecosystem architecture. It compares 2015 and today, the challenges Mobfox faced in growing their platform, and how they overcame them.

CON323 – Microservices Architectures for the Enterprise
Salva Jung, Principle Engineer for Samsung Mobile shares how Samsung Connect is architected as microservices running on Amazon ECS to securely, stably, and efficiently handle requests from millions of mobile and IoT devices around the world.

CON324 – Windows Containers on Amazon ECS
Docker containers are commonly regarded as powerful and portable runtime environments for Linux code, but Docker also offers API and toolchain support for running Windows Servers in containers. In this talk, we discuss the various options for running windows-based applications in containers on AWS.

CON326 – Remote Sensing and Image Processing on AWS
Learn how Encirca services by DuPont Pioneer uses Amazon ECS powered by GPU-instances and Amazon EC2 Spot Instances to run proprietary image-processing algorithms against satellite imagery. Mark Lanning and Ethan Harstad, engineers at DuPont Pioneer show how this architecture has allowed them to process satellite imagery multiple times a day for each agricultural field in the United States in order to identify crop health changes.

Workshops

Level 300 (Advanced)

CON317 – Advanced Container Management at Catsndogs.lol
Catsndogs.lol is a (fictional) company that needs help deploying and scaling its container-based application. During this workshop, attendees join the new DevOps team at CatsnDogs.lol, and help the company to manage their applications using Amazon ECS, and help release new features to make our customers happier than ever.Attendees get hands-on with service and container-instance auto-scaling, spot-fleet integration, container placement strategies, service discovery, secrets management with AWS Systems Manager Parameter Store, time-based and event-based scheduling, and automated deployment pipelines. If you are a developer interested in learning more about how Amazon ECS can accelerate your application development and deployment workflows, or if you are a systems administrator or DevOps person interested in understanding how Amazon ECS can simplify the operational model associated with running containers at scale, then this workshop is for you. You should have basic familiarity with Amazon ECS, Amazon EC2, and IAM.

Additional requirements:

The AWS CLI or AWS Tools for PowerShell installed
An AWS account with administrative permissions (including the ability to create IAM roles and policies) created at least 24 hours in advance.

WEDNESDAY 11/29

Birds of a Feather (BoF)

CON01 – Birds of a Feather: Containers and Open Source at AWS
Cloud native architectures take advantage of on-demand delivery, global deployment, elasticity, and higher-level services to enable developer productivity and business agility. Open source is a core part of making cloud native possible for everyone. In this session, we welcome thought leaders from the CNCF, Docker, and AWS to discuss the cloud’s direction for growth and enablement of the open source community. We also discuss how AWS is integrating open source code into its container services and its contributions to open source projects.

Breakout Sessions

Level 300 (Advanced)

CON308 – Mastering Kubernetes on AWS
Much progress has been made on how to bootstrap a cluster since Kubernetes’ first commit and is now only a matter of minutes to go from zero to a running cluster on Amazon Web Services. However, evolving a simple Kubernetes architecture to be ready for production in a large enterprise can quickly become overwhelming with options for configuration and customization.

In this session, Arun Gupta, Open Source Strategist for AWS and Raffaele Di Fazio, software engineer at leading European fashion platform Zalando, show the common practices for running Kubernetes on AWS and share insights from experience in operating tens of Kubernetes clusters in production on AWS. We cover options and recommendations on how to install and manage clusters, configure high availability, perform rolling upgrades and handle disaster recovery, as well as continuous integration and deployment of applications, logging, and security.

CON310 – Moving to Containers: Building with Docker and Amazon ECS
If you’ve ever considered moving part of your application stack to containers, don’t miss this session. We cover best practices for containerizing your code, implementing automated service scaling and monitoring, and setting up automated CI/CD pipelines with fail-safe deployments. Manjeeva Silva and Thilina Gunasinghe show how McDonalds implemented their home delivery platform in four months using Docker containers and Amazon ECS to serve tens of thousands of customers.

Level 400 (Expert)

CON402 – Advanced Patterns in Microservices Implementation with Amazon ECS
Scaling a microservice-based infrastructure can be challenging in terms of both technical implementation and developer workflow. In this talk, AWS Solutions Architect Pierre Steckmeyer is joined by Will McCutchen, Architect at BuzzFeed, to discuss Amazon ECS as a platform for building a robust infrastructure for microservices. We look at the key attributes of microservice architectures and how Amazon ECS supports these requirements in production, from configuration to sophisticated workload scheduling to networking capabilities to resource optimization. We also examine what it takes to build an end-to-end platform on top of the wider AWS ecosystem, and what it’s like to migrate a large engineering organization from a monolithic approach to microservices.

CON404 – Deep Dive into Container Scheduling with Amazon ECS
As your application’s infrastructure grows and scales, well-managed container scheduling is critical to ensuring high availability and resource optimization. In this session, we deep dive into the challenges and opportunities around container scheduling, as well as the different tools available within Amazon ECS and AWS to carry out efficient container scheduling. We discuss patterns for container scheduling available with Amazon ECS, the Blox scheduling framework, and how you can customize and integrate third-party scheduler frameworks to manage container scheduling on Amazon ECS.

Chalk Talks

Level 300 (Advanced)

CON312 – Building a Selenium Fleet on the Cheap with Amazon ECS with Spot Fleet
Roberto Rivera and Matthew Wedgwood, engineers at RetailMeNot, give a practical overview of setting up a fleet of Selenium nodes running on Amazon ECS with Spot Fleet. Discuss the challenges of running Selenium with high availability at minimum cost using Amazon ECS container introspection to connect the Selenium Hub with its nodes.

CON315 – Virtually There: Building a Render Farm with Amazon ECS
Learn how 8i Corp scales its multi-tenanted, volumetric render farm up to thousands of instances using AWS, Docker, and an API-driven infrastructure. This render farm enables them to turn the video footage from an array of synchronized cameras into a photo-realistic hologram capable of playback on a range of devices, from mobile phones to high-end head mounted displays. Join Owen Evans, VP of Engineering for 8i, as they dive deep into how 8i’s rendering infrastructure is built and maintained by just a handful of people and powered by Amazon ECS.

CON325 – Developing Microservices – from Your Laptop to the Cloud
Wesley Chow, Staff Engineer at Adroll, shows how his team extends Amazon ECS by enabling local development capabilities. Hologram, Adroll’s local development program, brings the capabilities of the Amazon EC2 instance metadata service to non-EC2 hosts, so that developers can run the same software on local machines with the same credentials source as in production.

CON327 – Patterns and Considerations for Service Discovery
Roven Drabo, head of cloud operations at Kaplan Test Prep, illustrates Kaplan’s complete container automation solution using Amazon ECS along with how his team uses NGINX and HashiCorp Consul to provide an automated approach to service discovery and container provisioning.

CON328 – Building a Development Platform on Amazon ECS
Quinton Anderson, Head of Engineering for Commonwealth Bank of Australia, walks through how they migrated their internal development and deployment platform from Mesos/Marathon to Amazon ECS. The platform uses a custom DSL to abstract a layered application architecture, in a way that makes it easy to plug or replace new implementations into each layer in the stack.

Workshops

Level 300 (Advanced)

CON318 – Interstella 8888: Monolith to Microservices with Amazon ECS
Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience deploying Docker containers as you break Interstella 8888’s aging monolithic application into containerized microservices. Using Amazon ECS and an Application Load Balancer, you create API-based microservices and deploy them leveraging integrations with other AWS services.

CON332 – Build a Java Spring Application on Amazon ECS
This workshop teaches you how to lift and shift existing Spring and Spring Cloud applications onto the AWS platform. Learn how to build a Spring application container, understand bootstrap secrets, push container images to Amazon ECR, and deploy the application to Amazon ECS. Then, learn how to configure the deployment for production.

THURSDAY 11/30

Breakout Sessions

Level 200 (Introductory)

CON201 – Containers on AWS – State of the Union
Just over four years after the first public release of Docker, and three years to the day after the launch of Amazon ECS, the use of containers has surged to run a significant percentage of production workloads at startups and enterprise organizations. Join Deepak Singh, General Manager of Amazon Container Services, as he covers the state of containerized application development and deployment trends, new container capabilities on AWS that are available now, options for running containerized applications on AWS, and how AWS customers successfully run container workloads in production.

Level 300 (Advanced)

CON304 – Batch Processing with Containers on AWS
Batch processing is useful to analyze large amounts of data. But configuring and scaling a cluster of virtual machines to process complex batch jobs can be difficult. In this talk, we show how to use containers on AWS for batch processing jobs that can scale quickly and cost-effectively. We also discuss AWS Batch, our fully managed batch-processing service. You also hear from GoPro and Here about how they use AWS to run batch processing jobs at scale including best practices for ensuring efficient scheduling, fine-grained monitoring, compute resource automatic scaling, and security for your batch jobs.

Level 400 (Expert)

CON406 – Architecting Container Infrastructure for Security and Compliance
While organizations gain agility and scalability when they migrate to containers and microservices, they also benefit from compliance and security, advantages that are often overlooked. In this session, Kelvin Zhu, lead software engineer at Okta, joins Mitch Beaumont, enterprise solutions architect at AWS, to discuss security best practices for containerized infrastructure. Learn how Okta built their development workflow with an emphasis on security through testing and automation. Dive deep into how containers enable automated security and compliance checks throughout the development lifecycle. Also understand best practices for implementing AWS security and secrets management services for any containerized service architecture.

Chalk Talks

Level 300 (Advanced)

CON329 – Full Software Lifecycle Management for Containers Running on Amazon ECS
Learn how The Washington Post uses Amazon ECS to run Arc Publishing, a digital journalism platform that powers The Washington Post and a growing number of major media websites. Amazon ECS enabled The Washington Post to containerize their existing microservices architecture, avoiding a complete rewrite that would have delayed the platform’s launch by several years. In this session, Jason Bartz, Technical Architect at The Washington Post, discusses the platform’s architecture. He addresses the challenges of optimizing Arc Publishing’s workload, and managing the application lifecycle to support 2,000 containers running on more than 50 Amazon ECS clusters.

CON330 – Running Containerized HIPAA Workloads on AWS
Nihar Pasala, Engineer at Aetion, discusses the Aetion Evidence Platform, a system for generating the real-world evidence used by healthcare decision makers to implement value-based care. This session discusses the architecture Aetion uses to run HIPAA workloads using containers on Amazon ECS, best practices, and learnings.

Level 400 (Expert)

CON408 – Building a Machine Learning Platform Using Containers on AWS
DeepLearni.ng develops and implements machine learning models for complex enterprise applications. In this session, Thomas Rogers, Engineer for DeepLearni.ng discusses how they worked with Scotiabank to leverage Amazon ECS, Amazon ECR, Docker, GPU-accelerated Amazon EC2 instances, and TensorFlow to develop a retail risk model that helps manage payment collections for millions of Canadian credit card customers.

Workshops

Level 300 (Advanced)

CON319 – Interstella 8888: CICD for Containers on AWS
Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to learn how to set up a CI/CD pipeline for containerized microservices. You get hands-on experience deploying Docker container images using Amazon ECS, AWS CloudFormation, AWS CodeBuild, and AWS CodePipeline, automating everything from code check-in to production.

FRIDAY 12/1

Breakout Sessions

Level 400 (Expert)

CON405 – Moving to Amazon ECS – the Not-So-Obvious Benefits
If you ask 10 teams why they migrated to containers, you will likely get answers like ‘developer productivity’, ‘cost reduction’, and ‘faster scaling’. But teams often find there are several other ‘hidden’ benefits to using containers for their services. In this talk, Franziska Schmidt, Platform Engineer at Mapbox and Yaniv Donenfeld from AWS will discuss the obvious, and not so obvious benefits of moving to containerized architecture. These include using Docker and Amazon ECS to achieve shared libraries for dev teams, separating private infrastructure from shareable code, and making it easier for non-ops engineers to run services.

Chalk Talks

Level 300 (Advanced)

CON331 – Deploying a Regulated Payments Application on Amazon ECS
Travelex discusses how they built an FCA-compliant international payments service using a microservices architecture on AWS. This chalk talk covers the challenges of designing and operating an Amazon ECS-based PaaS in a regulated environment using a DevOps model.

Workshops

Level 400 (Expert)

CON407 – Interstella 8888: Advanced Microservice Operations
Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. In this workshop, you help Interstella 8888 build a modern microservices-based logistics system to save the company from financial ruin. We give you the hands-on experience you need to run microservices in the real world. This includes implementing advanced container scheduling and scaling to deal with variable service requests, implementing a service mesh, issue tracing with AWS X-Ray, container and instance-level logging with Amazon CloudWatch, and load testing.

Know before you go

Want to brush up on your container knowledge before re:Invent? Here are some helpful resources to get started:

– Tiffany
@tiffanyfayj

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances. Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

Amazon ECR repository for storing the Docker image to be used in the task definition
S3 bucket that holds the transferred logs
Task definition, with image name and S3 bucket as environment variables provided via input parameter
CloudWatch Events rule
Amazon ECS cluster
Amazon ECS container instances using Spot Fleet
IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

Clone the code from GitHub that performs RDS API calls to retrieve the log files.
git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
Build and tag the image.
cd aws-ecs-scheduled-tasks/container-code/src && ls Dockerfile rdslogsshipper.py requirements.txt
docker build -t rdslogsshipper .
Sending build context to Docker daemon 9.728 kB Step 1 : FROM python:3 ---> 41397f4f2887 Step 2 : WORKDIR /usr/src/app ---> Using cache ---> 59299c020e7e Step 3 : COPY requirements.txt ./ ---> 8c017e931c3b Removing intermediate container df09e1bed9f2 Step 4 : COPY rdslogsshipper.py /usr/src/app ---> 099a49ca4325 Removing intermediate container 1b1da24a6699 Step 5 : RUN pip install --no-cache-dir -r requirements.txt ---> Running in 3ed98b30901d Collecting boto3 (from -r requirements.txt (line 1)) Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB) Collecting botocore (from -r requirements.txt (line 2)) Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB) Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1)) Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB) Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1)) Downloading jmespath-0.9.3-py2.py3-none-any.whl Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2)) Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB) Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2)) Downloading docutils-0.14-py3-none-any.whl (543kB) Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2)) Downloading six-1.10.0-py2.py3-none-any.whl Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3 Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0 ---> f892d3cb7383 Removing intermediate container 3ed98b30901d Step 6 : COPY . . ---> ea7550c04fea Removing intermediate container b558b3ebd406 Successfully built ea7550c04fea
Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule --schedule-expression "rate(15 minutes)" { "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule" }
CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
1. Create the IAM role to be assumed by CloudWatch.
  aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json { "Role": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": "events.amazonaws.com" } } ] }, "RoleId": "AROAIRYYLDCVZCUACT7FS", "CreateDate": "2017-07-14T22:44:52.627Z", "RoleName": "Test-Role", "Path": "/", "Arn": "arn:aws:iam::12345678901:role/Test-Role" } }
  The following is an example of the event-role.json file used earlier:
  { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "events.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
  aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json { "Policy": { "PolicyName": "test-policy", "CreateDate": "2017-07-14T22:51:20.293Z", "AttachmentCount": 0, "IsAttachable": true, "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", "DefaultVersionId": "v1", "Path": "/", "Arn": "arn:aws:iam::123455678901:policy/test-policy", "UpdateDate": "2017-07-14T22:51:20.293Z" } }
  The following is an example of the event-policy.json file used earlier:
  { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecs:RunTask" ], "Resource": [ "arn:aws:ecs:*::task-definition/" ], "Condition": { "ArnLike": { "ecs:cluster": "arn:aws:ecs:*::cluster/" } } } ] }
3. Attach the IAM policy to the role.
  aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role" { "FailedEntries": [], "FailedEntryCount": 0 }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

My colleague Sanjay Padhi shared the guest post below in order to recognize an important milestone in the use of EC2 Spot Instances.

— Jeff;

A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs.

The Experiment
Professor Amy Apon, Co-Director of the Complex Systems, Analytics and Visualization Institute at Clemson University with Professor Alexander Herzog and graduate students Brandon Posey and Christopher Gropp in collaboration with members of the AWS team as well as AWS Partner Omnibond performed the experiments. They used software infrastructure based on CloudyCluster that provisions high performance computing clusters on dynamically allocated AWS resources using Amazon EC2 Spot Fleet. Spot Fleet is a collection of biddable spot instances in EC2 responsible for maintaining a target capacity specified during the request. The SLURM scheduler was used as an overlay virtual workload manager for the data analytics workflows. The team developed additional provisioning and workflow automation software as shown below for the design and orchestration of the experiments. This setup allowed them to evaluate various topic models on different data sets with massively parallel parameter sweeps on dynamically allocated AWS resources. This framework can easily be used beyond the current study for other scientific applications that use parallel computing.

Ramping to 1.1 Million vCPUs
The figure below shows elastic, automatic expansion of resources as a function of time, in the US East (Northern Virginia) Region. At just after 21:40 (GMT-1) on Aug. 26, 2017, the number of vCPUs utilized was 1,119,196. Clemson researchers also took advantage of the new per-second billing for the EC2 instances that they launched. The vCPU count usage is comparable to the core count on the largest supercomputers in the world.

Here’s the breakdown of the EC2 instance types that they used:

Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments.

Meet the Team
Here’s the team that ran the experiment (Professor Alexander Herzog, graduate students Christopher Gropp and Brandon Posey, and Professor Amy Apon):

Professor Apon said about the experiment:

I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments.

Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me:

Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding.

About the Experiment
The experiments test parameter combinations on a range of topics and other parameters used in the topic model. The topic model outputs are stored in Amazon S3 and are currently being analyzed. The models have been applied to 17 years of computer science journal abstracts (533,560 documents and 32,551,540 words) and full text papers from the NIPS (Neural Information Processing Systems) Conference (2,484 documents and 3,280,697 words). This study allows the research team to systematically measure and analyze the impact of parameters and model selection on model convergence, topic composition and quality.

Looking Forward
This study constitutes an interaction between computer science, artificial intelligence, and high performance computing. Papers describing the full study are being submitted for peer-reviewed publication. I hope that you enjoyed this brief insight into the ways in which AWS is helping to break the boundaries in the frontiers of natural language processing!

— Sanjay Padhi, Ph.D, AWS Research and Technical Computing

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/powering-your-amazon-ecs-cluster-with-amazon-ec2-spot-instances/

This post was graciously contributed by:


Chad Schmutzer Solutions Architect	Shawn O’Connor Solutions Architect

Today we are excited to announce that Amazon EC2 Container Service (Amazon ECS) now supports the ability to launch your ECS cluster on Amazon EC2 Spot Instances directly from the ECS console.

Spot Instances allow you to bid on spare Amazon EC2 compute capacity. Spot Instances typically cost 50-90% less than On-Demand Instances. Powering your ECS cluster with Spot Instances lets you reduce the cost of running your existing containerized workloads, or increase your compute capacity by two to ten times while keeping the same budget. Or you could do a combination of both!

Using Spot Instances, you specify the price you are willing to pay per instance-hour. Your Spot Instance runs whenever your bid exceeds the current Spot price. If your instance is reclaimed due to an increase in the Spot price, you are not charged for the partial hour that your instance has run.

The ECS console uses Spot Fleet to deploy your Spot Instances. Spot Fleet attempts to deploy the target capacity you request (expressed in terms of instances or a vCPU count) for your containerized application by launching Spot Instances that result in the best prices for you. If your Spot Instances are reclaimed due to a change in Spot prices or available capacity, Spot Fleet also attempts to maintain its target capacity.

Containers are a natural fit for the diverse pool of resources that Spot Fleet thrives on. Spot Fleet enable you to provision capacity across multiple Spot Instance pools (combinations of instance types and Availability Zones), which helps improve your application’s availability and reduce operating costs of the fleet over time. Combining the extensible and flexible container placement system provided by ECS with Spot Fleet allows you to efficiently deploy containerized workloads and easily manage clusters at any scale for a fraction of the cost.

Previously, deploying your ECS cluster on Spot Instances was a manual process. In this post, we show you how to achieve high availability, scalability, and cost savings for your container workloads by using the new Spot Fleet integration in the ECS console. We also show you how to build your own ECS cluster on Spot Instances using AWS CloudFormation.

Creating an ECS cluster running on Spot Instances

You can create an ECS cluster using the AWS Management Console.

Open the Amazon ECS console at https://console.aws.amazon.com/ecs/.
In the navigation pane, choose Clusters.
On the Clusters page, choose Create Cluster.
For Cluster name, enter a name.
In Instance configuration, for Provisioning model, choose Spot.

Choosing an allocation strategy

The two available Spot Fleet allocation strategies are Diversified and Lowest price.

The allocation strategy you choose for your Spot Fleet determines how it fulfills your Spot Fleet request from the possible Spot Instance pools. When you use the diversified strategy, the Spot Instances are distributed across all pools. When you use the lowest price strategy, the Spot Instances come from the pool with the lowest price specified in your request.

Remember that each instance type (the instance size within each instance family, e.g., c4.4xlarge), in each Availability Zone, in every region, is a separate pool of capacity, and therefore a separate Spot market. By diversifying across as many different instance types and Availability Zones as possible, you can improve the availability of your fleet. You also make your fleet less sensitive to increases in the Spot price in any one pool over time.

You can select up to six EC2 instance types to use for your Spot Fleet. In this example, we’ve selected m3, m4, c3, c4, r3, and r4 instance types of size xlarge.

You need to enter a bid for your instances. Typically, bidding at or near the On-Demand Instance price is a good starting point. Your bid is the maximum price that you are willing to pay for instance types in that Spot pool. While the Spot price is at or below your bid, you pay the Spot price. Bidding lower ensures that you have lower costs, while bidding higher reduces the probability of interruption.

Configure the number of instances to have in your cluster. Spot Fleet attempts to launch the number of Spot Instances that are required to meet the target capacity specified in your request. The Spot Fleet also attempts to maintain its target capacity if your Spot Instances are reclaimed due to a change in Spot prices or available capacity.

The latest ECS–optimized AMI is used for the instances when they are launched.

Configure your storage and network settings. To enable diversification and high availability, be sure to select subnets in multiple Availability Zones. You can’t select multiple subnets in the same Availability Zone in a single Spot Fleet.

The ECS container agent makes calls to the ECS API actions on your behalf. Container instances that run the agent require the ecsInstanceRole IAM policy and role for the service to know that the agent belongs to you. If you don’t have the ecsInstanceRole already, you can create one using the ECS console.

If you create a managed compute environment that uses Spot Fleet, you must create a role that grants the Spot Fleet permission to bid on, launch, and terminate instances on your behalf. You can also create this role using the ECS console.

That’s it! In the ECS console, choose Create to spin up your new ECS cluster running on Spot Instances.

Using AWS CloudFormation to deploy your ECS cluster on Spot Instances

We have also published a reference architecture AWS CloudFormation template that demonstrates how to easily launch a CloudFormation stack and deploy your ECS cluster on Spot Instances.

The CloudFormation template includes the Spot Instance termination notice script mentioned earlier, as well as some additional logging and other example features to get you started quickly. You can find the CloudFormation template in the Amazon EC2 Spot Instances GitHub repo.

Give it a try and customize it as needed for your environment!

Handling termination

With Spot Instances, you never pay more than the price you specified. If the Spot price exceeds your bid price for a given instance, it is terminated automatically for you.

The best way to protect against Spot Instance interruption is to architect your containerized application to be fault-tolerant. In addition, you can take advantage of a feature called Spot Instance termination notices, which provides a two-minute warning before EC2 must terminate your Spot Instance.

This warning is made available to the applications on your Spot Instance using an item in the instance metadata. When you deploy your ECS cluster on Spot Instances using the console, AWS installs a script that checks every 5 seconds for the Spot Instance termination notice. If the notice is detected, the script immediately updates the container instance state to DRAINING.

A simplified version of the Spot Instance termination notice script is as follows:

#!/bin/bash while sleep 5; do if [ -z $(curl -Isf http://169.254.169.254/latest/meta-data/spot/termination-time) ]; then /bin/false else ECS_CLUSTER=$(curl -s http://localhost:51678/v1/metadata | jq .Cluster | tr -d \") CONTAINER_INSTANCE=$(curl -s http://localhost:51678/v1/metadata \ | jq .ContainerInstanceArn | tr -d \") aws ecs update-container-instances-state --cluster $ECS_CLUSTER \ --container-instances $CONTAINER_INSTANCE --status DRAINING fi done

When you set a container instance to DRAINING, ECS prevents new tasks from being scheduled for placement on the container instance. If the resources are available, replacement service tasks are started on other container instances in the cluster. Container instance draining enables you to remove a container instance from a cluster without impacting tasks in your cluster. Service tasks on the container instance that are in the PENDING state are stopped immediately.

Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service’s deployment configuration parameters, minimumHealthyPercent and maximumPercent.

ECS on Spot Instances in action

Want to see how customers are already powering their ECS clusters on Spot Instances? Our friends at Mapbox are doing just that.

Mapbox is a platform for designing and publishing custom maps. The company uses ECS to power their entire batch processing architecture to collect and process over 100 million miles of sensor data per day that they use for powering their maps. They also optimize their batch processing architecture on ECS using Spot Instances.

The Mapbox platform powers over 5,000 apps and reaches more than 200 million users each month. Its backend runs on ECS, allowing it to serve more than 1.3 billion requests per day. To learn more about their recent migration to ECS, read their recent blog post, We Switched to Amazon ECS, and You Won’t Believe What Happened Next. Then, in their follow-up blog post, Caches to Cash, learn how they are running their entire platform on Spot Instances, allowing them to save upwards of 50–90% on their EC2 costs.

Conclusion

We hope that you are as excited as we are about running your containerized applications at scale and cost effectively using Spot Instances. For more information, see the following pages:

If you have comments or suggestions, please comment below.