Show
Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/ This post is written by Steve Cole, WW SA Leader for EC2 Spot, and David Bermeo, Senior Product Manager for EC2. On October 20, 2021, AWS released new functionality to the Amazon Fault Injection Simulator that supports triggering the interruption of Amazon EC2 Spot Instances. This functionality lets you test the fault tolerance of your software by interrupting instances on command. The triggered interruption will be preceded with a Rebalance Recommendation (RBR) and Instance Termination Notification (ITN) so that you can fully test your applications as if an actual Spot interruption had occurred. In this post, we’ll provide two examples of how easy it has now become to simulate Spot interruptions and validate the fault-tolerance of an application or service. We will demonstrate testing an application through the console and a service via CLI. Engineering use-case (console)Whether you are building a Spot-capable product or service from scratch or evaluating the Spot compatibility of existing software, the first step in testing is identifying whether or not the software is tolerant of being interrupted. In the past, one way this was accomplished was with an AWS open-source tool called the Amazon EC2 Metadata Mock. This tool let customers simulate a Spot interruption as if it had been delivered through the Instance Metadata Service (IMDS), which then let customers test how their code responded to an RBR or an ITN. However, this model wasn’t a direct plug-and-play solution with how an actual Spot interruption would occur, since the signal wasn’t coming from AWS. In particular, the method didn’t provide the centralized notifications available through Amazon EventBridge or Amazon CloudWatch Events that enabled off-instance activities like launching AWS Lambda functions or other orchestration work when an RBR or ITN was received. Now, Fault Injection Simulator has removed the need for custom logic, since it lets RBR and ITN signals be delivered via the standard IMDS and event services simultaneously. Let’s walk through the process in the AWS Management Console. We’ll identify an instance that’s hosting a perpetually-running queue worker that checks the IMDS before pulling messages from Amazon Simple Queue Service (SQS). It will be part of a service stack that is scaled in and out based on the queue depth. Our goal is to make sure that the IMDS is being polled properly so that no new messages are pulled once an ITN is received. The typical processing time of a message with this example is 30 seconds, so we can wait for an ITN (which provides a two minute warning) and need not act on an RBR. First, we go to the Fault Injection Simulator in the AWS Management Console to create an experiment. At the experiment creation screen, we start by creating an optional name (recommended for console use) and a description, and then selecting an IAM Role. If this is the first time that you’ve used Fault Injection Simulator, then you’ll need to create an IAM Role per the directions in the FIS IAM permissions documentation. I’ve named the role that we created ‘FIS.’ After that, I’ll select an action (interrupt) and identify a target (the instance). First, I name the action. The Action type I want is to interrupt the Spot Instance: aws:ec2:send-spot-instance-interruptions. In the Action parameters, we are given the option to set the duration. The minimum value here is two minutes, below which you will receive an error since Spot Instances will always receive a two minute warning. The advantage here is that, by setting the durationBeforeInterruption to a value above two minutes, you will get the RBR (an optional point for you to respond) and ITN (the actual two minute warning) at different points in time, and this lets you respond to one or both. The target instance that we launched is depicted in the following screenshot. It is a Spot Instance that was launched as a persistent request with its interruption action set to ‘stop’ instead of ‘terminate.’ The option to stop a Spot Instance, introduced in 2020, will let us restart the instance, log in and retrieve logs, update code, and perform other work necessary to implement Spot interruption handling. Now that an action has been defined, we configure the target. We have the option of naming the target, which we’ve done here to match the Name tagged on the EC2 instance ‘qWorker’. The target method we want to use here is Resource ID, and then we can either type or select the desired instance from a drop-down list. Selection mode will be ‘all’, as there is only one instance. If we were using tags, which we will in the next example, then we’d be able to select a count of instances, up to five, instead of just one. Once you’ve saved the Action, the Target, and the Experiment, then you’ll be able to begin the experiment by selecting the ‘Start from the Action’ menu at the top right of the screen. After the experiment starts, you’ll be able to observe its state by refreshing the screen. Generally, the process will take just seconds, and you should be greeted by the Completed state, as seen in the following screenshot. In the following screenshot, having opened an interruption log group created in CloudWatch Event Logs, we can see the JSON of the RBR. Two minutes later, we see the ITN in the same log group. Another two minutes after the ITN, we can see the EC2 instance is in the process of stopping (or terminating, if you elect). Shortly after the stop is issued by EC2, we can see the instance stopped. It would now be possible to restart the instance and view logs, make code changes, or do whatever you find necessary before testing again. Now that our experiment succeeded in interrupting our Spot Instance, we can evaluate the performance of the code running on the instance. It should have completed the processing of any messages already retrieved at the ITN, and it should have not pulled any new messages afterward. This experiment can be saved for later use, but it will require selecting the specific instance each time that it’s run. We can also re-use the experiment template by using tags instead of an instance ID, as we’ll show in the next example. This shouldn’t prove troublesome for infrequent experiments, and especially those run through the console. Or, as we did in our example, you can set the instance interruption behavior to stop (versus terminate) and re-use the experiment as long as that particular instance continues to exist. When the experiments get more frequent, it might be advantageous to automate the process, possibly as part of the test phase of a CI/CD pipeline. Doing this is programmatically possible through the AWS CLI or SDK. Operations use-case (CLI)Once the developers of our product validate the single-instance fault tolerance, indicating that the target workload is capable of running on Spot Instances, then the next logical step is to deploy the product as a service on multiple instances. This will allow for more comprehensive testing of the service as a whole, and it is a key process in collecting valuable information, such as performance data, response times, error rates, and other metrics to be used in the monitoring of the service. Once data has been collected on a non-interrupted deployment, it is then possible to use the Spot interruption action of the Fault Injection Simulator to observe how well the service can handle RBR and ITN while running, and to see how those events influence the metrics collected previously. When testing a service, whether it is launched as instances in an Amazon EC2 Auto Scaling group, or it is part of one of the AWS container services, such as Amazon Elastic Container Service (Amazon ECS) or the Amazon Elastic Kubernetes Service (EKS), EC2 Fleet, Amazon EMR, or across any instances with descriptive tagging, you now have the ability to trigger Spot interruptions to as many as five instances in a single Fault Injection Simulator experiment. We’ll use tags, as opposed to instance IDs, to identify candidates for interruption to interrupt multiple Spot Instances simultaneously. We can further refine the candidate targets with one or more filters in our experiment, for example targeting only running instances if you perform an action repeatedly. In the following example, we will be interrupting three instances in an Auto Scaling group that is backing a self-managed EKS node group. We already know the software will behave as desired from our previous engineering tests. Our goal here is to see how quickly EKS can launch replacement tasks and identify how the service as a whole responds during the event. In our experiment, we will identify instances that contain the tag aws:autoscaling:groupName with a value of “spotEKS”. The key benefit here is that we don’t need a list of instance IDs in our experiment. Therefore, this is a re-usable experiment that can be incorporated into test automation without needing to make specific selections from previous steps like collecting instance IDs from the target Auto Scaling group. We start by creating a file that describes our experiment in JSON rather than through the console: { "description": "interrupt multiple random instances in ASG", "targets": { "spotEKS": { "resourceType": "aws:ec2:spot-instance", "resourceTags": { "aws:autoscaling:groupName": "spotEKS" }, "selectionMode": "COUNT(3)" } }, "actions": { "interrupt": { "actionId": "aws:ec2:send-spot-instance-interruptions", "description": "interrupt multiple instances", "parameters": { "durationBeforeInterruption": "PT4M" }, "targets": { "SpotInstances": "spotEKS" } } }, "stopConditions": [ { "source": "none" } ], "roleArn": "arn:aws:iam::xxxxxxxxxxxx:role/FIS", "tags": { "Name": "multi-instance" } }Then we upload the experiment template to Fault Injection Simulator from the command-line. aws fis create-experiment-template --cli-input-json file://experiment.jsonThe response we receive returns our template along with an ID, which we’ll need to execute the experiment. { "experimentTemplate": { "id": "EXT3SHtpk1N4qmsn", ... } }We then execute the experiment from the command-line using the ID that we were given at template creation. aws fis start-experiment --experiment-template-id EXT3SHtpk1N4qmsnWe then receive confirmation that the experiment has started. { "experiment": { "id": "EXPaFhEaX8GusfztyY", "experimentTemplateId": "EXT3SHtpk1N4qmsn", "state": { "status": "initiating", "reason": "Experiment is initiating." }, ... } }To check the status of the experiment as it runs, which for interrupting Spot Instances is quite fast, we can query the experiment ID for success or failure messages as follows: aws fis get-experiment --id EXPaFhEaX8GusfztyYAnd finally, we can confirm the results of our experiment by listing our instances through EC2. Here we use the following command-line before and after execution to generate pre- and post-experiment output: aws ec2 describe-instances --filters\ Name='tag:aws:autoscaling:groupName',Values='spotEKS'\ Name='instance-state-name',Values='running'\ | jq .Reservations[].Instances[].InstanceId | sortWe can then compare this to identify which instances were interrupted and which were launched as replacements. < "i-003c8d95c7b6e3c63" < "i-03aa172262c16840a" < "i-02572fa37a61dc319" --- > "i-04a13406d11a38ca6" > "i-02723d957dc243981" > "i-05ced3f71736b5c95"SummaryIn the previous examples, we have demonstrated through the console and command-line how you can use the Spot interruption action in the Fault Injection Simulator to ascertain how your software and service will behave when encountering a Spot interruption. Simulating Spot interruptions will help assess the fault-tolerance of your software and can assess the impact of interruptions in a running service. The addition of events can enable more tooling, and being able to simulate both ITNs and RBRs, along with the Capacity Rebalance feature of Auto scaling groups, now matches the end-to-end experience of an actual AWS interruption. Get started on simulating Spot interruptions in the console.
Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/identifying-optimal-locations-for-flexible-workloads-with-spot-placement-score/ This post is written by Jessie Xie, Solutions Architect for EC2 Spot, and Peter Manastyrny, Senior Product Manager for EC2 Auto Scaling and EC2 Fleet. Amazon EC2 Spot Instances let you run flexible, fault-tolerant, or stateless applications in the AWS Cloud at up to a 90% discount from On-Demand prices. Since we introduced Spot Instances back in 2009, we have been building new features and integrations with a single goal – to make Spot easy and efficient to use for your flexible compute needs. Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts. In exchange for the discount, Spot Instances are interruptible and must be returned when EC2 needs the capacity back. The location and amount of spare capacity available at any given moment is dynamic and changes in real time. This is why Spot workloads should be flexible, meaning they can utilize a variety of different EC2 instance types and can be shifted in real time to where the spare capacity currently is. You can use Spot Instances with tools such as EC2 Fleet and Amazon EC2 Auto Scaling which make it easy to run workloads on multiple instance types. The AWS Cloud spans 81 Availability Zones across 25 Regions with plans to launch 21 more Availability Zones and 7 more Regions. However, until now there was no way to find an optimal location (either a Region or Availability Zone) to fulfill your Spot capacity needs without trying to launch Spot Instances there first. Today, we are excited to announce Spot placement score, a new feature that helps you identify an optimal location to run your workloads on Spot Instances. Spot placement score recommends an optimal Region or Availability Zone based on the amount of Spot capacity you need and your instance type requirements. Spot placement score is useful for workloads that could potentially run in a different Region. Additionally, because the score takes into account your instance type selection, it can help you determine if your request is sufficiently instance type flexible for your chosen Region or Availability Zone. How Spot placement score worksTo use Spot placement score you need to specify the amount of Spot capacity you need, what your instance type requirements are, and whether you would like a recommendation for a Region or a single Availability Zone. For instance type requirements, you can either provide a list of instance types, or the instance attributes, like the number of vCPUs and amount of memory. If you choose to use the instance attributes option, you can then use the same attribute configuration to request your Spot Instances in the recommended Region or Availability Zone with the new attribute-based instance type selection feature in EC2 Fleet or EC2 Auto Scaling. Spot placement score provides a list of Regions or Availability Zones, each scored from 1 to 10, based on factors such as the requested instance types, target capacity, historical and current Spot usage trends, and time of the request. The score reflects the likelihood of success when provisioning Spot capacity, with a 10 meaning that the request is highly likely to succeed. Provided scores change based on the current Spot capacity situation, and the same request can yield different scores when ran at different times. It is important to note that the score serves as a guideline, and no score guarantees that your Spot request will be fully or partially fulfilled. You can also filter your score by Regions or Availability Zones, which is useful for cases where you can use only a subset of AWS Regions, for example any Region in the United States. Let’s see how Spot placement score works in practice through an example. Using Spot placement score with AWS Management ConsoleTo try Spot placement score, log into your AWS account, select EC2, Spot Requests, and click on Spot placement score to open the Spot placement score window. Here, you need provide your target capacity and instance type requirements by clicking on Enter requirements. You can enter target capacity as a number of instances, vCPUs, or memory. vCPUs and memory options are useful for vertically scalable workloads that are sized for a total amount of compute resources and can utilize a wide range of instance sizes. Target capacity is limited and based on your recent Spot usage with accounting for potential usage growth. For accounts that do not have recent Spot usage, there is a default limit aligned with the Spot Instances limit. For instance type requirements, there are two options. First option is to select Specify instance attributes that match your compute requirements tab and enter your compute requirements as a number of vCPUs, amount of memory, CPU architecture, and other optional attributes. Second option is to select Manually select instance types tab and select instance types from the list. Please note that you need to select at least three different instance types (that is, different families, generations, or sizes). If you specify a smaller number of instance types, Spot placement score will always yield a low score. Spot placement score is designed to help you find an optimal location to request Spot capacity tailored to your specific workload needs, but it is not intended to be used for getting high-level Spot capacity information across all Regions and instance types. Let’s try to find an optimal location to run a workload that can utilize r5.8xlarge, c5.9xlarge, and m5.8xlarge instance types and is sized at 2000 instances. Once you select 2000 instances under Target capacity, select r5.8xlarge, c5.9xlarge, and m5.8xlarge instances under Select instance types, and click Load placement score button, you will get a list of Regions sorted by score in a descending order. There is also an option to filter by specific Regions if needed. The highest rated Region for your requirements turns out to be US East (N. Virginia) with a score of 8. The second closest contender is Europe (Ireland) with a score of 5. That tells you that right now the optimal Region for your Spot requirements is US East (N. Virginia). Let’s now see if it is possible to get a higher score. Remember, the key best practice for Spot is to be flexible and utilize as many instance types as possible. To do that, press the Edit button on the Target capacity and instance type requirements tab. For the new request, keep the same target capacity at 2000, but expand the selection of instance types by adding similarly sized instance types from a variety of instance families and generations, i.e., r5.4xlarge, r5.12xlarge, m5zn.12xlarge, m5zn.6xlarge, m5n.8xlarge, m5dn.8xlarge, m5d.8xlarge, r5n.8xlarge, r5dn.8xlarge, r5d.8xlarge, c5.12xlarge, c5.4xlarge, c5d.12xlarge, c5n.9xlarge. c5d.9xlarge, m4.4xlarge, m4.16xlarge, m4.10xlarge, r4.8xlarge, c4.8xlarge. After requesting the scores with updated requirements, you can see that even though the score in US East (N. Virginia) stays unchanged at 8, the scores for Europe (Ireland) and US West (Oregon) improved dramatically, both raising to 9. Now, you have a choice of three high-scored Regions to request your Spot Instances, each with a high likelihood to succeed. To request Spot Instances based on the score, you can use EC2 Fleet or EC2 Auto Scaling. Please note, that the score implies that you use capacity-optimized Spot allocation strategy when requesting the capacity. If you use other allocation strategies, such as lowest-price, the result in the recommended Region or Availability Zone will not align with the score provided. You can also request the scores at the Availability Zone level. This is useful for running workloads that need to have all instances in the same Availability Zone, potentially to minimize inter-Availability Zone data transfer costs. Workloads such as Apache Spark, which involve transferring a high volume of data between instances, would be a good use case for this. To get scores per Availability Zone you can check the box Provide placement scores per Availability Zone. When requesting instances based on Availability Zone recommendation, you need to make sure to configure EC2 Fleet or EC2 Auto Scaling request to only use that specific Availability Zone. With Spot placement score, you can test different instance type combinations at different points in time, and find the most optimal Region or Availability Zone to run your workloads on Spot Instances. Availability and pricingYou can use Spot placement score today in all public and AWS GovCloud Regions with the exception of those based in China, where we plan to release later. You can access Spot placement score using the AWS Command Line Interface (CLI), AWS SDKs, and Management Console. There is no additional charge for using Spot placement score, you will only pay EC2 standard rates if provisioning instances based on recommendation. To learn more about using Spot placement score, visit the Spot placement score documentation page. To learn more about best practices for using Spot Instances, see Spot documentation.
Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/ This post is written by Scott Horsfield, Principal Solutions Architect, EC2 Scalability and Surabhi Agarwal, Sr. Product Manager, EC2. In 2010, AWS released launch configurations as a way to define the parameters of instances launched by EC2 Auto Scaling groups. In 2017, AWS released launch templates, the successor of launch configurations, as a way to streamline and simplify the launch process for Auto Scaling, Spot Fleet, Amazon EC2 Spot Instances, and On-Demand Instances. Launch templates define the steps required to create an instance, by capturing instance parameters in a resource that can be used across multiple services. Launch configurations have continued to live alongside launch templates but haven’t benefitted from all of the features we’ve added to launch templates. Today, AWS is recommending that customers using launch configurations migrate to launch templates. We will continue to support and maintain launch configurations, but we will not be adding any new features to them. We will focus on adding new EC2 features to launch templates only. You can continue using launch configurations, and AWS is committed to supporting applications you have already built using them, but in order for you to take advantage of our most recent and upcoming releases, a migration to launch templates is recommended. Additionally, we plan to no longer support new instance types with launch configurations by the end of 2022. Our goal is to have all customers moved over to launch templates by then. Moving to launch templates is simple to accomplish and can be done easily today. In this blog, we provide more details on how you can transition from launch configurations to launch templates. If you are unable to transition to launch templates due to lack of tooling or specific functions, or have any concerns, please contact AWS Support. Launch templates vs. launch configurationsLaunch configurations have been a part of Amazon EC2 Auto Scaling Groups since 2010. Customers use launch configurations to define Auto Scaling group configurations that include AMI and instance type definition. In 2017, AWS released launch templates, which reduce the number of steps required to create an instance by capturing all launch parameters within one resource that can be used across multiple services. Since then, AWS has released many new features such as Mixed Instance Policies with Auto Scaling groups, Targeted Capacity Reservations, and unlimited mode for burstable performance instances that only work with launch templates. Launch templates provide several key benefits to customers, when compared to launch configurations, that can improve the availability and optimization of the workloads you host in Auto Scaling groups and allow you to access the full set of EC2 features when launching instances in an Auto Scaling group. Some of the key benefits of launch templates when used with Auto Scaling groups include: How to determine where you are using launch configurationsUse the Launch Configuration Inventory Script to find all of the launch configurations in your account. You can use this script to generate an inventory of launch configurations across all regions in a single account or all accounts in your AWS Organization. The script can be run with a variety of options for different levels of account access. You can learn more about these options in this GitHub post. In its simplest form it will use the default credentials profile to inventory launch configurations across all regions in a single account. Once the script has completed, you can view the generated inventory.csv file to get a sense of how many launch configurations may need to be converted to launch templates or deleted. How to transition to launch templates todayIf you’re ready to move to launch templates now, making the transition is simple and mostly automated through the AWS Management Console. For customers who do not use the AWS Management Console, most popular Infrastructure as Code (IaC), such as CloudFormation and Terraform, already support launch templates, as do the AWS CLI and SDKs. To perform this transition, you will need to ensure that your user has the required permissions. Here are some examples to get you started. AWS Management Console
Instances launched by this Auto Scaling group continue to run and are not automatically be replaced by making this change. Any instance launched after making this change uses the launch template for its configuration. As your Auto Scaling group scales up and down, the older instances are replaced. If you’d like to force an update, you can use Instance Refresh to ensure that all instances are running the same launch template and version. CloudFormation and TerraformIf you use CloudFormation to create and manage your infrastructure, you should use the AWS::EC2::LaunchTemplate resource to create launch templates. After adding a launch template resource to your CloudFormation stack template file update your Auto Scaling group resource definition by adding a LaunchTemplate property and removing the existing LaunchConfigurationName property. We have several examples available to help you get started. Using launch templates with Terraform is a similar process. Update your template file to include a aws_launch_template resource and then update your aws_autoscaling_group resources to reference the launch template. In addition to making these changes, you may also want to consider adding a MixedInstancesPolicy to your Auto Scaling group. A MixedInstancesPolicy allows you to configure your Auto Scaling group with multiple instance types and purchase options. This helps improve the availability and optimization of your applications. Some examples of these benefits include using Spot Instances and On-Demand Instances within the same Auto Scaling group, combining CPU architectures such as Intel, AMD, and ARM (Graviton2), and having multiple instance types configured in case of a temporary capacity issue. You can generate and configure example templates for CloudFormation and Terraform in the AWS Management Console. AWS CLIIf you’re using the AWS CLI to create and manage your Auto Scaling groups, these examples will show you how to accomplish common tasks when using launch templates. SDKsAWS SDKs already include APIs for creating launch templates. If you’re using one of our SDKs to create and configure your Auto Scaling groups, you can find more information in the SDK documentation for your language of choice. Next stepsWe’re excited to help you take advantage of the latest EC2 features by making the transition to launch templates as seamless as possible. As we make this transition together, we’re here to help and will continue to communicate our plans and timelines for this transition. If you are unable to transition to launch templates due to lack of tooling or functionalities or have any concerns, please contact AWS Support. Also, stay tuned for more information on tools to help make this transition easier for you.
Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-run-massively-multiplayer-games-with-ec2-spot-using-aurora-serverless/ This post is written by Yahav Biran, Principal Solutions Architect, and Pritam Pal, Sr. EC2 Spot Specialist SA Massively multiplayer online (MMO) game servers must dynamically scale their compute and storage to create a world-scale persistence simulation with millions of dynamic objects, such as complex AR/VR synthetic environments that match real-world fidelity. The Elastic Kubernetes Service (EKS) powered by Amazon EC2 Spot and Aurora Serverless allow customers to create world-scale persistence simulations processed by numerous cost-effective compute chipsets, such as ARM, x86, or Nvidia GPU. It also persists them On-Demand — an automatic scaling configuration of open-source database engines like MySQL or PostgreSQL without managing any database capacity. This post proposes a fully open-sourced option to build, deploy, and manage MMOs on AWS. We use a Python-base game server to demonstrate the MMOs. ChallengeIncreasing competition in the gaming industry has driven game developers to architect cost-effective game servers that can scale up to meet player demands and scale down to meet cost goals. AWS enables MMO developers to scale their game beyond the limits of a single server. The game state (world) can spatially partition across many Regions based on the requested number of sessions or simulations using Amazon EC2 compute power. As the game progresses over many sessions, you must track the simulation’s global state. Amazon Aurora maintains the global game state in memory to manage complex interactions, such as hand-offs across instances and Regions. Amazon Aurora Serverless powers all of these for PostgreSQL or MySQL. This blog shows how to use a commodity server using an ephemeral disk and decouple the game state from the game server. We store the game state in Aurora for PostgreSQL, but you can also use DynamoDB or KeySpaces for the NoSQL case. Game overviewWe use a Minecraft clone to demonstrate a distributed persistence simulation. The game server is python-based deployed on Agones, an open-source multiplayer dedicated game-server platform on Kubernetes. The Kubernetes cluster is powered by EC2 Spot Instances and configured with EC2 instances to auto-scale that expands and shrinks the compute seamlessly upon game-server allocation. We add a git-ops-based continuous delivery system that stores the game-server binaries and config in a git repository and deploys the game in a cluster deploy in one or more Regions to allow global compute scale. The following image is a diagram of the architecture. The game server persists every object in an Aurora Serverless PostgreSQL-compatible edition. The serverless database configuration aids automatic start-up and scales capacity up or down as per player demand. The world is divided into 32×32 block chunks in the XYZ plane (Y is up). This allows it to be “infinite” (PostgreSQL Bigint type) and eases data management. Only visible chunks must be queried from the database. The central database table is named “block” and has the columns p, q, x, y, z, w. (p, q) identifies the chunk, (x, y, z) identifies the block position, and (w) identifies the block type. 0 represents an empty block (air). In the game, the chunks store their blocks in a hash map. An (x, y, z) key maps to a (w) value. The y positions of blocks are limited to 0 <= y < 256. The upper limit is essentially an artificial limitation that prevents users from building tall structures. Users cannot destroy blocks at y = 0 to avoid falling underneath the world. Solution overviewKubernetes allows dedicated game server scaling and orchestration without limiting the compute platform spanning across many Regions and staying closer to the player. For simplicity, we use EKS to reduce operational overhead. Amazon EC2 runs the compute simulation, which might require different EC2 instance types. These include compute-optimized instances for compute-bound applications benefiting from high-performance processors or accelerated compute (GPU), using hardware accelerators to perform functions like graphics processing or data pattern matching. In addition, the EC2 Auto Scaling runs the game-server configured to use Amazon EC2 Spot Instances in order to allow up to 90% discount as compared to On-Demand Instance prices. However, Amazon EC2 can interrupt your Spot Instance when the demand for Spot Instances rises, when the supply of Spot Instances decreases, or when the Spot price exceeds your maximum price. The following two mechanisms minimize the EC2 reclaim compute capacity impact:
Two Auto Scaling groups are deployed for method two. The first Auto Scaling group uses latest generation Spot Instances (C5, C5a, C5n, M5, M5a, and M5n) instances, and the second uses all generations x86-based instances (C4 and M4). We configure the cluster-autoscaler that controls the Auto Scaling group size with the Expander priority option in order to favor the latest generation Spot Auto Scaling group. The priority should be a positive value, and the highest value wins. For each priority value, a list of regular expressions should be given. The following example assumes an EKS cluster craft-us-west-2 and two ASGs. The craft-us-west-2-nodegroup-spot5 Auto Scaling group wins the priority. Therefore, new instances will be spawned from the EC2 Spot Auto Scaling group. The complete working spec is available in https://github.com/aws-samples/spotable-game-server/blob/master/specs/cluster_autoscaler.yml. We propose the following two options to minimize player impact during interruption. The first is based on two-minute interruption notification, and the second on rebalance recommendations.
Choosing between the two depends on the transition you want for the player. Both cases deploy DaemonSet that listens to either notification and notifies every game server running on the EC2 instance. It also prevents new game-servers from running on this instance. The daemon set pulls from the instance metadata every five seconds denoted by POLL_INTERVAL as follows: where NOTICE_URL can be either NOTICE_URL=”http://169.254.169.254/latest/meta-data/spot/termination-time” Or, for the second option: NOTICE_URL=”http://169.254.169.254/latest/meta-data/events/recommendations/rebalance” The command that notifies all the game-servers about the interruption is: kubectl drain ${NODE_NAME} --force --ignore-daemonsets --delete-local-data From that point, every game server that runs on the instance gets notified by the Unix signal SIGTERM. In our example server.py, we tell the OS to signal the Python process and complete the sig_handler function. The example prompts a message to every connected player regarding the incoming interruption. Why Agones?Agones orchestrates game servers via declarative configuration in order to manage groups of ready game-servers to play. It also offers integrated SDK for managing game server lifecycle, health, and configuration. Finally, it runs on Kubernetes, so it is an all-up open-source platform that runs anywhere. The Agones SDK is easily implemented. Furthermore, combining the compute platform AWS and Agones offers the most secure, resilient, scalable, and cost-effective method for running an MMO. In our example, we implemented the /health in agones_health and /allocate in agones_allocate calls. Then, the agones_health() called upon the server init to indicate that it is ready to assign new players. The agones_health() using the native health and relay its health to keep the game-server in a viable state. The main() function forks a new process that reports health. Agones manages the port allocation and maintains the game state, e.g., Allocated, Scheduled, Shutdown, Creating, and Unhealthy. Other ways than Agones on EKS?Configuring an Agones group of ready game-servers behind a load balancer is difficult. Agones game-servers endpoint must be published so that players’ clients can connect and play. Agones creates game-servers endpoints that are an IP and port pair. The IP is the public IP of the EC2 Instance. The port results from PortPolicy, which generates a non-predictable port number. Hence, make it impossible to use with load balancer such as Amazon Network Load Balancer (NLB). Suppose you want to use a load balancer to route players via a predictable endpoint. You could use the Kubernetes Deployment construct and configure a Kubernetes Service construct with NLB and no need to implement additional SDK or install any additional components on your EKS cluster. The following example defines a service craft-svc that creates an NLB that will route TCP connections to Pod targets carrying the selector craft and listening on port 4080. The game server Deployment set the metadata label for the Service load balancer and the port. Furthermore, the Kubernetes readinessProbe and livenessProbe offer similar features as the Agones SDK /allocate and /health implemented prior, making the Deployment option parity with the Agones option. Overview of the DatabaseThe compute layer running the game could be reclaimed at any time within minutes, so it is imperative to continuously store the game state as players progress without impacting the experience. Furthermore, it is essential to read the game state quickly from scratch upon session recovery from an unexpected interruption. Therefore, the game requires fast reads and lazy writes. More considerations are consistency and isolation. Developers could handle inconsistency via the game in order to relax hard consistency requirements. As for isolation, in our Craft example players can build a structure without other players seeing it and publish it globally only when they choose. Choosing the proper database for the MMO depends on the game state structure, e.g., a set of related objects such as our Craft example or a single denormalized table. The former fits the relational model used with RDS or Aurora open-source databases such as MySQL or PostgreSQL. While the latter can be used with Keyspaces, an AWS managed Cassandra, or a key-value store such as DynamoDB. Our example includes two options to store the game state with Aurora Serverless for PostgreSQL or DynamoDB. We chose those because of the ACID support. PostgreSQL offers four isolation levels: dirty read, nonrepeatable read, phantom read, and serialization anomaly. DynamoDB offers two isolation levels: serializable and read-committed. Both databases’ options allow the game developer to implement the best player experience and avoid additional implementation efforts. Moreover, both engines offer Restful connection methods to the database. Aurora uses Data API. The Data API doesn’t require a persistent DB cluster connection. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. Game developers can run SQL statements via the endpoint without managing connections. DynamoDB only supports a Restful connection. Finally, Aurora Serverless and DynamoDB scale the compute and storage to reduce operational overhead and pay only for what was played. ConclusionMMOs are unique because they require infrastructure features similar to other game types like FPS or casual games, and reliable persistence storage for the game-state. Unfortunately, this leads to expensive choices that make monetizing the game difficult. Therefore, we proposed an option with a fun game to help you, the developer, analyze what is best for your MMO. Our option is built upon open-source projects that allow you to build it anywhere, but we also show that AWS offers the most cost-effective and scalable option. We encourage you to read recent announcements about these topics, including several at AWS re:Invent 2021.
Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-spot-blueprints-a-template-generator-for-frameworks-like-kubernetes-and-apache-spark/ This post is authored by Deepthi Chelupati, Senior Product Manager for Amazon EC2 Spot Instances, and Chad Schmutzer, Principal Developer Advocate for Amazon EC2 Customers have been using EC2 Spot Instances to save money and scale workloads to new levels for over a decade. Launched in late 2009, Spot Instances are spare Amazon EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand Instance prices. One thing customers love about Spot Instances is their integration across many services on AWS, the AWS Partner Network, and open source software. These integrations unlock the ability to take advantage of the deep savings and scale Spot Instances provide for interruptible workloads. Some of the most popular services used with Spot Instances include Amazon EC2 Auto Scaling, Amazon EMR, AWS Batch, Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (EKS). When we talk with customers who are early in their cost optimization journey about the advantages of using Spot Instances with their favorite workloads, they typically can’t wait to get started. They often tell us that while they’d love to start using Spot Instances right away, flipping between documentation, blog posts, and the AWS Management Console is time consuming and eats up precious development cycles. They want to know the fastest way to start saving with Spot Instances, while feeling confident they’ve applied best practices. Customers have overwhelmingly told us that the best way for them to get started quickly is to see complete workload configuration examples in infrastructure as code templates for AWS CloudFormation and Hashicorp Terraform. To address this feedback, we launched Spot Blueprints. Spot Blueprints overviewToday we are excited to tell you about Spot Blueprints, an infrastructure code template generator that lives right in the EC2 Spot Console. Built based on customer feedback, Spot Blueprints guides you through a few short steps. These steps are designed to gather your workload requirements while explaining and configuring Spot best practices along the way. Unlike traditional wizards, Spot Blueprints generates custom infrastructure as code in real time within each step so you can easily understand the details of the configuration. Spot Blueprints makes configuring EC2 instance type agnostic workloads easy by letting you express compute capacity requirements as vCPU and memory. Then the wizard automatically expands those requirements to a flexible list of EC2 instance types available in the AWS Region you are operating. Spot Blueprints also takes high-level Spot best practices like Availability Zone flexibility, and applies them to your workload compute requirements. For example, automatically including all of the required resources and dependencies for creating a Virtual Private Cloud (VPC) with all Availability Zones configured for use. Additional workload-specific Spot best practices are also configured, such as graceful interruption handling for load balancer connection draining, automatic job retries, or container rescheduling. Spot Blueprints makes keeping up with the latest and greatest Spot features (like the capacity-optimized allocation strategy and Capacity Rebalancing for EC2 Auto Scaling) easy. Spot Blueprints is continually updated to support new features as they become available. Today, Spot Blueprints supports generating infrastructure as code workload templates for some of the most popular services used with Spot Instances: Amazon EC2 Auto Scaling, Amazon EMR, AWS Batch, and Amazon EKS. You can tell us what blueprint you’d like to see next right in Spot Blueprints. We are excited to hear from you! What are Spot best practices?We’ve mentioned Spot best practices a few times so far in this blog. Let’s quickly review the best practices and how they relate to Spot Blueprints. When using Spot Instances, it is important to understand a couple of points:
For these reasons, it is important to follow best practices when using Spot Instances in your workloads. We like to call these “Spot best practices.” These best practices can be summarized as follows:
Over the last few years, we’ve focused on making it easier to follow these best practices by adding features such as the following: As mentioned prior, in addition to the Spot best practices we reviewed, there is an additional set of Spot best practices native to each workload that has integrated support for Spot Instances. For example, the ability to implement graceful interruption handling for:
Spot Blueprints are designed to quickly explain and generate templates with Spot best practices for each specific workload. The custom-generated workload templates can be downloaded in either AWS CloudFormation or HashiCorp Terraform format, allowing for further customization and learning before being deployed in your environment. Next, let’s walk through configuring and deploying an example Spot blueprint. Example tutorialIn this example tutorial, we use Spot Blueprints to configure an Apache Spark environment running on Amazon EMR, deploy the template as a CloudFormation stack, run a sample job, and then delete the CloudFormation stack. First, we navigate to the EC2 Spot console and click on “Spot Blueprints”: In the next screen, we select the EMR blueprint, and then click “Configure blueprint”: A couple of notes here:
In Step 1, we give the blueprint a name and configure permissions. We have the blueprint create new IAM roles to allow Amazon EMR and Amazon EC2 compute resources to make calls to the AWS APIs on your behalf: We see a summary of resources that will be created, along with a code snippet preview. Unlike traditional wizards, you can see the code generated in every step! In Step 2, the network is configured. We create a new VPC with public subnets in all Availability Zones in the Region. This is a Spot best practice because it increases the number of Spot capacity pools, which increases the possibility for EC2 to find and provision the required capacity: We see a summary of resources that will be created, along with a code snippet preview. Here is the CloudFormation code for our template, where you can see the VPC creation, including all dependencies such as the internet gateway, route table, and subnets: attachGateway: DependsOn: - vpc - internetGateway Properties: InternetGatewayId: Ref: internetGateway VpcId: Ref: vpc Type: AWS::EC2::VPCGatewayAttachment internetGateway: DependsOn: - vpc Type: AWS::EC2::InternetGateway publicRoute: DependsOn: - publicRouteTable - internetGateway - attachGateway Properties: DestinationCidrBlock: 0.0.0.0/0 GatewayId: Ref: internetGateway RouteTableId: Ref: publicRouteTable Type: AWS::EC2::Route publicRouteTable: DependsOn: - vpc - attachGateway Properties: Tags: - Key: Name Value: Public Route Table VpcId: Ref: vpc Type: AWS::EC2::RouteTable vpc: Properties: CidrBlock: Fn::FindInMap: - CidrMappings - vpc - CIDR EnableDnsHostnames: true EnableDnsSupport: true Tags: - Key: Name Value: Ref: AWS::StackName Type: AWS::EC2::VPC publicSubnet1RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet1 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet1 publicSubnet1: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1a CidrBlock: 10.0.0.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ1) publicSubnet2RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet2 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet2 publicSubnet2: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1b CidrBlock: 10.0.1.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ2) publicSubnet3RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet3 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet3 publicSubnet3: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1c CidrBlock: 10.0.2.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ3) publicSubnet4RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet4 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet4 publicSubnet4: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1d CidrBlock: 10.0.3.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ4) publicSubnet5RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet5 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet5 publicSubnet5: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1e CidrBlock: 10.0.4.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ5) publicSubnet6RouteTableAssociation: DependsOn: - publicRouteTable - publicSubnet6 - attachGateway Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: Ref: publicRouteTable SubnetId: Ref: publicSubnet6 publicSubnet6: DependsOn: - attachGateway Type: AWS::EC2::Subnet Properties: VpcId: Ref: vpc AvailabilityZone: us-east-1f CidrBlock: 10.0.5.0/24 MapPublicIpOnLaunch: true Tags: - Key: Name Value: Fn::Sub: ${EnvironmentName} Public Subnet (AZ6)In Step 3, we configure the compute environment. Spot Blueprints makes this easy by allowing us to simply define our capacity units and required compute capacity for each type of node in our EMR cluster. In this walkthrough we will use vCPUs. We select 4 vCPUs for the core node capacity, and 12 vCPUs for the task node capacity. Based on these configurations, Spot Blueprints will apply best practices to the EMR cluster settings. These include using On-Demand Instances for core nodes since interruptions to core nodes can cause instability in the EMR cluster, and using Spot Instances for task nodes because the EMR cluster is typically more tolerant to task node interruptions. Finally, task nodes are provisioned using the capacity-optimized allocation strategy because it helps find the most optimal Spot capacity: You’ll notice there is no need to spend time thinking about which instance types to select – Spot Blueprints takes our application’s minimum vCPUs per instance and minimum vCPU to memory ratio requirements and automatically selects the optimal instance types. This instance type selection process applies Spot best practices by a) using instance types across different families, generations, and sizes, and b) using the maximum number of instance types possible for core (5) and task (15) nodes: Here is the CloudFormation code for our template. You can see the EMR cluster creation, the applications being installed (Spark, Hadoop, and Ganglia), the flexible list of instance types and Availability Zones, and the capacity-optimized allocation strategy enabled (along with all dependencies): emrSparkInstanceFleetCluster: DependsOn: - vpc - publicRoute - publicSubnet1RouteTableAssociation - publicSubnet2RouteTableAssociation - publicSubnet3RouteTableAssociation - publicSubnet4RouteTableAssociation - publicSubnet5RouteTableAssociation - publicSubnet6RouteTableAssociation - spotBlueprintsEmrRole Properties: Applications: - Name: Spark - Name: Hadoop - Name: Ganglia Instances: CoreInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge WeightedCapacity: 4 - InstanceType: c4.xlarge WeightedCapacity: 4 - InstanceType: c3.xlarge WeightedCapacity: 4 - InstanceType: m5.xlarge WeightedCapacity: 4 - InstanceType: m3.xlarge WeightedCapacity: 4 Name: Ref: AWS::StackName TargetOnDemandCapacity: "4" LaunchSpecifications: OnDemandSpecification: AllocationStrategy: lowest-price Ec2SubnetIds: - Ref: publicSubnet1 - Ref: publicSubnet2 - Ref: publicSubnet3 - Ref: publicSubnet4 - Ref: publicSubnet5 - Ref: publicSubnet6 MasterInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge - InstanceType: c4.xlarge - InstanceType: c3.xlarge - InstanceType: m5.xlarge - InstanceType: m3.xlarge Name: Ref: AWS::StackName TargetOnDemandCapacity: "1" JobFlowRole: Ref: spotBlueprintsEmrEc2InstanceProfile Name: Ref: AWS::StackName ReleaseLabel: emr-5.30.1 ServiceRole: Ref: spotBlueprintsEmrRole Tags: - Key: Name Value: Ref: AWS::StackName VisibleToAllUsers: true Type: AWS::EMR::Cluster emrSparkInstanceTaskFleet: DependsOn: - emrSparkInstanceFleetCluster Properties: ClusterId: Ref: emrSparkInstanceFleetCluster InstanceFleetType: TASK InstanceTypeConfigs: - InstanceType: c5.xlarge WeightedCapacity: 4 - InstanceType: c4.xlarge WeightedCapacity: 4 - InstanceType: c3.xlarge WeightedCapacity: 4 - InstanceType: m5.xlarge WeightedCapacity: 4 - InstanceType: m3.xlarge WeightedCapacity: 4 - InstanceType: m4.xlarge WeightedCapacity: 4 - InstanceType: r4.xlarge WeightedCapacity: 4 - InstanceType: r5.xlarge WeightedCapacity: 4 - InstanceType: r3.xlarge WeightedCapacity: 4 - InstanceType: c4.2xlarge WeightedCapacity: 8 - InstanceType: c3.2xlarge WeightedCapacity: 8 - InstanceType: c5.2xlarge WeightedCapacity: 8 - InstanceType: m5.2xlarge WeightedCapacity: 8 - InstanceType: m4.2xlarge WeightedCapacity: 8 - InstanceType: m3.2xlarge WeightedCapacity: 8 LaunchSpecifications: SpotSpecification: TimeoutAction: TERMINATE_CLUSTER TimeoutDurationMinutes: 60 AllocationStrategy: capacity-optimized Name: TaskFleet TargetSpotCapacity: "12" Type: AWS::EMR::InstanceFleetConfigIn Step 4, we have the option of enabling EMR managed scaling on the cluster. Enabling EMR managed scaling is a Spot best practice because this allows EMR to automatically increase or decrease the compute capacity in the cluster based on the workload, further optimizing performance and cost. EMR continuously evaluates cluster metrics to make scaling decisions that optimize the cluster for cost and speed. Spot Blueprints automatically configures minimum and maximum scaling values based on the compute requirements we defined in the previous step: Here is the updated CloudFormation code for our template with managed scaling (ManagedScalingPolicy) enabled: emrSparkInstanceFleetCluster: DependsOn: - vpc - publicRoute - publicSubnet1RouteTableAssociation - publicSubnet2RouteTableAssociation - publicSubnet3RouteTableAssociation - publicSubnet4RouteTableAssociation - publicSubnet5RouteTableAssociation - publicSubnet6RouteTableAssociation - spotBlueprintsEmrRole Properties: Applications: - Name: Spark - Name: Hadoop - Name: Ganglia Instances: CoreInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge WeightedCapacity: 4 - InstanceType: c4.xlarge WeightedCapacity: 4 - InstanceType: c3.xlarge WeightedCapacity: 4 - InstanceType: m5.xlarge WeightedCapacity: 4 - InstanceType: m3.xlarge WeightedCapacity: 4 Name: Ref: AWS::StackName TargetOnDemandCapacity: "4" LaunchSpecifications: OnDemandSpecification: AllocationStrategy: lowest-price Ec2SubnetIds: - Ref: publicSubnet1 - Ref: publicSubnet2 - Ref: publicSubnet3 - Ref: publicSubnet4 - Ref: publicSubnet5 - Ref: publicSubnet6 MasterInstanceFleet: InstanceTypeConfigs: - InstanceType: c5.xlarge - InstanceType: c4.xlarge - InstanceType: c3.xlarge - InstanceType: m5.xlarge - InstanceType: m3.xlarge Name: Ref: AWS::StackName TargetOnDemandCapacity: "1" JobFlowRole: Ref: spotBlueprintsEmrEc2InstanceProfile Name: Ref: AWS::StackName ReleaseLabel: emr-5.30.1 ServiceRole: Ref: spotBlueprintsEmrRole Tags: - Key: Name Value: Ref: AWS::StackName VisibleToAllUsers: true ManagedScalingPolicy: ComputeLimits: MaximumCapacityUnits: "32" MinimumCapacityUnits: "16" MaximumOnDemandCapacityUnits: "4" MaximumCoreCapacityUnits: "4" UnitType: InstanceFleetUnits Type: AWS::EMR::ClusterIn Step 5, we can review and download the template code for further customization in either CloudFormation or Terraform format, review the instance configuration summary, and review a summary of the resources that would be created from the template. Spot Blueprints will also upload the CloudFormation template to an Amazon S3 bucket so we can deploy the template directly from the CloudFormation console or CLI. Let’s go ahead and click on “Deploy in CloudFormation,” copy the URL, and then click on “Deploy in CloudFormation” again: Having copied the S3 URL, we go to the CloudFormation console to launch the CloudFormation stack: We walk through the CloudFormation stack creation steps and the stack launches, creating all of the resources as configured in the blueprint. It takes roughly 15-20 minutes for our stack creation to complete: Once the stack creation is complete, we navigate to the Amazon EMR console to view the EMR cluster configured with Spot best practices: Next, let’s run a sample Spark application written in Python to calculate the value of pi on the cluster. We’ll do this by uploading the sample application code to an S3 bucket in our account and then adding a step to the cluster referencing the application location code with arguments: The step runs and completes successfully: The results of the calculation are sent to our S3 bucket as configured in the arguments: {"tries":1600000,"hits":1256253,"pi":3.1406325} CleanupNow that our job is done, we delete the CloudFormation stack in order to remove the AWS resources created. Please note that as a part of the EMR cluster creation, EMR creates some EC2 security groups that cannot be removed by CloudFormation since they were created by the EMR cluster and not by CloudFormation. As a result, the deletion of the CloudFormation stack will fail to delete the VPC on the first try. To solve this, we have the option of deleting the VPC manually by hand, or we can let the CloudFormation stack ignore the VPC and leave it (along with the security groups) in place for future use. We can then delete the CloudFormation stack a final time: ConclusionNo matter if you are a first-time Spot user learning how to take advantage of the savings and scale offered by Spot Instances, or a veteran Spot user expanding your Spot usage into a new architecture, Spot Blueprints has you covered. We hope you enjoy using Spot Blueprints to quickly get you started generating example template frameworks for workloads like Kubernetes and Apache Spark with Spot best practices in place. Please tell us which blueprint you’d like to see next right in Spot Blueprints. We can’t wait to hear from you!
Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/proactively-manage-spot-instance-lifecycle-using-the-new-capacity-rebalancing-feature-for-ec2-auto-scaling/ By Deepthi Chelupati and Chad Schmutzer AWS now offers Capacity Rebalancing for Amazon EC2 Auto Scaling, a new feature for proactively managing the Amazon EC2 Spot Instance lifecycle in an Auto Scaling group. Capacity Rebalancing complements the capacity optimized allocation strategy (designed to help find the most optimal spare capacity) and the mixed instances policy (designed to enhance availability by deploying across multiple instance types running in multiple Availability Zones). Capacity Rebalancing increases the emphasis on availability by automatically attempting to replace Spot Instances in an Auto Scaling group before they are interrupted by Amazon EC2. In order to proactively replace Spot Instances, Capacity Rebalancing leverages the new EC2 Instance rebalance recommendation, a signal that is sent when a Spot Instance is at elevated risk of interruption. The rebalance recommendation signal can arrive sooner than the existing two-minute Spot Instance interruption notice, providing an opportunity to proactively rebalance a workload to new or existing Spot Instances that are not at elevated risk of interruption. Capacity Rebalancing for EC2 Auto Scaling provides a seamless and automated experience for maintaining desired capacity through the Spot Instance lifecycle. This includes monitoring for rebalance recommendations, attempting to proactively launch replacement capacity for existing Spot Instances when they are at elevated risk of interruption, detaching from Elastic Load Balancing if necessary, and running lifecycle hooks as configured. This post provides an overview of using Capacity Rebalancing in EC2 Auto Scaling to manage your Spot Instance backed workloads, and dives into an example use case for taking advantage of Capacity Rebalancing in your environment. EC2 Auto Scaling and Spot Instances – a classic love storyFirst, let’s review what Spot Instances are and why EC2 Auto scaling provides an optimal platform to manage your Spot Instance backed workloads. This will help illustrate how Capacity Rebalancing can benefit these workloads. Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand prices. In exchange for the discount, Spot Instances come with a simple rule – they are interruptible and must be returned when EC2 needs the capacity back. Where does this spare capacity come from? Since AWS builds capacity for unpredictable demand at any given time (think all 350+ instance types across 77 Availability Zones and 24 Regions), there is often excess capacity. Rather than let that spare capacity sit idle and unused, it is made available to be purchased as Spot Instances. As you can imagine, the location and amount of spare capacity available at any given moment is dynamic and continually changes in real time. This is why it is extremely important for Spot customers to only run workloads that are truly interruption tolerant. Additionally, Spot workloads should be flexible, meaning they can be shifted in real time to where the spare capacity currently is (or otherwise be paused until spare capacity is available again). In practice, being flexible means qualifying a workload to run on multiple EC2 instance types (think big: multiple families, sizes, and generations), and in multiple Availability Zones, at any given time. This is where EC2 Auto Scaling comes in. EC2 Auto Scaling is designed to help you maintain application availability. It also allows you to automatically add or remove EC2 instances according to conditions you define. We’ve continued to innovate on behalf of our customers by adding new features to EC2 Auto Scaling to natively support flexible configurations for EC2 workloads. One of these innovations is the mixed instances policy (launched in 2018), which supports multiple instance types and purchase options in a single Auto Scaling group. Another innovation is the capacity optimized allocation strategy (launched in 2019), an allocation strategy designed to locate optimal spare capacity for Spot Instances backed workloads. These features are aimed at supporting flexible workload best practices, and reacting to the dynamic shifts in capacity automatically. The next level – moving from reactive to proactive Spot Capacity Rebalancing in EC2 Auto ScalingThe default behavior for EC2 Auto Scaling is to take a reactive approach to Spot Instance interruptions. This means that EC2 Auto Scaling attempts to replace an interrupted Spot Instance with another Spot Instance only after the instance has been shut down by EC2 and the health check fails. The reactive approach to interruptions works fine for many workloads. However, we have received feedback from customers requesting that EC2 Auto Scaling take a more proactive approach to handling Spot Instance interruptions. Capacity Rebalancing in EC2 Auto Scaling is the answer to this request. Capacity Rebalancing is designed to take a proactive approach in handling the dynamic nature of EC2 capacity. It does this by monitoring for the EC2 Instance rebalance recommendation signal in addition to the “final” two-minute Spot Instance interruption notice. When a rebalance recommendation signal is detected, it automatically attempts to get a head start in replacing Spot Instances with new Spot Instances before they are shut down. In addition to attempting to maintain desired capacity through interruptions by launching replacement Spot Instances, Capacity Rebalancing gives customers the opportunity to gracefully remove Spot Instances from an Auto Scaling group by taking Spot Instances through the normal shut down process, such as deregistering from a load balancer and running terminating lifecycle hooks. Capacity Rebalancing in EC2 Auto Scaling works best when combined with a few best practices. Let’s quickly review them:
Example tutorial – Web application workloadNow that you understand the best practices for taking advantage of Capacity Rebalancing in EC2 Auto Scaling, let’s dive into the example workload. In this scenario, we have a web application powered by 75% Spot Instances and 25% On-Demand Instances in an Auto Scaling group, running behind an Application Load Balancer. We’d like to maintain availability, and have the Auto Scaling group automatically handle Spot Instance interruptions and rebalancing of capacity. The Auto Scaling group configuration looks like this (note the best practices of instance type and Availability Zone flexibility combined with the capacity optimized allocation strategy in the mixed instances policy): { "AutoScalingGroupName": "myAutoScalingGroup", "CapacityRebalance": true, "DesiredCapacity": 12, "MaxSize": 15, "MinSize": 12, "MixedInstancesPolicy": { "InstancesDistribution": { "OnDemandBaseCapacity": 0, "OnDemandPercentageAboveBaseCapacity": 25, "SpotAllocationStrategy": "capacity-optimized" }, "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "myLaunchTemplate", "Version": "$Default" }, "Overrides": [ { "InstanceType": "c5.large" }, { "InstanceType": "c5a.large" }, { "InstanceType": "m5.large" }, { "InstanceType": "m5a.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "m4.large" }, { "InstanceType": "c3.large" }, { "InstanceType": "m3.large" } ] } }, "TargetGroupARNs": [ "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/a1b2c3d4e5f6g7h8" ], "VPCZoneIdentifier": "mySubnet1,mySubnet2,mySubnet3" }Next, create the Auto Scaling group as follows: aws autoscaling create-auto-scaling-group \ --cli-input-json file://myAutoScalingGroup.jsonWe also use a lifecycle hook to download logs before an instance is shut down: aws autoscaling put-lifecycle-hook \ --lifecycle-hook-name myTerminatingHook \ --auto-scaling-group-name myAutoScalingGroup \ --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \ --heartbeat-timeout 300In this example scenario, let’s say that the above config results in nine Spot Instances and three On-Demand instances being deployed in the Auto Scaling group, three Spot Instances, and one On-Demand instance in each Availability Zone. With Capacity Rebalancing enabled, if any of the nine Spot Instances receive the EC2 Instance rebalance recommendation signal, EC2 Auto Scaling will automatically request a replacement Spot Instance according to the allocation strategy (capacity optimized), resulting in 10 running Spot Instances. When the new Spot Instance passes EC2 health checks, it is joined to the load balancer and placed into service. Upon placing the new Spot Instance in service, EC2 Auto Scaling then proceeds with the shutdown process for the Spot Instance that has received the rebalance recommendation signal. It detaches the instance from the load balancer, drains connections, and then carries out the terminating lifecycle hook. Once the terminating lifecycle hook is complete, EC2 Auto Scaling shuts down the instance, bringing capacity back to nine Spot Instances. ConclusionConsider using the new Capacity Rebalancing feature for EC2 Auto Scaling in your environment to proactively manage Spot Instance lifecycle. Capacity Rebalancing attempts to maintain workload availability by automatically rebalancing capacity as necessary, providing a seamless and hands-off experience for managing Spot Instance interruptions. Capacity Rebalancing works best when combined with instance type flexibility and the capacity optimized allocation strategy, and may be especially useful for workloads that can easily rebalance across shifting capacity, including:
To learn more about Capacity Rebalancing for EC2 Auto Scaling, please visit the documentation. To learn more about the new EC2 Instance rebalance recommendation, please visit the documentation.
Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/running-web-applications-on-amazon-ec2-spot-instances/ Author: Isaac Vallhonrat, Sr. EC2 Spot Specialist SA Amazon EC2 Spot Instances allow customers to save up to 90% compared to On-Demand pricing by leveraging spare EC2 capacity. Spot Instances are a perfect fit for fault tolerant workloads that are flexible to run on multiple instance types such as batch jobs, code builds, load tests, containerized workloads, web applications, big data clusters, and High Performance Computing clusters. In this blog post, I walk you through how-to and best practices to run web applications on Spot Instances, so you can benefit from the scale and savings that they provide. As Spot Instances are interruptible, your web application should be stateless, fault tolerant, loosely coupled, and rely on external data stores for persistent data, like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, etc. Spot Instances recapWhile Spot Instances have been around since 2009, recent updates and service integrations have made them easier than ever to integrate with your workloads. Before I get into the details of how to use them to host a web application, here is a brief recap of how Spot Instances work: First, Spot Instances are a purchasing option within EC2. There is no difference in hardware when compared to On-Demand or Reserved Instances. The only difference is that these instances can be reclaimed with a two-minute notice if EC2 needs the capacity back. A set of unused EC2 instances with the same instance type, operating system, and Availability Zone is a Spot capacity pool. Spot Instance pricing is set by Amazon EC2 and adjusts gradually based on long-term trends in supply and demand of EC2 instances in each pool. You can expect Spot pricing to be stable over time, meaning no sudden spikes or swings. You can view historical pricing data for the last three months in both the EC2 console and via the API. The following image is an example of the pricing history for m5.xlarge instances in the N. Virginia (us-east-1). Figure 1. Spot Instance pricing history for an m5.xlarge Linux/UNIX instance in us-east-1. Price changes independently for each Availability Zone and in small increments over time based on long term supply and demand. As each capacity pool is independent from each other, I recommend you spread the fleet of instances that power your workload across multiple Availability Zones (AZs) and be flexible to use multiple instance types. This way, you effectively increase the amount of spare capacity you can use to launch Spot Instances and are able to replace interrupted instances with instances from other pools. The best way to adhere to Spot Instance flexibility best practices and managing your fleet of instances is using an Amazon EC2 Auto Scaling group. This allows you to combine multiple instance types and purchase options. Once you identify a list of suitable instance types for your workload, you can use Auto Scaling features to launch a combination of On-Demand and Spot Instances from optimal Spot Instance pools in each Availability Zone. Stateless web applications are a natural fit for Spot Instances as they are fault tolerant, used to handle interruptions through scale-in events in an Auto Scaling group, and commonly run across multiple Availability Zones. It’s also normally easy to implement instance type flexibility on web applications, so they tap on Spot Instance capacity from multiple pools. Identifying suitable instance types for your web applicationBeing instance type flexible is key to being successful with Spot Instances. At the time of this writing, AWS provides more than 270 instance types, so it’s likely that your web application can run on multiple of them. For example, if your application runs on the m5.xlarge instance type, it can also run on the m5d.xlarge instance type as it is the same instance type, but with local SSDs. Also, running your application on the r5.xlarge or r5d.xlarge performs similarly since they use the same processor family. Instead, your application runs on an instance with more memory and you benefit from the savings Spot Instances provide. You may be able to qualify additional similarly-sized instance types from other families or generations, like the AMD-based variants m5a.xlarge and r5a.xlarge, or the higher bandwidth variants m5n.xlarge and r5n.xlarge. These may present some performance differences compared to other types due to their hardware and/or the virtualization system. However, Application Load Balancer has mechanisms in place to spread the load across your instances according to the number of outstanding requests they are processing. This means, instances handling longer requests or that have lower processing power are not overloaded. This allows you to acquire capacity from even more Spot capacity pools regardless of hardware differences. Later in the blog post, I dive into the details on the load balancing mechanisms. Configure your Auto Scaling groupNow that you have qualified multiple instance types to run your web application, you need to create an EC2 Auto Scaling group. The first step is to create a Launch Template where you specify configuration for your fleet of instances including the AMI, Security Groups, Tags, user data scripts, etc. You configure a single instance type on the template, here you configure the instance type you commonly use to run your web application. Also, do not mark the Request Spot Instances checkbox on the template. You configure multiple instance types and the Spot Instance settings when configuring the Auto Scaling group. Here’s my Launch Template: Figure 2. Launch templates console Now, go to the EC2 Auto Scaling console to create an Auto Scaling group. In the wizard, select the launch template you created and then click next. Then, on the second step of the wizard, select Combine purchase options and instance types. You can see the configuration options in the following image. Figure 3. Combine purchase options and instance types configuration. Here is a detailed look at the configuration details:
Next, you choose your instance types from the Instance Types section of the wizard. Here, there is a primary instance type inherited from your Launch Template. The console then automatically populates a list of same size instance types across other families and generations that could also fit your web application requirements. You can add or remove instance types as you see fit. Figure 4. Selection of instance types Note: There is an option to combine instance types of different sizes which is very helpful for queue worker nodes and similar workloads. We won’t cover this in this blog post as it is not relevant for web applications. You can read more about this feature here. Once you select your instance types, your Auto Scaling group uses the configured Spot Instance allocation strategy to decide which Spot Instance types to launch on each Availability Zone. By specifying multiple instance types, EC2 Auto Scaling replenishes capacity from other Spot Instance pools applying your configured allocation strategy if some of your instances get interrupted. This maintains the size of your fleet and keeps your service available. For the On-Demand part of your Auto Scaling group, the allocation strategy is prioritized. This means that EC2 Auto Scaling launches the primary instance type, and moves to the next types in the order you selected, in the unlikely event it cannot launch the capacity. If you have Reserved Instances to cover your compute baseline, set the instance type you reserved as the primary instance type in your Auto Scaling group, so the On-Demand Instances get the reservation discounts. You can learn more about how Reserved Instances are applied here. After completing this step, complete the Auto Scaling group creation wizard, and you can move on to set up your Application Load Balancer. Load balancing with Application Load BalancerTo balance the load across the fleet of instances running your web application you use an Application Load Balancer (ALB). EC2 Auto Scaling groups are fully integrated with ALB and Target Groups that manage the fleet of instances fronted by ALB. Target groups are set up by default with a deregistration delay of 300 seconds. This means that when EC2 Auto Scaling needs to remove an instance out of the fleet, it first puts the instance in draining state, informs ALB to stop sending new requests to it, and allows the configured time for in-flight requests to complete before the instance is finally deregistered. This way, removing an instance from the fleet is not perceptible to your end users. As Spot Instances are interrupted with a 2 minute warning, you need to adjust this setting to a slightly lower time. Recommended values are 90 seconds or less. This allows time for in-flight requests to complete and gracefully close existing connections before the instance is interrupted. At the time of this writing, EC2 Auto Scaling is not aware of Spot Instance interruptions and does not trigger draining automatically when a Spot Instance is going to be interrupted. However, you can put the instance in draining state by catching the interruption notice and calling the EC2 Auto Scaling API. I cover this in more detail in the next section. Target Groups allow you to configure a routing algorithm, which by default is Round Robin. As with Spot Instances you combine multiple instance types in your fleet, your web application may see some performance differences if they use different hardware and in some cases a different virtualization platform (Xen for the older generation instances and AWS Nitro for the newer ones). While the impact to response time may be negligible to the end user, you need to ensure your instances handle load fairly accounting for the processing power differences of each instance type. With the Least Outstanding Requests load balancing algorithm, instead of distributing the requests across your fleet in a round-robin fashion, as a new request comes in, the load balancer sends it to the target with the least number of outstanding requests. This way backend instances with lower processing capabilities or processing long-standing requests are not burdened with more requests and the load is spread evenly across targets. This also helps the new targets effectively take load from overloaded targets. These settings can be configured on the Target Groups section of the EC2 Console and editing the attributes of your target group. Figure 5. ALB Target group attribute configuration. Your architecture will look like the following image. Figure 6. Stateless web application hosted on a combination of On-Demand and Spot Instances Leveraging Instance Termination Notices for graceful terminationWhen a Spot Instance is going to be interrupted, EC2 triggers a Spot Instance interruption notice that is presented via the EC2 Metadata endpoint on the instance. It can also be caught by an Amazon EventBridge rule, for example, trigger an AWS Lambda function. You can find the specific documentation here. By catching the interruption notice, you can take actions to gracefully take the instance out of the fleet without impacting your end users. For example, stop receiving new requests from the Load Balancer, allowing in-flight requests to finish and launch replacement capacity. Below is a sample Spot Instance Interruption Notice caught by Amazon EventBridge: Upon intercepting the interruption event, you can leverage the DetachInstances API call of EC2 Auto Scaling. Invoking this API puts the instance in Draining state, which makes the configured Load Balancer stop sending new requests to the instance that is going to be interrupted. The instance remains in this state during the configured deregistration delay on the Target Group to allow time for in-flight requests to complete. Figure 7. Target Group console showing instance draining. Also, if you set the ShouldDecrementDesiredCapacity parameter of the API call to False, EC2 Auto Scaling attempts to launch replacement Spot Instance capacity according to your instance type selection and allocation strategy. Below is an example code snippet calling the Auto Scaling API using Python Boto3. In some cases, you may also want to take actions at the operating system level when a Spot Instance is going to be interrupted. For example, you may want to gracefully stop your application so it closes open connections to a database, deregister agents running on the instance or other clean-up activities. You can leverage AWS Systems Manager Run Command to invoke commands on the instance that is going to be interrupted to perform these actions. You may also want to delay the execution of commands to capitalize on the two minute warning. As a first step, you can check the instance termination time on the instance metadata and wait, for example, until 30 seconds before the interruption. Then, gracefully stop your application and issue other termination commands. Note that there are no charges for using Run Command (limits apply) and the API call is asynchronous so it doesn’t hold your Lambda function running. Note: To use Run Command you must have the AWS Systems Manager agent running on your instance and meet a set of pre-requisites, like configuring an IAM instance profile. You can find more information here. We provide you with a sample solution that handles all the steps outlined above here that you can easily deploy on your AWS account. The sample solution also allows you to optionally execute commands upon Spot Instance interruption by creating an AWS Systems Manager parameter in Parameter Store specific to each Auto Scaling group with the commands you want to run, so you can easily customize termination commands to the needs of each application. You can find detailed configuration instructions in the README.md file. The high-level architecture of this solution is detailed below: Figure 8. Serverless architecture to handle Spot Instance interruptions Tying it all togetherNow, that I covered all the relevant features to run web applications on Spot Instances, let’s put them all together and summarize:
I hope you found this blog post useful, and if you are not using Spot Instances to run your stateless web applications today, I encourage you to try them to get the cost savings and scale they provide. Isaac Vallhonrat is a Sr. Specialist Solutions Architect for EC2 Spot at AWS. He helps customers build resilient, fault tolerant and cost-effective architectures with EC2 Spot Instances across multiple workload types like web applications, containers, big data, CI/CD, etc.
Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/ This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze IntroductionAmazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story. Solution OverviewAmazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up. Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post. Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications. By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone. You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy. We focus on three main points in this blog:
Application of Amazon SQS with Spot InstancesAmazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes. In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy. AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS applicationTrax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions. Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier. You can create a Dynamic Scaling Policy by doing the following:
In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch. Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated. With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node). To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection. Handling Spot Instance interruptionsThis instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone). Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires. Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message. Testing the solutionIf you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post. For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job. To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot. You can also see the application logs using CloudWatch Logs: /usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB /usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png /usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in /usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up /usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx /usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle /usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --no-protected-from-scale-in ConclusionThis post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs. In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:
We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution. We’d love your feedback—please comment and let me know what you think. About the authors Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others. As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze
Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/ This post is contributed by Alex Kimber, Global Solutions Architect No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster. This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization. PrerequisitesThis approach will be demonstrated via a simple batch-processing environment with the following components:
Testing On-Demand InstancesIn this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU. A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot. In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance. The batch total is $38.40. Testing On-Demand and Spot InstancesThe alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add. You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification. Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time. To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances. The instances then quickly set to work on the queue. The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs. The batch used the following instances for 30 minutes.
The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances. SummaryThis post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.
Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/ AWS announces the new capacity-optimized allocation strategy for Amazon EC2 Auto Scaling and EC2 Fleet. This new strategy automatically makes the most efficient use of spare capacity while still taking advantage of the steep discounts offered by Spot Instances. It’s a new way for you to gain easy access to extra EC2 compute capacity in the AWS Cloud. This post compares how the capacity-optimized allocation strategy deploys capacity compared to the current lowest-price allocation strategy. OverviewSpot Instances are spare EC2 compute capacity in the AWS Cloud available to you at savings of up to 90% off compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by EC2 with two minutes of notification when EC2 needs the capacity back. When making requests for Spot Instances, customers can take advantage of allocation strategies within services such as EC2 Auto Scaling and EC2 Fleet. The allocation strategy determines how the Spot portion of your request is fulfilled from the possible Spot Instance pools you provide in the configuration. The existing allocation strategy available in EC2 Auto Scaling and EC2 Fleet is called “lowest-price” (with an option to diversify across N pools). This strategy allocates capacity strictly based on the lowest-priced Spot Instance pool or pools. The “diversified” allocation strategy (available in EC2 Fleet but not in EC2 Auto Scaling) spreads your Spot Instances across all the Spot Instance pools you’ve specified as evenly as possible. As the AWS global infrastructure has grown over time in terms of geographic Regions and Availability Zones as well as the raw number of EC2 Instance families and types, so has the amount of spare EC2 capacity. Therefore it is important that customers have access to tools to help them utilize spare EC2 capacity optimally. The new capacity-optimized strategy for both EC2 Auto Scaling and EC2 Fleet provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics. WalkthroughTo illustrate how the capacity-optimized allocation strategy deploys capacity compared to the existing lowest-price allocation strategy, here are examples of Auto Scaling group configurations and use cases for each strategy. Lowest-price (diversified over N pools) allocation strategyThe lowest-price allocation strategy deploys Spot Instances from the pools with the lowest price in each Availability Zone. This strategy has an optional modifier SpotInstancePools that provides the ability to diversify over the N lowest-priced pools in each Availability Zone. Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The lowest-price strategy does not account for pool capacity depth as it deploys Spot Instances. As a result, the lowest-price allocation strategy is a good choice for workloads with a low cost of interruption that want the lowest possible prices, such as:
ExampleThe following example configuration shows how capacity could be allocated in an Auto Scaling group using the lowest-price allocation strategy diversified over two pools: { "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale", "MixedInstancesPolicy": { "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "my-launch-template", "Version": "$Latest" }, "Overrides": [ { "InstanceType": "c3.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "c5.large" } ] }, "InstancesDistribution": { "OnDemandPercentageAboveBaseCapacity": 0, "SpotAllocationStrategy": "lowest-price", "SpotInstancePools": 2 } }, "MinSize": 10, "MaxSize": 100, "DesiredCapacity": 60, "HealthCheckType": "EC2", "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456" }In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to lowest-price over two SpotInstancePools. First, EC2 Auto Scaling tries to make sure that it balances the requested capacity across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the lowest-price allocation strategy allocates the Spot Instance launches to the lowest-priced pool per zone. Using the example Spot prices shown in the following table, the resulting allocation is:
The cost for this Auto Scaling group is $1.83/hour. Of course, the Spot Instances are allocated according to the lowest price and are not optimized for capacity. The Auto Scaling group could experience higher interruptions if the lowest-priced Spot Instance pools are not as deep as others, since upon interruption the Auto Scaling group will attempt to re-provision instances into the lowest-priced Spot Instance pools. Capacity-optimized allocation strategyThere is a price associated with interruptions, restarting work, and checkpointing. While the overall hourly cost of capacity-optimized allocation strategy might be slightly higher, the possibility of having fewer interruptions can lower the overall cost of your workload. The effectiveness of the capacity-optimized allocation strategy depends on following Spot best practices by being flexible and providing as many instance types and Availability Zones (Spot Instance pools) as possible in the configuration. It is also important to understand that as capacity demands change, the allocations provided by this strategy also change over time. Remember that Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The capacity-optimized strategy does account for pool capacity depth as it deploys Spot Instances, but it does not account for Spot prices. As a result, the capacity-optimized allocation strategy is a good choice for workloads with a high cost of interruption, such as:
ExampleThe following example configuration shows how capacity could be allocated in an Auto Scaling group using the capacity-optimized allocation strategy: { "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale", "MixedInstancesPolicy": { "LaunchTemplate": { "LaunchTemplateSpecification": { "LaunchTemplateName": "my-launch-template", "Version": "$Latest" }, "Overrides": [ { "InstanceType": "c3.large" }, { "InstanceType": "c4.large" }, { "InstanceType": "c5.large" } ] }, "InstancesDistribution": { "OnDemandPercentageAboveBaseCapacity": 0, "SpotAllocationStrategy": "capacity-optimized" } }, "MinSize": 10, "MaxSize": 100, "DesiredCapacity": 60, "HealthCheckType": "EC2", "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456" }In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices (especially critical when using the capacity-optimized allocation strategy) and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to capacity-optimized. First, EC2 Auto Scaling tries to make sure that the requested capacity is evenly balanced across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the capacity-optimized allocation strategy optimizes the Spot Instance launches by analyzing capacity metrics per instance type per zone. This is because this strategy effectively optimizes by capacity instead of by the lowest price (hence its name). Using the example Spot prices shown in the following table, the resulting allocation is:
The cost for this Auto Scaling group is $1.91/hour, only 5% more than the lowest-priced example above. However, notice the distribution of the Spot Instances is different. This is because the capacity-optimized allocation strategy determined this was the most efficient distribution from an available capacity perspective. ConclusionConsider using the new capacity-optimized allocation strategy to make the most efficient use of spare capacity. Automatically deploy into the most available Spot Instance pools—while still taking advantage of the steep discounts provided by Spot Instances. This allocation strategy may be especially useful for workloads with a high cost of interruption, including:
No matter which allocation strategy you choose, you still enjoy the steep discounts provided by Spot Instances. These discounts are possible thanks to the stable Spot pricing made available with the new Spot pricing model. Chad Schmutzer is a Principal Developer Advocate for the EC2 Spot team. Follow him on twitter to get the latest updates on saving at scale with Spot Instances, to provide feedback, or just say HI.
Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/running-your-game-servers-at-scale-for-up-to-90-lower-compute-cost/ This post is contributed by Yahav Biran, Chad Schmutzer, and Jeremy Cowan, Solutions Architects at AWS Many successful video games such Fortnite: Battle Royale, Warframe, and Apex Legends use a free-to-play model, which offers players access to a portion of the game without paying. Such games are no longer low quality and require premium-like quality. The business model is constrained on cost, and Amazon EC2 Spot Instances offer a viable low-cost compute option. The casual multiplayer games naturally fit the Spot offering. With the orchestration of Amazon EKS containers and the mechanism available to minimize the player impact and optimize the cost when running multiplayer game-servers workloads, both casual and hardcore multiplayer games fit the Spot Instance offering. Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand Instances. Spot Instances enable you to optimize your costs and scale your application’s throughput up to 10 times for the same budget. Spot Instances are best suited for fault-tolerant workloads. Multiplayer game-servers are no exception: a game-server state is updated using real-time player inputs, which makes the server state transient. Game-server workloads can be disposable and take advantage of Spot Instances to save up to 90% on compute cost. In this blog, we share how to architect your game-server workloads to handle interruptions and effectively use Spot Instances. Characteristics of game-server workloadsSimply put, multiplayer game servers spend most of their life updating current character position and state (mostly animation). The rest of the time is spent on image updates that result from combat actions, moves, and other game-related events. More specifically, game servers’ CPUs are busy doing network I/O operations by accepting client positions, calculating the new game state, and multi-casting the game state back to the clients. That makes a game server workload a good fit for general-purpose instance types for casual multiplayer games and, preferably, compute-optimized instance types for the hardcore multiplayer games. AWS provides a wide variety for both compute-optimized (C5 and C4) and general-purpose (M5) instance types with Amazon EC2 Spot Instances. Because capacities fluctuate independently for each instance type in an Availability Zone, you can often get more compute capacity for the same price when using a wide range of instance types. For more information on Spot Instance best practices, see Getting Started with Amazon EC2 Spot Instances One solution that customers use for running dedicated game-servers is Amazon GameLift. This solution deploys a fleet of Amazon GameLift FleetIQ and Spot Instances in an AWS Region. FleetIQ places new sessions on game servers based on player latencies, instance prices, and Spot Instance interruption rates so that you don’t need to worry about Spot Instance interruptions. For more information, see Reduce Cost by up to 90% with Amazon GameLift FleetIQ and Spot Instances on the AWS Game Tech Blog. In other cases, you can use game-server deployment patterns like containers-based orchestration (such as Kubernetes, Swarm, and Amazon ECS) for deploying multiplayer game servers. Those systems manage a large number of game-servers deployed as Docker containers across several Regions. The rest of this blog focuses on this containerized game-server solution. Containers fit the game-server workload because they’re lightweight, start quickly, and allow for greater utilization of the underlying instance. Why use Amazon EC2 Spot Instances?A Spot Instance is the choice to run a disposable game server workload because of its built-in two-minute interruption notice that provides graceful handling. The two-minute termination notification enables the game server to act upon interruption. We demonstrate two examples for notification handling through Instance Metadata and Amazon CloudWatch. For more information, see “Interruption handling” and “What if I want my game-server to be redundant?” segments later in this blog. Spot Instances also offer a variety of EC2 instances types that fit game-server workloads, such as general-purpose and compute-optimized (C4 and C5). Finally, Spot Instances provide low termination rates. The Spot Instance Advisor can help you choose a good starting point for determining which instance types have lower historical interruption rates. Interruption handlingAvoiding player impact is key when using Spot Instances. Here is a strategy to avoid player impact that we apply in the proposed reference architecture and code examples available at Spotable Game Server on GitHub. Specifically, for Amazon EKS, node drainage requires draining the node via the kubectl drain command. This makes the node unschedulable and evicts the pods currently running on the node with a graceful termination period (terminationGracePeriodSeconds) that might impact the player experience. As a result, pods continue to run while a signal is sent to the game to end it gracefully. Node drainageNode drainage requires an agent pod that runs as a DaemonSet on every Spot Instance host to pull potential spot interruption from Amazon CloudWatch or instance metadata. We’re going to use the Instance Metadata notification. The following describes how termination events are handled with node drainage:
Optimization strategies for Kubernetes specificationsThis section describes a few recommended strategies for Kubernetes specifications that enable optimal game server placements on the provisioned worker nodes.
Reference architectureThe following architecture describes an instance mix of On-Demand and Spot Instances in an Amazon EKS cluster of multiplayer game servers. Within a single VPC, control plane node pools (Master Host and Storage Host) are highly available and thus run On-Demand Instances. The game-server hosts/nodes uses a mix of Spot and On-Demand Instances. The control plane, the API server is accessible via an Amazon Elastic Load Balancing Application Load Balancer with a preconfigured allowed list. What if I want my game server to be redundant?A game server is a sessionful workload, but it traditionally runs as a single dedicated game server instance with no redundancy. For game servers that use TCP as the transport network layer, AWS offers Network Load Balancers as an option for distributing player traffic across multiple game servers’ targets. Currently, game servers that use UDP don’t have similar load balancer solutions that add redundancy to maintain a highly available game server. This section proposes a solution for the case where game servers deployed as containerized Amazon EKS pods use UDP as the network transport layer and are required to be highly available. We’re using the UDP load balancer because of the Spot Instances, but the choice isn’t limited to when you’re using Spot Instances. The following diagram shows a reference architecture for the implementation of a UDP load balancer based on Amazon EKS. It requires a setup of an Amazon EKS cluster as suggested above and a set of components that simulate architecture that supports multiplayer game services. For example, this includes game-server inventory that captures the running game-servers, their status, and allocation placement. The Amazon EKS cluster is on the left, and the proposed UDP load-balancer system is on the right. A new game server is reported to an Amazon SQS queue that persists in an Amazon DynamoDB table. When a player’s assignment is required, the match-making service queries an API endpoint for an optimal available game server through the game-server inventory that uses the DynamoDB tables.The solution includes the following main components:
Scheduling mechanismOur goal in this example is to reduce the chance that two game servers that serve the same load-balancer game endpoint are interrupted at the same time. Therefore, we need to prevent the scheduling of the same game servers (mockup-UDP-server) on the same host. This example uses advanced scheduling in Kubernetes where the pod affinity/anti-affinity policy is being applied. We define two soft labels, mockup-grp1 and mockup-grp2, in the podAffinity section as follows: The requiredDuringSchedulingIgnoredDuringExecution tells the scheduler that the subsequent rule must be met upon pod scheduling. The rule says that pods that carry the value of key: “app” mockup-grp1 will not be scheduled on the same node as pods with key: “app” mockup-grp2 due to topologyKey: “kubernetes.io/hostname”. When a load balancer pod (udp-lb) is scheduled, it queries the game-server-inventory-api endpoint for two game-server pods that run on different nodes. If this request isn’t fulfilled, the load balancer pod enters a crash loop until two available game servers are ready. Try it outWe published two examples that guide you on how to build an Amazon EKS cluster that uses Spot Instances. The first example, Spotable Game Server, creates the cluster, deploys Spot Instances, Dockerizes the game server, and deploys it. The second example, Game Server Glutamate, enhances the game-server workload and enables redundancy as a mechanism for handling Spot Instance interruptions. ConclusionMultiplayer game servers have relatively short-lived processes that last between a few minutes to a few hours. The current average observed life span of Spot Instances in US and EU Regions ranges between a few hours to a few days, which makes Spot Instances a good fit for game servers. Amazon GameLift FleetIQ offers native and seamless support for Spot Instances, and Amazon EKS offers mechanisms to significantly minimize the probability of interrupting the player experience. This makes the Spot Instances an attractive option for not only casual multiplayer game server but also hardcore game servers. Game studios that use Spot Instances for multiplayer game server can save up to 90% of the compute cost, thus benefiting them as well as delighting their players.
Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/ AWS Online Tech Talks – June 2018 Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today! Note – All sessions are free and in Pacific Time. Tech talks featured this month: Analytics & Big Data June 18, 2018 | 11:00 AM – 11:45 AM PT – Get Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data. AWS re:Invent June 25, 2018 | 01:00 PM – 01:45 PM PT – Accelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances. June 26, 2018 | 01:00 PM – 01:45 PM PT – Ensuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach. Containers Databases June 18, 2018 | 01:00 PM – 01:45 PM PT – Oracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora. June 20, 2018 | 09:00 AM – 09:45 AM PT – Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools. Enterprise & Hybrid June 19, 2018 | 11:00 AM – 11:45 AM PT – Launch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new AWS Environments June 21, 2018 | 11:00 AM – 11:45 AM PT – Leading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation. June 21, 2018 | 01:00 PM – 01:45 PM PT – Enabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences. June 28, 2018 | 01:00 PM – 01:45 PM PT – Fireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device. June 27, 2018 | 11:00 AM – 11:45 AM PT – AWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products. Machine Learning June 19, 2018 | 09:00 AM – 09:45 AM PT – Integrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment. June 21, 2018 | 09:00 AM – 09:45 AM PT – Building Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics. Management Tools June 20, 2018 | 01:00 PM – 01:45 PM PT – Optimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs. Mobile Security, Identity & Compliance June 26, 2018 | 09:00 AM – 09:45 AM PT – Understanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally. Serverless June 19, 2018 | 01:00 PM – 01:45 PM PT – Productionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM. Storage June 26, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/ We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you. Note – All sessions are free and in Pacific Time. April & early May — 2018 ScheduleCompute April 30, 2018 | 01:00 PM – 01:45 PM PT – Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances. May 1, 2018 | 01:00 PM – 01:45 PM PT – How to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS. May 2, 2018 | 01:00 PM – 01:45 PM PT – Deep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud. Containers April 23, 2018 | 11:00 AM – 11:45 AM PT – New Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS. Databases April 23, 2018 | 01:00 PM – 01:45 PM PT – ElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache. April 25, 2018 | 01:00 PM – 01:45 PM PT – Intro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle. DevOps April 25, 2018 | 09:00 AM – 09:45 AM PT – Debug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun. Enterprise & Hybrid April 23, 2018 | 09:00 AM – 09:45 AM PT – An Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale. April 24, 2018 | 11:00 AM – 11:45 AM PT – Deploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0 IoT May 2, 2018 | 11:00 AM – 11:45 AM PT – How to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT. Machine Learning April 24, 2018 | 09:00 AM – 09:45 AM PT – Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe. April 26, 2018 | 09:00 AM – 09:45 AM PT – Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge. Mobile April 30, 2018 | 11:00 AM – 11:45 AM PT – Offline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync. Networking May 2, 2018 | 09:00 AM – 09:45 AM PT – Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites. Security, Identity & Compliance April 30, 2018 | 09:00 AM – 09:45 AM PT – Amazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings. May 3, 2018 | 09:00 AM – 09:45 AM PT – Protect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds. Serverless April 24, 2018 | 01:00 PM – 01:45 PM PT – Tips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes. Storage May 1, 2018 | 11:00 AM – 11:45 AM PT – Building Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage. May 3, 2018 | 11:00 AM – 11:45 AM PT – Integrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.
Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/ Contributed by Deepthi Chelupati and Roshni Pary Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs. At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore. How does the new pricing model work?You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price. Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console. As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently. In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances: You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests: The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications. Fewer interruptionsSpot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average. To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available. Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.
Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/taking-advantage-of-amazon-ec2-spot-instance-interruption-notices/ Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back. Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances. In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice. In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer. ArchitectureIn this reference architecture, you use an AWS CloudFormation template to deploy the following: After the AWS CloudFormation stack deployment is complete, you then create an Amazon EC2 Spot Fleet request diversified across both Availability Zones and use a couple of recent Spot Fleet features: Elastic Load Balancing integration and Tagging Spot Fleet Instances. When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise. WalkthroughTo complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks. Launch the stackGo ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions. $ git clone https://github.com/awslabs/ec2-spot-labs.git $ aws cloudformation create-stack --stack-name spot-spin-cwe \ --template-body file://ec2-spot-labs/ec2-spot-interruption-notice-cloudwatch-events/ec2-spot-interruption-notice-cloudwatch-events.yaml \ --capabilities CAPABILITY_IAMYou should receive a StackId value in return, confirming the stack is launching. { "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/spot-spin-cwe/083e7ad0-0ade-11e8-9e36-500c219ab02a" }Review the detailsHere are the details of the architecture being launched by the stack. IAM permissionsGive permissions to a few components in the architecture:
The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these. lambdaFunctionRole: Properties: AssumeRolePolicyDocument: Statement: - Action: - sts:AssumeRole Effect: Allow Principal: Service: - lambda.amazonaws.com Version: 2012-10-17 ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Path: / Policies: - PolicyDocument: Statement: - Action: elasticloadbalancing:DeregisterTargets Effect: Allow Resource: '*' - Action: ec2:DescribeTags Effect: Allow Resource: '*' Version: '2012-10-17' PolicyName: Fn::Join: - '-' - - Ref: AWS::StackName - lambdaFunctionRole Type: AWS::IAM::RoleAllow CloudWatch Events to call the Lambda function and publish to the SNS topic. lambdaFunctionPermission: Properties: Action: lambda:InvokeFunction FunctionName: Fn::GetAtt: - lambdaFunction - Arn Principal: events.amazonaws.com SourceArn: Fn::GetAtt: - eventRule - Arn Type: AWS::Lambda::Permission snsTopicPolicy: DependsOn: - snsTopic Properties: PolicyDocument: Id: Fn::GetAtt: - snsTopic - TopicName Statement: - Action: sns:Publish Effect: Allow Principal: Service: - events.amazonaws.com Resource: Ref: snsTopic Version: '2012-10-17' Topics: - Ref: snsTopic Type: AWS::SNS::TopicPolicyFinally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this. spotFleetRole: Properties: AssumeRolePolicyDocument: Statement: - Action: - sts:AssumeRole Effect: Allow Principal: Service: - spotfleet.amazonaws.com Version: 2012-10-17 ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole Path: / Type: AWS::IAM::RoleElastic Load Balancing timeout delayBecause you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice. loadBalancerTargetGroup: DependsOn: - vpc Properties: HealthCheckIntervalSeconds: 5 HealthCheckPath: / HealthCheckTimeoutSeconds: 2 Port: 80 Protocol: HTTP TargetGroupAttributes: - Key: deregistration_delay.timeout_seconds Value: 120 UnhealthyThresholdCount: 2 VpcId: Ref: vpc Type: AWS::ElasticLoadBalancingV2::TargetGroupCloudWatch Events ruleTo capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic. eventRule: DependsOn: - snsTopic Properties: Description: Events rule for Spot Instance Interruption Notices EventPattern: detail-type: - EC2 Spot Instance Interruption Warning source: - aws.ec2 State: ENABLED Targets: - Arn: Ref: snsTopic Id: Fn::GetAtt: - snsTopic - TopicName - Arn: Fn::GetAtt: - lambdaFunction - Arn Id: Ref: lambdaFunction Type: AWS::Events::RuleLambda functionThe Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag. import boto3 def handler(event, context): instanceId = event['detail']['instance-id'] instanceAction = event['detail']['instance-action'] try: ec2client = boto3.client('ec2') describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}]) except: print("No action being taken. Unable to describe tags for instance id:", instanceId) return try: elbv2client = boto3.client('elbv2') deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}]) except: print("No action being taken. Unable to deregister targets for instance id:", instanceId) return print("Detaching instance from target:") print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",") returnSNS topicFinally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received. snsTopic: Properties: DisplayName: SNS Topic for EC2 Spot Instance Interruption Notices Type: AWS::SNS::TopicCreate a Spot Fleet requestTo proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack: $ aws cloudformation describe-stacks --stack-name spot-spin-cweUsing the output values of the CloudFormation stack, update the following values in the Spot Fleet request configuration:
Be sure to also replace %amiId% with the latest Amazon Linux AMI for your region and %keyName% with your environment. { "AllocationStrategy": "diversified", "IamFleetRole": "%spotFleetRole%", "LaunchSpecifications": [ { "ImageId": "%amiId%", "InstanceType": "c4.large", "Monitoring": { "Enabled": true }, "KeyName": "%keyName%", "SubnetId": "%publicSubnet1%,%publicSubnet2%", "UserData": "IyEvYmluL2Jhc2gKeXVtIC15IHVwZGF0ZQp5dW0gLXkgaW5zdGFsbCBodHRwZApjaGtjb25maWcgaHR0cGQgb24KaW5zdGFuY2VpZD0kKGN1cmwgaHR0cDovLzE2OS4yNTQuMTY5LjI1NC9sYXRlc3QvbWV0YS1kYXRhL2luc3RhbmNlLWlkKQplY2hvICJoZWxsbyBmcm9tICRpbnN0YW5jZWlkIiA+IC92YXIvd3d3L2h0bWwvaW5kZXguaHRtbApzZXJ2aWNlIGh0dHBkIHN0YXJ0Cg==", "TagSpecifications": [ { "ResourceType": "instance", "Tags": [ { "Key": "loadBalancerTargetGroup", "Value": "%loadBalancerTargetGroup%" } ] } ] } ], "TargetCapacity": 2, "TerminateInstancesWithExpiration": true, "Type": "maintain", "ReplaceUnhealthyInstances": true, "InstanceInterruptionBehavior": "terminate", "LoadBalancersConfig": { "TargetGroupsConfig": { "TargetGroups": [ { "Arn": "%loadBalancerTargetGroup%" } ] } } }Save the configuration and place the Spot Fleet request: $ aws ec2 request-spot-fleet --spot-fleet-request-config file://sfr.jsonYou should receive a SpotFleetRequestId in return, confirming the request: { "SpotFleetRequestId": "sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7" }You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request: $ aws ec2 describe-spot-fleet-requests --spot-fleet-request-id sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7 { "SpotFleetRequestConfigs": [ { "ActivityStatus": "fulfilled", "CreateTime": "2018-02-08T01:23:16.029Z", "SpotFleetRequestConfig": { "AllocationStrategy": "diversified", "ExcessCapacityTerminationPolicy": "Default", "FulfilledCapacity": 2.0, … "TargetCapacity": 2, … } ] }Next, you can confirm that the Spot Instances have been registered with the Elastic Load Balancing target group and are in a healthy state: $ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/spot-loadB-1DZUVWL720VS6/26456d12cddbf23a { "TargetHealthDescriptions": [ { "Target": { "Id": "i-056c95d9dd6fde892", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } }, { "Target": { "Id": "i-06c4c47228fd999b8", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } } ] }TestIn order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice: $ aws ec2 modify-spot-fleet-request --spot-fleet-request-id sfr-3cec4927-9d86-4cc5-a4f0-faa996c841b7 --target-capacity 1 { "Return": true }As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state. $ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/spot-loadB-1DZUVWL720VS6/26456d12cddbf23a { "TargetHealthDescriptions": [ { "Target": { "Id": "i-0c3dcd78efb9b7e53", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "draining", "Reason": "Target.DeregistrationInProgress", "Description": "Target deregistration is in progress" } }, { "Target": { "Id": "i-088c91a66078b4299", "Port": 80 }, "HealthCheckPort": "80", "TargetHealth": { "State": "healthy" } } ] }ConclusionIn conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!
Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/building-an-immersive-vr-streaming-solution-on-aws/ This post was contributed by: With the explosion in virtual reality (VR) technologies over the past few years, we’ve had an increasing number of customers ask us for advice and best practices around deploying their VR-based products and service offerings on the AWS Cloud. It soon became apparent that while the VR ecosystem is large in both scope and depth of types of workloads (gaming, e-medicine, security analytics, live streaming events, etc.), many of the workloads followed repeatable patterns, with storage and delivery of live and on-demand immersive video at the top of the list. Looking at consumer trends, the desire for live and on-demand immersive video is fairly self-explanatory. VR has ushered in convenient and low-cost access for consumers and businesses to a wide variety of options for consuming content, ranging from browser playback of live and on-demand 360º video, all the way up to positional tracking systems with a high degree of immersion. All of these scenarios contain one lowest common denominator: video. Which brings us to the topic of this post. We set out to build a solution that could support both live and on-demand events, bring with it a high degree of scalability, be flexible enough to support transformation of video if required, run at a low cost, and use open-source software to every extent possible. In this post, we describe the reference architecture we created to solve this challenge, using Amazon EC2 Spot Instances, Amazon S3, Elastic Load Balancing, Amazon CloudFront, AWS CloudFormation, and Amazon CloudWatch, with open-source software such as NGINX, FFMPEG, and JavaScript-based client-side playback technologies. We step you through deployment of the solution and how the components work, as well as the capture, processing, and playback of the underlying live and on-demand immersive media streams. This GitHub repository includes the source code necessary to follow along. We’ve also provided a self-paced workshop, from AWS re:Invent 2017 that breaks down this architecture even further. If you experience any issues or would like to suggest an enhancement, please use the GitHub issue tracker. PrerequisitesAs a side note, you’ll also need a few additional components to take best advantage of the infrastructure:
You’re going to generate HTML5-compatible video (Apple HLS to be exact), but there are many other native iOS and Android options for consuming the media that you create. It’s also worth noting that your playback device should support projection of your input stream. We’ll talk more about that in the next section. How does immersive media work?At its core, any flavor of media, be that audio or video, can be viewed with some level of immersion. The ability to interact passively or actively with the content brings with it a further level of immersion. When you look at VR devices with rotational and positional tracking, you naturally need more than an ability to interact with a flat plane of video. The challenge for any creative thus becomes a tradeoff between immersion features (degrees of freedom, monoscopic 2D or stereoscopic 3D, resolution, framerate) and overall complexity. Where can you start from a simple and effective point of view, that enables you to build out a fairly modular solution and test it? There are a few areas we chose to be prescriptive with our solution. First, monoscopic 360-degree video is currently one of the most commonly consumed formats on consumer devices. We explicitly chose to focus on this format, although the infrastructure is not limited to it. More on this later. Second, if you look at most consumer-level cameras that provide live streaming ability, and even many professional rigs, there are at least two lenses or cameras at a minimum. The figure above illustrates a single capture from a Ricoh Theta S in monoscopic 2D. The left image captures 180 degrees of the field of view, and the right image captures the other 180 degrees. For this post, we chose a typical midlevel camera (the Ricoh Theta S), and used a laptop with open-source software (Open Broadcaster Software) to encode and stream the content. Again, the solution infrastructure is not limited to this particular brand of camera. Any camera or encoder that outputs 360º video and encodes to H264+AAC with an RTMP transport will work. Third, capturing and streaming multiple camera feeds brings additional requirements around stream synchronization and cost of infrastructure. There is also a requirement to stitch media in real time, which can be CPU and GPU-intensive. Many devices and platforms do this either on the device, or via outboard processing that sits close to the camera location. If you stitch and deliver a single stream, you can save the costs of infrastructure and bitrate/connectivity requirements. We chose to keep these aspects on the encoder side to save on cost and reduce infrastructure complexity. Last, the most common delivery format that requires little to no processing on the infrastructure side is equirectangular projection, as per the above figure. By stitching and unwrapping the spherical coordinates into a flat plane, you can easily deliver the video exactly as you would with any other live or on-demand stream. The only caveat is that resolution and bit rate are of utmost importance. The higher you can push these (high bit rate @ 4K resolution), the more immersive the experience is for viewers. This is due to the increase in sharpness and reduction of compression artifacts. Knowing that we would be transcoding potentially at 4K on the source camera, but in a format that could be transmuxed without an encoding penalty on the origin servers, we implemented a pass-through for the highest bit rate, and elected to only transcode lower bitrates. This requires some level of configuration on the source encoder, but saves on cost and infrastructure. Because you can conform the source stream, you may as well take advantage of that! For this post, we chose not to focus on ways to optimize projection. However, the reference architecture does support this with additional open source components compiled into the FFMPEG toolchain. A number of options are available to this end, such as open source equirectangular to cubic transformation filters. There is a tradeoff, however, in that reprojection implies that all streams must be transcoded. Processing and origination stackTo get started, we’ve provided a CloudFormation template that you can launch directly into your own AWS account. We quickly review how it works, the solution’s components, key features, processing steps, and examine the main configuration files. Following this, you launch the stack, and then proceed with camera and encoder setup.
How it worksA camera captures content, and with the help of a contribution encoder, publishes a live stream in equirectangular format. The stream is encoded at a high bit rate (at least 2.5 Mbps, but typically 16+ Mbps for 4K) using H264 video and AAC audio compression codecs, and delivered to a primary origin via the RTMP protocol. Streams may transit over the internet or dedicated links to the origins. Typically, for live events in the field, internet or bonded cellular are the most widely used. The encoder is typically configured to push the live stream to a primary URI, with the ability (depending on the source encoding software/hardware) to roll over to a backup publishing point origin if the primary fails. Because you run across multiple Availability Zones, this architecture could handle an entire zone outage with minor disruption to live events. The primary and backup origins handle the ingestion of the live stream as well as transcoding to H264+AAC-based adaptive bit rate sets. After transcode, they package the streams into HLS for delivery and create a master-level manifest that references all adaptive bit rates. The edge cache fleet pulls segments and manifests from the active origin on demand, and supports failover from primary to backup if the primary origin fails. By adding this caching tier, you effectively separate the encoding backend tier from the cache tier that responds to client or CDN requests. In addition to origin protection, this separation allows you to independently monitor, configure, and scale these components. Viewers can use the sample HTML5 player (or compatible desktop, iOS or Android application) to view the streams. Navigation in the 360-degree view is handled either natively via device-based gyroscope, positionally via more advanced devices such as a head mount display, or via mouse drag on the desktop. Adaptive bit rate is key here, as this allows you to target multiple device types, giving the player on each device the option of selecting an optimum stream based on network conditions or device profile. Solution componentsWhen you deploy the CloudFormation template, all the architecture services referenced above are created and launched. This includes:
The template also provisions the underlying dependencies:
The edge cache fleet instances need some way to discover the primary and backup origin locations. You use elastic network interfaces and elastic IP addresses for this purpose. As each component of the infrastructure is provisioned, software required to transcode and process the streams across the Spot Instances is automatically deployed. This includes NGiNX-RTMP for ingest of live streams, FFMPEG for transcoding, NGINX for serving, and helper scripts to handle various tasks (potential Spot Instance interruptions, queueing, moving content to S3). Metrics and logs are available through CloudWatch and you can manage the deployment using the CloudFormation console or AWS CLI. Key features include:
You’re supporting both live and on-demand. On-demand content is created automatically when the encoder stops publishing to the origin.
Spot Instances are used exclusively for infrastructure to optimize cost and scale throughput. To protect the origin servers, the midtier cache fleet pulls, caches, and delivers to downstream CDNs.
The Application Load Balancer endpoint allows CloudFront or any third-party CDN to source content from the edge fleet and, indirectly, the origin.
These three components form the core of the stream ingest, transcode, packaging, and delivery infrastructure, as well as the VOD-processing component for creating transcoded VOD content on-demand.
All infrastructure can be easily created and modified using CloudFormation. To provide an end-to-end experience right away, we’ve included a test player page hosted as a static site on S3. This page uses A-Frame, a cross-platform, open-source framework for building VR experiences in the browser. Though A-Frame provides many features, it’s used here to render a sphere that acts as a 3D canvas for your live stream. Spot Instance considerationsAt this stage, and before we discuss processing, it is important to understand how the architecture operates with Spot Instances. Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. Spot Instances enables you to optimize your costs on the AWS Cloud and scale your application’s throughput up to 10X for the same budget. By selecting Spot Instances, you can save up-to 90% on On-Demand prices. This allows you to greatly reduce the cost of running the solution because, outside of S3 for storage and CloudFront for delivery, this solution is almost entirely dependent on Spot Instances for infrastructure requirements. We also know that customers running events look to deploy streaming infrastructure at the lowest price point, so it makes sense to take advantage of it wherever possible. A potential challenge when using Spot Instances for live streaming and on-demand processing is that you need to proactively deal with potential Spot Instance interruptions. How can you best deal with this? First, the origin is deployed in a primary/backup deployment. If a Spot Instance interruption happens on the primary origin, you can fail over to the backup with a brief interruption. Should a potential interruption not be acceptable, then either Reserved Instances or On-Demand options (or a combination) can be used at this tier. Second, the edge cache fleet runs a job (started automatically at system boot) that periodically queries the local instance metadata to detect if an interruption is scheduled to occur. Spot Instance Interruption Notices provide a two-minute warning of a pending interruption. If you poll every 5 seconds, you have almost 2 full minutes to detach from the Load Balancer and drain or stop any traffic directed to your instance. Lastly, use an SQS queue when transcoding. If a transcode for a Spot Instance is interrupted, the stale item falls back into the SQS queue and is eventually re-surfaced into the processing pipeline. Only remove items from the queue after the transcoded files have been successfully moved to the destination S3 bucket. ProcessingAs discussed in the previous sections, you pass through the video for the highest bit rate to save on having to increase the instance size to transcode the 4K or similar high bit rate or resolution content. We’ve selected a handful of bitrates for the adaptive bit rate stack. You can customize any of these to suit the requirements for your event. The default ABR stack includes:
These can be modified by editing the /etc/nginx/rtmp.d/rtmp.conf NGINX configuration file on the origin or the CloudFormation template. It’s important to understand where and how streams are transcoded. When the source high bit rate stream enters the primary or backup origin at the /live RTMP application entry point, it is recorded on stop and start of publishing. On completion, it is moved to S3 by a cleanup script, and a message is placed in your SQS queue for workers to use. These workers transcode the media and push it to a playout location bucket. This solution uses Spot Fleet with automatic scaling to drive the fleet size. You can customize it based on CloudWatch metrics, such as simple utilization metrics to drive the size of the fleet. Why use Spot Instances for the transcode option instead of Amazon Elastic Transcoder? This allows you to implement reprojection of the input stream via FFMPEG filters in the future. The origins handle all the heavy live streaming work. Edges only store and forward the segments and manifests, and provide scaling plus reduction of burden on the origin. This lets you customize the origin to the right compute capacity without having to rely on a ‘high watermark’ for compute sizing, thus saving additional costs. Loopback is an important concept for the live origins. The incoming stream entering /live is transcoded by FFMPEG to multiple bit rates, which are streamed back to the same host via RTMP, on a secondary publishing point /show. The secondary publishing point is transparent to the user and encoder, but handles HLS segment generation and cleanup, and keeps a sliding window of live segments and constantly updating manifests. ConfigurationOur solution provides two key points of configuration that can be used to customize the solution to accommodate ingest, recording, transcoding, and delivery, all controlled via origin and edge configuration files, which are described later. In addition, a number of job scripts run on the instances to provide hooks into Spot Instance interruption events and the VOD SQS-based processing queue. Origin instancesThe rtmp.conf excerpt below also shows additional parameters that can be customized, such as maximum recording file size in Kbytes, HLS Fragment length, and Playlist sizes. We’ve created these in accordance with general industry best practices to ensure the reliable streaming and delivery of your content. rtmp { server { listen 1935; chunk_size 4000; application live { live on; record all; record_path /var/lib/nginx/rec; record_max_size 128000K; exec_record_done /usr/local/bin/record-postprocess.sh $path $basename; exec /usr/local/bin/ffmpeg <…parameters…>; } application show { live on; hls on; ... hls_type live; hls_fragment 10s; hls_playlist_length 60s; ... } } }This exposes a few URL endpoints for debugging and general status. In production, you would most likely turn these off:
You also control the TTLs, as previously discussed. It’s important to note here that you are setting TTLs explicitly at the origin, instead of in CloudFront’s distribution configuration. While both are valid, this approach allows you to reconfigure and restart the service on the fly without having to push changes through CloudFront. This is useful for debugging any caching or playback issues. location /stat { ... } location /stat.xsl { ... } location /control { ... } # origin controlled ttls location ~* /hls/.*\.ts$ { ... add_header Cache-Control "max-age=900"; expires 900s; } location ~* /hls/.*\.m3u8$ { ... add_header Cache-Control "max-age=5"; expires 5s; }Handler scriptsHere is a brief overview of the scripts we use to handle the events and process of the solution. We encourage you to take some time to review them.
For more details, see the Delivery and Playback section later in this post. Camera sourceWith the processing and origination infrastructure running, you need to configure your camera and encoder. As discussed, we chose to use a Ricoh Theta S camera and Open Broadcaster Software (OBS) to stitch and deliver a stream into the infrastructure. Ricoh provides a free ‘blender’ driver, which allows you to transform, stitch, encode, and deliver both transformed equirectangular (used for this post) video as well as spherical (two camera) video. The Theta provides an easy way to get capturing for under $300, and OBS is a free and open-source software application for capturing and live streaming on a budget. It is quick, cheap, and enjoys wide use by the gaming community. OBS lowers the barrier to getting started with immersive streaming. While the resolution and bit rate of the Theta may not be 4K, it still provides us with a way to test the functionality of the entire pipeline end to end, without having to invest in a more expensive camera rig. One could also use this type of model to target smaller events, which may involve mobile devices with smaller display profiles, such as phones and potentially smaller sized tablets. Looking for a more professional solution? Nokia, GoPro, Samsung, and many others have options ranging from $500 to $50,000. This solution is based around the Theta S capabilities, but we’d encourage you to extend it to meet your specific needs. If your device can support equirectangular RTMP, then it can deliver media through the reference architecture (dependent on instance sizing for higher bit rate sources, of course). If additional features are required such as camera stitching, mixing, or device bonding, we’d recommend exploring a commercial solution such as Teradek Sphere. All cameras have varied PC connectivity support. We chose the Ricoh Theta S due to the real-time video connectivity that it provides through software drivers on macOS and PC. If you plan to purchase a camera to use with a PC, confirm that it supports real-time capabilities as a peripheral device. Encoding and publishingNow that you have a camera, encoder, and AWS stack running, you can finally publish a live stream. To start streaming with OBS, configure the source camera and set a publishing point. Use the RTMP application name /live on port 1935 to ingest into the primary origin’s Elastic IP address provided as the CloudFormation output: primaryOriginElasticIp. You also need to choose a stream name or stream key in OBS. You can use any stream name, but keep the naming short and lowercase, and use only alphanumeric characters. This avoids any parsing issues on client-side player frameworks. There’s no publish point protection in your deployment, so any stream key works with the default NGiNX-RTMP configuration. For more information about stream keys, publishing point security, and extending the NGiNX-RTMP module, see the NGiNX-RTMP Wiki. You should end up with a configuration similar to the following: The Output settings dialog allows us to rescale the Video canvas and encode it for delivery to our AWS infrastructure. In the dialog below, we’ve set the Theta to encode at 5 Mbps in CBR mode using a preset optimized for low CPU utilization. We chose these settings in accordance with best practices for the stream pass-through at the origin for the initial incoming bit rate. You may notice that they largely match the FFMPEG encoding settings we use on the origin – namely constant bit rate, a single audio track, and x264 encoding with the ‘veryfast’ encoding profile. Live to On-DemandAs you may have noticed, an on-demand component is included in the solution architecture. When talking to customers, one frequent request that we see is that they would like to record the incoming stream with as little effort as possible. NGINX-RTMP’s recording directives provide an easy way to accomplish this. We record any newly published stream on stream start at the primary or backup origins, using the incoming source stream, which also happens to be the highest bit rate. When the encoder stops broadcasting, NGINX-RTMP executes an exec_record_done script – record-postprocess.sh (described in the Configuration section earlier), which ensures that the content is well-formed, and then moves it to an S3 ingest bucket for processing. Transcoding of content to make it ready for VOD as adaptive bit rate is a multi-step pipeline. First, Spot Instances in the transcoding cluster periodically poll the SQS queue for new jobs. Items on the queue are pulled off on demand by processing instances, and transcoded via FFMPEG into adaptive bit rate HLS. This allows you to also extend FFMPEG using filters for cubic and other bitrate-optimizing 360-specific transforms. Finally, transcoded content is moved from the ingest bucket to an egress bucket, making them ready for playback via your CloudFront distribution. Separate ingest and egress by bucket to provide hard security boundaries between source recordings (which are highest quality and unencrypted), and destination derivatives (which may be lower quality and potentially require encryption). Bucket separation also allows you to order and archive input and output content using different taxonomies, which is common when moving content from an asset management and archival pipeline (the ingest bucket) to a consumer-facing playback pipeline (the egress bucket, and any other attached infrastructure or services, such as CMS, Mobile applications, and so forth). Because streams are pushed over the internet, there is always the chance that an interruption could occur in the network path, or even at the origin side of the equation (primary to backup roll-over). Both of these scenarios could result in malformed or partial recordings being created. For the best level of reliability, encoding should always be recorded locally on-site as a precaution to deal with potential stream interruptions. Delivery and playbackWith the camera turned on and OBS streaming to AWS, the final step is to play the live stream. We’ve primarily tested the prototype player on the latest Chrome and Firefox browsers on macOS, so your mileage may vary on different browsers or operating systems. For those looking to try the livestream on Google Cardboard, or similar headsets, native apps for iOS (VRPlayer) and Android exist that can play back HLS streams. The prototype player is hosted in an S3 bucket and can be found from the CloudFormation output clientWebsiteUrl. It requires a stream URL provided as a query parameter ?url=<stream_url> to begin playback. This stream URL is determined by the RTMP stream configuration in OBS. For example, if OBS is publishing to rtmp://x.x.x.x:1935/live/foo, the resulting playback URL would be: https://<cloudFrontDistribution>/hls/foo.m3u8 The combined player URL and playback URL results in a path like this one: https://<clientWebsiteUrl>/?url=https://<cloudFrontDistribution>/hls/foo.m3u8 To assist in setup/debugging, we’ve provided a test source as part of the CloudFormation template. A color bar pattern with timecode and audio is being generated by FFmpeg running as an ECS task. Much like OBS, FFmpeg is streaming the test pattern to the primary origin over the RTMP protocol. The prototype player and test HLS stream can be accessed by opening the clientTestPatternUrl CloudFormation output link. What’s next?In this post, we walked you through the design and implementation of a full end-to-end immersive streaming solution architecture. As you may have noticed, there are a number of areas this could expand into, and we intend to do this in follow-up posts around the topic of virtual reality media workloads in the cloud. We’ve identified a number of topics such as load testing, content protection, client-side metrics and analytics, and CI/CD infrastructure for 24/7 live streams. If you have any requests, please drop us a line. We would like to extend extra-special thanks to Scott Malkie and Chad Neal for their help and contributions to this post and reference architecture.
Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/aws-reinvent-2017-guide-to-all-things-containers/ Contributed by Tiffany Jernigan, Developer Advocate for Amazon ECS Get ready for takeoff!We made sure that this year’s re:Invent is chock-full of containers: there are over 40 sessions! New to containers? No problem, we have several introductory sessions for you to dip your toes. Been using containers for years and know the ins and outs? Don’t miss our technical deep-dives and interactive chalk talks led by container experts. If you can’t make it to Las Vegas, you can catch the keynotes and session recaps from our livestream and on Twitch. Session typesNot everyone learns the same way, so we have multiple types of breakout content:
Session levelsWhether you’re new to containers or you’ve been using them for years, you’ll find useful information at every level.
Session locationsAll container sessions are located in the Aria Resort. MONDAY 11/27Breakout sessionsLevel 200 (Introductory)CON202 – Getting Started with Docker and Amazon ECS Chalk talksLevel 200 (Introductory)CON211 – Reducing your Compute Footprint with Containers and Amazon ECS CON212 – Anomaly Detection Using Amazon ECS, AWS Lambda, and Amazon EMR Level 400 (Expert)CON410 – Advanced CICD with Amazon ECS Control Plane WorkshopsLevel 200 (Introductory)CON209 – Interstella 8888: Learn How to Use Docker on AWS CON213 – Hands-on Deployment of Kubernetes on AWS TUESDAY 11/28Breakout SessionsLevel 200 (Introductory)CON206 – Docker on AWS CON208 – Building Microservices on AWS Level 300 (Advanced)CON302 – Building a CICD Pipeline for Containers on AWS Ajit Zadgaonkar, Director of Engineering and Operations at Edmunds walks through best practices for CICD architectures used by his team to deploy containers. We also deep dive into topics such as how to create an accessible CICD platform and architect for safe blue/green deployments. CON307 – Building Effective Container Images In this session, AWS Sr. Technical Evangelist Abby Fuller covers how to create effective images to run containers in production. This includes an in-depth discussion of how Docker image layers work, things you should think about when creating your images, working with Amazon ECR, and mise-en-place for install dependencies. Prakash Janakiraman, Co-Founder and Chief Architect at Nextdoor discuss high-level and language-specific best practices for with building images and how Nextdoor uses these practices to successfully scale their containerized services with a small team. CON309 – Containerized Machine Learning on AWS CON320 – Monitoring, Logging, and Debugging for Containerized Services Chalk TalksLevel 300 (Advanced)CON314 – Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS CON322 – Maximizing Amazon ECS for Large-Scale Workloads CON323 – Microservices Architectures for the Enterprise CON324 – Windows Containers on Amazon ECS CON326 – Remote Sensing and Image Processing on AWS WorkshopsLevel 300 (Advanced)CON317 – Advanced Container Management at Catsndogs.lol Additional requirements:
WEDNESDAY 11/29Birds of a Feather (BoF)CON01 – Birds of a Feather: Containers and Open Source at AWS Breakout SessionsLevel 300 (Advanced)CON308 – Mastering Kubernetes on AWS In this session, Arun Gupta, Open Source Strategist for AWS and Raffaele Di Fazio, software engineer at leading European fashion platform Zalando, show the common practices for running Kubernetes on AWS and share insights from experience in operating tens of Kubernetes clusters in production on AWS. We cover options and recommendations on how to install and manage clusters, configure high availability, perform rolling upgrades and handle disaster recovery, as well as continuous integration and deployment of applications, logging, and security. CON310 – Moving to Containers: Building with Docker and Amazon ECS Level 400 (Expert)CON402 – Advanced Patterns in Microservices Implementation with Amazon ECS CON404 – Deep Dive into Container Scheduling with Amazon ECS Chalk TalksLevel 300 (Advanced)CON312 – Building a Selenium Fleet on the Cheap with Amazon ECS with Spot Fleet CON315 – Virtually There: Building a Render Farm with Amazon ECS CON325 – Developing Microservices – from Your Laptop to the Cloud CON327 – Patterns and Considerations for Service Discovery CON328 – Building a Development Platform on Amazon ECS WorkshopsLevel 300 (Advanced)CON318 – Interstella 8888: Monolith to Microservices with Amazon ECS CON332 – Build a Java Spring Application on Amazon ECS THURSDAY 11/30Breakout SessionsLevel 200 (Introductory)CON201 – Containers on AWS – State of the Union Level 300 (Advanced)CON304 – Batch Processing with Containers on AWS Level 400 (Expert)CON406 – Architecting Container Infrastructure for Security and Compliance Chalk TalksLevel 300 (Advanced)CON329 – Full Software Lifecycle Management for Containers Running on Amazon ECS CON330 – Running Containerized HIPAA Workloads on AWS Level 400 (Expert)CON408 – Building a Machine Learning Platform Using Containers on AWS WorkshopsLevel 300 (Advanced)CON319 – Interstella 8888: CICD for Containers on AWS FRIDAY 12/1Breakout SessionsLevel 400 (Expert)CON405 – Moving to Amazon ECS – the Not-So-Obvious Benefits Chalk TalksLevel 300 (Advanced)CON331 – Deploying a Regulated Payments Application on Amazon ECS WorkshopsLevel 400 (Expert)CON407 – Interstella 8888: Advanced Microservice Operations Know before you goWant to brush up on your container knowledge before re:Invent? Here are some helpful resources to get started: – Tiffany
Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/
When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool. One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances. Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money. Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances. In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads. ArchitectureThe following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket. Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs. The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster. The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched. WalkthroughTo provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:
Before you beginEnsure that Git, Docker, and the AWS CLI are installed on your computer. In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster. Implementation Steps
That’s it. The logs now run based on the defined schedule. To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance. ConclusionIn this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance. In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events. If you have questions or suggestions, please comment below.
Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/ My colleague Sanjay Padhi shared the guest post below in order to recognize an important milestone in the use of EC2 Spot Instances. — Jeff; A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs. The Experiment Ramping to 1.1 Million vCPUs Here’s the breakdown of the EC2 instance types that they used: Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments. Meet the Team Professor Apon said about the experiment: I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments. Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me: Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding. About the Experiment Looking Forward — Sanjay Padhi, Ph.D, AWS Research and Technical Computing
Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/powering-your-amazon-ecs-cluster-with-amazon-ec2-spot-instances/ This post was graciously contributed by:
Today we are excited to announce that Amazon EC2 Container Service (Amazon ECS) now supports the ability to launch your ECS cluster on Amazon EC2 Spot Instances directly from the ECS console. Spot Instances allow you to bid on spare Amazon EC2 compute capacity. Spot Instances typically cost 50-90% less than On-Demand Instances. Powering your ECS cluster with Spot Instances lets you reduce the cost of running your existing containerized workloads, or increase your compute capacity by two to ten times while keeping the same budget. Or you could do a combination of both! Using Spot Instances, you specify the price you are willing to pay per instance-hour. Your Spot Instance runs whenever your bid exceeds the current Spot price. If your instance is reclaimed due to an increase in the Spot price, you are not charged for the partial hour that your instance has run. The ECS console uses Spot Fleet to deploy your Spot Instances. Spot Fleet attempts to deploy the target capacity you request (expressed in terms of instances or a vCPU count) for your containerized application by launching Spot Instances that result in the best prices for you. If your Spot Instances are reclaimed due to a change in Spot prices or available capacity, Spot Fleet also attempts to maintain its target capacity. Containers are a natural fit for the diverse pool of resources that Spot Fleet thrives on. Spot Fleet enable you to provision capacity across multiple Spot Instance pools (combinations of instance types and Availability Zones), which helps improve your application’s availability and reduce operating costs of the fleet over time. Combining the extensible and flexible container placement system provided by ECS with Spot Fleet allows you to efficiently deploy containerized workloads and easily manage clusters at any scale for a fraction of the cost. Previously, deploying your ECS cluster on Spot Instances was a manual process. In this post, we show you how to achieve high availability, scalability, and cost savings for your container workloads by using the new Spot Fleet integration in the ECS console. We also show you how to build your own ECS cluster on Spot Instances using AWS CloudFormation. Creating an ECS cluster running on Spot InstancesYou can create an ECS cluster using the AWS Management Console.
Choosing an allocation strategyThe two available Spot Fleet allocation strategies are Diversified and Lowest price. The allocation strategy you choose for your Spot Fleet determines how it fulfills your Spot Fleet request from the possible Spot Instance pools. When you use the diversified strategy, the Spot Instances are distributed across all pools. When you use the lowest price strategy, the Spot Instances come from the pool with the lowest price specified in your request. Remember that each instance type (the instance size within each instance family, e.g., c4.4xlarge), in each Availability Zone, in every region, is a separate pool of capacity, and therefore a separate Spot market. By diversifying across as many different instance types and Availability Zones as possible, you can improve the availability of your fleet. You also make your fleet less sensitive to increases in the Spot price in any one pool over time. You can select up to six EC2 instance types to use for your Spot Fleet. In this example, we’ve selected m3, m4, c3, c4, r3, and r4 instance types of size xlarge. You need to enter a bid for your instances. Typically, bidding at or near the On-Demand Instance price is a good starting point. Your bid is the maximum price that you are willing to pay for instance types in that Spot pool. While the Spot price is at or below your bid, you pay the Spot price. Bidding lower ensures that you have lower costs, while bidding higher reduces the probability of interruption. Configure the number of instances to have in your cluster. Spot Fleet attempts to launch the number of Spot Instances that are required to meet the target capacity specified in your request. The Spot Fleet also attempts to maintain its target capacity if your Spot Instances are reclaimed due to a change in Spot prices or available capacity. The latest ECS–optimized AMI is used for the instances when they are launched. Configure your storage and network settings. To enable diversification and high availability, be sure to select subnets in multiple Availability Zones. You can’t select multiple subnets in the same Availability Zone in a single Spot Fleet. The ECS container agent makes calls to the ECS API actions on your behalf. Container instances that run the agent require the ecsInstanceRole IAM policy and role for the service to know that the agent belongs to you. If you don’t have the ecsInstanceRole already, you can create one using the ECS console. If you create a managed compute environment that uses Spot Fleet, you must create a role that grants the Spot Fleet permission to bid on, launch, and terminate instances on your behalf. You can also create this role using the ECS console. That’s it! In the ECS console, choose Create to spin up your new ECS cluster running on Spot Instances. Using AWS CloudFormation to deploy your ECS cluster on Spot InstancesWe have also published a reference architecture AWS CloudFormation template that demonstrates how to easily launch a CloudFormation stack and deploy your ECS cluster on Spot Instances. The CloudFormation template includes the Spot Instance termination notice script mentioned earlier, as well as some additional logging and other example features to get you started quickly. You can find the CloudFormation template in the Amazon EC2 Spot Instances GitHub repo. Give it a try and customize it as needed for your environment! Handling terminationWith Spot Instances, you never pay more than the price you specified. If the Spot price exceeds your bid price for a given instance, it is terminated automatically for you. The best way to protect against Spot Instance interruption is to architect your containerized application to be fault-tolerant. In addition, you can take advantage of a feature called Spot Instance termination notices, which provides a two-minute warning before EC2 must terminate your Spot Instance. This warning is made available to the applications on your Spot Instance using an item in the instance metadata. When you deploy your ECS cluster on Spot Instances using the console, AWS installs a script that checks every 5 seconds for the Spot Instance termination notice. If the notice is detected, the script immediately updates the container instance state to DRAINING. A simplified version of the Spot Instance termination notice script is as follows: #!/bin/bash while sleep 5; do if [ -z $(curl -Isf http://169.254.169.254/latest/meta-data/spot/termination-time) ]; then /bin/false else ECS_CLUSTER=$(curl -s http://localhost:51678/v1/metadata | jq .Cluster | tr -d \") CONTAINER_INSTANCE=$(curl -s http://localhost:51678/v1/metadata \ | jq .ContainerInstanceArn | tr -d \") aws ecs update-container-instances-state --cluster $ECS_CLUSTER \ --container-instances $CONTAINER_INSTANCE --status DRAINING fi doneWhen you set a container instance to DRAINING, ECS prevents new tasks from being scheduled for placement on the container instance. If the resources are available, replacement service tasks are started on other container instances in the cluster. Container instance draining enables you to remove a container instance from a cluster without impacting tasks in your cluster. Service tasks on the container instance that are in the PENDING state are stopped immediately. Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service’s deployment configuration parameters, minimumHealthyPercent and maximumPercent. ECS on Spot Instances in actionWant to see how customers are already powering their ECS clusters on Spot Instances? Our friends at Mapbox are doing just that. Mapbox is a platform for designing and publishing custom maps. The company uses ECS to power their entire batch processing architecture to collect and process over 100 million miles of sensor data per day that they use for powering their maps. They also optimize their batch processing architecture on ECS using Spot Instances. The Mapbox platform powers over 5,000 apps and reaches more than 200 million users each month. Its backend runs on ECS, allowing it to serve more than 1.3 billion requests per day. To learn more about their recent migration to ECS, read their recent blog post, We Switched to Amazon ECS, and You Won’t Believe What Happened Next. Then, in their follow-up blog post, Caches to Cash, learn how they are running their entire platform on Spot Instances, allowing them to save upwards of 50–90% on their EC2 costs. ConclusionWe hope that you are as excited as we are about running your containerized applications at scale and cost effectively using Spot Instances. For more information, see the following pages: If you have comments or suggestions, please comment below. |