🌟 CloudProfs Issue #1: IaC starting points, Kubernetes snags

Read the full first issue of CloudProfs (July 30) in browser here.

Top news stories

AWS announced it is to retire EC2-Classic Networking on August 15 2022. Users have just over a year to migrate, though it is worth noting that all AWS accounts created after December 4 2013 are VPC-only – customers could only have EC2-Classic by request. If you want to check that you don’t have any rogue instances, you can use various resources, from the EC2 Classic Resource Finder script to the AWSSupport-MigrateEC2ClassicToVPC runbook. This blog post goes through all the steps required.

Google has launched Google Enterprise APIs, focusing more on cloud customer support and trying to ensure any changes will bring minimal disruption for users. Customers will now receive a ‘minimum of one year’s notice of an impending change, during which time the feature will continue to operate without issue.’ In the words of one pundit, it ‘makes life a lot harder for Google Cloud engineers, but better for customers – [it is] the right choice.’ You can take a look at the APIs which are subject to this new policy in the Google Cloud Marketplace. Full blog here.

A whitepaper from SUSE (pdf, download link) evaluating open source technologies found that 42% of 800 global respondents currently ran containers for production workloads. Of that number, almost three in five (57%) opt for Kubernetes. Two thirds (66%) prefer a commercially curated and supported distribution of an open source Kubernetes platform, ahead of building, maintaining and supporting their own. Perhaps unsurprisingly, the top feature users/planners look for in a Kubernetes platform are 100% open source (36%), followed by multi-clustering and potential for edge deployments (34%).

Releases and product notes this week

July 29: Anthos clusters on bare metal 1.8.2 now available. Anthos clusters on AWS (aws-1.8.1-gke.1) now available. Now requires Kubectl version 1.17 or higher, and Terraform version v0.14.3 or higher
July 28: Importing AWS CloudFormation stacks into a CloudFormation stack set. (See CloudFormation tutorial below!)
July 27: General availability of Azure Stream Analytics Tools for Visual Studio Code
July 27: Amazon SageMaker Pipelines now integrates with GitHub, BitBucket and Jenkins

 

Deep dive: Four Kubernetes failure stories and what we can learn from them

Taken from this article – read it in full here.

1. Adevinta, an online portal for classified adverts, found that Kubernetes was taking an ‘eternity’ to talk to AWS. Taking the theory of ‘if the problem is not DNS, DNS will often lead you to the problem’, the team investigated.

“An analysis of tcpdump testified DNS to be innocent but there was something with the way requests were handled,” the company noted. Multiple queries on each request to the AWS Instance Metadata service were leading to spikes in resolution time. There were two calls; the first would query the IAM roles associated with the instance, while the second would request temporary credentials to access the instance. The problem was – the credentials would expire and the client would have to request a new one.

The expiration time is hardcoded on the AWS SDK and is included in the certificates. But when retrieving certs from both containers and EC2, they found the one from the container had a time of 15 minutes. The reason? The AWS Java SDK was forced to refresh any certificate with less than 15 minutes expiration time. This was creating a loop; each request refreshed the temporary certificate, explaining the two calls to the AWS API, and the huge latency time as a result.

The fix: reconfiguring credentials with a longer expiration period.

2. Moonlight, a software developer community, ran into intermittent outage problems for five days back in January. Kubernetes monitoring reported that some nodes and pods were unresponsive. Allied to networking service disruptions with Google Cloud, the company put two and two together and assumed this was the root cause.

Yet the problem was not fully restored, and the Moonlight website fell over a few days later. Escalating the problem to Google Cloud’s support engineering team, there was a pattern identified in the nodes of Moonlight’s managed Kubernetes cluster. When the nodes experienced periods of sustained CPU usage, the VM would experience a kernel panic and crash.

This was down to Kubernetes’ scheduler, which dynamically decides which pods should run on which nodes. Moonlight discovered a four-step death loop: a) the scheduler assigned multiple pods with high CPU use to the same node; b) the pods consumed 100% of CPU resources on the shared node; c) the node hit kernel panic and appeared unresponsive; d) the scheduler would reassign all crashed pods to a new node, repeating the process.

The fix: adding anti-affinity rules to all major deployments.

3. Retailer Zalando hit a snag with its custom continuous delivery platform (CDP) builds for its operations. The team noticed Kubernetes builder pods were unable to find AWS IAM credentials.

“For a pod to get AWS IAM credentials, kube2iam needs the pod’s IP address, which is set by kubelet process on the associated node. So Kubelet was taking multiple minutes updating the status of the pod. In default configuration, Kubelet is slow responding to the requests to the API server. Somehow Zalando’s CDP Kubernetes cluster had only one available node to the builder pods. The rapid creation and deletion of pods on a single node was delimiting Kubelet.”

The fix: upscaling the cluster to include more than one node.

4. Ravelin, an ML-based fraud detection provider, had one last piece of the puzzle left for its Kubernetes migration on Google Kubernetes Engine (GKE): moving its API layer from the old VMs into the Kubernetes cluster. But the team hit a hitch: integration tests occasionally received 502 errors.

Digging deeper, the team realised the way the documentation tells you to remove a pod from a service or load balance in K8s is a little removed from reality. The theory is:

  • 1) Replication controller decides to remove a pod
  • 2) Pod endpoint is removed from the service or load balancer, with new traffic no longer flowing to the pod
  • 3) The pod’s pre-stop hook is invoked, or the pod receives a SIGTERM
  • 4) The pod ‘gracefully shuts down’ and stops listening for new connections
  • 5) The graceful shutdown completes, and the pod exits when all existing connections eventually become idle or terminate

What Ravelin found was that steps 2 and 3 happened at the same time. The pod may receive the SIGTERM quite some time before the change in endpoints is actioned at the ingress. In other words, the pod will receive new connections, and it has to process them otherwise the client will receive 500 errors.

The fix: The pod should exit only when its termination grace period expires, and it is killed with SIGKILL. Make sure this grace period is longer than the time taken to reprogram your load balancer.

Other useful resources:

#1: Walmart Global Tech explains how to deploy an ML-based containerised Docker Web application to a local Kubernetes cluster (intermediate)

#2: Kubernetes deployment strategies: from rolling update, to recreate, to canary, to blue/green. Understanding the best choices for development/staging environments vs. production (beginner/intermediate)

#3: Cloud Native Computing Foundation on use cases for advanced pod scheduling in Kubernetes as well as best practices for implementing it in real-world situations (advanced)

 

IaC tutorials: Azure policies in Terraform, basics of CloudFormation

Rapidly deploy Azure policies in Terraform:

An article from Jonathan D’Aloia on the Adatis blog outlines how to import policies into your Terraform State, to then deploy into an Azure Resource Group to secure your landing zone prior to deploying any resources.

1) The first step is to create a Policy definition statement without defining any parameters, so it looks like the below:

resource “azurerm_policy_definition” “example” {

}

2) Initialise with terraform init command. After initialising you can import an Azure Policy into your state by running this command. (To find your definition id in the Azure Portal, navigate to policies, then select the policy you wish to import.)
terraform import azurerm_policy_definition.example <definition id>

3) Generate the Terraform code so the Policy Definition can be assigned. Do this with the terraform show command. The code can also be copied and added to the main.tf file for other tenants.

4) With the policy defined, we can assign this policy to the appropriate resource group through the following code. This code simply states to create a resource group called ‘test-resources-blog-post’ in West Europe:

resource “azurerm_resource_group” “example” {

name     = “test-resources-blog-post”

location = “West Europe”

}

resource “azurerm_policy_assignment” “example” {

name                 = “example-policy-assignment-blog-post”

scope                = azurerm_resource_group.example.id

policy_definition_id = azurerm_policy_definition.example.id

description          = “Policy Assignment created via an Acceptance Test”

display_name         = “My Example Policy Assignment-Blog-Post”

}

5) Assign the policy definition from the ID (see step 2) and apply this policy to the scope of the resource group.

6) Once this has been done, run terraform initterraform plan and terraform apply in that order.

Deploy your first AWS CloudFormation script:

This article from Ram Vegiraju explains how to deploy your first AWS CloudFormation script. The tutorial runs through an example of manually provisioning resources versus deploying a CloudFormation script to create a REST API and a serverless Lambda function on AWS. This article is designed for those who have at least some experience of AWS.

To use CloudFormation, this tutorial explains how to use a YAML file to declare and provision resources that are then deployed to AWS to create a CloudFormation stack. The tutorial also outlines how to use a SAM template, which is the template used for the purpose of this task.

The full code can be seen here, while ideas for next steps in terms of how CloudFormation can help build CI/CD pipelines in AWS are available in this article.

Other useful resources:

#1: Learning about conditional logic in Terraform – where you will need to define parts of a Terraform configuration for a particular resource only when needed (intermediate/advanced)
#2: Bicep and Terraform compared: Deployment methods, syntax, and tooling (intermediate)

 

The week’s top podcasts

Looking to learn Azure certifications? Mike Pfeiffer and Tim Warner, Microsoft MVPs, talk about the new AZ-700 exam and how to become an Azure-certified network engineer (Cloudskills.fm)

Kelsey Hightower, principal developer advocate at Google Cloud Platform, and Mark Shuttleworth, CEO of Canonical, talk about the ‘quiet support’ of Canonical’s Ubuntu Linux distribution for Kubernetes – and why this is as it should be. (The New Stack Makers)

Juraj Masar, co-founder and CEO at Better Stack, talks DevOps MVPs and product motivations. “If you think about all the other monitoring tools – give me one example of a service which is so easy to set up? This comes down to ‘I’m not the kind of guy that enjoys tweaking my Prometheus config, right? What I like is actually running my business, building amazing software, shipping yet another feature. And I have my Prometheus in place so that the servers don’t crash. Essentially, better uptime is an insurance policy that you are purchasing so that you know that your software works.” (Software Engineering Daily)

Do devs need a product manager? Gaëlle Sharma, senior technical product manager at the New York Times, talks about what product managers do, what makes a good product manager, and why some tech companies’ engineering teams have gone without them. (Go Time)

Leave a Reply

Your email address will not be published. Required fields are marked *