Read the full first issue of CloudProfs (July 30) in browser here.

—

Releases and product notes this week

July 29: Anthos clusters on bare metal 1.8.2 now available. Anthos clusters on AWS (aws-1.8.1-gke.1) now available. Now requires Kubectl version 1.17 or higher, and Terraform version v0.14.3 or higher
July 28: Importing AWS CloudFormation stacks into a CloudFormation stack set. (See CloudFormation tutorial below!)
July 27: General availability of Azure Stream Analytics Tools for Visual Studio Code
July 27: Amazon SageMaker Pipelines now integrates with GitHub, BitBucket and Jenkins

Deep dive: Four Kubernetes failure stories and what we can learn from them

Taken from this article – read it in full here.

1. Adevinta, an online portal for classified adverts, found that Kubernetes was taking an ‘eternity’ to talk to AWS. Taking the theory of ‘if the problem is not DNS, DNS will often lead you to the problem’, the team investigated.

“An analysis of tcpdump testified DNS to be innocent but there was something with the way requests were handled,” the company noted. Multiple queries on each request to the AWS Instance Metadata service were leading to spikes in resolution time. There were two calls; the first would query the IAM roles associated with the instance, while the second would request temporary credentials to access the instance. The problem was – the credentials would expire and the client would have to request a new one.

The expiration time is hardcoded on the AWS SDK and is included in the certificates. But when retrieving certs from both containers and EC2, they found the one from the container had a time of 15 minutes. The reason? The AWS Java SDK was forced to refresh any certificate with less than 15 minutes expiration time. This was creating a loop; each request refreshed the temporary certificate, explaining the two calls to the AWS API, and the huge latency time as a result.

The fix: reconfiguring credentials with a longer expiration period.

2. Moonlight, a software developer community, ran into intermittent outage problems for five days back in January. Kubernetes monitoring reported that some nodes and pods were unresponsive. Allied to networking service disruptions with Google Cloud, the company put two and two together and assumed this was the root cause.

Yet the problem was not fully restored, and the Moonlight website fell over a few days later. Escalating the problem to Google Cloud’s support engineering team, there was a pattern identified in the nodes of Moonlight’s managed Kubernetes cluster. When the nodes experienced periods of sustained CPU usage, the VM would experience a kernel panic and crash.

This was down to Kubernetes’ scheduler, which dynamically decides which pods should run on which nodes. Moonlight discovered a four-step death loop: a) the scheduler assigned multiple pods with high CPU use to the same node; b) the pods consumed 100% of CPU resources on the shared node; c) the node hit kernel panic and appeared unresponsive; d) the scheduler would reassign all crashed pods to a new node, repeating the process.

The fix: adding anti-affinity rules to all major deployments.

3. Retailer Zalando hit a snag with its custom continuous delivery platform (CDP) builds for its operations. The team noticed Kubernetes builder pods were unable to find AWS IAM credentials.

“For a pod to get AWS IAM credentials, kube2iam needs the pod’s IP address, which is set by kubelet process on the associated node. So Kubelet was taking multiple minutes updating the status of the pod. In default configuration, Kubelet is slow responding to the requests to the API server. Somehow Zalando’s CDP Kubernetes cluster had only one available node to the builder pods. The rapid creation and deletion of pods on a single node was delimiting Kubelet.”

The fix: upscaling the cluster to include more than one node.

4. Ravelin, an ML-based fraud detection provider, had one last piece of the puzzle left for its Kubernetes migration on Google Kubernetes Engine (GKE): moving its API layer from the old VMs into the Kubernetes cluster. But the team hit a hitch: integration tests occasionally received 502 errors.

Digging deeper, the team realised the way the documentation tells you to remove a pod from a service or load balance in K8s is a little removed from reality. The theory is:

1) Replication controller decides to remove a pod
2) Pod endpoint is removed from the service or load balancer, with new traffic no longer flowing to the pod
3) The pod’s pre-stop hook is invoked, or the pod receives a SIGTERM
4) The pod ‘gracefully shuts down’ and stops listening for new connections
5) The graceful shutdown completes, and the pod exits when all existing connections eventually become idle or terminate

What Ravelin found was that steps 2 and 3 happened at the same time. The pod may receive the SIGTERM quite some time before the change in endpoints is actioned at the ingress. In other words, the pod will receive new connections, and it has to process them otherwise the client will receive 500 errors.

The fix: The pod should exit only when its termination grace period expires, and it is killed with SIGKILL. Make sure this grace period is longer than the time taken to reprogram your load balancer.

Other useful resources:

#1: Walmart Global Tech explains how to deploy an ML-based containerised Docker Web application to a local Kubernetes cluster (intermediate)

#2: Kubernetes deployment strategies: from rolling update, to recreate, to canary, to blue/green. Understanding the best choices for development/staging environments vs. production (beginner/intermediate)

#3: Cloud Native Computing Foundation on use cases for advanced pod scheduling in Kubernetes as well as best practices for implementing it in real-world situations (advanced)

IaC tutorials: Azure policies in Terraform, basics of CloudFormation

Rapidly deploy Azure policies in Terraform:

An article from Jonathan D’Aloia on the Adatis blog outlines how to import policies into your Terraform State, to then deploy into an Azure Resource Group to secure your landing zone prior to deploying any resources.

1) The first step is to create a Policy definition statement without defining any parameters, so it looks like the below:

resource “azurerm_policy_definition” “example” {

}

2) Initialise with terraform init command. After initialising you can import an Azure Policy into your state by running this command. (To find your definition id in the Azure Portal, navigate to policies, then select the policy you wish to import.)
terraform import azurerm_policy_definition.example <definition id>

3) Generate the Terraform code so the Policy Definition can be assigned. Do this with the terraform show command. The code can also be copied and added to the main.tf file for other tenants.

4) With the policy defined, we can assign this policy to the appropriate resource group through the following code. This code simply states to create a resource group called ‘test-resources-blog-post’ in West Europe:

resource “azurerm_resource_group” “example” {

name = “test-resources-blog-post”

location = “West Europe”

}

resource “azurerm_policy_assignment” “example” {

name = “example-policy-assignment-blog-post”

scope = azurerm_resource_group.example.id

policy_definition_id = azurerm_policy_definition.example.id

description = “Policy Assignment created via an Acceptance Test”

display_name = “My Example Policy Assignment-Blog-Post”

}

5) Assign the policy definition from the ID (see step 2) and apply this policy to the scope of the resource group.

6) Once this has been done, run terraform init, terraform plan and terraform apply in that order.

Deploy your first AWS CloudFormation script:

This article from Ram Vegiraju explains how to deploy your first AWS CloudFormation script. The tutorial runs through an example of manually provisioning resources versus deploying a CloudFormation script to create a REST API and a serverless Lambda function on AWS. This article is designed for those who have at least some experience of AWS.

To use CloudFormation, this tutorial explains how to use a YAML file to declare and provision resources that are then deployed to AWS to create a CloudFormation stack. The tutorial also outlines how to use a SAM template, which is the template used for the purpose of this task.

The full code can be seen here, while ideas for next steps in terms of how CloudFormation can help build CI/CD pipelines in AWS are available in this article.

Other useful resources:

#1: Learning about conditional logic in Terraform – where you will need to define parts of a Terraform configuration for a particular resource only when needed (intermediate/advanced)
#2: Bicep and Terraform compared: Deployment methods, syntax, and tooling (intermediate)

The week’s top podcasts

Looking to learn Azure certifications? Mike Pfeiffer and Tim Warner, Microsoft MVPs, talk about the new AZ-700 exam and how to become an Azure-certified network engineer (Cloudskills.fm)

Kelsey Hightower, principal developer advocate at Google Cloud Platform, and Mark Shuttleworth, CEO of Canonical, talk about the ‘quiet support’ of Canonical’s Ubuntu Linux distribution for Kubernetes – and why this is as it should be. (The New Stack Makers)

Juraj Masar, co-founder and CEO at Better Stack, talks DevOps MVPs and product motivations. “If you think about all the other monitoring tools – give me one example of a service which is so easy to set up? This comes down to ‘I’m not the kind of guy that enjoys tweaking my Prometheus config, right? What I like is actually running my business, building amazing software, shipping yet another feature. And I have my Prometheus in place so that the servers don’t crash. Essentially, better uptime is an insurance policy that you are purchasing so that you know that your software works.” (Software Engineering Daily)

Do devs need a product manager? Gaëlle Sharma, senior technical product manager at the New York Times, talks about what product managers do, what makes a good product manager, and why some tech companies’ engineering teams have gone without them. (Go Time)

Get the best cloud content in a 10-minute digest

🌟 CloudProfs Issue #1: IaC starting points, Kubernetes snags

Top news stories

Releases and product notes this week

Deep dive: Four Kubernetes failure stories and what we can learn from them

IaC tutorials: Azure policies in Terraform, basics of CloudFormation

The week’s top podcasts

Leave a Reply Cancel reply