GKE archivos - Geko Cloud

GCP/GKE – New logging system

Geko Cloud — Mon, 07 Sep 2020 05:27:04 +0000

Introduction

It’s very likely you have been running a GKE cluster at the version v1.15 for many months without issues and then -suddenly- the logs stopped being received. You are using the default logging system the system provides, so you then check the GCP status website but everything is running fine. You also know you have not modified anything on the cluster before the log stop, but at the same time you guess it has to be something on your side (on our case because it’s not happening on some other projects we manage).

If the above story is familiar to you continue reading as this article will cover from how to detect it, to the solution that will make your cluster logs being received again.

1. Symptoms

1. – The very first symptom you may notice is the sudden, complete absence of logs in Stackdriver, which should be coming -as usual- from your cluster’s pods.

– Beside the previous fact, you will additionally find out you have no Fluentd agents anymore in your cluster.
```
$ kubectl get daemonset -n kube-system | grep -i fluentd
```

– On the other hand, all the nodes of your cluster have the label beta.kubernetes.io/fluentd-ds-ready=true, which is required for the Fluentd agents to know to which nodes should be deployed.

$ kubectl get nodes
NAME                           STATUS   ROLES    AGE     VERSION
gke-cluster-node-93c020-7rjq   Ready    none     14h     v1.15.11-gke.15
gke-cluster-node-93c020-9w22   Ready    none     23h     v1.15.11-gke.15
gke-cluster-node-93c020-jdbt   Ready    none     47h     v1.15.11-gke.15
gke-cluster-node-93c020-jvcl   Ready    none     3h17m   v1.15.11-gke.15

$ kubectl describe node gke-cluster-node-93c020-7rjq | grep -i fluentd-ds-ready
       beta.kubernetes.io/fluentd-ds-ready=true

$ # Do the same for the remaining nodes in order to ensure all of them are properly labeled

– Your GKE cluster is currently at the version v1.15 or greater.

2. Diagnose

What is currently happening is GCP forcibly deprecating the legacy monitoring/logging system. The replacement is called Cloud Operations for GKE which (for our use case) does basically the same. Once said that, keep in mind there are a few differences you should take care about when searching (such as metric names changes), and that you will find them all on the migration guide.

3. Treatment

Label the cluster nodes

If you found on previous steps that your cluster nodes were not labeled, do it now!

$ kubectl label node $NODE_NAME beta.kubernetes.io/fluentd-ds-ready=true

Get your cluster’s name

List the available clusters.

$ gcloud container clusters list
NAME        LOCATION        MASTER_VERSION  MASTER_IP    MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
my-cluster  europe-east1-a  1.15.12-gke.2   34.35.36.37  n1-standard-2  1.15.11-gke.15  4          RUNNING

Disable the logging service for your cluster

Update your cluster’s config in order to set the logging service to none.

$ gcloud container clusters update my-cluster --logging-service none

Enable again the logging service for your cluster

Update your cluster’s config in order to set the logging service to the default one.

$ gcloud container clusters update my-cluster --logging-service logging.googleapis.com

WARNING: You will get an error telling you that enabling it is not possible due to the deprecation. This is the way we finally realized where the problem was. Just move on, you are on the right way.

Ensure the migration status is the expected one

Open the monitoring site at the GCP GUI and then go to Settings. You will find a tab named “Kubernetes Migration Status” which should look as follows.

Fully enable the logging&monitoring service for your cluster

Open the GKE clusters list and then click on the name of the cluster where you want to setup the new logging system. Once you see the cluster config, click on the EDIT button as is shown below.

Scroll down until finding the Kubernetes Engine Monitoring setting and then select the System and workload logging and monitoring option.

IMPORTANT: Don’t forget to save your changes at the bottom.

And that’s it! Your cluster will receive logs again as expected!

Conclusion

As you may probably know Google Cloud Platform is still in beta, so you will often find some changes breaking your already-working systems. You have two options at this point: On the one hand you could keep yourself up to date with all the future changes and deprecations. On the other hand you could just limit yourself to fix them as soon as they appear, but keep in mind you will always be blindly fighting against them.

There is a third option (which should be the first one), consisting on regularly coming to the Geko’s blog and check out if we have already dealt with the problem you are encountering. The Geko team will be always glad to see you here

La entrada GCP/GKE – New logging system se publicó primero en Geko Cloud.

Upgrade GKE public-cluster’s Terraform module

Jose Luis Sánchez — Wed, 13 May 2020 06:24:40 +0000

Introduction

From time to time Google introduces new features and changes that sometimes also force the Terraform modules to upgrade themselves. It was our case at Geko, where we were using the GKE module for public-cluster deployment&management at version 5.x. A few days ago, when we planned to update some parameters it came that Google had removed the support for the Kubernetes dashboard. It was completely deprecated and the module was failing because of it, so we were forced to upgrade the module in order to meet the new conditions. There were up to 3 major version upgrades available, so we decided to go for it and use the latest one. However, it was not a standalone solution as it required to handle Terraform state’s incoherences.

The aim of this lab is to learn how to upgrade the official Terraform module intended to deploy&manage a public GKE cluster. We will specially deal with module’s (kubernetes-engine.beta-public-cluster) breaking changes, and we will manage to obtain the consistent status we previously had before the failure which preceded the upgrade.

Estimated time to finish this lab: ~20 minutes

1. Remove the previous resources

It’s strongly encouraged to perform a tfstate file backup before continue!

It’s especially important to remove all the conflicting resources from the Terraform state as soon as they are bounded between them using dependencies. The goal here is to remove any deprecated binding prior to importing them again from the current “picture” there’s already deployed.

The main components on a Kubernetes cluster are the networks (and subnetworks), the node pool and the cluster itself. Let’s focus on them.

terraform state rm module.gke.google_container_cluster.primary
terraform state rm module.gke.google_container_node_pool.pools[0]
terraform state rm module.vpc.google_compute_network.network
terraform state rm module.vpc.google_compute_subnetwork.subnetwork[0]
terraform state rm module.vpc.google_compute_subnetwork.subnetwork[1]

2. Upgrade versions

Once removed the previous states the next step is to set the version for the required modules to the current latest version. For the GKE module the latest now it’s 8.1.0, but it will be allowed to automatically adopt minor upgrades (“~>”).

Upgrade the GKE cluster module

 module "gke" {
   source  = "terraform-google-modules/kubernetes-engine/google//modules/beta-public-cluster"
-  version = "~> 5.0"
+  version = "~> 8.1"

Upgrade the VPC module

 module "vpc" {
-  source  = "github.com/terraform-google-modules/terraform-google-network?ref=v1.1.0"
+  source  = "github.com/terraform-google-modules/terraform-google-network?ref=v2.3.0"

Check the new resources

In order to find out if the new resources have experienced a name change (due to the modules upgrade), a Terraform plan is strongly encouraged.

On this case it has been found that some module’s internal hierarchy and also list’s indexes have changed.

  module.gke.google_container_cluster.primary
  
- module.gke.google_container_node_pool.pools[0]
+ module.gke.google_container_node_pool.pools["default-node-pool"]

  module.vpc.google_compute_network.network
  
- module.vpc.google_compute_subnetwork.subnetwork[0]
+ module.vpc.module.subnets.google_compute_subnetwork.subnetwork["southamerica-east1/my-cluster-public"]

- module.vpc.google_compute_subnetwork.subnetwork[1]
+ module.vpc.module.subnets.google_compute_subnetwork.subnetwork["southamerica-east1/my-cluster-private"]

3. Import fresh resources

Keep in mind that the zone/region depends on your kind of cluster. If it’s zonal you must use the master zone (e.g. southamerica-east1-a). On the other hand, if it’s a regional cluster you must use the region (e.g. southamerica-east1). The following example assumes a regional cluster located at southamerica-east1, in the project “my-project“, and with a cluster name “my-cluster“. The network names were set accordingly to the cluster’s name, just adding the suffixes “private” and “public” to the subnets to properly differentiate them.

Note also the new module hierarchy and indexing.

# Global vars
REGION="southamerica-east1"
PROJECT="my-project"
CLUSTER="my-cluster"

# Cluster
CLUSTER_LOCAL="module.gke.google_container_cluster.primary"
CLUSTER_REMOTE="${PROJECT}/${REGION}/${CLUSTER}"
terraform import $CLUSTER_LOCAL $CLUSTER_REMOTE

# Node pool
POOL_LOCAL="module.gke.google_container_node_pool.pools["default-node-pool"]"
POOL_REMOTE="${CLUSTER_REMOTE}/default-node-pool"
terraform import $POOL_LOCAL $POOL_REMOTE

# Subnetworks
BASE_SUBNET_LOCAL="module.vpc.module.subnets.google_compute_subnetwork.subnetwork"

## Public
PUBLIC_SUBNET_LOCAL="${BASE_SUBNET_LOCAL}["${REGION}/${CLUSTER}-public"]"
PUBLIC_SUBNET_REMOTE="${CLUSTER_REMOTE}-public"
terraform import $PUBLIC_SUBNET_LOCAL $PUBLIC_SUBNET_REMOTE

## Private
PRIVATE_SUBNET_LOCAL="${BASE_SUBNET_LOCAL}["${REGION}/${CLUSTER}-private"]"
PRIVATE_SUBNET_REMOTE="${CLUSTER_REMOTE}-private"
terraform import $PRIVATE_SUBNET_LOCAL $PRIVATE_SUBNET_REMOTE

# Network
NETWORK_LOCAL="module.vpc.module.vpc.google_compute_network.network"
NETWORK_REMOTE="${PROJECT}/${CLUSTER}"
terraform import $NETWORK_LOCAL $NETWORK_REMOTE

4. Update parameters

It’s very likely you will encounter that after a Terraform plan the google_container_cluster resource still needs to be updated due to a subnetwork parameter change. The new subnet keys have made the indexes to change their order. Just edit your GKE module to replace the subnetwork parameter as below.

- subnetwork = module.vpc.subnets_names[0]
+ subnetwork = module.vpc.subnets_names[1]

Conclusion

As you may have read above, sometimes -when relying on third parties- could happen that a breaking change is introduced and you get yourself into troubles to get the service back again. Beside this, the solution could introduce collateral damages which will require additional sub-solutions. On this particular case regarding Terraform, dealing with inconsistent states is not really common nor recommended, but it comes that is the only method you have available to solve them on your tool-set.

I hope you’ve enjoyed this post and I encourage you to check our blog for other posts that you might find helpful. Do not hesitate to contact us if you would like us to help you on your projects.

See you on the next post!

La entrada Upgrade GKE public-cluster’s Terraform module se publicó primero en Geko Cloud.