Disclaimer
I have published this post on my work blog https://reece.tech previously.
Overview
I like to upgrade our Kubernetes clusters quite frequently. Recently I started the upgrade journey to 1.16.
Some upgrades are rather uneventless and completed within in a few minutes (we run 5 master nodes per cluster), however this particular upgrade was different.
Preparation
The biggest change in 1.16 is that certain (and commonly used) API versions have been removed completely. Yes, there were mentions and deprecation warnings here and there in the past but now it’s for real.
For example, you will not be able to create or upgrade deployments or daemonsets created with the extensions/v1beta1 API version without changing your resource manifests.
We did upgrade Kubernetes internal services like Grafana, Prometheus, dashboards and our logging services API versions prior to upgrading our clusters to 1.16.
API version changes
Here is a list of all changes (removed APIs in Kubernetes):
Resource | API old | API new | K8S version |
---|---|---|---|
Deployment | extensions/v1beta1 / apps/v1beta2 | apps/v1 | 1.16 |
DaemonSet | extensions/v1beta1 / apps/v1beta2 | apps/v1 | 1.16 |
Ingress | extensions/v1beta1 | networking.k8s.io/v1beta1 | 1.20 |
NetworkPolicy | extensions/v1beta1 | networking.k8s.io/v1 | 1.16 |
PodSecurityPolicy | extensions/v1beta1 | policy/v1beta1 | 1.16 |
StatefulSet | extensions/v1beta1 / apps/v1beta2 | apps/v1 | 1.16 |
ReplicaSet | extensions/v1beta1 / apps/v1beta2 | apps/v1 | 1.16 |
How do we find resources that need to be updated?
Resources currently available in the cluster have been applied to both the old and the new API version - so upgrading to 1.16 will not break any existing configuration.
You can explicitly query the API server using kubectl to get resources specific to a certain API version - for example kubectl get deployment.extensions –all-namespaces will show you all deployments with the extensions/* API version and kubectl get deployment.apps –all-namespaces will show you the apps/v1 deployments. The output of both may actually differ.
Just running kubectl get deployment –all-namespaces may show the extensions/* version only - especially confusing when applying a new resource with the latest API version and then checking the API version using kubectl get x -o yaml only to get returned the deprecated API version for the same resource.
What exactly needs to be changed?
Most changes are quite straight forward and only require the apiVersion: field to be updated - for example for kind: Ingress you only change apiVersion: extensions/v1beta1 to apiVersion: networking.k8s.io/v1beta1 and you’re done.
However, other resources (specifically DaemonSet and Deployment) may require additional fields - for example spec.selector. You can use kubectl convert to migrate existing templates to the new API version. However, if you are using Helm charts, this is not overly useful and templates need to be updated manually.
Third party charts
Certain third party charts (such as Ingress controllers, Prometheus and friends, operators to name a few) needed to be updated as well.
Ongoing
We are in the (slow) process of upgrading hundreds of Ingress API versions in various git repositories over the next few months to be ready for version 1.20 (when the old/deprecated Ingress API version gets removed).
Other Pitfalls
Other than the API version change, we also experienced problems with Flannel and Grafana.
Flannel
The network overlay Flannel broke immediately after the API server upgrade with error message “runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized” (and all master nodes were marked as NotReady). Not ideal.
Cause was the flannel configuration (file and config map): /etc/cni/net.d/10-flannel.conflist, which did not include the cniVersion key. Updating this (and adding “cniVersion”: “0.2.0”) fixed the issue.
Prometheus / Grafana dashboards
The upgrade also broke some of our Grafana dashboards, as the container_cpu_usage_seconds_total key removed the container_name entity (replaced by container) - so quite an easy fix.
RBAC
Some permissions may need to be created / updated for the new API versions. Our tiller permissions did not allow networking.k8s.io/v1beta1 resources to be updated for example.
Pod restarts
After restarting the new kubelet, all pods on the node in question restarted simultaneously, leading to very high node load. We decided to sequentially rebuild each node with 1.16 instead of upgrading worker nodes in place over a period of a couple of weeks instead of a big bang approach for our production clusters.
Prometheus helm charts
We had older helm charts for Prometheus installed using deprecated API versions. A few weeks after the 1.16 change, the Prometheus helm chart could not be upgraded easily using helm (even though the API versions were corrected now) - it involved manually editing helm config map statuses to complete the upgrade.
Conclusion
This was one of the more complicated upgrades. Reading the changelog and provided upgrade instructions definitely helps.
More information here:
Comments
Post a Comment