Kubernetes on GCP, Stackdriver logging after update to v1.15 - kubernetes

I have a Kubernetes cluster on GCP that hosts a Flask application and some more services.
Before upgrading the master node to version 1.15 (was 1.14.x) I saw every log from the flask application on Stackdriver's GKE Container logs, now I don't get any log.
Searching through the release notes I noticed that from 1.15 they:
disabled stackdriver logging agent to prevent node startup failures
I'm not entirely sure that's the reason but I'm sure that the logging stopped after upgrading the master and node versions to 1.15, there has been no code change in the application core.
My question is how can I reactivate the logs I saw before?

I actually found the solution, as stated by the release notes, the stackdriver agent actually becomes disabled by default in 1.15.
To activate it again you need to edit the cluster following these instructions, setting "System and workload logging and monitoring" under "Stackdriver Kubernetes Engine Monitoring"
After that, I could not use anymore the legacy Stackdriver Monitoring, so I found my logs weren't under the resource "GKE Container" but under "Kubernetes Container".
I also had to update every log-based metric that had a filter on resource.type="container", changing it to resource.type="k8s_container"

Related

kubetnetes cluster in Azure (AKS) upgrade 1.24.9 in fail state with pods facing intermittent DNS issues

I upgrade AKS using Azure portal from 1.23.5 to 1.24.9. This part finished properly (or so I assumed) based on below status on Azure portal.
I continued with 1.24.9 to 1.25.5. This time it worked partly. Azure portal shows 1.25.5 for nodepool with provision state "Failed". While nodes are still at 1.24.9.
I found that some nodes were having issues connecting to network including outside e.g. github as well as internal "services". For some reason it is intermittent issue. On same node it sometime works and sometimes not. (I had pods running on each node with python.)
Each node has cluster IP in resolv.conf
One of the question on SO had a hint about ingress-nginx compatibility. I found that I had an incompatible version. So I upgraded it to 1.6.4 which is compatible with 1.24 and 1.25 both
But this network issue still persists. I am not sure if this is because AKS provisioning state of "Failed". Connectivity check for this cluster in Azure portal is Success. Only issue reported in Azure portal diagnostics is nodepool provisioning state.
is there anything I need to do after ingress-nginx upgrade for all nodes/pods to get the new config?
Or is there a way to re-trigger this upgrade? although I am not sure why, but just assuming that it may reset the configs on all nodes and might work.

Kubernetes Engine: Node keeps getting unhealthy and rebooted for no apparent reason

My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster

How to go about logging in GKE without using Stackdriver

We are unable to grab logs from our GKE cluster running containers if StackDriver is disabled on GCP. I understand that it is proxying stderr/stdout but it seems rather heavy handed to block these outputs when Stackdriver is disabled.
How does one get an ELF stack going on GKE without being billed for StackDriver aka disabling it entirely? or is it so much a part of GKE that this is not doable?
From the article linked on a similar question regarding GCP:
"Kubernetes doesn’t specify a logging agent, but two optional logging agents are packaged with the Kubernetes release: Stackdriver Logging for use with Google Cloud Platform, and Elasticsearch. You can find more information and instructions in the dedicated documents. Both use fluentd with custom configuration as an agent on the node." (https://kubernetes.io/docs/concepts/cluster-administration/logging/#exposing-logs-directly-from-the-application)
Perhaps our understanding of Stackdriver billing is wrong?
But we don't want to be billed for Stackdriver as the 150MB of logs outside of the GCP metrics is not going to be enough and we have some expertise in setting up ELF for logging that we'd like to use.
You can disable Stackdriver logging/monitoring on Kubernetes by editing your cluster, and setting "Stackdriver Logging" and "Stackdriver Monitoring" to disable.
I would still suggest sticking to GCP over AWS as you get the whole Kube as a service experience. Amazon's solution is still a little way off, and they are planning charging for the service in addition to the EC2 node prices (Last I heard).

How do I enable audit logging for Google Container Engine?

Running a GKE cluster with 1.8.1 - when I look at /logs/kube-apiserver-audit.log, it's completely empty. I've taken actions like creating deployments and deleting pods that have been visible in audit logs for clusters I've provisioned outside of GKE.
Is there a better way to view or access these kinds of events with GKE?
That would be because Container Engine 1.8 release does not enable the audit logging feature yet. From Release Notes:
KNOWN ISSUE: Audit Logging, a beta feature in Kubernetes 1.8, is currently not enabled on Container Engine.
It will probably be enabled at some point in the future, I’d keep an eye on the Release Notes.

Change kubernetes master env variable on GKE

I want to enable Stackdriver logging with my Kubernetes cluster on GKE.
It's stated here: https://kubernetes.io/docs/user-guide/logging/stackdriver/
This article assumes that you have created a Kubernetes cluster with cluster-level logging support for sending logs to Stackdriver Logging. You can do this either by selecting the Enable Stackdriver Logging checkbox in the create cluster dialogue in GKE, or by setting the KUBE_LOGGING_DESTINATION flag to gcp when manually starting a cluster using kube-up.sh.
But my cluster was created without this option enabled.
How do I change the environment variable while my cluster is running?
Unfortunately, logging isn't a setting that can be enabled/disabled on a cluster once it is running. This is something that we hope to change in the near future, but in the mean time your best bet is to delete and recreate your cluster (sorry!).