What Condition Causes the Pod Log Reader Return EOF - kubernetes

I am using client-go to continuouslly pull log streams from kubernetes pods. Most of the time everything works as expected, until the job runs couple of hours.
The code is like below:
podLogOpts := corev1.PodLogOptions{ Follow: true, }
kubeJob, err := l.k8sclient.GetKubeJob(l.job.GetNamespace(), l.job.GetJobId())
...
podName := l.k8sclient.GetKubeJobPodNameByJobId(l.job.GetNamespace(), l.job.GetJobId())
req := l.k8sclient.GetKubeClient().CoreV1().Pods(l.job.GetNamespace()).GetLogs(podName, &podLogOpts)
podLogStream, err := req.Stream(context.TODO())
...
for {
copied, err := podLogStream.Read(buf)
if err == io.EOF {
// here is place where error happens
// usually after many hours, the podLogStream return EOF.
// I checked the pod status it is still running and keeps printing data to pod stdout. why would this happend???
break
}
...
}
The podLogStream returns EOF about 3-4 hours later. But I checked the pod status and found pod is still running and the service inside keeps printing data to the stdout. So why would this happend? How to fix it?
UPDATE
I found that every 4 hours the pod stream api -- read -- would return EOF so I have to make the goroutine sleep and retry a second later, by recreating the pogLogStream and reading logs from new stream object. It works. But why would this happen??

When you contact logs endpoint what happens is that apiserver is forwarding your request to the kubelet which is hosting your pod. Kubelet server then start streaming content of the log file to the apiserver and later to your client. Since it is streaming logs from the file and not from the stdout directly it may happen that log file is rotated by container log manager and as consequence you receive EOF and need to reinitialize the stream.

Related

How to resolve "Invalid Sequence Token" when using cloudwatch agent?

I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?
As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.

In Kubernetes, I'm not seeing SystemOOM messages in events - instead I'm getting an error message in kubelet

A process on my K8s host is being killed by oomkiller. Occasionally I get the following logged in K8s events (I want to see this):
2021-03-16 16:55:23.000000 events WARN count = 1 type = DELETED name = boutique2.166cf7e33013b8a8 kind = Node name = boutique2 uid = boutique2 reason = SystemOOM msg = System OOM encountered
But most of the time, I don't get the SystemOOm message, but get this in kubelet instead:
2021-03-17 08:43:06.405000 kubelet W 4081 oomparser.go:173] error reading /dev/kmsg: read /dev/kmsg: broken pipe
2021-03-17 08:43:06.405000 kubelet E 4081 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
My logs are being collected by fluentd. Any ideas why I am not always seeing a SystemOOM event being logged?
I am using minikube for this environment.
Thanks in advance.

Verifying that a kubernetes pod is deleted using client-go

I am trying to ensure that a pod is deleted before proceeding with another Kubernetes Operation. So the idea I have is to Call the Pod Delete Function and then call the Pod Get Function.
// Delete Pod
err := kubeClient.CoreV1().Pods(tr.namespace).Delete(podName, &metav1.DeleteOptions{})
if err != nil {
....
}
pod, err := kubeClient.CoreV1().Pods(tr.namespace).Get(podName, &metav1.DeleteOptions{})
// What do I look for to confirm that the pod has been deleted?
err != nil && errors.IsNotFound(err)
Also this is silly and you shouldn't do it.

How to invoke the Pod proxy verb using the Kubernetes Go client?

The Kubernetes remote API allows HTTP access to arbitrary pod ports using the proxy verb, that is, using an API path of /api/v1/namespaces/{namespace}/pods/{name}/proxy.
The Python client offers corev1.connect_get_namespaced_pod_proxy_with_path() to invoke the above proxy verb.
Despite reading, browsing, and searching the Kubernetes client-go for some time, I'm still lost how to do the same with the goclient what I'm able to do with the python client. My other impression is that I may need to dive down into the rest client of the client changeset, if there's no ready-made API corev1 call available?
How do I correctly construct the GET call using the rest client and the path mentioned above?
As it turned out after an involved dive into the Kubernetes client sources, accessing the proxy verb is only possible when going down to the level of the RESTClient and then building the GET/... request by hand. The following code shows this in form of a fully working example:
package main
import (
"fmt"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
clcfg, err := clientcmd.NewDefaultClientConfigLoadingRules().Load()
if err != nil {
panic(err.Error())
}
restcfg, err := clientcmd.NewNonInteractiveClientConfig(
*clcfg, "", &clientcmd.ConfigOverrides{}, nil).ClientConfig()
if err != nil {
panic(err.Error())
}
clientset, err := kubernetes.NewForConfig(restcfg)
res := clientset.CoreV1().RESTClient().Get().
Namespace("default").
Resource("pods").
Name("hello-world:8000").
SubResource("proxy").
// The server URL path, without leading "/" goes here...
Suffix("index.html").
Do()
if err != nil {
panic(err.Error())
}
rawbody, err := res.Raw()
if err != nil {
panic(err.Error())
}
fmt.Print(string(rawbody))
}
You can test this, for instance, on a local kind cluster (Kubernetes in Docker). The following commands spin up a kind cluster, prime the only node with the required hello-world webserver, and then tell Kubernetes to start the pod with said hello-world webserver.
kind create cluster
docker pull crccheck/hello-world
docker tag crccheck/hello-world crccheck/hello-world:current
kind load docker-image crccheck/hello-world:current
kubectl run hello-world --image=crccheck/hello-world:current --port=8000 --restart=Never --image-pull-policy=Never
Now run the example:
export KUBECONFIG=~/.kube/kind-config-kind; go run .
It then should show this ASCII art:
<xmp>
Hello World
## .
## ## ## ==
## ## ## ## ## ===
/""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ / ===- ~~~
\______ o _,/
\ \ _,'
`'--.._\..--''
</xmp>

Kubernetes kubelet error updating node status

Running a kubernetes cluster in AWS via EKS. Everything appears to be working as expected, but just checking through all logs to verify. I hopped on to one of the worker nodes and I noticed a bunch of errors when looking at the kubelet service
Oct 09 09:42:52 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 09:42:52.335445 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Oct 09 10:03:54 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 10:03:54.831820 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Nodes are all showing as ready, but I'm not sure why those errors are appearing. Have 3 worker nodes and all 3 have the same kubelet errors (hostnames are different obviously)
Additional information. It would appear that the error is coming from this line in kubelet_node_status.go
node, err := kl.heartbeatClient.CoreV1().Nodes().Get(string(kl.nodeName), opts)
if err != nil {
return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
}
From the workers I can execute get nodes using kubectl just fine:
kubectl get --kubeconfig=/var/lib/kubelet/kubeconfig nodes
NAME STATUS ROLES AGE VERSION
ip-172-26-0-58.ec2.internal Ready <none> 1h v1.10.3
ip-172-26-1-193.ec2.internal Ready <none> 1h v1.10.3
Turns out this is not an issue. Official reply from AWS regarding these errors:
The kubelet will regularly report node status to the Kubernetes API. When it does so it needs an authentication token generated by the aws-iam-authenticator. The kubelet will invoke the aws-iam-authenticator and store the token in it's global cache. In EKS this authentication token expires after 21 minutes.
The kubelet doesn't understand token expiry times so it will attempt to reach the API using the token in it's cache. When the API returns the Unauthorized response, there is a retry mechanism to fetch a new token from aws-iam-authenticator and retry the request.