fluentd isnt shipping logs to stackdriver - kubernetes

I have an application deployed on kubernetes on GKE,
Kubernetes version: v1.7.11-gke.1
Stackdriver Logging is enabled on my cluster
fluntd-gcp image on my cluster (by default):
gcr.io/google-containers/fluentd-gcp:2.0.9
my logs were all ok, seen in stackdriver,
but since a few days ago logs from one deployment (lets call it my-app ) stopped arriving in stackdriver
even though they are logged from my app :
kubectl logs -f my-app-3270987706-cx0r2 --namespace=production
{"time":"2018-01-30 16:11:13.155","msg":"ignoring xml"}
{"time":"2018-01-30 16:11:14.155","msg":"success blabla"}
I see the following logs from fluentd:
2018-01-30 16:11:46 +0000 [warn]: emit transaction failed:
error_class=Errno::ENOENT error="No such file or directory # sys_fail2 -
(/var/log/fluentd-buffers/kubernetes.system.buffer..b563203c1da7cb5e1.log, /var/log/fluentd-
buffers/kubernetes.system.buffer..q563203c1da7cb5e1.log)" tag="docker"
2018-01-30 16:11:46 +0000 [warn]: suppressed same stacktrace
2018-01-30 16:11:46 +0000 [error]: Exception emitting record:
No such file or directory # sys_fail2 -
(/var/log/fluentd-buffers/kubernetes.system.buffer..b563203c1da7cb5e1.log,
/var/log/fluentd-buffers/kubernetes.system.buffer..q563203c1da7cb5e1.log)
why logs arent shipped to stackdriver? how can I fix it?
edit:
Ill note that the logs of other apps do appear in stackdriver
the logs of the failing app are very big - maybe thats why they fail to log?

Related

Grafana upgrade causes error in rolling up dashboard

I upgraded Grafana from version 8.1.8 to 8.5.20 and after I did, I started seeing these errors in the logs about rolling up dashboards:
logger=analytics.summaries t=2023-02-09T21:41:06.67+0000 lvl=eror msg="error during daily rollup" err="context deadline exceeded"
logger=analytics.summaries t=2023-02-09T21:41:06.67+0000 lvl=eror msg="got error while rolling up dashboard" dashboard=8928
There isn't any other information in the logs and I'm not sure why this is occurring. What does it mean to rollup a dashboard and how would I resolve this error?

How to install Eclipse-che in azure Kubernetes cluster

I'm trying to install the Eclipse-Che by following this blog : https://che.eclipseprojects.io/2022/07/25/#karatkep-installing-eclipse-che-on-aks.html,
yet following all the steps i'm not able to successfully install the Eclipse che.
1)
After running this command:
kubectl logs -l app.kubernetes.io/component=che-operator -n eclipse-che -f
these are the errors i'm facing:
logs: Waited for 1.034843163s due to client-side throttling, not priority and fairness, request: GET:https://10.1.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
time="2022-09-12T14:08:29Z" level=info msg="Successfully reconciled."
2) the Che-gateway pod is failing:
che-gateway-7d54ccdd59-bblw6 3/4 CrashLoopBackOff 18 (2m51s ago) 70m
Description: Oauth-proxy container is getting failed (Crash loop back error)
Logs of the oauth- Proxy container:
#invalid configuration:
missing setting: login-url
missing setting: redeem-url

How to remove the unwanted characters from fluentd logs

Currently I am sending my Kubernetes logs to cloud watch using Fluentd, but when I check the logs in cloudwatch, the logs are having extra unicode characters. I tried different ways to and regexp to solve but no luck. Here is the sample how my log is in cloud watch
Log in Cloudwatch: "log": "\u001b[2m2021-10-13 20:07:10.351\u001b[0;39m \u001b[32m INFO\u001b[0;39m \u001b[35m1\u001b[0;39m \u001b[2m---\u001b[0;39m \u001b[2m[trap-executor-0]\u001b[0;39m \u001b[36mc.n.d.s.r.aws.ConfigClusterResolver \u001b[0;39m \u001b[2m:\u001b[0;39m Resolving eureka endpoints via configuration\n"
Actual log : 2021-10-13 20:07:10.351 INFO 1 --- [trap-executor-0] c.n.d.s.r.aws.ConfigClusterResolver : Resolving eureka endpoints via configuration

Openshift 3.11 cloud integration fails with lookup RequestError: send request failed\\ncaused by: Post https://ec2.eu-west-.amazonaws.com

Following the docs: https://docs.openshift.com/container-platform/3.11/install_config/configuring_aws.html#aws-cluster-labeling
Configuring the cloud integration after the cluster build.
When the cluster services are restarted on the masters it fails looking up AWS instances:
22 16:32:10.112895 75995 server.go:261] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0c5cbd50923f9c6d2: "error listing AWS instances: \"Request.service: main process exited, code=exited, status=255/n/a Error: send request failed\\ncaused by: Post https://ec2.eu-west-.amazonaws.com/: dial tcp: lookup ec2.eu-west-.amazonaws.com: no such host\""
On closer inspection seems to be due to incorrect hostname:
https://ec2.eu-west-.amazonaws.com/ VS https://ec2.eu-west-2.amazonaws.com/
So I double checked the config, which seems to be correct:
# cat /etc/origin/cloudprovider/aws.conf
[Global]
Zone = eu-west-2
Had a google and it seems to be a similar issue to this:
https://github.com/kubernetes-sigs/kubespray/issues/4345
Is there a way to work around this? Moving off 3.11 isn't an option right now.
Thanks.
Looks as though it needs to be zone, rather than the region.
# cat /etc/origin/cloudprovider/aws.conf
[Global]
Zone = eu-west-2a

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved