heapster uptimes randomly reset - kubernetes

Since a couple of weeks, my reported uptimes for most of my pods are incorrect and reset to 0 frequently but at a random rate (sometimes it's reset after a couple of minutes/seconds, sometimes a couple of hours).
The data are sinked to influxdb and displayed with Grafana.
Here is a screenshot of the uptime of some MongoDB nodes over a week (none of them have restarted). Only the blue line (node-2) is correct, all other are reset randomly.
Versions:
kubernetes: 1.8.3
heapster: 1.4.3 amd64
influxdb: 1.1.1 amd64
Any idea of what is going wrong?

Related

influxdb container high cpu- i/o- network-usage

im running Homeassistant + InfluxDb 1.8.10 on my Raspy 4 8GB.
Few days ago i noticed that my InfluxDB is running high on CPU, I/O Usage and Network Usage.
Earlier is was around 1-5 % CPU, now the Container Stats tells me ~250%...
Portainer Container Stats
Do u have a Idea what to do?
Thank you in advance,
Chris
So first i thought my sd card is diing, so i moved my complete data to a new ssd...
The Load moved from 7/8 to 4/5, but same container stats.
Pi Data
Then I moved the InfluxDB to a own container and removed it from Homeassistant, but even no success.

cassandra is logging timeout of node URGENT_MESSAGES

URGENT_MESSAGES-[no-channel] dropping message of type GOSSIP_DIGEST_SYN whose timeout expired before reaching the network
Thank you for your message. Yesterday we solved the Problem.
The reason was a "dead node" oviously leaved form a change in the kubernetes deployment.
So, allways look out for dead nodes, after changing something in the cluster deployment.
You didn't provide a lot of information but I'm assuming that your cluster is running into a known issue where gossip messages are being dropped during startup of a Cassandra node (CASSANDRA-16877).
The starting node sends GOSSIP_DIGEST_SYN with a high priority (URGENT_MESSAGES) but for large clusters, Cassandra 4.0 nodes cannot serialise the gossip state when the size of the state exceeds 128kb and no acknowledgement gets sent. Since a node can not gossip with other nodes, it fails to start.
This was urgently fixed in Cassandra 4.0.1 last year. Upgrade the binaries on the affected Cassandra 4.0 nodes and that should allow them to start successfully and join the ring. Cheers!

Elastic Cloud APM Server - Queue is full

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version
I deleted all the indexes related to APM and everything worked after some minutes.
for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

Grafana displays No Data from Prometheus Under 6 hours time period

I have a Prometheus Dashboard that will display No Data when viewing it on a remote machine and the time period is 6 hours or less. The issue does not happen on the local machine. Grafana and Prometheus service are both running on the same CentOS 7 server. I've tested this on Windows and Linux machines
Edit: Setting range to 12hrs doesn't populate the data properly, but 24hrs and more does

`kubctl get pods` has high latency

I am attempting to identify and fix the source of high latency when running kubectl get pods.
I am running 1.1.4 on AWS.
When running the command from the master host of afflicted master, I consistently get response times of 6s.
Other queries, such as get svc and get rc return on the order of 20ms.
Running get pods on a mirror cluster returns in 150ms.
I've crawled through master logs and system stats, but have not identified the issue.
We speeded up LIST operations in 1.2. You might be interested in learning the updates to Kubernetes performance and scalability in 1.2.
Chris - how big cluster do you have and how many pods do you have in it?
Obviously the time it take to return the response will be bigger if the result is bigger.
Also, what do you mean by "running on mirror cluster returns in 150ms"? What is "mirror cluster"?