I have issue that my application running almost got into its limit at 1 Gi. I've done checking ...
the describe pods but nothing events come
check htop process through exec but just shows nothing heavy running on background
check the memory.stat and showing this
How can I debug whats the process consume most of my memory? I have no many idea about the memory.stat, i've already read the memory.state documentation from this kernel docs and read some stackoverflow but still puzzled. could you please give me a suggest?
htop is a good approach to find relative memory utilization. we see on the screenshot that inside the pod only apache2 are running. Knowing apache I would guess that it has big log files. Can you check by kubectl describe pod if they use emptyDir volumes.
Another approach is from inside the pod to do du -sh /var/log/apache2/* ( check the logs location in config file is no logs there) ; if there big file(s), just truncate them by cat > /var/log/apache2/[name_of_file] , check memory usage, if the volume is backend by RAM you would see decrease in memory usage.
Related
context
Our current context is the following: researchers are running HPC calculations on our Kubernetes cluster. Unfortunately, some pods cannot get scheduled because the container engine (here Docker) is not able to pull the images because the node is running out of disk space.
hypotheses
images too big
The first hypothesis is that the images are too big. This probably the case because we know that some images are bigger than 7 GB.
datasets being decompressed locally
Our second hypothesis is that some people are downloading their datasets locally (e.g. curl ...) and inflate them locally. This would generate the behavior we are observing.
Envisioned solution
I believe that this problem is a good case for a daemon set that would have access to the node's file system. Typically, this pod would calculate the total disk space used by all the pods on the node and would expose them as a Prometheus metric. From there is would be easy to set alert rules in place to check which pods have grown a lot over a short period of time.
How to calculate the total disk space used by a pod?
The question then becomes: is there a way to calculate the total disk space used by a pod?
Does anyone have any experience with this?
Kubernetes does not track overall storage available. It only knows things about emptyDir volumes and the filesystem backing those.
For calculating total disk space you can use below command
kubectl describe nodes
From the above output of the command you can grep ephemeral-storage which is the virtual disk size; this partition is also shared and consumed by Pods via emptyDir volumes, image layers,container logs and container writable layers.
Check where the process is still running and holding file descriptors and/or perhaps some space (You may have other processes and other file descriptors too not being released). Check Is that kubelet.
You can verify by running $ ps -Af | grep xxxx
With Prometheus you can calculate with the below formula
sum(node_filesystem_size_bytes)
Please go through Get total and free disk space using Prometheus for more information.
I have a python code where I process some data, write neo4j queries and then commit these queries to neo4j. When I run the code on my local machine and write the output to local neo4j it doesn't take more than 15 minutes. However, when I run my code locally and write the output to noe4j pod in k8s pod it takes double the time, and when I build my code and deploy it to k8s and run that pod and write the output to neo4j pod it takes a round 3 hours. since I'm new to k8s deployment it might something in the pod configurations or settings, so I appreciate if I can get some hints
There could be few reasons of that.
I would first check how much resources does your pod consume while you are processing data, you can do that using kubectl top pod.
Second I would check if there are any limits inside pod. You can read a great deal about them on Managing Compute Resources for Containers.
If you have a limit set then it might be too low and that's causing the extended time while processing data.
If limits are not set then it might be because of how you installed minik8s. I think as default it's being installed with 4G is memory, you can look at alternative methods of installing minik8s. With multipass you can specify more memory to allocate.
There also can be a issue with Page Cache Sizing, Heap Sizing or number of open files. Please read the Neo4j Performance Tuning.
Our setup:
We are using kubernetes in GCP.
We have pods that write logs to a shared volume, with a sidecar container that sucks up our logs for our logging system.
We cannot just use stdout instead for this process.
Some of these pods are long lived and are filling up disk space because of no log rotation.
Question:
What is the easiest way to prevent the disk space from filling up here (without scheduling pod restarts)?
I have been attempting to install logrotate using: RUN apt-get install -y logrotate in our Dockerfile and placing a logrotate config file in /etc/logrotate.d/dynamicproxy but it doesnt seem to get run. /var/lib/logrotate/status never gets generated.
I feel like I am barking up the wrong tree or missing something integral to getting this working. Any help would be appreciated.
We ended up writing our own daemonset to properly collect the logs from the nodes instead of the container level. We then stopped writing to shared volumes from the containers and logged to stdout only.
We used fluentd to the logs around.
https://github.com/splunk/splunk-connect-for-kubernetes/tree/master/helm-chart/splunk-kubernetes-logging
In general, you should write logs to stdout and configure log collection tool like ELK stack. This is the best practice.
However, if you want to run logrotate as a separate process in your container - you may use Supervisor, which serves as a very simple init system and allows you to run as many parallel process in container as you want.
Simple example for using Supervisor for rotating Nginx logs can be found here: https://github.com/misho-kr/docker-appliances/tree/master/nginx-nodejs
If you write to the filesystem the application creating the logs should be responsible for rotation. If you are running a java application with logback or log4j it is simple configuration change. For other languages/frameworks it is usually similar.
If that is not an option you could use a specialized tool to handle the rotation and piping the output to it. One example would be http://cr.yp.to/daemontools/multilog.html
As method of last resort you could investigate to log into a named pipe (FIFO) instead of a real file and have some other process handling the retrieval and writing of the data - including the rotation.
I'm sending a message from Application A to Artemis but I'm getting this error from Application A:
AMQ212054: Destination address=my-service is blocked. If the system is configured to block make sure you consume messages on this configuration.
Looking at the logs of artemis starting up this is what I see which I believe is the cause:
AMQ222210: Storage usage is beyond max-disk-usage. System will start blocking producers
I've looked at the documentation here and found nothing that could help. Also have logged into the running container and changed the 'max-disk-usage' to 100 as per my google research and so far nothing has helped.
I'm running artemis using the following command:
docker run -it --rm -e ARTEMIS_USERNAME=artemis -e ARTEMIS_PASSWORD=artemis -p 8161:8161 -p 61616:61616 vromero/activemq-artemis
Any help is appreciated~ Thank you
You are receiving this message because you computer's disk space is over 90% full and Artemis blocks producers once this happens. To solve your problem you can either:
Clear up disk space on your computer so that it is below 90% .
Increase how full your disk can be before Artimes blocks producers. To do this you need to modify the broker configuration file which is located at:
path-to-broker\artemis\etc\broker.xml
In this file, there is a tag labeled max-disk-usage which is by default set to 90. Simply increase this to 100 (or whatever value you feel comfortable with).
Note that the reason Artemis configures your brokers to start blocking producers once you computer's disk space usage reaches 90% and above is to prevent potentially using up all of your disk space in the case of message back log.
I've downloaded a different version and this issue hasn't occurred anymore.
running v1.10 and i notice that kube-controller-managers memory usage spikes and the OOMs all the time. it wouldn't be so bad if the system didn't fall to a crawl before this happens tho.
i tried modifying /etc/kubernetes/manifests/kube-controller-manager.yaml to have a resource.limits.memory=1Gi but the kube-controller-manager pod never seems to want to come back up.
any other options?
There is a bug in kube-controller-manager, and it's fixed in https://github.com/kubernetes/kubernetes/pull/65339
First of all, you missed information about the amount of memory you use per node.
Second, what do you mean by "system didn't fall to a crawl" - do you mean nodes are swapping?
All Kubernetes masters and nodes are expected to have swap disabled - it's recommended by the Kubernetes community, as mentioned in the Kubernetes documentation.
Support for swap is non-trivial and degrades performance.
Turn off swap on every node by:
sudo swapoff -a
Finally,
resource.limits.memory=1Gi
is default value per pod. These limits are hard limits. Pod reaching this level of allocated memory can cause OOM, even if you have gigabytes of unallocated memory.