I am using DSBulk to unload data into CSV from a DSE cluster installed under Kubernetes, My cluster consists of 9 Kubernetes Pods each with 120 GB Ram.
I have monitored the resources while unloading the data and observed that the more the data is fetched in CSV the more the ram is getting utilised and pods are restarting due to lack of memory.
If one Pod is down at a time the DSBulk unload won't fail, but if 2 Pods are down unload will fail with the exception :
Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but
only 0 replica responded).
Is there a way to avoid this exceeding of memory happening or is there a way to increase the timeout duration.
The command I am using is :
dsbulk unload -maxErrors -1 -h ‘[“ < My Host > ”]’ -port 9042 -u < My user name >
-p < Password > -k < Key Space > -t < My Table > -url < My Table >
--dsbulk.executor.continuousPaging.enabled false --datastax-java-driver.basic.request.page-size 1000
--dsbulk.engine.maxConcurrentQueries 128 --driver.advanced.retry-policy.max-retries 100000
After a lot of Trial and Error, we found out the problem was with Kubernetes Cassandra pods using the main server's memory size as Max Direct Memory Size, rather than using the pods max assigned Ram.
The pods were assigned 120 GB of Ram, but Cassandra on each pod was assigning 185 GB Ram to file_cache_size, which made the unloading process fails as Kubernetes was rebooting each Pod that utilises Ram more than 120 GB.
The reason is that Max Direct Memory Size is calculated as:
Max direct memory = ((system memory - JVM heap size))/2
And each pod was using 325 GB as Max Direct Memory Size and each pods file_cache_size sets automatically to be half of Max Direct Memory Size value, So whenever a pod requests for memory more than 120 GB Kubernetes will restart it.
The solution to it was to set Max Direct Memory Size as an env variable in Kubernetes cluster's yaml file with a default value or to override it by setting the file_cache_size value on each pod's Cassandra yaml's file
Related
I copied the contents of an older Ceph cluster to a new Ceph cluster using rclone. Because several of the buckets had tens of millions of objects in a single directory I had to enumerate these individually and use the "rclone copyto" command to move them. After copying, the number of objects match but the space utilization on the second Ceph cluster is much higher.
Each Ceph cluster is configured with the default triple redundancy.
The older Ceph cluster has 1.4PiB of raw capacity.
The older Ceph cluster has 526TB in total bucket utilization as reported by "radosgw-admin metadata bucket stats". The "ceph -s" status on this cluster shows 360TiB of object utilization with a total capacity of 1.4PiB for 77% space utilization. The two indicated quantities of 360TiB used in the cluster and 526TB used by buckets are significantly different. There isn't enough raw capacity on this cluster to hold 526TB.
After copying the contents to the new Ceph cluster, the total bucket utilization of 553TB is reflected in the "ceph -s" status as 503TiB. This is slightly higher than the bucket total of the source I assume due to larger drive's block sizes, but the status utilization matches the sum of the bucket utilization as expected. The number of objects in each bucket of the destination cluster matches the source buckets also.
Is this a setting in the first Ceph cluster that merges duplicate objects like a simplistic compression? There isn't enough capacity in the first Ceph cluster to hold much over 500TB so this seems like the only way this could happen. I assume that when two objects are the same, that each bucket gets a symlink like pointer to the same object. The new Ceph cluster doesn't seem to have this capability or it's not set to behave this way.
The first cluster is Ceph version 13.2.6 and the second is version 17.2.3.
Its not clear enough that how the kubectl top node command showing the resouce consumption Percentaage and what are all factors its considering.
For example, our nodesize of the nodepools having 2, CPU, 7 GB RAM. Out of that when I describe the node its showing only 4.6 GB is as allocatable.
So for what purpose this remaining 2.5 GB is used for and what are all the components using this reserverd cache memory?
Eventhough the allocatable memory is showing as 4.6 GB (out of 7 GB), when I run the "kubectl top node" command, its shwon as 3.8 GB is consumed (than means 82% of the allocatable memory 4.6).
So inorder to analyse the exact usage of the pods within the node, i calculated each pods request memory/usage memory (which is higher), and the sum is resulted to only 2.3GB usage. So its not getting in sync with the top node commands output, and still 1.5 GB is missing from the above calculation with respect the "kubectl top node" command output and this difference is coming in addition to the cached memory of 2.5GB alrady. So totally out of 7GB RAM, only usable memory is around ~3 GB and remaing ~4 GB is simply cached by AKS.
So would like to understand the following points in details
For what purpose the the memory is cached - difference between Actual memory and Allocatable memory ie 7GB-4.6GB = 2.4GB
What are the components calculated under "kubectl top node " command? its not only the pods consumption but what are all component.
Eventhough the memory is cached already(the above 2.4GB) , Why again the nodes are showing additional memory usage (3.8GB) than the actual pods usage(2.3GB) That means,
- Kubectl top node output - 3.8GB
- sum of pods request/Usage memory- 2.3 GB
- Why again this 1.5GB is added to the top node command?
Let us assume kubernetes cluster with one worker node (1core and 256MB RAM). all pods will be scheduled in worker node.
At first i deployed a pod with config (request: cpu 0.4, limit: cpu 0.8), it deployed successfully. as the machine has 1 core free it took 0.8 cpu
Can i able to deploy another pod with same config? If yes will first pod's cpu reduce to 0.4?
Resource requests and limits are considered in two different places.
Requests are only considered when scheduling a pod. If you're scheduling two pods that each request 0.4 CPU on a node that has 1.0 CPU, then they fit and could both be scheduled there (along with other pods requesting up to a total of 0.2 CPU more).
Limits throttle CPU utilization, but are also subject to the actual physical limits of the node. If one pod tries to use 1.0 CPU but its pod spec limits it to 0.8 CPU, it will get throttled. If two of these pods run on the same hypothetical node with only 1 actual CPU, they will be subject to the kernel scheduling policy and in practice will probably each get about 0.5 CPU.
(Memory follows the same basic model, except that if a pod exceeds its limits or if the total combined memory used on a node exceeds what's available, the pod will get OOM-killed. If your node has 256 MB RAM, and each pod has a memory request of 96 MB and limit of 192 MB, they can both get scheduled [192 MB requested memory fits] but could get killed if either one individually allocates more than 192 MB RAM [its own limit] or if the total memory used by all Kubernetes and non-Kubernetes processes on that node goes over the physical memory limit.)
Fractional requests are allowed. A Container with spec.containers[].resources.requests.cpu of 0.5 is guaranteed half as much CPU as one that asks for 1 CPU. The expression 0.1 is equivalent to the expression 100m, which can be read as “one hundred millicpu”. Some people say “one hundred millicores”, and this is understood to mean the same thing. A request with a decimal point, like 0.1, is converted to 100m by the API, and precision finer than 1m is not allowed. For this reason, the form 100m might be preferred.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
From here
In your condition you able to run 2 pods on the node.
I would like to know, does scheduler considers resource limits when scheduling a pod?
For example, of scheduler schedules 4 pods in a specific node with total capacity <200mi, 400m> and the total resource limits of those pods are <300mi, 700m>, what will be happened?
Only resource requests are considered during scheduling. This can result in a node being overcommitted. (Managing Compute Resources for Containers in the Kubernetes documentation says a little more.)
In your example, say your node has 1 CPU and 2 GB of RAM, and you've scheduled 4 pods that request 0.2 CPU and 400 MB RAM each. Those all "fit" (requiring 0.8 CPU and 1.6 GB RAM total) so they get scheduled. If any individual pod exceeds its own limit, its CPU usage will be throttled or memory allocation will fail or the process will be killed. But, say all 4 of the pods try to allocate 600 MB of RAM: none individually exceeds its limits, but in aggregate it's more memory than the system has, so the underlying Linux kernel will invoke its out-of-memory killer and shut down processes to free up space. You might see this as a pod restarting for no apparent reason.
In spark-env.sh, it's possible to configure the following environment variables:
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
export SPARK_WORKER_MEMORY=22g
[...]
# - SPARK_MEM, to change the amount of memory used per node (this should
# be in the same format as the JVM's -Xmx option, e.g. 300m or 1g)
export SPARK_MEM=3g
If I start a standalone cluster with this:
$SPARK_HOME/bin/start-all.sh
I can see at the Spark Master UI webpage that all the workers start with only 3GB RAM:
-- Workers Memory Column --
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
[...]
However, I specified 22g as SPARK_WORKER_MEMORY in spark-env.sh
I'm somewhat confused by this. Probably I don't understand the difference between "node" and "worker".
Can someone explain the difference between the two memory settings and what I might have done wrong?
I'm using spark-0.7.0. See also here for more configuration info.
A standalone cluster can host multiple Spark clusters (each "cluster" is tied to a particular SparkContext). i.e. you can have one cluster running kmeans, one cluster running Shark, and another one running some interactive data mining.
In this case, the 22GB is the total amount of memory you allocated to the Spark standalone cluster, and your particular instance of SparkContext is using 3GB per node. So you can create 6 more SparkContext's using up to 21GB.