AWS EB should create new instance once my docker reached its maximum memory limit - scala

I have deployed my dockerized micro services in AWS server using Elastic Beanstalk which is written using Akka-HTTP(https://github.com/theiterators/akka-http-microservice) and Scala.
I have allocated 512mb memory size for each docker and performance problems. I have noticed that the CPU usage increased when server getting more number of requests(like 20%, 23%, 45%...) & depends on load, then it automatically came down to the normal state (0.88%). But Memory usage keeps on increasing for every request and it failed to release unused memory even after CPU usage came to the normal stage and it reached 100% and docker killed by itself and restarted again.
I have also enabled auto scaling feature in EB to handle a huge number of requests. So it created another duplicate instance only after CPU usage of the running instance is reached its maximum.
How can I setup auto-scaling to create another instance once memory usage is reached its maximum limit(i.e 500mb out of 512mb)?
Please provide us a solution/way to resolve these problems as soon as possible as it is a very critical problem for us?

CloudWatch doesn't natively report memory statistics. But there are some scripts that Amazon provides (usually just referred to as the "CloudWatch Monitoring Scripts for Linux) that will get the statistics into CloudWatch so you can use those metrics to build a scaling policy.
The Elastic Beanstalk documentation provides some information on installing the scripts on the Linux platform at http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html.
However, this will come with another caveat in that you cannot use the native Docker deployment JSON as it won't pick up the .ebextensions folder (see Where to put ebextensions config in AWS Elastic Beanstalk Docker deploy with dockerrun source bundle?). The solution here would be to create a zip of your application that includes the JSON file and .ebextensions folder and use that as the deployment artifact.
There is also one thing I am unclear on and that is if these metrics will be available to choose from under the Configuration -> Scaling section of the application. You may need to create another .ebextensions config file to set the custom metric such as:
option_settings:
aws:elasticbeanstalk:customoption:
BreachDuration: 3
LowerBreachScaleIncrement: -1
MeasureName: MemoryUtilization
Period: 60
Statistic: Average
Threshold: 90
UpperBreachScaleIncrement: 2
Now, even if this works, if the application will not lower its memory usage after scaling and load goes down then the scaling policy would just continue to trigger and reach max instances eventually.
I'd first see if you can get some garbage collection statistics for the JVM and maybe tune the JVM to do garbage collection more often to help bring memory down faster after application load goes down.

Related

Tomcat in k8s pod and db in cloud - slow connection

I have tomcat, zookeeper and kafka deployled in local k8s(kind) cluster. The database is remote i.e. in cloud. The pages load very slowly.
But when i moved tomcat outside of the pod and started manually with zk and kafka in local k8s cluster and db in remote cloud the pages are loading fine.
Why is Tomcat very slow when inside a Kubernetes pod?
In theory, a program running in a container can run as fast as a program running on the host machine.
In practice, there are many things that can affect the performance.
When running on Windows or macOS (for instance with Docker Desktop), container doesn't run directly on the machine, but in a small Linux virtual machine. This VM will add a bit of overhead, and it might not have as much CPU and RAM as the host environment. One way to have a look at the resource usage of containers is to use docker stats; or docker run -ti --pid host alpine and then use classic UNIX tools like free, top, vmstat, ... to see the resource usage in the VM.
In most environments (at least with Docker, and with Kubernetes clusters in their most common default configurations), containers run without resource constraints and limits. However, it is fairly common (and, in fact, highly recommended!) to set resource requests and limits when running containers on Kubernetes. You can check resource limits of a pod with kubectl describe. If metrics-server is installed (which is recommended, even on dev/staging environments), you can check resource usage with kubectl top. Tools like k9s will show you resource requests, limits, and usage in a comprehensive way (as long as the data is available; i.e. you still need to install metrics-server to obtain pod metrics, for instance).
In addition to the VM overhead described above, if the container does a lot of I/O (whether it's disk or network), there might be a bit of overhead in comparison to a native process. This can become noticeable if the container writes on the container copy-on-write filesystem (instead of a volume), especially when using the device-mapper storage driver.
Applications that use "live reload" techniques (that automatically rebuild or restart when source code is edited) are particularly prone to this I/O issue, because there are unfortunately no efficient methods to watch file modifications across a virtual machine boundary. This means that many web frameworks exhibit extreme performance degradations when running in containers on Mac or Windows when the source code is mounted to the container.
In addition to these factors, there can be other subtle differences that might affect the overall performance of a containerized application. When observing performance issues, it is very helpful to use a profiler (or some kind of APM solution) to see which parts of the code take longer to execute. If no profiler or APM is available, try to execute individual portions of the code independently to compare their performance. For instance, have a small piece of code that executes a single query to the database; or executes a single task from a job queue, etc.
Good luck!

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Elastic Cloud APM Server - Queue is full

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version
I deleted all the indexes related to APM and everything worked after some minutes.
for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

Writing to neo4j pod takes much more time than writing to local neo4j

I have a python code where I process some data, write neo4j queries and then commit these queries to neo4j. When I run the code on my local machine and write the output to local neo4j it doesn't take more than 15 minutes. However, when I run my code locally and write the output to noe4j pod in k8s pod it takes double the time, and when I build my code and deploy it to k8s and run that pod and write the output to neo4j pod it takes a round 3 hours. since I'm new to k8s deployment it might something in the pod configurations or settings, so I appreciate if I can get some hints
There could be few reasons of that.
I would first check how much resources does your pod consume while you are processing data, you can do that using kubectl top pod.
Second I would check if there are any limits inside pod. You can read a great deal about them on Managing Compute Resources for Containers.
If you have a limit set then it might be too low and that's causing the extended time while processing data.
If limits are not set then it might be because of how you installed minik8s. I think as default it's being installed with 4G is memory, you can look at alternative methods of installing minik8s. With multipass you can specify more memory to allocate.
There also can be a issue with Page Cache Sizing, Heap Sizing or number of open files. Please read the Neo4j Performance Tuning.

How to control various log sizes

I have cluster running in Azure.
I have multiple gigabytes of log data under D:\SvcFab\Log\Traces. Is there way to control amount of trace data that is collected/stored? Will the logs grow indefinitely?
Also the D:\SvcFab\ReplicatorLog has 8GB of preallocated data as specified by SharedLogSizeInMB parameter (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-configuration). How can I change this setting in Azure cluster or should it always be kept default?
For Azure clusters the SvcFab\Log folder will grow up to 5GB. It will also shrink if detects your disk is running out of space (<1GB). There are no controls for this in Azure.
This may be old but if you still have this issue, the solution for this is to add parameter in arm template for service fabric cluster .. there are some other ways to do this but this one is the most guaranteed one
https://techcommunity.microsoft.com/t5/azure-paas-developer-blog/reduce-log-size-on-service-fabric-node/ba-p/1017493