Having issue with Multi Threaded Java app in Cloud Foundry - sap-cloud-platform

My Java app has a program which executes multiple threads on the same time to complete a task. It works really well on my local computer as it has 4 Cores and 8 Logical Processors but when I deploy my app on cloud foundry, it doesn't allow to create more than 1 thread. I debugged and found that Cloud Foundry JVM has only 1 CPU allocated therefore it cannot run multiple threads simultaneously.
How can I fix this problem?
Do I need to buy more CPUs or There is a way to change JVM configuration to set multiple CPUs for the java app.

In the SAP Cloud Platform Cloud Foundry, a number of CPUs depends on allocated memory. You can get a maximum vertical scaling of 8GB RAM and 2 CPUs. See the ration in the quote below.
In the Cloud Foundry environment, applications get guaranteed CPU share of ¼ core per GB instance memory. As the maximum instance memory per application is 8 GB, this allows for vertical scaling up to 2 CPUs.
Here you can find more on quotas and limitations of SAP Cloud Platform.
This blog provides interesting insights on performance testing on SCP
To update your quota go to overview section of your App and follow the steps on the screenshot:

I am giving an answer to my own question as I have figured out the issue. The thread deadlock was the root cause of the issue. Multiple threads were calling the same methods concurrently and the threads were getting trapped when waiting for each other to release the lock. Therefore the execution was becoming unresponsive. To fix it, I changed all the methods used in call method of Callable to have synchronisation using keyword "synchronized" in method declaration. It is now working fine locally and in the Cloud environment. The only thing, I still don't understand is why deadlock was only happening in cloud environment but not locally. I guess it might be the different JVM configuration in the cloud.
#Artyom Kovalyov, here is how I splited tasks for parallel processing.
final int numberOfWorkers = 4;
List<Callable<Void>> tasks = new ArrayList<Callable<Void>>();
//code here to create and add tasks to list
final List<List<Callable<Void>>> dividedTasks = Lists.partition(tasks, numberOfWorkers);
final ExecutorService executor = Executors.newFixedThreadPool(numberOfWorkers);
for (List<Callable<Void>> subsetOfTasks : dividedTasks) {
try {
// you can invoke all tasks in one go but I prefer invoking number of taks same as thread pool max thread number
executor.invokeAll(subsetOfTasks);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdownNow();

Related

"frozen" Google compute engine instance with PostgreSQL

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
Debian 9 with PG 11.9
Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.
I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time.
As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.
At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again.
I only resized my disk, now my instance disk is 20GB and is working fine.
Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.
If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.
Changing a machine type
1.- Go to the VM Instances page.
2.- In the Name column, click your instance.
From the instance details page, complete the following steps:
a) Click the Stop button to stop the instance, if you have not stopped it yet.
b) After the instance stops, click the Edit button at the top of the page.
c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.
d) Save your changes and start again your VM Instance.
You can resize your disk following this guide or with the following command:
gcloud compute disks resize DISK_NAME --size DISK_SIZE
Or with the Console:
Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.
After you resize the disk, you must resize the file system so that the operating system can access the additional space.
Note: Do not resize boot disks beyond 2 TB because this is the limit.
Edit1
You mentioned that the logs don’t show information about the issue when the instance is frozen.
Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.
For Debian, this logs should be in the following path:
/var/log/kern.log
Also the messages log could help
/var/log/messages
You can obtain more information about the logs in this link.
Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).
You can consult this URL to get more information.
Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage
Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.
This link explains better this configuration.
Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.
Edit2
I have found an article that describes a similar problem and it said that:
When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.
And this is the recommendation they give.
When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.
You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

Multiple inputs running in parallel in the same application vs. starting a new application for each input

As i can see when running the application in debug mode there are 4 different threads on which Drools is running. When i start the application with two threads running two inputs in parallel again the same 4 threads remain (they don't turn to 8 for example).
My question is , is there a benefit from running the inputs in a separate application (and with this starting a separate Drools application) , or all of this is covered and we end up with the same results using one application and starting the inputs in parallel?
Advantages of using 1 JVM's:
heap pooling
JIT compilation only once in java code
JIT compilation only once in DRL code iff you reuse your kiebases
Ability to implement queueing and round-robin solving. See SolverManager issue and runnablePartThreadLimit's purpose in Partitioned Search (which will also be supported on SolverManager at some point).
Advantages of using multiple JVM's:
Garbage Collector works more efficiently usually
Ability to link OS (especially on linux) process id's to specific cores. This allows to do round-robin solving through OS configuration.

Celery: limit memory usage (large number of django installations)

we're having a setup with a large number of separate django installations on a single box. each of these have their own code base & linux user.
we're using celery for some asynchronous tasks.
each of the installations has its own setup for celery, i.e. its own celeryd & worker.
the amount of asynchronous tasks per installation is limited, and not time-critical.
when a worker starts it takes about 30mb of memory. when it has run for a while this amount may grow (presumably due to fragmentation).
the last bulletpoint has already been (somewhat) solved by settings --maxtasksperchild to a low number (say 10). This ensures a restart after 10 tasks, after which the memory at least goes back to 30MB.
However, each celeryd is still taking up a lot of memory, since the minimum amount of workers appears to be 1 as opposed to 0. I also imagine running python manage.py celery worker does not lead to the smallest-possible footprint for the celeryd, since the full stack is loaded even if the only thing that happens is checking for tasks.
In an ideal setup, I'd like to see the following: a process that has a very small memory footprint (100k or so) is looking at the queue for new tasks. when such a task arises, it spins up the (heavy) full django stack in a separate process. and when the worker is done, the heavy process is spun down.
Is such a setup configurable using (somewhat) standard celery? If not, what points of extension are there?
we're (currently) using celery 3.0.17 and the associated django-celery.
Just to make sure I understand - you have a lot of different django codebases, each with their own celery, and they take up too much memory when running on a single box simultaneously, all waiting for a celery job to come down the pipe? How many celery instances are we talking about here?
In my experience, you're using django celery in a very different way than it was designed for - all of your different django projects should be condensed to a few (or a single) project(s), composed of multiple applications. Then you set up a small number of queues to field celery tasks from the different apps - this way, you only have as many dormant celery threads taking up 30mb as you have queues, and a single queue can handle multiple tasks (from multiple apps if you want). The memory issue should go away.
To reiterate - you only need one celeryd, driving multiple workers. This way your bottleneck is job concurrency, not dormant memory needs.
Why do you need so many django installations? Please let me know if I'm missing something, or if you need clarification.

NUMA awareness of JVM

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.