Drools performance - drools

I have an issue regarding performance of Drools on different Machines.
I made very simple JMH Benchmark test:
package ge.magticom.rules.benchmark;
import ge.magticom.rules.benchmark.Subscriber
rule "bali.free.smsparty"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 2)
then
$subs.setResult(15);
end
rule "bali.free.smsparty5"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 3)
then
$subs.setResult(14);
end
#Benchmark
public Subscriber send() throws Exception {
Subscriber subscriber = new Subscriber();
subscriber.setChargingProfileID(5);
StatelessKieSession session = ruleBase.newStatelessKieSession();
ArrayList<Object> objs = new ArrayList<Object>();
objs.add(subscriber);
session.execute(objs);
return subscriber;
}
On Home development machine
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
(Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads ) 64 GB Memory with JDK 11 and have very great performance:
With 7 threads a have nearly 2M operation per second(Stateless)
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 2154292.750 ± 149405.498 ops/s
But on preproduction server Which is Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 112 Threads and 1 TB RAM I have half of performance (Even increasing threads)
NAME="Oracle Linux Server"
VERSION="8.4"
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 1084939.195 ± 107897.663 ops/s
I'm trying to test our billing system with java 11 and Drolls 7.54.0.Final.
Our system was based on Jrockit realtime 1.6 and drools version 4.0.3. We are moving system from Sun Solaris SPARK to Intel base system.
Running same rules with Jrockit 1.6 I got even worth performance issue with Home and Preproduction environment:
Home test benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 692054.563 ± 3507.519 ops/s
Preproduction benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 382283.288 ± 6405.953 ops/s
As you can see, it's nearly half performance of very simple rules.
But for real rules, such as our online charging system, it's even bad performance :
On home environment I got
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 152.846 ± 87.076 ops/s
this means 1 message contains nearly 100 iterations
so in 00:01:49 benchmark processed 16287 sessions with 430590 events of rule calls. single rule call is about 2.33 millisecond in average, which is not very great, but not as bad as on preproduction
On Preproduction server
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 35.013 ± 9.565 ops/s
in 00:01:54 I got only 3723 sessions which contains wholly 98571 events of rule calls. Each call is 10.7299 msc in average.
During running all these benchmark nothing was running on preproduction system. But on home environment there is a lot of development tools, was running tests from Intellij IDEA
Can you suggest anything, which may cause such difference in performance. I tried different java versions and vendors. These results are based on oracle-jdk-11.0.8.
Here are kernel params of Preproduction server:
fs.file-max = 6815744
kernel.sem = 2250 32000 100 128
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.shmmax = 4398046511104
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500

This is just a very wild guess since I definitively don't have enough information, but are the 2 environments using the same garbage collectors configured in the same way? Maybe you're using ParallelGC (which in my experience is better for pure throughput as you're measuring) on one side and G1 on the other?

Thanks for answer.
I used several GC configuration, none of them were ParallelGC. I think GC is not a problem. I used ZGC in final tests and GC pause times are not above 5 msc (tested also with java 16 and pause times are below 100 microsecond ). :
#Fork(value = 2, jvmArgs = {"--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints",
"-Xmx10G","-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
java -version
java version "11.0.8" 2020-07-14 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.8+10-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode)
Here is Flame graph generated with AsyncProfilers
As you can see, at home enviroment Java process is using 95% of whole time, but on server only 65%. the time difference is also obviouse :
RulesBenchmark.send thrpt 5 1612318.098 ± 64712.672 ops/s
Home Result FlameGraph.html
RulesBenchmark.send thrpt 5 775498.081 ± 72237.890 ops/s
Server Flame Graph.html

Related

ERROR: No OpenCL platforms found, check OpenCL installation

I tried to run Matlab program on gpu (CentOS 7.3).
This Matlab use caffe.
When I run it from the command line with:
matlab -nodisplay -r "demo, quit"
it run okay.
When I run it with LSF command:
bsub -q gpu -R "select[ngpus>0] rusage[ngpus_shared=1]" matlab -nodisplay -r "demo, quit"
I get the error :
ERROR: No OpenCL platforms found, check OpenCL installation
I comprare the LD_PATH_LIBRARY - are the same.
What can be the problem?
Any ideas are welcome!
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 8.0.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K40m
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 375.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 09:00.0
Max compute units 15
Max clock frequency 745MHz
Compute Capability (NV) 3.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 245760 (240KiB)
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
My best guess would be that the bsub command from LSF schedules the job on another machine (compute node) in a cluster, where OpenCL is not installed.
Having OpenCL/CUDA on the frontend, but not the compute nodes of a cluster is something I've witnessed quite a few times. Even parts of the filesystem with the libraries are shared, the folder /etc/OpenCL/vendors, used for OpenCLs ICD mechanism must be present.
You could try running clinfo via bsub (if you didn't already), or use bsub to execute ls /etc/OpenCL/vendors.
If you're not sure whether or not the LSF-submitted jobs run on the same machine or not, use the hostname command with and without bsub.
Hope that helps.

Out of memory issue in jdk but works fine in openjdk, java application deployed on jboss 5.1

I have deployed my java application on jboss and linux 2.6.32 .
The machine has 8 gb of memory, but when I run the application on openjdk using
JAVA_OPTS="$JAVA_OPTS -server -Xms2048m -Xmx2048m -XX:MaxPermSize=700m
-XX:NewRatio=3 -XX:+DisableExplicitGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Dofbiz.home=adasdfasdf".
It's working fine.
But when i try to run the same on jdk 1.6 its giving me out of memory error as below:
There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 444 bytes for vframeArray::allocate
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (allocation.inline.hpp:44), pid=11749, tid=707259248
#
# JRE version: 6.0_32-b05
# Java VM: Java HotSpot(TM) Server VM (20.7-b02 mixed mode linux-x86 )
--------------- T H R E A D ---------------
Current thread (0x2a5b5000): JavaThread "main" [_thread_in_Java, id=11768, stack(0x2a22e000,0x2a27f000)]
Stack: [0x2a22e000,0x2a27f000], sp=0x2a27cd50, free space=315k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x7257e0]
How can i make the application run using jdk 1.6?

Unusual sysbench results Raspberry Pi

I have 2 raspberry pi's that I wanted to benchmark for load balancing purpose.
Raspberry pi Model B v1.1 - running Raspbian Jessie
Raspberry pi Model B+ v1.2 - running Raspbian Jessie
I installed sysbench on both systems and ran: sysbench --num-threads=1 --test=cpu --cpu-max-prime=10000 --validate run on the first and changed --num-threads=4 on the second, as its quadcore and ran both.
The results are not at all what I expected (I obviously expected the multithreaded benchmark to severely outperform the single threaded benchmark). When I ran a the command with a single thread, performance was about the same on both systems. But when I changed the number of threads to 4 on the second Pi it still took the same amount of time, except that the per request statistics showed that the average request took about 4 times as much time. I can seem to grasp why this is.
Here are the results:
Raspberry pi v1.1
Single thread
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1325.0229s
total number of events: 10000
total time taken by event execution: 1324.9665
per-request statistics:
min: 131.00ms
avg: 132.50ms
max: 171.58ms
approx. 95 percentile: 137.39ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 1324.9665/0.00
Raspberry pi v1.2
Four threads
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1321.0618s
total number of events: 10000
total time taken by event execution: 5283.8876
per-request statistics:
min: 486.45ms
avg: 528.39ms
max: 591.60ms
approx. 95 percentile: 553.98ms
Threads fairness:
events (avg/stddev): 2500.0000/0.00
execution time (avg/stddev): 1320.9719/0.03
"Raspberry pi Model B+ v1.2" has the same CPU as "Raspberry pi Model B v1.1". Both boards are from the first generation of Raspberry Pi and they have 1 core CPU.
For 4 CPU you need Raspberry Pi 2 Model B instead of Raspberry pi Model B+.
Yeah, the naming is a bit confusing :(

akka custom fork-join-executor dispatcher behaves differently on OSX and RHEL

When I deploy a Play framework application, using the Akka framework to a production machine it behaves differently then on my development workstation.
This is a system that receives a batch of device IP addresses, it performs some processing on each device and aggregates the results after all devices in the batch have been processed. This processing isn't very CPU intensive.
I basically have 2 types of actors, A BatchActor, and a DeviceActor. For the devices, I've created a created an actor backed by a RoundRobinPool router, and a custom dispatcher. I'm attempting to process ~500 device at a time (in parallel).
This issue is that when I run this code on my OSX machine, it runs as I would except.
For instance if I submit a batch of 200 device IP addresses, the application running on my workstations all the devices in parallel.
However when I copy this application to the production machine, Red Hat Enterprise Linux (RHEL), and run it submitting the same list of devices, it only processes 1 to 2 devices at a time.
What do I need to do to fix this issue?
The relevant code is as follows:
object Application extends Controller {
...
val numberOfWorkers = 500
val workers = Akka.system.actorOf(Props[DeviceActor]
.withRouter(RoundRobinPool(nrOfInstances = numberOfWorkers))
.withDispatcher("my-dispatcher")
)
def batchActor(config:BatchConfig)
= Akka.system.actorOf(BatchActor.props(workers, config), s"batch-${config.batchId}")
...
def batch = Action(parse.json) { request =>
request.body.validate[BatchConfig] match {
case config:BatchConfig => {
...
val batch = batchActor(config)
batch ! BatchActorProtocol.Start
Ok(Json.toJson(status))
}
...
}
}
The application.conf configuration section looks like the following:
my-dispatcher {
# Dispatcher is the name of the event-based dispatcher
type = Dispatcher
# What kind of ExecutionService to use
executor = "fork-join-executor"
# Configuration for the fork join pool
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
# Throughput defines the maximum number of messages to be
# processed per actor before the thread jumps to the next actor.
# Set to 1 for as fair as possible.
throughput = 500
}
Inside the BatchActor I'm simply parsing the list of devices and feeding it to the
class BatchActor(val workers:ActorRef, val config:BatchConfig) extends Actor
...
def receive = {
case Start => start
...
}
private def start = {
...
devices.map { devices =>
results(devices.host) = None
workers ! DeviceWork(self, config, devices, steps)
}
...
}
after which the WorkerActor submits a result object back to the BatchActer.
My workstation: OS X - v10.9.3
java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
production machine: Red Hat Enterprise Linux Server release 6.5 (Santiago)
java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Software:
Scala: v2.11.2
SBT: v0.13.6
Play: v2.3.5
Akka: v2.3.4
I'm using typesafe activator/sbt to start the application. The command is as follows:
cd <project dir>
./activator run -Dhttp.port=6600
Any help appreciated. I've been stuck on this issue for a couple of days now.
I believe you have too much parallelism in your code i.e., you are creating too many threads in your dispatcher. How many cores do you have on your Redhat box ? I've never seen such high value used. A lot of threads in FJ pool may be resulting in a large number of context switches. Try just using the default dispatcher and see if that fixes your issue or not. You can also change the values of min and max parallelism to 2 or 3 times number of cores you have.
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
Another thing to try is to create an uber jar using (sbt-assembly) and then deploy that instead of using activator to deploy it.
Finally, you can look inside your JVMs using something like VisualJVM or Yourkit.
After hours spent trying different things including:
doing research on different threading implementations on linux - pthreads vs NPTL
reading through all the VM documentation on threading
ulimits
trying various changes in the Play and Akka framework configurations
and finally a complete re-write of the thread management using scala futures, etc..
Nothing seemed to work. Then I did a detailed comparison and the only thing that was different was that I used the Oracle Hotspot implementation on my laptop, and the OpenJDK implementation on the production machine.
So I installed the Oracle VM on the production machine and that seemed to fix the issue. Even though I couldn't determine what the ultimate solution was, it seems that the default installation of OpenJDK on RHEL is complied or configured differently enough to not allow spawning of ~ 500 threads at a time.
I'm sure I'm missing something, but after ~ 3 days of searching I couldn't find it.

mongodb higher faults on Windows than on Linux

I am executing below C# code -
for (; ; )
{
Console.WriteLine("Doc# {0}", ctr++);
BsonDocument log = new BsonDocument();
log["type"] = "auth";
BsonDateTime time = new BsonDateTime(DateTime.Now);
log["when"] = time;
log["user"] = "staticString";
BsonBoolean bol = BsonBoolean.False;
log["res"] = bol;
coll.Insert(log);
}
When I run it on a MongoDB instance (version 2.0.2) running on virtual 64 bit Linux machine with just 512 MB ram, I get about 5k inserts with 1-2 faults as reported by mongostat after few mins.
When same code is run against a MongoDB instance (version 2.0.2) running on a physical Windows machine with 8 GB of ram, I get 2.5k inserts with about 80 faults as reported by mongostat after few mins.
Why more faults are occurring on Windows? I can see following message in logs-
[DataFileSync] FlushViewOfFile failed 33 file
Journaling is disable on both instances
Also, is 5k insert on a virtual machine with 1-2 faults a good enough speed? or should I be expecting better inserts?
Looks like this is a known issue - https://jira.mongodb.org/browse/SERVER-1163
page fault counter on Windows is in fact the total page faults which include both hard and soft page fault.
Process : Page Faults/sec. This is an indication of the number of page faults that
occurred due to requests from this particular process. Excessive page faults from a
particular process are an indication usually of bad coding practices. Either the
functions and DLLs are not organized correctly, or the data set that the application
is using is being called in a less than efficient manner.