Unusual sysbench results Raspberry Pi - raspberry-pi

I have 2 raspberry pi's that I wanted to benchmark for load balancing purpose.
Raspberry pi Model B v1.1 - running Raspbian Jessie
Raspberry pi Model B+ v1.2 - running Raspbian Jessie
I installed sysbench on both systems and ran: sysbench --num-threads=1 --test=cpu --cpu-max-prime=10000 --validate run on the first and changed --num-threads=4 on the second, as its quadcore and ran both.
The results are not at all what I expected (I obviously expected the multithreaded benchmark to severely outperform the single threaded benchmark). When I ran a the command with a single thread, performance was about the same on both systems. But when I changed the number of threads to 4 on the second Pi it still took the same amount of time, except that the per request statistics showed that the average request took about 4 times as much time. I can seem to grasp why this is.
Here are the results:
Raspberry pi v1.1
Single thread
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1325.0229s
total number of events: 10000
total time taken by event execution: 1324.9665
per-request statistics:
min: 131.00ms
avg: 132.50ms
max: 171.58ms
approx. 95 percentile: 137.39ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 1324.9665/0.00
Raspberry pi v1.2
Four threads
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1321.0618s
total number of events: 10000
total time taken by event execution: 5283.8876
per-request statistics:
min: 486.45ms
avg: 528.39ms
max: 591.60ms
approx. 95 percentile: 553.98ms
Threads fairness:
events (avg/stddev): 2500.0000/0.00
execution time (avg/stddev): 1320.9719/0.03

"Raspberry pi Model B+ v1.2" has the same CPU as "Raspberry pi Model B v1.1". Both boards are from the first generation of Raspberry Pi and they have 1 core CPU.
For 4 CPU you need Raspberry Pi 2 Model B instead of Raspberry pi Model B+.
Yeah, the naming is a bit confusing :(

Related

Drools performance

I have an issue regarding performance of Drools on different Machines.
I made very simple JMH Benchmark test:
package ge.magticom.rules.benchmark;
import ge.magticom.rules.benchmark.Subscriber
rule "bali.free.smsparty"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 2)
then
$subs.setResult(15);
end
rule "bali.free.smsparty5"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 3)
then
$subs.setResult(14);
end
#Benchmark
public Subscriber send() throws Exception {
Subscriber subscriber = new Subscriber();
subscriber.setChargingProfileID(5);
StatelessKieSession session = ruleBase.newStatelessKieSession();
ArrayList<Object> objs = new ArrayList<Object>();
objs.add(subscriber);
session.execute(objs);
return subscriber;
}
On Home development machine
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
(Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads ) 64 GB Memory with JDK 11 and have very great performance:
With 7 threads a have nearly 2M operation per second(Stateless)
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 2154292.750 ± 149405.498 ops/s
But on preproduction server Which is Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 112 Threads and 1 TB RAM I have half of performance (Even increasing threads)
NAME="Oracle Linux Server"
VERSION="8.4"
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 1084939.195 ± 107897.663 ops/s
I'm trying to test our billing system with java 11 and Drolls 7.54.0.Final.
Our system was based on Jrockit realtime 1.6 and drools version 4.0.3. We are moving system from Sun Solaris SPARK to Intel base system.
Running same rules with Jrockit 1.6 I got even worth performance issue with Home and Preproduction environment:
Home test benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 692054.563 ± 3507.519 ops/s
Preproduction benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 382283.288 ± 6405.953 ops/s
As you can see, it's nearly half performance of very simple rules.
But for real rules, such as our online charging system, it's even bad performance :
On home environment I got
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 152.846 ± 87.076 ops/s
this means 1 message contains nearly 100 iterations
so in 00:01:49 benchmark processed 16287 sessions with 430590 events of rule calls. single rule call is about 2.33 millisecond in average, which is not very great, but not as bad as on preproduction
On Preproduction server
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 35.013 ± 9.565 ops/s
in 00:01:54 I got only 3723 sessions which contains wholly 98571 events of rule calls. Each call is 10.7299 msc in average.
During running all these benchmark nothing was running on preproduction system. But on home environment there is a lot of development tools, was running tests from Intellij IDEA
Can you suggest anything, which may cause such difference in performance. I tried different java versions and vendors. These results are based on oracle-jdk-11.0.8.
Here are kernel params of Preproduction server:
fs.file-max = 6815744
kernel.sem = 2250 32000 100 128
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.shmmax = 4398046511104
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
This is just a very wild guess since I definitively don't have enough information, but are the 2 environments using the same garbage collectors configured in the same way? Maybe you're using ParallelGC (which in my experience is better for pure throughput as you're measuring) on one side and G1 on the other?
Thanks for answer.
I used several GC configuration, none of them were ParallelGC. I think GC is not a problem. I used ZGC in final tests and GC pause times are not above 5 msc (tested also with java 16 and pause times are below 100 microsecond ). :
#Fork(value = 2, jvmArgs = {"--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints",
"-Xmx10G","-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
java -version
java version "11.0.8" 2020-07-14 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.8+10-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode)
Here is Flame graph generated with AsyncProfilers
As you can see, at home enviroment Java process is using 95% of whole time, but on server only 65%. the time difference is also obviouse :
RulesBenchmark.send thrpt 5 1612318.098 ± 64712.672 ops/s
Home Result FlameGraph.html
RulesBenchmark.send thrpt 5 775498.081 ± 72237.890 ops/s
Server Flame Graph.html

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

Is there an equivalent register to Intel's MSR_SMI_COUNT on AMD architecture?

On recent Intel CPUs it's possible to count the number of SMIs that have occurred, by reading msr 0x34.
I have checked the manuals at -
https://developer.amd.com/resources/developer-guides-manuals/
for an equivalent register/function, without success.
AMD Zen specifies the LsSmiRx performance counter for System Management Interrupts (SMIs):
PMCx02B [SMIs Received] (Core::X86::Pmc::Core::LsSmiRx)
Counts the number of SMIs received.
(Open-Source
Register Reference
For AMD Family 17h Processors
Models 00h-2Fh. Rev 3.03, 2018, page 153)
On Linux, you can monitor it like this:
# perf stat -e ls_smi_rx -I 60000
This command prints each minute a count of all newly triggered SMIs aggregated over all CPUs.
That means for monitoring - unlike with the MSR_SMI_COUNT register available on Intel CPUs - you have to actively program a PMU register (to observe the LsSmiRx event).
NB: The above referenced AMD documentation confirms that AMD Zen doesn't support the SMI_COUNT MSR (0x34), since it isn't included in the list of available MSRs (in Chapter 2.1.10, page 77).
No, but SMI count is available as a PMC (performance counter) on AMD processors.

Kitura slow or low request per second?

I've download Kitura 0.20 and created a new project for a benchmark on a swift build -c release
import Kitura
let router = Router()
router.get("/") {
request, response, next in
response.send("Hello, World!")
next()
}
Kitura.addHTTPServer(onPort: 8090, with: router)
Kitura.run()
and the score appear to be low compare to Zewo and Vapor which could hit 400k+ request/s?
MacBook-Pro:hello2 yanli$ wrk -t1 -c100 -d30 --latency http://localhost:8090
Running 30s test # http://localhost:8090
1 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 415.36us 137.54us 3.09ms 91.11%
Req/Sec 5.80k 2.47k 7.19k 85.71%
Latency Distribution
50% 391.00us
75% 443.00us
90% 513.00us
99% 0.93ms
16229 requests in 30.01s, 1.67MB read
Socket errors: connect 0, read 342, write 55, timeout 0
Requests/sec: 540.84
Transfer/sec: 57.04KB
I suspect you are running out of ephemeral ports. Your issue is probably the same as this one: 'ab' program freezes after lots of requests, why?
Kitura currently does not support HTTP keepalive, and so every request requires a new connection. One symptom of this is that regardless of how many seconds you attempt to drive load, you'll see a similar number of completed requests (16229 in your example).
On OS X, there are 16,384 ephemeral ports available by default, and these will be rapidly exhausted unless you tune the network settings.
[1] http://danielmendel.github.io/blog/2013/04/07/benchmarkers-beware-the-ephemeral-port-limit/
[2] https://rolande.wordpress.com/2010/12/30/performance-tuning-the-network-stack-on-mac-osx-10-6/
My approach has been to reduce the Maximum Segment Lifetime tunable (which defaults to 15000, or 15 seconds) and increase the range of available ports temporarily while benchmarking, for example:
sudo sysctl -w net.inet.tcp.msl=1000
sudo sysctl -w net.inet.ip.portrange.first=32768
<run benchmark>
sudo sysctl -w net.inet.tcp.msl=15000
sudo sysctl -w net.inet.ip.portrange.first=49152

Could MongoDB send multiply update commands at once?

In my project, I have to update documents in MongoDB many times. I find MongoDB support insert a lot of documents with insert_many use a single command, but I can't use update_many to update a lot at once, they have difference condition. I have to update them one by one.
With insert_many, I can insert more than 7000 documents per second. At the same environment, there are only about 1500 documents could be updated per second. It just seems inefficient to send thousands of commands when one will do.
Is it possible to send multiple update commands to MongoDB server at once?
Thanks for your explain #Blakes Seven, I have rewritten my program with Bulk and update documents with "unordered" operation. There is the speed report on my test environment.
1 thread: 12655 doc/s cpu: 150 - 200%
2 threads: 19005 doc/s cpu: 200 - 300%
3 threads: 24433 doc/s cpu: 300 - 400%
4 threads: 28957 doc/s cpu: 400 - 500%
5 threads: 35586 doc/s cpu: 500 - 600%
6 threads: 32942 doc/s cpu: 600+%
On my test environment, test program and MongoDB server running on the same machine, It seems not perfect for multiple threads. The CPU usage of MongoDB when run the program with a single thread, It was between 150% and 200%. MongoDB executed the operations in parallel exactly, seems have a limit of the threads with a client connect.
Anyway, a single thread is enough for me, besides, fewer thread has higher efficiency.
Another report on the online environment that client and server running on a different machine:
1 thread: 14719 doc/s
2 threads: 26837 doc/s
3 threads: 34908 doc/s
4 threads: 46151 doc/s
5 threads: 47842 doc/s
6 threads: 52522 doc/s
You can do that with
db.collection.findAndModify()
Please go through documentation :
https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/