Understanding Phreak algorithm's performance - drools

I'm trying to understand what is making Drools perform poorly in our use case. I read the Phreak documentation here: https://access.redhat.com/documentation/en-us/red_hat_decision_manager/7.4/html/decision_engine_in_red_hat_decision_manager/phreak-algorithm-con_decision-engine
It doesn't seem to mention anything regarding how node to node jumps are done. Here's an example to explain what I mean by that:
Let's say I have a Person object with three fields: name, lastName, SSN
I define a large number of rules this way: when lastName = 'Smith' and SSN = 'XXXXXXXXXX' then name = "Jane".
Assuming I have a large number of people with "Smith" as last name, say 10,000, what is the complexity to get a single name given a last name and an SSN? Would it take 10,000 comparisons, or does the "Smith" node keep some form of hash map with all the underlying SSN?
What if instead of an SSN with an equality operator I used a range, like a date range for example, defining rules like this: "when lastName = 'Smith' and startedShool >= 2005 and finishedSchool <= 2010". Do nodes keep some fancy data structure to speed up the queries with range operators?
EDIT: Per request I'm adding an example to explain how the rules are set up
We have a single class called Evidence as both input and output. We build each Evidence instance in a different activation-group and add it to a set.
We usually define a few catch-all, broad rules with low salience and add rules with more specific conditions and higher salience.
This is a sample of two rules with increasing salience. We define ~10K of these.
One can imagine a sort of tree structure where at each level the salience increases and the conditions become more stringent. The Evidence class functions as a sort of key-value pair, so many of them will have the same node in common (e.g. name = location).
To execute the rules, we would add two evidences (e.g. [name= location, stringValue='BBB'] and [name = code, stringValue = theCode]) and fire the rules.
rule "RULE 1::product"
salience 0
activation-group "product"
when
$outputs : EvidenceSet.Builder( )
Evidence( name == "location" && stringValue == "BBB" )
then
$outputs.addEvidences(Evidence.newBuilder().setName("Product").setStringValue("999"));
end
rule "RULE 2::product"
salience 1
activation-group "product"
when
$outputs : EvidenceSet.Builder( )
Evidence( name == "location" && stringValue == "BBB" )
Evidence( name == "code" && stringValue == "thecode" )
then
$outputs.addEvidences(Evidence.newBuilder().setName("Product").setStringValue("777"));
end

Here is the rule template
rule "RULE %s::product"
when
$outputs : EvidenceSet.Builder( )
Evidence( name == "location" && stringValue == "BBB" )
Evidence( name == "code" && stringValue matches ".*code.*" )
then
$outputs.addEvidences(Evidence.newBuilder().setName("Product").setStringValue("777"));
end
Here is the code which built 10000 rule drl out of the template
public static void main(String[] args) throws IOException {
String template = Resources.toString(Resources.getResource("template.drl"), Charset.defaultCharset());
try (PrintStream file = new PrintStream(new File("test.drl"))) {
for (int i = 0; i < 10_000; i++)
file.println(String.format(template, i));
}
}
Here is domain classes based on your drl syntax, excluding equals, hashCode and toString, looks strange for me...
public class Evidence {
public String name;
public String stringValue;
public Evidence() {
}
public Evidence(String name, String stringValue) {
this.name = name;
this.stringValue = stringValue;
}
public static Builder newBuilder() {
return new Builder();
}
}
public class EvidenceSet {
public static Set<Evidence> evidenceSet = new HashSet<>();
public static class Builder {
public Evidence evidence = new Evidence();
public Builder setName(String name) {
evidence.name = name;
return this;
}
public Builder setStringValue(String stringValue) {
evidence.stringValue = stringValue;
return this;
}
public void addEvidences(Builder builder) {
evidenceSet.add(builder.evidence);
}
}
}
Here is the test which executes 10k rules file
#DroolsSession("classpath:/test.drl")
public class PlaygroundTest {
private PerfStat firstStat = new PerfStat("first");
private PerfStat othersStat = new PerfStat("others");
#Rule
public DroolsAssert drools = new DroolsAssert();
#Test
public void testIt() {
firstStat.start();
drools.insertAndFire(new EvidenceSet.Builder(), new Evidence("location", "BBB"), new Evidence("code", "thecode"));
firstStat.stop();
for (int i = 0; i < 500; i++) {
othersStat.start();
drools.insertAndFire(new Evidence("code", "code" + i));
othersStat.stop();
}
System.out.printf("first, ms: %s%n", firstStat.getStat().getAvgTimeMs());
System.out.printf("others, ms: %s%n", othersStat.getStat().getAvgTimeMs());
System.out.println(EvidenceSet.evidenceSet);
}
}
The rule and the test should fit your requirements to match all decision nodes in the rule. Agenda group and salience has nothing to do with performance, skipped here for simplicity.
The test without 'for' section, gives following profiling result
The 'winner' is org.drools.core.rule.JavaDialectRuntimeData.PackageClassLoader.internalDefineClass which creates and registers auto generated java class for drools then block, which is heavyweight operation.
The test with 'for' section, gives totally different picture
Notice internalDefineClass stays the same, and test code execution becomes notable. Because classes of 10k rules were loaded and that took 12,5 secs on my 6 core (12 logical) CPU machine. After that all other insertions were executed in avg. 700 ms. Let analyze that time.
Then block of the rules is executed in avg 0.01 ms.
RULE 9::product - min: 0.00 avg: 0.01 max: 1.98 activations: 501
10000 RHS * 0.01 ms = 100 ms is the time taken by 10k rules to execute the following
$outputs.addEvidences(Evidence.newBuilder().setName("Product").setStringValue("777"));
Dummy object creation and manipulation becomes notable because you multiple it to 10k. Most of the time it was printing test runner output to the console, according to second performance profiling results. And total 100 ms. RHS execution is quite fast taking into account you rude objects creation to serve 'the builder' paradigm.
To answer your question, I would guess and suggest following issues:
You may not reuse compiled rules session.
It can be underestimated then block execution time, if you have dummy String.format somewhere in RHS it will be notable for the overall execution time just because format is comparably slow and you are dealing with 10k executions.
There could be cartesian rules matches. Just because you are using set for results, you can only guess how many 'exactly the same results' were inserted into that set, implying huge execution waste.
If you are using statefull session I don't see any memory cleanup and object retraction. Which may cause performance issues before OOME. Here is memory footprint of the run...
day2... cartesian matches
I removed logging and made a test for cartesian matches
#Test
public void testIt() {
firstStat.start();
drools.insertAndFire(new EvidenceSet.Builder(), new Evidence("location", "BBB"), new Evidence("code", "thecode"));
firstStat.stop();
for (int i = 0; i < 40; i++) {
othersStat.start();
drools.insertAndFire(new Evidence("location", "BBB"), new Evidence("code", "code" + i));
othersStat.stop();
}
drools.printPerformanceStatistic();
System.out.printf("first, ms: %s%n", firstStat.getStat().getAvgTimeMs());
System.out.printf("others, ms: %s%n", othersStat.getStat().getAvgTimeMs());
System.out.println(EvidenceSet.evidenceSet);
}
Notice we are inserting new Evidence("location", "BBB") each iteration. I was able to run the test with 40 iterations only otherwise I end-up with OOME (something new to me). Here is CPU and memory consumption for 40 iterations:
Each rule was triggered 1681 times! RHS avg. time is not notable (optimized and removed from execution?), but when block was executed...
RULE 999::product - min: 0.00 avg: 0.00 max: 1.07 activations: 1681
RULE 99::product - min: 0.00 avg: 0.00 max: 1.65 activations: 1681
RULE 9::product - min: 0.00 avg: 0.00 max: 1.50 activations: 1681
first, ms: 10271.322
others, ms: 1614.294
After removing cartesian match, the same test is executed much faster and without memory and GC issue
RULE 999::product - min: 0.00 avg: 0.04 max: 1.27 activations: 41
RULE 99::product - min: 0.00 avg: 0.04 max: 1.28 activations: 41
RULE 9::product - min: 0.00 avg: 0.04 max: 1.52 activations: 41
first, ms: 10847.358
others, ms: 102.015
Now I can increase iteration count up to 1000 and average execution time of iteration is even less
RULE 999::product - min: 0.00 avg: 0.00 max: 1.06 activations: 1001
RULE 99::product - min: 0.00 avg: 0.00 max: 1.20 activations: 1001
RULE 9::product - min: 0.00 avg: 0.01 max: 1.79 activations: 1001
first, ms: 10986.619
others, ms: 71.748
Inserting facts without retraction would have it limits though
Summary: you need to make sure you don't get cartesian matches.
Quick solution to omit any cartesian matches is to keep unique Evidences. Following is not 'common approach' of doing this, but it will not require to change and add even more complexity to rules conditions.
Add a method to Evidence class
public boolean isDuplicate(Evidence evidence) {
return this != evidence && hashCode() == evidence.hashCode() && equals(evidence);
}
Add a rule with highest salience
rule "unique known evidence"
salience 10000
when
e: Evidence()
Evidence(this.isDuplicate(e))
then
retract(e);
end
This was tested and appeared to be working by previous junit test which reproduced cartesian issue. Interestingly, for the very test (1000 iterations, 1000 rules), isDuplicate method was invoked 2008004 times and total time taken by all invocations was 172.543 ms. on my machine. RHS of the rule (event retraction) took at least 3 times longer than other rules (evidences collection).
RULE 999::product - min: 0.00 avg: 0.00 max: 1.07 activations: 1001
RULE 99::product - min: 0.00 avg: 0.01 max: 1.81 activations: 1001
RULE 9::product - min: 0.00 avg: 0.00 max: 1.25 activations: 1001
unique known evidence - min: 0.01 avg: 0.03 max: 1.88 activations: 1000
first, ms: 8974.7
others, ms: 69.197

Related

haproxy stats: qtime,ctime,rtime,ttime?

Running a web app behind HAProxy 1.6.3-1ubuntu0.1, I'm getting haproxy stats qtime,ctime,rtime,ttime values of 0,0,0,2704.
From the docs (https://www.haproxy.org/download/1.6/doc/management.txt):
58. qtime [..BS]: the average queue time in ms over the 1024 last requests
59. ctime [..BS]: the average connect time in ms over the 1024 last requests
60. rtime [..BS]: the average response time in ms over the 1024 last requests
(0 for TCP)
61. ttime [..BS]: the average total session time in ms over the 1024 last requests
I'm expecting response times in the 0-10ms range. ttime of 2704 milliseconds seems unrealistically high. Is it possible the units are off and this is 2704 microseconds rather than 2704 millseconds?
Secondly, it seems suspicious that ttime isn't even close to qtime+ctime+rtime. Is total response time not the sum of the time to queue, connect, and respond? What is the other time, that is included in total but not queue/connect/response? Why can my response times be <1ms, but my total response times be ~2704 ms?
Here is my full csv stats:
$ curl "http://localhost:9000/haproxy_stats;csv"
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
http-in,FRONTEND,,,4707,18646,50000,5284057,209236612829,42137321877,0,0,997514,,,,,OPEN,,,,,,,,,1,2,0,,,,0,4,0,2068,,,,0,578425742,0,997712,22764,1858,,1561,3922,579448076,,,0,0,0,0,,,,,,,,
servers,server1,0,0,0,4337,20000,578546476,209231794363,41950395095,,0,,22861,1754,95914,0,no check,1,1,0,,,,,,1,3,1,,578450562,,2,1561,,6773,,,,0,578425742,0,198,0,0,0,,,,29,1751,,,,,0,,,0,0,0,2704,
servers,BACKEND,0,0,0,5919,5000,578450562,209231794363,41950395095,0,0,,22861,1754,95914,0,UP,1,1,0,,0,320458,0,,1,3,0,,578450562,,1,1561,,3922,,,,0,578425742,0,198,22764,1858,,,,,29,1751,0,0,0,0,0,,,0,0,0,2704,
stats,FRONTEND,,,2,5,2000,5588,639269,8045341,0,0,29,,,,,OPEN,,,,,,,,,1,4,0,,,,0,1,0,5,,,,0,5374,0,29,196,0,,1,5,5600,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,1,200,196,639269,8045341,0,0,,196,0,0,0,UP,0,0,0,,0,320458,0,,1,4,0,,0,,1,0,,5,,,,0,0,0,0,196,0,,,,,0,0,0,0,0,0,0,,,0,0,0,0,
In haproxy >2 you now get two values n / n which is the max within a sliding window and the average for that window. The max value remains the max across all sample windows until a higher value is found. On 1.8 you only get the average.
Example of haproxy 2 v 1.8. Note these proxies are used very differently and with dramatically different loads.
So looks like the average response times at least since last reboot are 66m and 275ms.
The average is computed as:
data time + cumulative http connections - 1 / cumulative http connections
This might not be a perfect analysis so if anyone has improvements it'd be appreciated. This is meant to show how I came to the answer above so you can use it to gather more insight into the other counters you asked about. Most of this information was gathered from reading stats.c. The counters you asked about are defined here.
unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
unsigned int qtime_max, ctime_max, dtime_max, ttime_max; /* maximum of conn_time, queue_time, data_time, total_time observed */```
The stats page values are built from this code:
if (strcmp(field_str(stats, ST_F_MODE), "http") == 0)
chunk_appendf(out, "<tr><th>- Responses time:</th><td>%s / %s</td><td>ms</td></tr>",
U2H(stats[ST_F_RT_MAX].u.u32), U2H(stats[ST_F_RTIME].u.u32));
chunk_appendf(out, "<tr><th>- Total time:</th><td>%s / %s</td><td>ms</td></tr>",
U2H(stats[ST_F_TT_MAX].u.u32), U2H(stats[ST_F_TTIME].u.u32));
You asked about all the counter but I'll focus on one. As can be seen in the snippit above for "Response time:" ST_F_RT_MAX and ST_F_RTIME are the values displayed on the stats page as n (rtime_max) / n (rtime) respectively. These are defined as follows:
[ST_F_RT_MAX] = { .name = "rtime_max", .desc = "Maximum observed time spent waiting for a server response, in milliseconds (backend/server)" },
[ST_F_RTIME] = { .name = "rtime", .desc = "Time spent waiting for a server response, in milliseconds, averaged over the 1024 last requests (backend/server)" },
These set a "metric" value (among other things) in a case statement further down in the code:
case ST_F_RT_MAX:
metric = mkf_u32(FN_MAX, sv->counters.dtime_max);
break;
case ST_F_RTIME:
metric = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window));
break;
These metric values give us a good look at what the stats page is telling us. The first value in the "Responses time: 0 / 0" ST_F_RT_MAX, is some max value time spent waiting. The second value in "Responses time: 0 / 0" ST_F_RTIME is an average time taken for each connection. These are the max and average taken within a window of time, i.e. however long it takes for you to get 1024 connections.
For example "Responses time: 10000 / 20":
max time spent waiting (max value ever reached including http keepalive time) over the last 1024 connections 10 seconds
average time over the last 1024 connections 20ms
So for all intents and purposes
rtime_max = dtime_max
rtime = swrate_avg(d_time, srv_samples_window)
Which begs the question what is dtime_max d_time and srv_sample_window? These are the data time windows, I couldn't actually figure how these time values are being set, but at face value it's "some time" for the last 1024 connections. As pointed out here keepalive times are included in max totals which is why the numbers are high.
Now that we know ST_F_RT_MAX is a max value and ST_F_RTIME is an average, an average of what?
/* compue time values for later use */
if (selected_field == NULL || *selected_field == ST_F_QTIME ||
*selected_field == ST_F_CTIME || *selected_field == ST_F_RTIME ||
*selected_field == ST_F_TTIME) {
srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn;
if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0)
srv_samples_window = srv_samples_counter;
}
TIME_STATS_SAMPLES value is defined as
#define TIME_STATS_SAMPLES 512
unsigned int srv_samples_window = TIME_STATS_SAMPLES;
In mode http srv_sample_counter is sv->counters.p.http.cum_req. http.cum_req is defined as ST_F_REQ_TOT.
[ST_F_REQ_TOT] = { .name = "req_tot", .desc = "Total number of HTTP requests processed by this object since the worker process started" },
For example if the value of http.cum_req is 10, then srv_sample_counter will be 10. The sample appears to be the number of successful requests for a given sample window for a given backends server. d_time (data time) is passed as "sum" and is computed as some non-negative value or it's counted as an error. I thought I found the code for how d_time is created but I wasn't sure so I haven't included it.
/* Returns the average sample value for the sum <sum> over a sliding window of
* <n> samples. Better if <n> is a power of two. It must be the same <n> as the
* one used above in all additions.
*/
static inline unsigned int swrate_avg(unsigned int sum, unsigned int n)
{
return (sum + n - 1) / n;
}

before or after within variable window?

I am trying to design a rule system where the rules themselves can be external configured, using a rule configuration object. Specifically, I want to configure externally to the DRL rule definition what the minimum time between rule firings for a particular type of rule should be.
My approach so far is to insert the rule configuration as a fact ValueRuleSpec:
rule "Any value detected"
when
$r : ValueRuleSpec(mode == ValueRuleSpecMode.ANY(), $devices: devices)
$e : SensorEvent(deviceId memberOf $devices) from entry-point FMSensorEvents
not Activation(ruleSpec == $r, cause == $e)
then
insert(new Activation($r, $e));
end
The $r ValueRuleSpec object has a property triggerEvery that contains the minimum number of seconds between activations. I know that this can be done statically by testing for the absence of an Activation object that is inside a specific range before $e using something like:
not Activation(this before[60s, 0s] $e)
How could I do this with a configurable time window, using the $r.triggerEvery property as the number of seconds?
Answering my own question based on the advice of laune.
The before keyword's behavior is described in the manual as:
$eventA : EventA( this before[ 3m30s, 4m ] $eventB )
The previous pattern will match if and only if the temporal distance
between the time when $eventA finished and the time when $eventB
started is between ( 3 minutes and 30 seconds ) and ( 4 minutes ). In
other words:
3m30s <= $eventB.startTimestamp - $eventA.endTimeStamp <= 4m
Looking up the source code for the before evaluator we can see the same.
#Override
protected boolean evaluate(long rightTS, long leftTS) {
long dist = leftTS - rightTS;
return this.getOperator().isNegated() ^ (dist >= this.initRange && dist <= this.finalRange);
}
Based on this I've modified my code accordingly, and it seems to work correctly now:
rule "Any value detected"
when
$r : ValueRuleSpec(mode == ValueRuleSpecMode.ANY(), $devices: devices)
$e : SensorEvent(deviceId memberOf $devices) from entry-point FMSensorEvents
not Activation(ruleSpec == $r, cause == $e)
// no activation within past triggerEvery seconds for same device
not Activation(
ruleSpec == $r,
deviceId == $e.deviceId,
start.time > ($e.start.time - ($r.triggerEvery * 1000))
)
then
insert(new Activation($r, $e));
end

MongoDB Concurrency Bottleneck

Too Long; Didn't Read
The question is about a concurrency bottleneck I am experiencing on MongoDB. If I make one query, it takes 1 unit of time to return; if I make 2 concurrent queries, both take 2 units of time to return; generally, if I make n concurrent queries, all of them take n units of time to return. My question is about what can be done to improve Mongo's response times when faced with concurrent queries.
The Setup
I have a m3.medium instance on AWS running a MongoDB 2.6.7 server. A m3.medium has 1 vCPU (1 core of a Xeon E5-2670 v2), 3.75GB and a 4GB SSD.
I have a database with a single collection named user_products. A document in this collection has the following structure:
{ user: <int>, product: <int> }
There are 1000 users and 1000 products and there's a document for every user-product pair, totalizing a million documents.
The collection has an index { user: 1, product: 1 } and my results below are all indexOnly.
The Test
The test was executed from the same machine where MongoDB is running. I am using the benchRun function provided with Mongo. During the tests, no other accesses to MongoDB were being made and the tests only comprise read operations.
For each test, a number of concurrent clients is simulated, each of them making a single query as many times as possible until the test is over. Each test runs for 10 seconds. The concurrency is tested in powers of 2, from 1 to 128 simultaneous clients.
The command to run the tests:
mongo bench.js
Here's the full script (bench.js):
var
seconds = 10,
limit = 1000,
USER_COUNT = 1000,
concurrency,
savedTime,
res,
timediff,
ops,
results,
docsPerSecond,
latencyRatio,
currentLatency,
previousLatency;
ops = [
{
op : "find" ,
ns : "test_user_products.user_products" ,
query : {
user : { "#RAND_INT" : [ 0 , USER_COUNT - 1 ] }
},
limit: limit,
fields: { _id: 0, user: 1, product: 1 }
}
];
for (concurrency = 1; concurrency <= 128; concurrency *= 2) {
savedTime = new Date();
res = benchRun({
parallel: concurrency,
host: "localhost",
seconds: seconds,
ops: ops
});
timediff = new Date() - savedTime;
docsPerSecond = res.query * limit;
currentLatency = res.queryLatencyAverageMicros / 1000;
if (previousLatency) {
latencyRatio = currentLatency / previousLatency;
}
results = [
savedTime.getFullYear() + '-' + (savedTime.getMonth() + 1).toFixed(2) + '-' + savedTime.getDate().toFixed(2),
savedTime.getHours().toFixed(2) + ':' + savedTime.getMinutes().toFixed(2),
concurrency,
res.query,
currentLatency,
timediff / 1000,
seconds,
docsPerSecond,
latencyRatio
];
previousLatency = currentLatency;
print(results.join('\t'));
}
Results
Results are always looking like this (some columns of the output were omitted to facilitate understanding):
concurrency queries/sec avg latency (ms) latency ratio
1 459.6 2.153609008 -
2 460.4 4.319577324 2.005738882
4 457.7 8.670418178 2.007237636
8 455.3 17.4266174 2.00989353
16 450.6 35.55693474 2.040380754
32 429 74.50149883 2.09527338
64 419.2 153.7325095 2.063482104
128 403.1 325.2151235 2.115460969
If only 1 client is active, it is capable of doing about 460 queries per second over the 10 second test. The average response time for a query is about 2 ms.
When 2 clients are concurrently sending queries, the query throughput maintains at about 460 queries per second, showing that Mongo hasn't increased its response throughput. The average latency, on the other hand, literally doubled.
For 4 clients, the pattern continues. Same query throughput, average latency doubles in relation to 2 clients running. The column latency ratio is the ratio between the current and previous test's average latency. See that it always shows the latency doubling.
Update: More CPU Power
I decided to test with different instance types, varying the number of vCPUs and the amount of available RAM. The purpose is to see what happens when you add more CPU power. Instance types tested:
Type vCPUs RAM(GB)
m3.medium 1 3.75
m3.large 2 7.5
m3.xlarge 4 15
m3.2xlarge 8 30
Here are the results:
m3.medium
concurrency queries/sec avg latency (ms) latency ratio
1 459.6 2.153609008 -
2 460.4 4.319577324 2.005738882
4 457.7 8.670418178 2.007237636
8 455.3 17.4266174 2.00989353
16 450.6 35.55693474 2.040380754
32 429 74.50149883 2.09527338
64 419.2 153.7325095 2.063482104
128 403.1 325.2151235 2.115460969
m3.large
concurrency queries/sec avg latency (ms) latency ratio
1 855.5 1.15582069 -
2 947 2.093453854 1.811227185
4 961 4.13864589 1.976946318
8 958.5 8.306435055 2.007041742
16 954.8 16.72530889 2.013536347
32 936.3 34.17121062 2.043083977
64 927.9 69.09198599 2.021935563
128 896.2 143.3052382 2.074122435
m3.xlarge
concurrency queries/sec avg latency (ms) latency ratio
1 807.5 1.226082735 -
2 1529.9 1.294211452 1.055566166
4 1810.5 2.191730848 1.693487447
8 1816.5 4.368602642 1.993220402
16 1805.3 8.791969257 2.01253581
32 1770 17.97939718 2.044979532
64 1759.2 36.2891598 2.018374668
128 1720.7 74.56586511 2.054769676
m3.2xlarge
concurrency queries/sec avg latency (ms) latency ratio
1 836.6 1.185045183 -
2 1585.3 1.250742872 1.055438974
4 2786.4 1.422254414 1.13712774
8 3524.3 2.250554777 1.58238551
16 3536.1 4.489283844 1.994745425
32 3490.7 9.121144097 2.031759277
64 3527 18.14225682 1.989033023
128 3492.9 36.9044113 2.034168718
Starting with the xlarge type, we begin to see it finally handling 2 concurrent queries while keeping the query latency virtually the same (1.29 ms). It doesn't last too long, though, and for 4 clients it again doubles the average latency.
With the 2xlarge type, Mongo is able to keep handling up to 4 concurrent clients without raising the average latency too much. After that, it starts to double again.
The question is: what could be done to improve Mongo's response times with respect to the concurrent queries being made? I expected to see a rise in the query throughput and I did not expect to see it doubling the average latency. It clearly shows Mongo is not being able to parallelize the queries that are arriving.
There's some kind of bottleneck somewhere limiting Mongo, but it certainly doesn't help to keep adding up more CPU power, since the cost will be prohibitive. I don't think memory is an issue here, since my entire test database fits in RAM easily. Is there something else I could try?
You're using a server with 1 core and you're using benchRun. From the benchRun page:
This benchRun command is designed as a QA baseline performance measurement tool; it is not designed to be a "benchmark".
The scaling of the latency with the concurrency numbers is suspiciously exact. Are you sure the calculation is correct? I could believe that the ops/sec/runner was staying the same, with the latency/op also staying the same, as the number of runners grew - and then if you added all the latencies, you would see results like yours.

Perfomance slowdown with growth of operations number

I have next problem. My code performance depends on number of operation! How can it be? (I use gcc v 4.3.2 under openSuse 11.1)
Here is my code:
#define N_MAX 1000000
typedef unsigned int uint;
double a[N_MAX];
double b[N_MAX];
uint n;
int main(){
for (uint i=0; i<N_MAX; i++) {
a[i]=(double)rand()/RAND_MAX;
}
for (uint n=100000; n<500000; n+=5000) {
uint time1 = time(NULL);
for (uint i=0; i<n;++i)
for (uint j=0;j<n;++j)
b[j] = a[j];
uint time2 = time(NULL);
double time = double(time2-time1);
printf("%5d ", n);
printf("%5.2f %.3f\n", time, ((double)n*n/time)/1e9);
}
return 0;
}
And here is the log of results:
n-time-Gflops (=)
200000 23.00 1.739
205000 24.00 1.751
210000 25.00 1.764
215000 26.00 1.778
220000 27.00 1.793
225000 29.00 1.746
230000 30.00 1.763
235000 32.00 1.726
240000 32.00 1.800
245000 34.00 1.765
250000 36.00 1.736
255000 37.00 1.757
260000 38.00 1.779
265000 40.00 1.756
270000 42.00 1.736
275000 44.00 1.719
280000 46.00 1.704
285000 48.00 1.692
290000 49.00 1.716
295000 51.00 1.706
300000 54.00 1.667
305000 54.00 1.723
310000 59.00 1.629
315000 61.00 1.627
320000 66.00 1.552
325000 71.00 1.488
330000 76.00 1.433
335000 79.00 1.421
340000 84.00 1.376
345000 85.00 1.400
350000 89.00 1.376
355000 96.00 1.313
360000 102.00 1.271
365000 110.00 1.211
370000 121.00 1.131
375000 143.00 0.983
380000 156.00 0.926
385000 163.00 0.909
There is also the image but I can't post it cause of new users restrictions. But here is the log plot.
What is the reason of this slowdown?
How to get rid of it? Please Help!
Your inner loops increase number of iterations every time - it is expected to take more time to do their job if there is more calculations to do. First time there are 100k operations to be done, second time 105k operations, and so on. It simply must take more and more time.
EDIT: To be clearer, I tried to say it looks to something that Spolsky called Shlemiel the painter's algorithm
Thank a lot for response!
My expectation bases on idea that operation number per time unit should remains the same. So if I increase number of operations, say twice, then computation time should increase also twice, therefore the ration of operations number and the computation time should be constant. This is something that i didn't meet. :(

Instrument for count the number of method calls on iPhone

The Time Profiler can measure the amount of time spent on certain methods. Is there a similar method that measures the number of times a method is called?
DTrace can do this, but only in the iPhone Simulator (it's supported by Snow Leopard, but not yet by iOS). I have two writeups about this technology on MacResearch here and here where I walk through some case studies of using DTrace to look for specific methods and when they are called.
For example, I created the following DTrace script to measure the number of times methods were called on classes with the CP prefix, as well as total up the time spent in those methods:
#pragma D option quiet
#pragma D option aggsortrev
dtrace:::BEGIN
{
printf("Sampling Core Plot methods ... Hit Ctrl-C to end.\n");
starttime = timestamp;
}
objc$target:CP*::entry
{
starttimeformethod[probemod,probefunc] = timestamp;
methodhasenteredatleastonce[probemod,probefunc] = 1;
}
objc$target:CP*::return
/methodhasenteredatleastonce[probemod,probefunc] == 1/
{
this->executiontime = (timestamp - starttimeformethod[probemod,probefunc]) / 1000;
#overallexecutions[probemod,probefunc] = count();
#overallexecutiontime[probemod,probefunc] = sum(this->executiontime);
#averageexecutiontime[probemod,probefunc] = avg(this->executiontime);
}
dtrace:::END
{
milliseconds = (timestamp - starttime) / 1000000;
normalize(#overallexecutiontime, 1000);
printf("Ran for %u ms\n", milliseconds);
printf("%30s %30s %20s %20s %20s\n", "Class", "Method", "Total CPU time (ms)", "Executions", "Average CPU time (us)");
printa("%30s %30s %20#u %20#u %20#u\n", #overallexecutiontime, #overallexecutions, #averageexecutiontime);
}
This generates the following nicely formatted output:
Class Method Total CPU time (ms) Executions Average CPU time (us)
CPLayer -drawInContext: 6995 352 19874
CPPlot -drawInContext: 5312 88 60374
CPScatterPlot -renderAsVectorInContext: 4332 44 98455
CPXYPlotSpace -viewPointForPlotPoint: 3208 4576 701
CPAxis -layoutSublayers 2050 44 46595
CPXYPlotSpace -viewCoordinateForViewLength:linearPlotRange:plotCoordinateValue: 1870 9152
...
While you can create and run DTrace scripts from the command line, probably your best bet would be to create a custom instrument in Instruments and fill in the appropriate D code within that instrument. You can then easily run that against your application in the Simulator.
Again, this won't work on the device, but if you just want statistics on the number of times something is called, and not the duration it runs for, this might do the job.