MongoDB benchmarking inserts - mongodb

I am trying to benchmark MongoDB with the JS harness. I am trying to do inserts. The example given in mongo website.
However, I am trying an insert operation, which works totally fine, but gives out wrong queries/sec.
ops = [{op: "insert", ns: "benchmark.bench", safe: false, doc: {"a": 1}}]
The above works fine. Then, I have run the following in mongo shell:
for ( x = 1; x<=128; x*=2){
res = benchRun( { parallel : x ,
seconds : 5 ,
ops : ops
} )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
It gives out:
threads: 1 queries/sec: 0
threads: 2 queries/sec: 0
threads: 4 queries/sec: 0
threads: 8 queries/sec: 0
threads: 16 queries/sec: 0
threads: 32 queries/sec: 1.4
threads: 64 queries/sec: 0
threads: 128 queries/sec: 0
I dont understand why the queries/sec is 0 and not a single doc has been inserted. Is this right was testing performance for inserts?

Answering because I just encountered a similar problem.
Try replacing your print statement with printjson(res).
You will see that res has the following fields:
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"insertLatencyAverageMicros" : 8.173300153139357,
"totalOps" : NumberLong(130600),
"totalOps/s" : 25366.173139864142,
"findOne" : 0,
"insert" : 25366.173139864142,
"delete" : 0,
"update" : 0,
"query" : 0,
"command" : 0
}
As you can see, the query count is 0, hence when you print res.query it gives 0. To get the number of insert operations per second you would want to print res.insert. I believe res.query corresponds to the "find" operation.

Related

Difference between Supply method act vs tap

In Raku documentation about the Supply method act (vs tap)
https://docs.raku.org/type/Supply#method_act it is stated that:
the given code is guaranteed to be executed by only one thread at a
time
My understanding is that a thread must finish with the specific code object before another thread has to run it.
If that is the case, I stumbled upon a different behavior when I tried to implement the feature. Take a look at the following code snippet, where 2 "acts" are created and run in different threads:
#!/usr/bin/env perl6
say 'Main runs in [ thread : ', +$*THREAD, ' ]';
my $b = 1;
sub actor {
print " Tap_$*tap : $^a ", now;
$*tap < 2 ??
do {
say " - Sleep 0.1";
sleep 0.1
}
!!
do {
say " - Sleep 0.2";
sleep 0.2;
}
$b++;
say " Tap_$*tap +1 to \$b $b ", now;
}
my $supply = supply {
for 1..100 -> $i {
say "For Tap_$*tap [ \$i = $i ] => About to emit : $b ", now;
emit $b;
say "For Tap_$*tap [ \$i = $i ] => Emitted : $b ", now;
done if $b > 5
}
}
start {
my $*tap = 1;
once say "Tap_$*tap runs in [ thread : {+$*THREAD} ]";
$supply.act: &actor
}
start {
my $*tap = 2;
once say "Tap_$*tap runs in [ thread : {+$*THREAD} ]";
$supply.act: &actor
}
sleep 1;
and the result is the following (with added time gaps and comments):
1 Main runs in [ thread : 1 ] - Main thread
2 Tap_1 runs in [ thread : 4 ] - Tap 1 thread
3 For Tap_1 [ $i = 1 ] => About to emit : 1 Instant:1603354571.198187 - Supply thread [for tap 1]
4 Tap_1 : 1 Instant:1603354571.203074 - Sleep 0.1 - Tap 1 thread
5 Tap_2 runs in [ thread : 6 ] - Tap 2 thread
6 For Tap_2 [ $i = 1 ] => About to emit : 1 Instant:1603354571.213826 - Supply thread [for tap 2]
7 Tap_2 : 1 Instant:1603354571.213826 - Sleep 0.2 - Tap 2 thread
8
9 -----------------------------------------------------------------------------------> Time +0.1 seconds
10
11 Tap_1 +1 to $b 2 Instant:1603354571.305723 - Tap 1 thread
12 For Tap_1 [ $i = 1 ] => Emitted : 2 Instant:1603354571.305723 - Supply thread [for tap 1]
13 For Tap_1 [ $i = 2 ] => About to emit : 2 Instant:1603354571.30768 - Supply thread [for tap 1]
14 Tap_1 : 2 Instant:1603354571.30768 - Sleep 0.1 - Tap 1 thread
15
16 -----------------------------------------------------------------------------------> Time +0.1 seconds
17
18 Tap_1 +1 to $b 3 Instant:1603354571.410354 - Tap 1 thread
19 For Tap_1 [ $i = 2 ] => Emitted : 4 Instant:1603354571.425018 - Supply thread [for tap 1]
20 Tap_2 +1 to $b 4 Instant:1603354571.425018 - Tap 2 thread
21 For Tap_1 [ $i = 3 ] => About to emit : 4 Instant:1603354571.425018 - Supply thread [for tap 1]
22 For Tap_2 [ $i = 1 ] => Emitted : 4 Instant:1603354571.425995 - Supply thread [for tap 2]
23 Tap_1 : 4 Instant:1603354571.425995 - Sleep 0.1 - Tap 1 thread
24 For Tap_2 [ $i = 2 ] => About to emit : 4 Instant:1603354571.425995 - Supply thread [for tap 2]
25 Tap_2 : 4 Instant:1603354571.426973 - Sleep 0.2 - Tap 2 thread
26
27 -----------------------------------------------------------------------------------> Time +0.1 seconds
28
29 Tap_1 +1 to $b 5 Instant:1603354571.528079 - Tap 1 thread
30 For Tap_1 [ $i = 3 ] => Emitted : 5 Instant:1603354571.52906 - Supply thread [for tap 1]
31 For Tap_1 [ $i = 4 ] => About to emit : 5 Instant:1603354571.52906 - Supply thread [for tap 1]
32 Tap_1 : 5 Instant:1603354571.53004 - Sleep 0.1 - Tap 1 thread
33
34 -----------------------------------------------------------------------------------> Time +0.1 seconds
35
36 Tap_2 +1 to $b 6 Instant:1603354571.62859 - Tap 2 thread
37 For Tap_2 [ $i = 2 ] => Emitted : 6 Instant:1603354571.62859 - Supply thread [for tap 2]
38 Tap_1 +1 to $b 7 Instant:1603354571.631512 - Tap 1 thread
39 For Tap_1 [ $i = 4 ] => Emitted : 7 Instant:1603354571.631512 - Supply thread [for tap 2]
One can easily observe that the code object (subroutine &actor) is running concurrently in 2 threads (for example see output lines 4 & 7).
Can somebody clarify my misunderstanding about the matter?
There's very rarely any difference between tap and act in everyday use of Raku, because almost every Supply that you encounter is a serial supply. A serial supply is one that already enforces the protocol that a value will not be emitted until the previous one has been processed. The implementation of act is:
method act(Supply:D: &actor, *%others) {
self.sanitize.tap(&actor, |%others)
}
Where sanitize enforces the serial emission of values and in addition makes sure that events follow the grammar emit* [done | quit]. Since these properties are usually highly desirable anyway, every built-in way to obtain a Supply provides them, with the exception of being able to create a Supplier and call unsanitized-supply on it. (Historical note: a very early prototype did not enforce these properties so widely, creating more of a need for a method doing what act does. While the need for it diminished as the design involved into what was eventually shipped in the first language release, it got to keep its nice short name.)
The misunderstanding arises from expecting the serialization of events to be per source, whereas in reality it is per subscription. Consider this example:
my $timer = Supply.interval(1);
$timer.tap: { say "A: {now}" };
$timer.tap: { say "B: {now}" };
sleep 5;
Which produces output like this:
A: Instant:1603364746.02766
B: Instant:1603364746.031255
A: Instant:1603364747.025255
B: Instant:1603364747.028305
A: Instant:1603364748.025584
B: Instant:1603364748.029797
A: Instant:1603364749.026596
B: Instant:1603364749.029643
A: Instant:1603364750.027881
B: Instant:1603364750.030851
A: Instant:1603364751.030137
There is one source of events, but we establish two subscriptions to it. Each subscription enforces the serial rule, so if we do this:
my $timer = Supply.interval(1);
$timer.tap: { sleep 1.5; say "A: {now}" };
$timer.tap: { sleep 1.5; say "B: {now}" };
sleep 5;
Then we observe the following output:
A: Instant:1603364909.442341
B: Instant:1603364909.481506
A: Instant:1603364910.950359
B: Instant:1603364910.982771
A: Instant:1603364912.451916
B: Instant:1603364912.485064
Showing that each subscription is getting one event at a time, but merely sharing an (on-demand) source doesn't create any shared backpressure.
Since the concurrency control is associated with the subscription, it is irrelevant if the same closure clone is passed to tap/act. Enforcing concurrency control across multiple subscriptions is the realm of supply/react/whenever. For example this:
my $timer = Supply.interval(1);
react {
whenever $timer {
sleep 1.5;
say "A: {now}"
}
whenever $timer {
sleep 1.5;
say "B: {now}"
}
}
Gives output like this:
A: Instant:1603365363.872672
B: Instant:1603365365.379991
A: Instant:1603365366.882114
B: Instant:1603365368.383392
A: Instant:1603365369.884608
B: Instant:1603365371.386087
Where each event is 1.5s apart, because of the concurrency control implied by the react block.

mongo: update $push failed with "Resulting document after update is larger than 16777216"

I want to extend an large array using the update(.. $push ..) operation.
Here are the details:
I have a large collection 'A' with many fields. Amongst the fields, I want to extract the values of the 'F' field, and transfer them into one large array stored inside one single field of a document in collection 'B'.
I split the process into steps (to limit the memory used)
Here is the python program:
...
steps = 1000 # number of steps
step = 10000 # each step will handle this number of documents
start = 0
for j in range(steps):
print('step:', j, 'start:', start)
project = {'$project': {'_id':0, 'F':1} }
skip = {'$skip': start}
limit = {'$limit': step}
cursor = A.aggregate( [ skip, limit, project ], allowDiskUse=True )
a = []
for i, o in enumerate(cursor):
value = o['F']
a.append(value)
print('len:', len(a))
B.update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
start += step
Here is the oupput of this program:
step: 0 start: 0
step: 1 start: 100000
step: 2 start: 200000
step: 3 start: 300000
step: 4 start: 400000
step: 5 start: 500000
step: 6 start: 600000
step: 7 start: 700000
step: 8 start: 800000
step: 9 start: 900000
step: 10 start: 1000000
Traceback (most recent call last):
File "u_psfFlux.py", line 109, in <module>
lsst[k].update( {'_id': 1}, { '$push': {'v' : { '$each': a } } } )
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 2503, in update
collation=collation)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/collection.py", line 754, in _update
_check_write_command_response([(0, result)])
File "/home/ubuntu/.local/lib/python3.5/site-packages/pymongo/helpers.py", line 315, in _check_write_command_response
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Resulting document after update is larger than 16777216
Apparently the $push operation has to fetch the complete array !!! (my expectation was that this operation would always need the same amount of memory since we always append the same amount of values to the array)
In short, I don't understand why the update/$push operation fails with error...
Or... is there a way to avoid this unneeded buffering ?
Thanks for your suggestion
Christian

Scoping Issue with SparkContext.sequenceFile(...).foreach in Scala

My objective is to process a series of SequenceFile folders generated by calling org.apache.spark.rdd.RDD[_].saveAsObjectFile(...). My folder structure is similar to this:
\MyRootDirectory
\Batch0001
_SUCCESS
part-00000
part-00001
...
part-nnnnn
\Batch0002
_SUCCESS
part-00000
part-00001
...
part-nnnnn
...
\Batchnnnn
_SUCCESS
part-00000
part-00001
...
part-nnnnn
I need to extract some of the persisted data, however my collection - whether I use a ListBuffer, mutable.Map, or any other mutable type, loses scope and appears to be newed up on each iteration of sequenceFile(...).foreach
The following proof of concept generates a series of "Processing directory..." followed by "1 : 1" repeated and never increasing, as I expected counter and intList.size to do.
private def proofOfConcept(rootDirectoryName: String) = {
val intList = ListBuffer[Int]()
var counter: Int = 0
val config = new SparkConf().setAppName("local").setMaster("local[1]")
new File(rootDirectoryName).listFiles().map(_.toString).foreach { folderName =>
println(s"Processing directory $folderName...")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.sequenceFile(folderName, classOf[NullWritable], classOf[BytesWritable]).foreach(f => {
counter += 1
intList += counter
println(s" $counter : ${intList.size}")
})
sc.stop()
}
}
Output:
"C:\Program Files\Java\jdk1.8.0_111\bin\java" ...
Processing directory C:\MyRootDirectory\Batch0001...
17/05/24 09:30:25.228 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 0) / 57] 1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
Processing directory C:\MyRootDirectory\Batch0002...
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
Processing directory C:\MyRootDirectory\Batch0003...
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
The function inside foreach is run in a spark worker JVM, not inside the client JVM, where the variable is defined. That worker gets a copy of that variable locally, increments it, and prints it. My guess is you are testing this locally? If you were running this in a production, distributed spark environment, you wouldn't even see the output of those prints.
More generally, pretty much any function you pass into one of RDD's methods will probably be actually executed remotely and will not have mutable access to any local variables or anything. It will get an essentially immutable snapshot of them.
If you want to move data from spark's distributed storage back to the client, use RDD's collect method. The reverse is done with sc.parallelize. But note that both of these are usually done very rarely, since they do not happen in parallel.

Mongo staled with too much insertions

I trying to use mongodb to run a multiagent simulation.
I have one mongo instance in the same server that runs the simulation program, but when I have too much agents (~100.000 in 10 simulation steps) the mongodb becomes stalled during seconds.
The code for insert data in mongo is similar to:
if( mongo_client( &m_conn , m_dbhost.c_str(), m_dbport ) != MONGO_OK ) {
cout << "failed to connect '" << m_dbhost << ":" << m_dbport << "'\n";
cout << " mongo error: " << m_conn.err << endl;
return;
}
bson_init( &b );
bson_append_new_oid( &b, "_id" ) != BSON_OK );
bson_append_double( &b, "time", time );
bson_append_double( &b, "x", posx );
bson_append_double( &b, "y", posy );
bson_finish( &b );
if( mongo_insert( &m_conn , ns.c_str() , &b, NULL ) != MONGO_OK ){
cout << "failed to insert in mongo\n";
}
bson_destroy( &b );
mongo_disconnect( &m_conn );
Also, during the simulation, If I try to access using the mongo-shell, I also get errors:
$ mongo
MongoDB shell version: 2.4.1
connecting to: test
Wed Apr 3 10:10:24.870 JavaScript execution failed: Error: couldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:L112
exception: connect failed
After the simulation is ended, the mongo shell gets responsive again, and I can check that there is data in the database but it is discontinued. In the example, the agent m0n999 saved only 6 of 10 steps:
> show dbs
dB0B7F527F0FA45518712C8CB27611BD7 5.951171875GB
local 0.078125GB
> db.ins.m0n999.find()
{ "_id" : ObjectId("515bdf564c60ec1e000003e7"), "time" : 1, "x" : 1.1, "y" : 8.1 }
{ "_id" : ObjectId("515be0214c60ec1e0001075f"), "time" : 2, "x" : 1.2000000000000002, "y" : 8.2 }
{ "_id" : ObjectId("515be1c04c60ec1e0002da3a"), "time" : 4, "x" : 1.4000000000000004, "y" : 8.399999999999999 }
{ "_id" : ObjectId("515be2934c60ec1e0003b82c"), "time" : 5, "x" : 1.5000000000000004, "y" : 8.499999999999998 }
{ "_id" : ObjectId("515be3664c60ec1e000497cf"), "time" : 6, "x" : 1.6000000000000005, "y" : 8.599999999999998 }
{ "_id" : ObjectId("515be6cc4c60ec1e000824b2"), "time" : 10, "x" : 2.000000000000001, "y" : 8.999999999999996 }
>
How can solve this problem? How can avoid the lost of connections and recover from mongo stalls?
UPDATE
I'm getting in the global log errors like:
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] end connection 127.0.0.1:40748 (1 connection now open)",
I solved the problem, it has two errors:
I was creating too much collections. I changed from one collection per Agent, to only on collection per simulation proccess.
I was creating too much connections. I changed from one connection per Agent iteration, to only one connection per simulation step.

$pull operation in MongoDB not working for me

I have a document with the following key-array pair:
"home" : [
"Kevin Garnett",
"Paul Pierce",
"Rajon Rondo",
"Brandon Bass",
" 5 sec inbound",
"Kevin Seraphin"
]
I want to remove the element " 5 sec inbound" from the array and use the following command (in the MongoDB shell):
>coll.update({},{"$pull":{"home":" 5 sec inbound"}})
This is not working as verified by a query:
>coll.findOne({"home":/5 sec inbound/})
"home" : [
"Kevin Garnett",
"Paul Pierce",
"Rajon Rondo",
"Brandon Bass",
" 5 sec inbound",
"Kevin Seraphin"
]
Any help would be greatly appreciated!
That very same statement works for me:
> db.test.insert({"home" : [
... "Kevin Garnett",
... "Paul Pierce",
... "Rajon Rondo",
... "Brandon Bass",
... " 5 sec inbound",
... "Kevin Seraphin"
... ]})
> db.test.find({"home":/5 sec inbound/}).count()
1
> db.test.update({},{"$pull":{"home":" 5 sec inbound"}})
> db.test.find({"home":/5 sec inbound/}).count()
0