Why is rust-mongodb so slow? [duplicate] - mongodb

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed last month.
I've written the following simple code to test the performance difference between Rust and Python.
Here's the Rust version:
#![allow(unused)]
use mongodb::{sync::Client, options::ClientOptions, bson::doc, bson::Document};
fn cursor_iterate()-> mongodb::error::Result<()>{
// setup
let mongo_url = "mongodb://localhost:27017";
let db_name = "MYDB";
let collection_name = "MYCOLLECTION";
let client = Client::with_uri_str(mongo_url)?;
let database = client.database(db_name);
let collection = database.collection::<Document>(collection_name);
// println!("{:?}", collection.count_documents(None, None));
let mut cursor = collection.find(None, None)?;
let mut count = 0;
for result in cursor {
count += 1;
}
println!("Doc count: {}", count);
Ok(())
}
fn main() {
cursor_iterate();
}
This simple cursor iterator takes around 8 seconds with time cargo run:
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/bbson`
Doc count: 14469
real 0m8.545s
user 0m8.471s
sys 0m0.067s
Here's the equivalent Python code:
import pymongo
def main():
url = "mongodb://localhost:27017"
db = "MYDB"
client = pymongo.MongoClient(url)
coll = client.get_database(db).get_collection("MYCOLLECTION")
count = 0
for doc in coll.find({}):
count += 1
print('Doc count: ', count)
if __name__ == "__main__":
main()
It takes about a second to run with time python3 test.py:
Doc count: 14469
real 0m1.079s
user 0m0.603s
sys 0m0.116s
So what makes the Rust code this slow? Is it the sync? the equivalent C++ code takes about 100ms.
EDIT: After running in the --release mode, I get:
Doc count: 14469
real 0m0.928s
user 0m0.871s
sys 0m0.041s
still barely matching the python version.

The answer is already in your output:
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/bbson`
Doc count: 14469
real 0m8.545s
user 0m8.471s
sys 0m0.067s
It says unoptimized.
Use cargo run --release to enable optimizations.
Further, don't use time cargo run, because that also times the time it takes to compile your program.
Instead, use:
cargo build --release
time target/release/bbson

Related

mongodb-rust-driver perform poorly on find and get large amount of data compare to go-driver

I have a database consist of 85.4k of document with average size of 4kb
I write a simple code in go to find and get over 70k document from the database using mongodb-go-driver
package main
import (
"context"
"log"
"time"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
localC, _ := mongo.Connect(context.TODO(), options.Client().ApplyURI("mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb"))
localDb := localC.Database("sampleDB")
collect := localDb.Collection("sampleCollect")
localCursor, _ := collect.Find(context.TODO(), JSON{
"deleted": false,
})
log.Println("start")
start := time.Now()
var result []map[string] interface{} = make([]map[string] interface{}, 0)
localCursor.All(context.TODO(), &result)
log.Println(len(result))
log.Println("done")
log.Println(time.Now().Sub(start))
}
Which done in around 20 seconds
2021/03/21 01:36:43 start
2021/03/21 01:36:56 70922
2021/03/21 01:36:56 done
2021/03/21 01:36:56 20.0242869s
After that, I try to implement the similar thing in rust using mongodb-rust-driver
use mongodb::{
bson::{doc, Document},
error::Error,
options::FindOptions,
Client,
};
use std::time::Instant;
use tokio::{self, stream::StreamExt};
#[tokio::main]
async fn main() {
let client = Client::with_uri_str("mongodb://localhost:27017/")
.await
.unwrap();
let db = client.database("sampleDB");
let coll = db.collection("sampleCollect");
let find_options = FindOptions::builder().build();
let cursor = coll
.find(doc! {"deleted": false}, find_options)
.await
.unwrap();
let start = Instant::now();
println!("start");
let results: Vec<Result<Document, Error>> = cursor.collect().await;
let es = start.elapsed();
println!("{}", results.iter().len());
println!("{:?}", es);
}
But it took almost 1 minutes to complete the same task on release build
$ cargo run --release
Finished release [optimized] target(s) in 0.43s
Running `target\release\rust-mongo.exe`
start
70922
51.1356069s
May I know the performance on rust in this case is consider normal or I made some mistake on my rust code and it could be improve?
EDIT
As comment suggested, here is the Example document
The discrepancy here was due to some known bottlenecks in the Rust driver that have since been addressed in the latest beta release (2.0.0-beta.3); so, upgrading your mongodb dependency to use that version should solve the issue.
Re-running your examples with 10k copies of the provided sample document, I now see the Rust one taking ~3.75s and the Go one ~5.75s on my machine.

Running concurrent mongoengine queries with asyncio

I have 4 functions that basically build queries and execute them. I want to make them run simultaneously using asyncio. My implementation of asyncio seems correct as non mongodb tasks run as they should( example asyncio.sleep()). Here is the code:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
tasks = [
service.async_get_associate_opportunity_count_by_user(me, criteria),
service.get_new_associate_opportunity_count_by_user(me, criteria),
service.async_get_associate_favorites_count(me, criteria=dict()),
service.get_group_matched_opportunities_count_by_user(me, criteria)
]
available, new, favorites, group_matched = loop.run_until_complete(asyncio.gather(*tasks))
stats['opportunities']['available'] = available
stats['opportunities']['new'] = new
stats['opportunities']['favorites'] = favorites
stats['opportunities']['group_matched'] = group_matched
loop.close()
# functions written in other file
#asyncio.coroutine
def async_get_ass(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of available: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def get_new_associate_opportunity_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of new: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def async_get_associate_favorites_count(self, user, criteria={}, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
favorites = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of favorites: {}".format(run_time))
yield from asyncio.sleep(2)
return favorites
#asyncio.coroutine
def get_group_matched_opportunities_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of group matched: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
The yield from asyncio.sleep(2) is just to show that the functions run asynchronously. Here is the output on the terminal:
runtime of group matched: 0.11431598663330078
runtime of favorites: 0.0029871463775634766
Timestamp function run time: 0.0004897117614746094
runtime of new: 0.15225648880004883
runtime of available: 0.13006806373596191
total run time: 2403.2700061798096
From my understanding, apart from the 2000ms that gets added to the total run time due to the sleep function, it shouldn't be more than 155-160ms as the max run time among all functions is this value.
I'm currently looking into motorengine(a port of mongoengine 0.9.0) that apparently enables asynchronous mongodb queries but I think I won't be able to use it since my models have been defined using mongoengine. Is there a workaround to this problem?
The reason your queries aren't running in parallel is because whenever you run Opportunity.objects(query).count() in your coroutines, the entire event loop blocks because those methods are doing blocking IO.
So you need a mongodb driver which can do async/non-blocking IO. You are on the correct path with trying to use motorengine, but as far as I can tell it's written for the Tornado asynchronous framework. To get it to work with asyncio you would have to hookup Tornado and asycnio. See, http://tornado.readthedocs.org/en/latest/asyncio.html on how to do that.
Another option is to use asyncio-mongo, but it doesn't have a mongoengine compatibale ORM, so you might have to rewrite most of your code.

Avoiding duplicate tasks in celery broker

I want to create the following flow using celery configuration\api:
Send TaskA(argB) Only if celery queue has no TaskA(argB) already pending
Is it possible? how?
You can make your job aware of other tasks by some sort of memoization. If you use a cache control key (redis, memcached, /tmp, whatever is handy), you can make execution depend on that key. I'm using redis as an example.
from redis import Redis
#app.task
def run_only_one_instance(params):
try:
sentinel = Redis().incr("run_only_one_instance_sentinel")
if sentinel == 1:
#I am the legitimate running task
perform_task()
else:
#Do you want to do something else on task duplicate?
pass
Redis().decr("run_only_one_instance_sentinel")
except Exception as e:
Redis().decr("run_only_one_instance_sentinel")
# potentially log error with Sentry?
# decrement the counter to insure tasks can run
# or: raise e
I cannot think of a way but to
Retrieve all executing and scheduled tasks via celery inspect
Iterate through them to see if your task is there.
check this SO question to see how the first point is done.
good luck
I don't know it's gonna help you more than the other answers, but there goes my approach, following the same idea given by srj. I needed a way to block my server to launch tasks with same id to queue. So I made a general function to help me.
def is_task_active_or_registered(app, task_id):
i = app.control.inspect()
active_dict = i.active()
scheduled_dict = i.scheduled()
keys_set = set(active_dict.keys() + scheduled_dict.keys())
tasks_ids_set = set()
for _dict in [active_dict, scheduled_dict]:
for k in keys_set:
for task in _dict[k]:
tasks_ids_set.add(task['id'])
if task_id in tasks_ids_set:
return True
else:
return False
So, I use it like this:
In the context where my celery-app object is available, I define:
def check_task_can_not_run(task_id):
return is_task_active_or_registered(app=celery, task_id=task_id)
And so, from my client request, I call this check_task_can_not_run(...) and block task from being launched in case of True.
I was facing similar problem. The Beat was making duplicates in my queue. I wanted to use expires but this feature isn't working properly https://github.com/celery/celery/issues/4300.
So here is scheduler which checks if task has been already enqueued (based on task name).
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import json
from heapq import heappop, heappush
from celery.beat import event_t
from celery.schedules import schedstate
from django_celery_beat.schedulers import DatabaseScheduler
from typing import List, Optional
from typing import TYPE_CHECKING
from your_project import celery_app
if TYPE_CHECKING:
from celery.beat import ScheduleEntry
def is_task_in_queue(task, queue_name=None):
# type: (str, Optional[str]) -> bool
queues = [queue_name] if queue_name else celery_app.amqp.queues.keys()
for queue in queues:
if task in get_celery_queue_tasks(queue):
return True
return False
def get_celery_queue_tasks(queue_name):
# type: (str) -> List[str]
with celery_app.pool.acquire(block=True) as conn:
tasks = conn.default_channel.client.lrange(queue_name, 0, -1)
decoded_tasks = []
for task in tasks:
j = json.loads(task)
task = j['headers']['task']
if task not in decoded_tasks:
decoded_tasks.append(task)
return decoded_tasks
class SmartScheduler(DatabaseScheduler):
"""
Smart means that prevents duplicating of tasks in queues.
"""
def is_due(self, entry):
# type: (ScheduleEntry) -> schedstate
is_due, next_time_to_run = entry.is_due()
if (
not is_due or # duplicate wouldn't be created
not is_task_in_queue(entry.task) # not in queue so let it run
):
return schedstate(is_due, next_time_to_run)
# Task should be run (is_due) and it is present in queue (is_task_in_queue)
H = self._heap
if not H:
return schedstate(False, self.max_interval)
event = H[0]
verify = heappop(H)
if verify is event:
next_entry = self.reserve(entry)
heappush(H, event_t(self._when(next_entry, next_time_to_run), event[1], next_entry))
else:
heappush(H, verify)
next_time_to_run = min(verify[0], next_time_to_run)
return schedstate(False, min(next_time_to_run, self.max_interval))

Scala CompiledScript reuse

I am trying to use Scala ScriptEngine (2.11) to run Scala script in Java.
The script uses the bindings provided to the engine.
Script is used multiple times with different bindings.
For this, I am using CompiledScript.
The script runs fine for the first time and uses the bindings.
But, when the same CompiledScript is rerun using new bindings, the result does not change.
What I observed is that the script actually does not run the second time. It just uses the stored engine state.
The following is the code snippet:
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("scala");
Bindings bindings = engine.createBindings();
bindings.put("firstVal", 209);
bindings.put("secondVal", 30);
bindings.put("sumUtil", new Sum());
CompiledScript script = ((Compilable)engine).compile(
"var sum = sumUtil.asInstanceOf[com.myutils.Sum]\n"+
"var firstInt = firstVal.asInstanceOf[Integer]\n"+
"var secondInt = secondVal.asInstanceOf[Integer]\n"+
"sum.add(firstInt, secondInt)\n"
);
Integer res1 = (Integer) script.eval(bindings);
System.out.println("Result 1: "+res1);
bindings = engine.createBindings();
bindings.put("firstVal", 2);
bindings.put("secondVal", 3);
Integer res2 = (Integer) script.eval(bindings);
System.out.println("Result 2: "+res2);
The output is:
firstVal: Object = 209
secondVal: Object = 30
sumUtil: Object = com.myutils.Sum#71136646
Result 1: 239
firstVal: Object = 2
secondVal: Object = 3
Result 2: 239
The expectation is that Result 2 is "5" instead of 239.
Am I doing anything wrong here?
Thanks in advance

Strange issue with SBT, println, and scala console application

When I run my scala code (I'm using SBT), the prompt is displayed after I enter some text as shown here:
C:\... > sbt run
[info] Loading project definition [...]
[info] Set current project to [...]
Running com[...]
test
>>
exit
>> >> >> >> >> >> [success] Total time[...]
It seems like it's stacking up the print() statements and only displaying them when it runs a different command.
If I use println() it works as it should (except that I don't want a newline)
The code:
...
def main(args:Array[String]) {
var endSession:Boolean = false
var cmd = ""
def acceptInput:Any = {
print(">> ")
cmd = Console.readLine
if (cmd != "exit") {
if (cmd != "") runCommand(cmd)
acceptInput
}
}
acceptInput
}
...
What's going on here?
Output from print (and println) can be buffered. Scala sends output through java.io.PrintStream, which suggests that it will only auto-flush on newline, and then only if you ask. It might be OS dependent, though, since my print appears immediately.
If you add Console.out.flush after each print, you'll empty out the buffer to the screen (on any OS).