mongodb-rust-driver perform poorly on find and get large amount of data compare to go-driver - mongodb

I have a database consist of 85.4k of document with average size of 4kb
I write a simple code in go to find and get over 70k document from the database using mongodb-go-driver
package main
import (
"context"
"log"
"time"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
localC, _ := mongo.Connect(context.TODO(), options.Client().ApplyURI("mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb"))
localDb := localC.Database("sampleDB")
collect := localDb.Collection("sampleCollect")
localCursor, _ := collect.Find(context.TODO(), JSON{
"deleted": false,
})
log.Println("start")
start := time.Now()
var result []map[string] interface{} = make([]map[string] interface{}, 0)
localCursor.All(context.TODO(), &result)
log.Println(len(result))
log.Println("done")
log.Println(time.Now().Sub(start))
}
Which done in around 20 seconds
2021/03/21 01:36:43 start
2021/03/21 01:36:56 70922
2021/03/21 01:36:56 done
2021/03/21 01:36:56 20.0242869s
After that, I try to implement the similar thing in rust using mongodb-rust-driver
use mongodb::{
bson::{doc, Document},
error::Error,
options::FindOptions,
Client,
};
use std::time::Instant;
use tokio::{self, stream::StreamExt};
#[tokio::main]
async fn main() {
let client = Client::with_uri_str("mongodb://localhost:27017/")
.await
.unwrap();
let db = client.database("sampleDB");
let coll = db.collection("sampleCollect");
let find_options = FindOptions::builder().build();
let cursor = coll
.find(doc! {"deleted": false}, find_options)
.await
.unwrap();
let start = Instant::now();
println!("start");
let results: Vec<Result<Document, Error>> = cursor.collect().await;
let es = start.elapsed();
println!("{}", results.iter().len());
println!("{:?}", es);
}
But it took almost 1 minutes to complete the same task on release build
$ cargo run --release
Finished release [optimized] target(s) in 0.43s
Running `target\release\rust-mongo.exe`
start
70922
51.1356069s
May I know the performance on rust in this case is consider normal or I made some mistake on my rust code and it could be improve?
EDIT
As comment suggested, here is the Example document

The discrepancy here was due to some known bottlenecks in the Rust driver that have since been addressed in the latest beta release (2.0.0-beta.3); so, upgrading your mongodb dependency to use that version should solve the issue.
Re-running your examples with 10k copies of the provided sample document, I now see the Rust one taking ~3.75s and the Go one ~5.75s on my machine.

Related

Why is rust-mongodb so slow? [duplicate]

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed last month.
I've written the following simple code to test the performance difference between Rust and Python.
Here's the Rust version:
#![allow(unused)]
use mongodb::{sync::Client, options::ClientOptions, bson::doc, bson::Document};
fn cursor_iterate()-> mongodb::error::Result<()>{
// setup
let mongo_url = "mongodb://localhost:27017";
let db_name = "MYDB";
let collection_name = "MYCOLLECTION";
let client = Client::with_uri_str(mongo_url)?;
let database = client.database(db_name);
let collection = database.collection::<Document>(collection_name);
// println!("{:?}", collection.count_documents(None, None));
let mut cursor = collection.find(None, None)?;
let mut count = 0;
for result in cursor {
count += 1;
}
println!("Doc count: {}", count);
Ok(())
}
fn main() {
cursor_iterate();
}
This simple cursor iterator takes around 8 seconds with time cargo run:
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/bbson`
Doc count: 14469
real 0m8.545s
user 0m8.471s
sys 0m0.067s
Here's the equivalent Python code:
import pymongo
def main():
url = "mongodb://localhost:27017"
db = "MYDB"
client = pymongo.MongoClient(url)
coll = client.get_database(db).get_collection("MYCOLLECTION")
count = 0
for doc in coll.find({}):
count += 1
print('Doc count: ', count)
if __name__ == "__main__":
main()
It takes about a second to run with time python3 test.py:
Doc count: 14469
real 0m1.079s
user 0m0.603s
sys 0m0.116s
So what makes the Rust code this slow? Is it the sync? the equivalent C++ code takes about 100ms.
EDIT: After running in the --release mode, I get:
Doc count: 14469
real 0m0.928s
user 0m0.871s
sys 0m0.041s
still barely matching the python version.
The answer is already in your output:
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/bbson`
Doc count: 14469
real 0m8.545s
user 0m8.471s
sys 0m0.067s
It says unoptimized.
Use cargo run --release to enable optimizations.
Further, don't use time cargo run, because that also times the time it takes to compile your program.
Instead, use:
cargo build --release
time target/release/bbson

Updating data in MongoDB with Rust

I'm trying to update a field in a collection of a MongoDB database using Rust. I was using this code:
extern crate mongodb;
use mongodb::{Client, ThreadedClient};
use mongodb::db::ThreadedDatabase;
fn main() {
let client = Client::connect("ipaddress", 27017);
let coll = client.db("DEV").collection("DEV");
let film_a = doc!{"DEVID"=>"1"};
let filter = film_a.clone();
let update = doc!{"temp"=>"5"};
coll.update_one(filter, update, None).expect("failed");
}
This gives me an error saying update only works with the $ operator, which after some searching seems to mean I should use $set. I've been trying different versions of this but only get mismatched type errors and such.
coll.update_one({"DEVID": "1"},{$set:{"temp" => "5"}},None).expect("failed");
Where am I going wrong?
The DB looks like this.
db.DEVICES.find()
{ "_id" : ObjectId("59a7bb747a1a650f1814ef85"), "DEVID" : 1, "temp" : 0,
"room_temp" : 0 }
{ "_id" : ObjectId("59a7bb827a1a650f1814ef86"), "DEVID" : 2, "temp" : 0,
"room_temp" : 0 }
If someone is looking for the answer for a newer version of the driver, here it is based on #PureW's answer in an async version:
use mongodb::{Client, ThreadedClient, bson::doc};
use mongodb::db::ThreadedDatabase;
async fn main() {
let client = Client::connect("localhost", 27017).unwrap();
let coll = client.db("tmp").collection("tmp");
let filter = doc!{"DEVID":"1"};
let update = doc!{"$set": {"temp":"5"}};
coll.update_one(filter, update, None).await.unwrap();
}
You're pretty much there. The following compiles and runs for me when I try your example (hint: You haven't enclosed "$set" in quotes):
#[macro_use(bson, doc)]
extern crate bson;
extern crate mongodb;
use mongodb::{Client, ThreadedClient};
use mongodb::db::ThreadedDatabase;
fn main() {
let client = Client::connect("localhost", 27017).unwrap();
let coll = client.db("tmp").collection("tmp");
let filter = doc!{"DEVID"=>"1"};
let update = doc!{"$set" => {"temp"=>"5"}};
coll.update_one(filter, update, None).unwrap();
}
Another piece of advice: Using unwrap rather than expect might give you more precise errors.
As for using the mongodb-library, I've stayed away from that as the authors explicitly say it's not production ready and even the update_one example in their documentation is broken.
Instead I've used the wrapper over the battle-tested C-library with good results.

Golang slow scan() for multiple rows

I am running a query in Golang where I select multiple rows from my Postgresql Database.
I am using the following imports for my query
"database/sql"
"github.com/lib/pq"
I have narrowed down to my loop for scanning the results into my struct.
// Returns about 400 rows
rows, err = db.Query('SELECT * FROM infrastructure')
if err != nil {
return nil, err
}
var arrOfInfra []model.Infrastructure
for rows.Next() {
obj, ptrs := model.InfrastructureInit()
rows.Scan(ptrs...)
arrOfInfra = append(arrOfInfra, *obj)
}
rows.Close()
The above code takes about 8 seconds to run, and while the query is fast, the loop in rows.Next() takes the entire 8 seconds over to complete.
Any ideas? Am I doing something wrong, or is there a better way?
My configuration for my database
// host, port, dbname, user, password masked for obvious reasons
db, err := sql.Open("postgres", "host=... port=... dbname=... user=... password=... sslmode=require")
if err != nil {
panic(err)
}
// I have tried using the default, or setting to high number (100), but it doesn't seem to help with my situation
db.SetMaxIdleConns(1)
db.SetMaxOpenConns(1)
UPDATE 1:
I placed print statements in the for loop. Below is my updated snippet
for rows.Next() {
obj, ptrs := model.InfrastructureInit()
rows.Scan(ptrs...)
arrOfInfra = append(arrOfInfra, *obj)
fmt.Println("Len: " + fmt.Sprint(len(arrOfInfra)))
fmt.Println(obj)
}
I noticed that in this loop, it will actually pause half-way, and continue after a short break. It looks like this:
Len: 221
Len: 222
Len: 223
Len: 224
<a short pause about 1 second, then prints Len: 225 and continues>
Len: 226
Len: 227
...
..
.
and it will happen again later on at another row count, and again after a few hundred records.
UPDATE 2:
Below is a snippet of my InfrastructureInit() method
func InfrastructureInit() (*Infrastructure, []interface{}) {
irf := new(Infrastructure)
var ptrs []interface{}
ptrs = append(ptrs,
&irf.Base.ID,
&irf.Base.CreatedAt,
&irf.Base.UpdatedAt,
&irf.ListingID,
&irf.AddressID,
&irf.Type,
&irf.Name,
&irf.Description,
&irf.Details,
&irf.TravellingFor,
)
return irf, ptrs
}
I am not exactly sure what is causing this slowness, but I currently placed a quick patch on my server to using a redis database and precache my infrastructures, saving it as a string. It seems to be okay for now, but I now have to maintain both redis and my postgres.
I am still puzzled over this weird behavior, but I'm not exactly how rows.Next() work - does it make a query to the database everytime I call rows.Next()?
How do you think about just do like this?
defer rows.Close()
var arrOfInfra []*Infrastructure
for rows.Next() {
irf := &Infrastructure{}
err = rows.Scan(
&irf.Base.ID,
&irf.Base.CreatedAt,
&irf.Base.UpdatedAt,
&irf.ListingID,
&irf.AddressID,
&irf.Type,
&irf.Name,
&irf.Description,
&irf.Details,
&irf.TravellingFor,
)
if err == nil {
arrOfInfra = append(arrOfInfra, irf)
}
}
Hope this help.
I went some weird path myself while consolidating my understanding of how rows.Next() work and what might be impacting performance so thought about sharing this here for posterity (despite the question asked a long time ago).
Related to:
I am still puzzled over this weird behavior, but I'm not exactly how
rows.Next() work - does it make a query to the database everytime I
call rows.Next()?
It doesn't make a 'query' but it reads (transfers) data from the db through a driver on each iteration which means it can be impacted by e.g. bad network performance. Especially true if, for example, your db is not local to where you are running your Go code.
One approach to confirm whether network performance is an issue would be to run your go app on the same machine where your db is (if possible).
Assuming columns that are scanned in the above code are not of extremely large size or having custom conversions - reading ~400 rows should take in the order of 100ms at most (in a local setup).
For example - I had a case where I needed to read about 100k rows with about 300B per row and that was taking ~4s (local setup).

Unable to create MarkLogic scheduled tasks from within CPF action module

I have a MarkLogic database with Content Processing Framework (CPF) installed and the CPF pipeline is such that:
Whenever a document is inserted then it grabs the value of execution-date from the document and schedule a task for that time.
Example:
Sample document:
<sample>
<execution-date>2014-10-20T12:29:10</execution-date>
</sample>
when inserted triggers the CPF action module which reads the value of execution-date field and creates a scheduled task to be executed on the time read from execution-date field.
Following is the XQuery code snippet from the CPF action module that creates the scheduled task:
let $doc := fn:doc( $cpf:document-uri )
let $releasedon := xs:string($doc/sample/execution-date/text())
let $config := admin:get-configuration()
let $group := admin:group-get-id($config, "Default")
let $new-task :=
admin:group-one-time-scheduled-task(
"/tasks/task.xqy",
"/",
xs:dateTime($releasedon),
xdmp:database("SampleDB"),
xdmp:database("Modules"),
xdmp:user("admin"),
(),
"normal")
let $addTask := admin:group-add-scheduled-task($config,$group, $new-task)
return
admin:save-configuration($addTask),
xdmp:log(fn:concat("Task for document Uri: ", $cpf:document-uri, " created"))
Now, when I insert single document then everything works as expected, that is:
Document inserted successfully
the CPF action module is triggered successfully
Scheduled task created successfully.
But, when I try to insert multiple documents using:
xdmp:document-insert("/1.xml",
<sample>
<execution-date>2014-10-21T10:00:00</execution-date>
</sample>,
xdmp:default-permissions(),
("documents"))
,
xdmp:document-insert("/2.xml",
<sample>
<execution-date>2014-10-20T11:00:00</execution-date>
</sample>,
xdmp:default-permissions(),
("documents"))
CPF action module gets triggered successfully (log message can be seen in logs) BUT
ONLY one scheduled task gets created.
When looking in MarkLogic Admin Interface I can only find a single scheduled task which is scheduled to run at 2014-10-20T11:00:00
Please let me know what am I doing wrong or is there any configuration I am missing.
Any suggestions are welcomed.
Thanks!
The fundamental issue here is that the admin configuration manipulation APIs are not transactionally protected operations, so when you run two in parallel each one sees the initial state of the configuration files, then writes their bit to add the scheduled task, and then saves it, and only one of them wins. You can force this to behave in a transactionally protected way by forcing a lock on some URI It doesn't matter what it is. It doesn't even have to be in the database. As long as everything that is doing this is locking on the same URI you are fine. xdmp:lock-for-update("my.example.uri") will do this.
The following CPF action module is now working as expected:
xquery version "1.0-ml";
import module namespace cpf = "http://marklogic.com/cpf" at "/MarkLogic/cpf/cpf.xqy";
import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy";
declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;
declare function local:scheduleTask()
{
xdmp:lock-for-update("/sample.xml"),
if (cpf:check-transition($cpf:document-uri,$cpf:transition)) then try
{
let $doc := fn:doc( $cpf:document-uri )
let $releasedon := xs:string($doc/sample/execution-date/text())
let $config := admin:get-configuration()
let $group := admin:group-get-id($config, "Default")
let $new-task :=
admin:group-one-time-scheduled-task(
"/tasks/task.xqy",
"/",
xs:dateTime($releasedon),
xdmp:database("SampleDB"),
xdmp:database("Modules"),
xdmp:user("admin"),
(),
"normal")
let $addTask := admin:group-add-scheduled-task($config,$group, $new-task)
return
admin:save-configuration($addTask),
xdmp:log(fn:concat("Task for document Uri: ", $cpf:document-uri, " created"))
}
catch ($e) {
cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
}
else ( )
};
local:scheduleTask()

Is there a cleaner way to iterate through Mongo query results in Fantom?

I'm writing a web app in Fantom language and using afMongo to access a Mongo DB instance. Following the example in afMongo documentation I get the results of a query that I need to iterate through. In a simplified example, the iteration looks like this
class MapListIterator {
Void main(){
[Str:Obj?][] listOfMaps := [,]
listOfMaps.add(
["12345":[
"id":12345,
"code":"AU",
"name":"Australia"
]])
listOfMaps.each |Str:Obj? map| {
echo(map.keys)
keys := map.keys
keys.each {
echo(it)
echo(((Str:Obj?)map[it])["code"])
echo(((Str:Obj?)map[it])["name"])
}
}
}
}
I ran this code in Fantom online playground and it works Ok, but I wonder if it is a cleaner way to iterate through the results. I don't like the casting in my code above. Also, is there a better way to write the nested it-block, please?
EDIT:
Turns out that I was overcomplicating things. This is how the code looks after applying Steve's suggestions:
Str:Country mapOfCountries := [:]
mapOfCountries.ordered = true
listOfMaps := ([Str:Str?][]) collection.findAll
listOfMaps.each {
c := it["code"]
n := it["name"]
mapOfCountries.add(c, Country { code = c ; name = n })
}
I would re-cast the result and assign the map early on... which gives:
listOfMappedMaps := ([Str:[Str:Obj?]][]) listOfMaps
listOfMappedMaps.each {
map := it
map.keys.each {
echo(map[it]["code"])
echo(map[it]["name"])
}
}
The next step would be use Morphia which lets you use objects in place of maps.