How can I perform a bulkWrite in mongodb using rust mongodb driver? - mongodb

I am implementing a process in rust where I read a large number of documents from a mongodb collection, perform some calculations on the values of each document and then have to update the documents in mongodb.
In my initial implementation, after the calculations are performed, I go through each of the documents and call db.collection.replace_one.
let document = bson::to_document(&item).unwrap();
let filter = doc! { "_id": item.id.as_ref().unwrap() };
let result = my_collection.replace_one(filter, rec_document, None).await?
Since this is quite time consuming for large record sets, I want to implement it using db.collection.bulkWrite. In version 1.1.1 of the official rust mongodb driver, bulkWrite does not seem to be supported, so I want to use db.run_command. However, I am not sure how to call db.collection.bulkWrite(...) using run_command as I cannot figure out how to pass the command name as well as the set of documents to replace the values in mongodb.
What I have attempted is to create a String representing the command document with all the document records to be updated string joined as well. In order to create bson::Document from that string, I convert the string to bytes and then attempt to create the document to be passed using Document::from_reader but that doesn't work, nor is a good solution.
Is there a proper or better way to call bulkWrite using version 1.1.1 of the mongodb crate?

Related

How to use mongodb change stream instead of periodic query?

I wan't to calculate sum the documents in my collection satisfying a query. I dont want to poll my collection. How can you do this with mongodb changestream?
For example there are documents in the database and they all have some property: {"destination": "Target1"} And i want to know the amount of documents which are satisfying this previous requirement.
I don't want to run a query on every change of a collection. Because the documents changing very often
I am looking for a similar to oracle's cqn
You can use changestream and watch changes as follow:
watchCursor = db.getSiblingDB("mydatabase").mycollection.watch()
while (!watchCursor.isExhausted()){
if (watchCursor.hasNext()){
printjson(watchCursor.next());
}
}
changeStream docs
But perhaps you may do some query and use some good indexes?
It seems you can just execute:
db.collection.count({destination:"Target1"})
and if you have index on "destination" field it will be pretty quick ...

How to avoid mongo from returning duplicated documents by iterating a cursor in a constantly updated big collection?

Context
I have a big collection with millions of documents which is constantly updated with production workload. When performing a query, I have noticed that a document can be returned multiple times; My workload tries to migrate the documents to a SQL system which is set to allow unique row ids, hence it crashes.
Problem
Because the collection is so big and lots of users are updating it after the query is started, iterating over the cursor's result may give me documents with the same id (old and updated version).
What I'v tried
const cursor = db.collection.find(query, {snapshot: true});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}
Based on old documentation for the mongo driver (I'm using nodejs but this is applicable to any official mongodb driver), there is an option called snapshot which is said to avoid what is happening to me. Sadly, the driver returns an error indicating that this option does not exists (It was deprecated).
Question
Is there a way to iterate through the documents of a collection in a safe fashion that I don't get the same document twice?
I only see a viable option with aggregation pipeline, but I want to explore other options with standard queries.
Finally I got the answer from a mongo changelog page:
MongoDB 3.6.1 deprecates the snapshot query option.
For MMAPv1, use hint() on the { _id: 1} index instead to prevent a cursor from returning a document more than once if an intervening write operation results in a move of the document.
For other storage engines, use hint() with { $natural : 1 } instead.
So, from my code example:
const cursor = db.collection.find(query).hint({$natural: 1});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}

How to order the fields of the documents returned by the find query in MongoDB? [duplicate]

I am using PyMongo to insert data (title, description, phone_number ...) into MongoDB. However, when I use mongo client to view the data, it displays the properties in a strange order. Specifically, phone_number property is displayed first, followed by title and then comes description. Is there some way I can force a particular order?
The above question and answer are quite old. Anyhow, if somebody visits this I feel like I should add:
This answer is completely wrong. Actually in Mongo Documents ARE ordered key-value pairs. However when using pymongo it will use python dicts for documents which indeed are not ordered (as of cpython 3.6 python dicts retain order, however this is considered an implementation detail). But this is a limitation of the pymongo driver.
Be aware, that this limitation actually impacts the usability. If you query the db for a subdocument it will only match if the order of the key-values pairs is correct.
Just try the following code yourself:
from pymongo import MongoClient
db = MongoClient().testdb
col = db.testcol
subdoc = {
'field1': 1,
'field2': 2,
'filed3': 3
}
document = {
'subdoc': subdoc
}
col.insert_one(document)
print(col.find({'subdoc': subdoc}).count())
Each time this code gets executed the 'same' document is added to the collection. Thus, each time we run this code snippet the printed value 'should' increase by one. It does not because find only maches subdocuemnts with the correct ordering but python dicts just insert the subdoc in arbitrary order.
see the following answer how to use ordered dict to overcome this: https://stackoverflow.com/a/30787769/4273834
Original answer (2013):
MongoDB documents are BSON objects, unordered dictionaries of key-value pairs. So, you can't rely on or set a specific fields order. The only thing you can operate is which fields to display and which not to, see docs on find's projection argument.
Also see related questions on SO:
MongoDB field order and document position change after update
Can MongoDB and its drivers preserve the ordering of document elements
Ordering fields from find query with projection
Hope that helps.

Field's datatype of collection in mongodb

How to get field information of a collection in mongodb.
information I am looking for are
field name
data type
You will need to loop over all the documents and figure out what the used names are, and which types each specific field uses. MongoDB does not have a schema, so there is no short cut to fetch this. Be also aware that each field's value can have totally different data types as well—another one of MongoDB's strenghts.
To figure out some statistics, such as field names, the following script can help:
mr = db.runCommand({
"mapreduce" : "things",
"map" : function() {
for (var key in this) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "things" + "_keys"
})
Then run distinct on the resulting collection so as to find all the keys:
db[mr.result].distinct("_id");
But there is no way to also include the field types with a Map/Reduce job like this.
You can't determine the schema of a collection. Each of the objects of an collection might have a different schema, you should be aware of this.
I made a similar question a few months ago , in the post you can find how to retrieve the schema of an object using the java programing language; However, to the best of my knowledge, the is no way to retrieve the data types other than try to cast the objects (this is the way the BasicBsonObjects do it).
MongoDB supports dynamic schema, and there is no inbuilt feature for schema introspection or analysis as at MongoDB 2.4.
However .. it is possible to infer the schema by inspecting using a Map/Reduce across either a sample of documents or the entire collection.
There are a few open source tools which package this approach up in a helpful interface, for example:
Schema.js - extends the mongo shell with collection.schema() prototypes
Variety - runs as a standalone script
I like the approach of schema.js, and include it in my ~/mongorc.js startup file so it is available in my mongo shell sessions.
By default schema.js analyzes up to 50 documents in a collection and returns the results inline. There is a limit option to inspect more (or even all) documents in a collection, and it supports the Map/Reduce out options so results can optionally be saved or merged with an output collection.

MongoDB: range queries on insertion time with _id and ObjectID

I am trying to use mongodb's ObjectID to do a range query on the insertion time of a given collection. I can't really find any documentation that this is possible, except for this blog entry: http://mongotips.com/b/a-few-objectid-tricks/ .
I want to fetch all documents created after a given timestamp. Using the nodejs driver, this is what I have:
var timeId = ObjectId.createFromTime(timestamp);
var query = {
localUser: userId,
_id: {$gte: timeId}
};
var cursor = collection.find(query).sort({_id: 1});
I always get the same amount of records (19 in a collection of 27), independent of the timestamp. I noticed that createFromTime only fills the bytes in the objectid related to time, the other ones are left at 0 (like this: 4f6198be0000000000000000).
The reason that I try to use an ObjectID for this, is that I need the timestamp when inserting the document on the mongodb server, not when passing the document to the mongodb driver in node.
Anyone knows how to make this work, or has another idea how to generate and query insertion times that were generated on the mongodb server?
Not sure about nodejs driver in ruby, you can simply apply range queries like this.
jan_id = BSON::ObjectId.from_time(Time.utc(2012, 1, 1))
feb_id = BSON::ObjectId.from_time(Time.utc(2012, 2, 1))
#users.find({'_id' => {'$gte' => jan_id, '$lt' => feb_id}})
make sure
var timeId = ObjectId.createFromTime(timestamp) is creating an ObjectId.
Also try query without localuser