mongolite doesn't read the index and read queries are slow - mongodb

I have created a mongodb database using mongolite and I create index on the _row key on the database using following command:
collection$index(add = '{"_row" : 1}')
when I query a document via Robo3T program with the db.getCollection('collection').find({"_row": "ENSG00000197616"}) command, my index works and it takes less than a second to query the data.
Robo3T screen shot >>> pay attention to the query time
This is also the case when I query the data using pymongo package in python.
python screenshot >>> pay attention to query time
Surprisingly, when I perform the same query with mongolite, it takes more than 10 seconds to query data:
system.time(collection$find(query = '{"_row": "ENSG00000197616"}'))
user system elapsed
12.221 0.005 12.269
I think this can only come from mongolite package, otherwise, it wouldn't work on the other programs as well.
Any input is highly appreciated!

I found the solution here:
https://github.com/jeroen/mongolite/issues/37
The time consuming part is not data query but simplifying it in a dataframe.

Related

MongoDb update a field in a huge collection using pymongo fast

I have an 13GB of documents in a collection in mongoDB where I need to update a field ip_address. The original value and the replacement values are given in excel sheet. I am looping through each value from excel and updating it using:
old_value={"ip_address":original_value}
new_value={"$set":{"ip_address":replacement_value}
tableConnection.update_many(old_value,new_value)
In order to process 1 update it is taking over 2 minutes. I have 1500 updates to do. Is there any better way to do it?
Bulk operations won't speed up your updates by much; the best way to achieve a performance increase is to add an index. This can be as simple as:
db.collection.createIndex({'ip_address': 1})
Refer to the documentation regarding potential blocking on certain older versions of the database https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/
The index will take up addtional storage; if that is an issue you can delete the index once you've completed the updates.
To add on to the above answer given by Belly Buster the syntax to perform indexing and bulk_write in PyMongo that worked for me is :
db.collection.create_index("ip_address")
requests = [UpdateMany({'ip_address': 'xx.xx.xx.xx'}, {'$set': {'ip_address':'yy.yy.yy.yy'}}),[UpdateMany({'ip_address': 'xx.xx.xx.xx'}, {'$set': {'ip_address':'yy.yy.yy.yy'}})]
try :
db.collection.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
print(bwe)

Performance degrade with Mongo when using bulkwrite with upsert

I am using Mongo java driver 3.11.1 and Mongo Version 4.2.0 for my development.I am still learning mongo. My application receives data and either have to do insert or replace the existing document i.e. do an upsert.
Each document size is 780-1000 bytes as of now and each collection can have more than 3 millions records.
Approach 1: I tried using findOneandreplace for each document and it was taking more than 15 minutes to save the data.
Approach-2 I changed it to bulkwrite using below, which resulted in ~6-7 minutes for saving 20000 records.
List<Data> dataList;
dataList.forEach(data-> {
Document updatedDocument = new Document(data.getFields());
updates.add(new ReplaceOneModel(eq("DataId", data.getId()), updatedDocument, updateOptions));
});
final BulkWriteResult bulkWriteResult = mongoCollection.bulkWrite(updates);
3) I tried using collection.insertMany which takes 2 seconds to store the data.
As per driver code, insertMany also Internally InsertMany uses MixedBulkWriteOperation for inserting the data similar to bulkWrite.
My queries are -
a) I have to do upsert operation, Please let me know where i am doing any mistakes.
- Created the indexes on DataId field but resulted in <2 miliiseconds difference in terms of performance.
- Tried using writeConcern of W1, but performance is still the same.
b) why insertMany's performance is faster than bulk write. I could understand in terms of few seconds difference but unable to figure out the reason for 2-3 seconds for insertMany and 5-7 minutes for bulkwrite.
c) Are there any approaches that can be used to solve this situation.
This problem was solved to greater extent by adding index on DataId Field. Previously i had created index on DataId field but forgot to create index after creating collection.
This link How to improve MongoDB insert performance helped in resolving the problem

Bulk insert using mongo through Lambda timeout and insert data 3 times

We have need to insert data into mongo from S3. We wrote a Lambda function which simply read file from S3 (JSON File) and using Mongoose just execute InsertMany. When we execute this lambda. Our mongodb insert take round about 7-10 minutes for 10K records. I need help on following
Improve mongo Insert so we can insert 20k records < 5 minutes to avoid lambda timeout
I am already using Ordered:False to expedite insert in mongo
Use native mongoDB client instead of Mongoose solves the problem. Seems like InsertMany perform better with native mongoDB compare to mongoose

Fetching data from previous record in mongodb

How do I write a mongodb query in iReport to fetch data from previous record?
Example,
I have 3 columns - start,end,rest.
'Rest' has to be calculated as the difference between start of one record and the end of previous record. ('start' and 'end' are dates)
There are a few ways to do this. Basically you have to rework the data in the query so that it has the right format when it is about to be printed in JasperReport. This can be accomplished with runCommand command or mapreduce command given that you have sufficient permissions. Your best option is probably mapreduce. I suppose you have read this reference.

mongodb: how can I see the execution time for the aggregate command?

I execute the follow mongodb command in mongo shell
db.coll.aggregate(...)
and i see the list of result. but is it possible to see the query
execution time? Is there any equivalent function for explain method for aggregation queries.
var before = new Date()
#aggregation query
var after = new Date()
execution_mills = after - before
You can add a time function to your .mongorc.js file (in your home directory):
function time(command) {
const t1 = new Date();
const result = command();
const t2 = new Date();
print("time: " + (t2 - t1) + "ms");
return result;
}
and then you can use it like so:
time(() => db.coll.aggregate(...))
Caution
This method doesn't give relevant results for db.collection.find()
i see that in mongodb there is a possibility to use this two command:
db.setProfilingLevel(2)
and so after the query you can use db.system.profile.find() to see the query execution time and other
Or you can install the excellent mongo-hacker, which automatically times every query, pretty()fies it, colorizes the output, sorts the keys, and more:
I will write an answer to explain this better.
Basically there is no explain() functionality for the aggregation framework yet: https://jira.mongodb.org/browse/SERVER-4504
However there is a way to measure client side but not without its downsides:
You are not measuring the database
You are measuring the application
There are too many unknowns about the in between parts to be able to get an accurate reading, i.e. you can't say that it took 0.04ms for the document result to be formulated by the MongoDB server, serialised, sent over the wire, de-serialised by the app and then stored into a hash allowing you subtract that sum from the total to get a aggregation benchmark.
However that being said, you might be able to get a slightly accurate result by doing it in MongoDB console on the same server as the mongos / mongod. This will create very little in betweens, still too many but enough to maybe get a reading you could roughly trust. As such you could use #Zagorulkin's answer in that position.