I have a 30GB mongodb 3.6 collection with 500k documents. the main _id field is a float timestamp (i didnt manually define an index, but inserted on the _id field, assuming from documentation that _id will be used as index and automatically maintained.
Now if I query last data I do in Python 3
cursor = cry.find().sort('_id', pymongo.DESCENDING).limit(600)
df = list(cursor)
However, just querying the last 600 records takes about 1 minute. How can this be if the index is maintained? Is there a faster way to query (like by natural order) or do I need to re-index although documentation says its done automatically?
I also tried
cursor=cry.find().skip(cry.count() - 1000)
df = list(cursor)
but this is just as slow
Related
We have a 3GB collection in mongoDB 4.2 and this python 3.7, pymongo 3.12 function that deletes rows from the collection:
def delete_from_mongo_collection(table_name):
# connect to mongo cluster
cluster = MongoClient(MONGO_URI)
db = cluster["cbbap"]
# remove rows and return
query = { 'competitionId': { '$in': [30629, 30630] } }
db[table_name].delete_many(query)
return
Here is the relevant info on this collection, note that it has 360MB worth of indexes which are set to speed up retrievals of data from this collection by our Node API, although they may be the problem here.
The delete_many() is part of a pattern where we (a) remove stale data and (b) upload fresh data, each day. However, given that it is taking over an hour to remove the rows that match the query { 'competitionId': { '$in': [30629, 30630] } }, we'd be better off just dropping and re-inserting the entire table. What's frustrating is that competitionId is an index, and as the first index in our compound indexes, I thought it should be very fast to drop rows using an index. I wonder if having 360MB of indexes is responsible for the slow deletes?
We cannot use the hint parameter as we have mongoDB 4.2, not 4.4, and we do not want to upgrade to 4.4 yet as we are worried about major breaking changes that may occur in our pipelines and our node API.
What else can be done here to improve the performance of delete_many()?
Let say you have a collection of 10,000 documents and I make a find query with a the option limit(50). How will mongoDb choose which 50 documents to return.
Will it auto-sort them(maybe by their creation date) or not?
Will the query return the same documents every time it is called? How does the limit option work in mongodb?
Does mongoDB limit the documents after they are returned or as it queries them. Meaning will mongoDB query all documents the limit the results to 50 documents or will it query the 50 documents only?
The first 50 documents of the result set will be returned.
If you do not sort the documents (or if the order is not well-defined, such as sorting by a field with values that occur multiple times in the result set), the order may change from one execution to the next.
Will it auto-sort them(maybe by their creation date) or not?
No.
Will the query return the same documents every time it is called?
The query may produce the same results for a while and then start producing different results if, for example, another document is inserted into the collection.
Meaning will mongoDB query all documents the limit the results to 50 documents or will it query the 50 documents only?
Depends on the query. If an index is used, only the needed documents will be read from the storage engine. If a sort stage is used in the query execution, all documents will be read from storage, sorted, then the required number will be returned and the rest discarded.
I am using Mongo java driver 3.11.1 and Mongo Version 4.2.0 for my development.I am still learning mongo. My application receives data and either have to do insert or replace the existing document i.e. do an upsert.
Each document size is 780-1000 bytes as of now and each collection can have more than 3 millions records.
Approach 1: I tried using findOneandreplace for each document and it was taking more than 15 minutes to save the data.
Approach-2 I changed it to bulkwrite using below, which resulted in ~6-7 minutes for saving 20000 records.
List<Data> dataList;
dataList.forEach(data-> {
Document updatedDocument = new Document(data.getFields());
updates.add(new ReplaceOneModel(eq("DataId", data.getId()), updatedDocument, updateOptions));
});
final BulkWriteResult bulkWriteResult = mongoCollection.bulkWrite(updates);
3) I tried using collection.insertMany which takes 2 seconds to store the data.
As per driver code, insertMany also Internally InsertMany uses MixedBulkWriteOperation for inserting the data similar to bulkWrite.
My queries are -
a) I have to do upsert operation, Please let me know where i am doing any mistakes.
- Created the indexes on DataId field but resulted in <2 miliiseconds difference in terms of performance.
- Tried using writeConcern of W1, but performance is still the same.
b) why insertMany's performance is faster than bulk write. I could understand in terms of few seconds difference but unable to figure out the reason for 2-3 seconds for insertMany and 5-7 minutes for bulkwrite.
c) Are there any approaches that can be used to solve this situation.
This problem was solved to greater extent by adding index on DataId Field. Previously i had created index on DataId field but forgot to create index after creating collection.
This link How to improve MongoDB insert performance helped in resolving the problem
I am running a mongo query like this
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true, $ne:[]}}).sort({dateField:-1})
The collection has approx. 10^6 documents. I have indexes on the stringField and dateField (both ascending). This query takes ~3-4 seconds to run.
However, if I change my query to either of the below, it executes within 100ms
Remove $ne
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true}}).sort({dateField:-1})
Remove $exists
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$ne:[]}}).sort({dateField:-1})
Remove sort
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true, $ne:[]}})
Use arrayField.0
db.getCollection('foodfulfilments').find({stringField:"stringValue", "arrayField.0":{$exists:true}}).sort({dateField:-1})
The explain of these queries do not provide any insights to why the first query is so slow?
MongoDb version 3.4.18
sorting 2 millions of records using mongo sort is possible or not?
From the MongoDB Documentation, it is clearly mentioned that "When the sort operation consumes more than 32 megabytes, MongoDB returns an error."
But I have a requirement to sort huge number of records. How to do it?
It's possible. The documentation states that 32MB limit is there only when MongoDB sorts data in-memory i.e. without using an index.
When the sort operation consumes more than 32 megabytes, MongoDB
returns an error. To avoid this error, either create an index to
support the sort operation or use sort() in conjunction with limit().
The specified limit must result in a number of documents that fall
within the 32 megabyte limit.
I suggest that you add an index on the field on which you want to sort with ensureIndex command:
db.coll.ensureIndex({ sortFieldName : 1});
If you're sorting on multiple fields, you will need to add an compound index on the fields your sorting on (order of the fields in index matter):
db.coll.ensureIndex({ sortFieldName1 : 1, sortFieldName2 : 1});