Error during fetching big data using GridFS (MongoDB) - mongodb

When I'm trying to fetch file stored in MongoDB by GridFS (300mb) I'm getting error:
2014-07-16T22:50:10.201+0200 [conn1139] assertion 17144 Runner error:
Overflow sort stage buffered data usage of 33563462 bytes exceeds internal limit of 33554432 bytes ns:myproject.export_data.chunks query:{ $query: { files_id: ObjectId('53c6e5485f00005f00c6bae6'), n: { $gte: 0, $lte: 1220 } }, $orderby: { n: 1 } }
I found something similar, but it's already fixed:
https://jira.mongodb.org/browse/SERVER-13611
I'm using MongoDB 2.6.3

Not sure which driver or driver version you are using, but it is clear that your implementation is issuing a "sort" and without an index you are blowing up the 32MB memory sort limit when pulling in chunks over a range.
Better driver implementations do not do this and rather "cycle" the chunks with individual queries. But the problem here is you collection is missing the index it needs, either by your own setup or by the driver implementation that created this collection.
It seems you have named your "root" space "export_data", so switch to the database containing the GridFS collections an issue the following:
db.export_data.chunks.ensureIndex( { files_id: 1, n: 1 }, { unique: true } )
Or add something in your application code that does this to ensure the index exists.

This is not a bug. It's clearly about sort as described in the error message, not about GridFS. Read this section about sort limitation:
MongoDB will only return sorted results on fields without an index if the sort operation uses less than 32 megabytes of memory.
Which means your sort aborts if it uses more than 32MB memory without index.
It will be better if you can post the statements you are executing.

Related

MongoDB takes an hour to delete_many() 1GB of data

We have a 3GB collection in mongoDB 4.2 and this python 3.7, pymongo 3.12 function that deletes rows from the collection:
def delete_from_mongo_collection(table_name):
# connect to mongo cluster
cluster = MongoClient(MONGO_URI)
db = cluster["cbbap"]
# remove rows and return
query = { 'competitionId': { '$in': [30629, 30630] } }
db[table_name].delete_many(query)
return
Here is the relevant info on this collection, note that it has 360MB worth of indexes which are set to speed up retrievals of data from this collection by our Node API, although they may be the problem here.
The delete_many() is part of a pattern where we (a) remove stale data and (b) upload fresh data, each day. However, given that it is taking over an hour to remove the rows that match the query { 'competitionId': { '$in': [30629, 30630] } }, we'd be better off just dropping and re-inserting the entire table. What's frustrating is that competitionId is an index, and as the first index in our compound indexes, I thought it should be very fast to drop rows using an index. I wonder if having 360MB of indexes is responsible for the slow deletes?
We cannot use the hint parameter as we have mongoDB 4.2, not 4.4, and we do not want to upgrade to 4.4 yet as we are worried about major breaking changes that may occur in our pipelines and our node API.
What else can be done here to improve the performance of delete_many()?

How to query data efficiently in large mongodb collection?

I have one big mongodb collection (3-million docs, 50 GigaBytes), and it would be very slow to query the data even I have created the indexs.
db.collection.find({"C123":1, "C122":2})
e.g. the query will be timeout or will be extreme slow (10s at least), even if I have created the separate indexes for C123 and C122.
Should I create more indexs or increase the physical memory to accelerate the querying?
For such a query you should create compound indexes. One on both fields. And then it should be very efficient. Creating separate indexes won't help you much, because MongoDB engine will use first to get results of first part of query, but second if is used won't help much (or even can slow down in some cases your query because of lookup in indexes table and then in real data again). You can confirm used indexes by using .explain() on your query in shell.
See compound indexes:
https://docs.mongodb.com/manual/core/index-compound/
Also consider sorting directions on both your fields while making indexes.
The answer is really simple.
You don't need to create more indexes, you need to create the right indexes. Index on field c124 won't help queries on field c123, so no point in creating it.
Use better/more hardware. More RAM, more machines (sharding).
Create Right indices and carefully use compound index. (You can have max. 64 indices per collection and 31 fields in compound index)
Use mongo side pagination
Try to find out most used queries and build compound index around that.
Compound index strictly follow sequence so read documentation and do trials
Also try covered query for 'summary' like queries
Learned it hard way..
Use skip and limit. Run a loop for 50000 data at once .
https://docs.mongodb.com/manual/reference/method/cursor.skip/
https://docs.mongodb.com/manual/reference/method/cursor.limit/
example :
[
{
$group: {
_id: "$myDoc,homepage_domain",
count: {$sum: 1},
entry: {
$push: {
location_city: "$myDoc.location_city",
homepage_domain: "$myDoc.homepage_domain",
country: "$myDoc.country",
employee_linkedin: "$myDoc.employee_linkedin",
linkedin_url: "$myDoc.inkedin_url",
homepage_url: "$myDoc.homepage_url",
industry: "$myDoc.industry",
read_at: "$myDoc.read_at"
}
}
}
}, {
$limit : 50000
}, {
$skip: 50000
}
],
{
allowDiskUse: true
},
print(
db.Or9.insert({
"HomepageDomain":myDoc.homepage_domain,
"location_city":myDoc.location_city
})
)

MongoDB: Index is pretty slow on 100+ mio docs

I'm doing count on a collection with more than 100 millions documents.
My query is:
{
"domain": domain,
"categories" : "buzz",
"visit.timestamp" : { "$gte": date_from, "$lt": date_to },
}
I project only _id.
I have some indexes on it, like, per example:
{ "visit.timestamp": -1 }
and compound index like:
{ "visit.timestamp": -1, "domain": 1, "categories" : 1 }
A count based on, per example, 30 last days gives results in ~30 seconds.
An explain() shows me that the query use the simplest index: { "visit.timestamp": -1 }
So I tried to force the compound index in other order:
{ "categories" : 1, "domain": 1, "visit.timestamp": -1 }
{ "domain": 1, "categories" : 1, "visit.timestamp": -1 }
Then, the query uses one of them, but the result takes much longer: ~60 seconds in first case, and for the other one, more than 241 seconds!
Note 1: It's the same result with aggregation framework, but it's not surprising.
Note 2: "visit.timestamp" is an ISODate. Each document is more recent than the previous one.
Note 3: The count returns ~1.4 million documents (among the ~105 millions) but examined 12 millions docs (see below).
Question:
1/ I don't get why a query takes longer when using an index that should covered it completely. Do you have an explanation?
2/ Do you have any hint to improve the response time of this query?
The explain() shows that the query looked at:
"totalKeysExamined": 12628476,
"totalDocsExamined": 12628476,
Because, as I can understand, the index cover only the date index visit.timestamp and so all docs within the time-frame has to be examined.
Second question:
Make sure the problem is in MongoDB's scope. Isolate it from your application code and I/O. Do that by connecting to (one of) your MongoDB server(s) locally and execute the query.
Happens locally? Check CPU and disk health of your server(s).
CPU(s) and disk(s) all fit no sweat? Make sure your index fits in to RAM. Citing from MongoDB's FAQ:
What happens if an index does not fit into RAM?
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
In certain cases, an index does not need to fit entirely into RAM. For
details, see Indexes that Hold Only Recent Values in RAM.
First question:
Maybe your index doesn't fit into RAM. And making it compound may increase the number of I/O operations to the disk. I'm no MongoDB expert though.

My Mongo query is too large and I'm reaching a memory issue

I'm reaching some sort of RAM limit when doing this query, here's the error:
The operation: #<Moped::Protocol::Query
#length=100
#request_id=962
#response_to=0
#op_code=2004
#flags=[]
#full_collection_name="test_db.cases"
#skip=1650
#limit=150
#selector={"$query"=>{}, "$orderby"=>{"created_at"=>1}}
#fields=nil>
failed with error 17144: "Runner error: Overflow sort stage buffered data usage of 33555783 bytes exceeds internal limit of 33554432 bytes"
See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.
There are two solutions I can think of:
1) up the buffer limit. this requires mongo 2.8 which is some unstable release that i'd have to install manually.
2) break apart the query? chunk it? this is what the query looks like:
upload_set = Case.all.order_by(:created_at.asc).skip(#set_skipper).limit(150).each_slice(5).to_a
#set_skipper grows by 150 every time the method is called.
Any help?
From http://docs.mongodb.org/manual/reference/limits/
Sorted Documents
MongoDB will only return sorted results on fields without an index if
the combined size of all documents in the sort operation, plus a small
overhead, is less than 32 megabytes.
Did you try using an index on created_at ? That should remove that limitation.

retrieve large number of records with mongoDB in a reasonable time

I'm using mongoDB to store a querylog and get some stats about it.
Objects that I store in mongoDB contains the text of the query, the date,
the user, if the user clicked on some results etc etc.
Now i'm trying to retrieve all the queries not clicked by a user in a certain day
with java. My code is approximately this:
DBObject query = new BasicDBObject();
BasicDBObject keys = new BasicDBObject();
keys.put("Query", 1);
query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
query.put("IsClick", false);
...
DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);
The output of the query contains about 20k records that I need to iterate.
The problem is that it takes minutes :( . I don't think is normal.
From the server log i see:
Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:128015 nreturned:2661 --> 106059ms
So retrieving the first chunk takes 2 seconds, the second 0.1 seconds, the third 106 seconds!!! weird..
I tried changing the batch size, creating indexes on Date and IsClick, rebooting the machine :P but no way. Where I'm wrong?
There are several factors here that can affect speed. It will be necessary to gather some extra data to identify the cause here.
Some potential issues:
Indexes: are you using the right indexes? You should probably be indexing on IsClick/Date. That puts the range second which is the normal suggestion. Note that this is different from indexing on Date/IsClick, order is important. Try a .explain() on your query to see what indexes are being used.
Data Size: in some cases, slowness can be caused by too much data. This could be too many documents or too many large documents. It can also be caused by trying to find too many needles in a really large haystack. You are bringing back 252k in data (reslen) and 12k documents, so this is probably not the problem.
Disk IO: MongoDB uses memory-mapped files and therefore uses lots of virtual memory. If you have more data than RAM then fetching certain documents requires "going to disk". Going to disk can be a very expensive operation. You can identify "going to disk" by using tools like iostat or resmon (Windows) to monitor the disk activity.
Based on personal experience, I strongly suspect #3, with a possible exacerbation from #1. I would start with watching the IO while running a .explain() query. This should quickly narrow down the range of possible problems.