I want to get a large number (1 million) of documents by their object id, stored in a list obj_ids. So I use
docs = collection.find({'_id' : {'$in' : obj_ids}})
However, when trying to access the documents (e.g. list(docs)) I get
pymongo.errors.DocumentTooLarge: BSON document too large (19889042 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
which confuses me. As I understand this, the document size is 16 MB for a single document. But I don't think I have any document exceeding this limit:
I did not get this error message when inserting any of the documents in the first place.
This error does not show up if I chunk the ObjectIds into 2 subsets, and recombine the results later.
So if there is not some too big document in my collection, what is the error message about?
Your query {'_id' : {'$in' : obj_ids}} is the issue, that's too large, not the documents themselves.
You'll need to refactor your approach; maybe do it in batches and join the results.
Related
I cannot find the answer to this seemingly basic question. The documentation says the max size for a BSON document is 16MB.
I want to know what is the max size of a query result allowed. E.g., if there are 1,000 records in a collection, and each document is 1MB, will mongodb throw an error if there are 400 documents (totaling to 400MB)?
My sample use case is to Query data about people who have not blocked the user
and there is no limit to the number of people that can block the user
So my query looks something like
db.collection().find( { followedPersonId: { $nin: [ blockerId1, blockerId2, blockerId3.....] } } )
So the number of Array items in the $nin operator can grow to a potentially large number, So is there a limit to the size of this array in MongoDB?
The command size limit is currently set at 48 MB. If your query is bigger than that the driver should fail when trying to serialize it, and the server should fail if it was asked to parse it.
Since your query is technically a query document, I imagine the lower 16 MB BSON document size limit would also apply to it. The 48 MB limit applies to find-and-modify queries that specify a query document and an update document.
Short answer is NO!
When you try to do this the query will become huge. And performance will be less.
Will the ids be in crores?
If yes then I will suggest not to go with this method.
I want to store my data in a collection that holds a maximum of ~10000 past records before they're deleted automatically from the database, like a FIFO queue of 10k records.
I have been looking at capped collections, but apart from the maximum number of records, they also require the maximum size of the collection. Now, I don't really know how much size 10k records are going to occupy. If I set a size that can't hold 10k records, I'll have lesser than I need. On the other hand, if I set an upper limit on the size, space is going to be wasted because mongo allocates the space beforehand.
What I can do is get dummy records, but I don't know how to check the size of each document.
Does anyone know of a method that can set an upper limit solely on the number of documents in my collection? Using the latest version on mongo out right now (v 3.6.3)
I'd personally push in 10000 actual real documents into a mongodb collection then call stats on the collection
db.test.stats()
Then use the size property:
> db.test.stats()
{
"ns" : "test.test",
"size" : 28609,
"count" : 1018,
"avgObjSize" : 28,
You may also want to add some padding, say 10% to that number:
cap col size = size * 1.1
The collection of MongoDB I am working on takes sensor data from cellphone and it is pinged to the server like every 2-6 seconds.
The data is huge and the limit of 16mb is crossed after 4-5 hours, there don't seem to be any work around for this?
I have tried searching for it on Stack Overflow and went through various questions but no one actually shared their hack.
Is there any way... on the DB side maybe which will distribute the chunk like it is done for big files via gridFS?
To fix this problem you will need to make some small amendments to your data structure. By the sounds of it, for your documents to exceed the 16mb limit, you must be embedding your sensor data into an array in a single document.
I would not suggest using GridFS here, I do not believe it to be the best solution, and here is why.
There is a technique known as bucketing that you could employ which will essentially split your sensor readings out into separate documents, solving this problem for you.
The way it works is this:
Lets say I have a document with some embedded readings for a particular sensor that looks like this:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
]
}
With the structure above, there is already a major flaw, the readings array could grow exponentially, and exceed the 16mb document limit.
So what we can do is change the structure slightly to look like this, to include a count property:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
],
count : 3
}
The idea behind this is, when you $push your reading into your embedded array, you increment ($inc) the count variable for every push that is performed. And when you perform this update (push) operation, you would include a filter on this "count" property, which might look something like this:
{ count : { $lt : 500} }
Then, set your Update Options so that you can set "upsert" to "true":
db.sensorReadings.update(
{ name: "SensorName1", count { $lt : 500} },
{
//Your update. $push your reading and $inc your count
$push: { readings: [ReadingDocumentToPush] },
$inc: { count: 1 }
},
{ upsert: true }
)
see here for more info on MongoDb Update and the Upsert option:
MongoDB update documentation
What will happen is, when the filter condition is not met (i.e when there is either no existing document for this sensor, or the count is greater or equal to 500 - because you are incrementing it every time an item is pushed), a new document will be created, and the readings will now be embedded in this new document. So you will never hit the 16mb limit if you do this properly.
Now, when querying the database for readings of a particular sensor, you may get back multiple documents for that sensor (instead of just one with all the readings in it), for example, if you have 10,000 readings, you will get 20 documents back, each with 500 readings each.
You can then use aggregation pipeline and $unwind to filter your readings as if they were their own individual documents.
For more information on unwind see here, it's very useful
MongoDB Unwind
I hope this helps.
You can handle this type of situations using GridFS in MongoDB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks 1, and stores each chunk as a separate document. By default, GridFS uses a chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.
The documentation of GriFS contains almost everything you need to implement GridFS. You can follow it.
As your data is stream, you can try as following...
gs.write(data, callback)
where data is a Buffer or a string, callback gets two parameters - an error object (if error occured) and result value which indicates if the write was successful or not. While the GridStore is not closed, every write is appended to the opened GridStore.
You can follow this github page for streaming related information.
Assume a mobile game that is backed by a MongoDB database containing a User collection with several million documents.
Now assume several dozen properties that must be associated with the user - e.g. an array of _id values of Friend documents, their username, photo, an array of _id values of Game documents, last_login date, count of in-game currency, etc, etc, etc..
My concern is whether creating and updating large, growing arrays on many millions of User documents will add any 'weight' to each User document, and/or slowness to the overall system.
We will likely never eclipse 16mb per document, but we can safely say our documents will be 10-20x larger if we store these growing lists directly.
Question: is this even a problem in MongoDB? Does document size even matter if your queries are properly managed using projection and indexes, etc? Should we be actively pruning document size, e.g. with references to external lists vs. embedding lists of _id values directly?
In other words: if I want a user's last_login value, will a query that projects/selects only the last_login field be any different if my User documents are 100kb vs. 5mb?
Or: if I want to find all users with a specific last_login value, will document size affect that sort of query?
One way to rephrase the question is to say, does a 1 million document query take longer if documents are 16mb vs 16kb each.
Correct me if I'm wrong, from my own experience, the smaller the document size, the faster the query.
I've done queries on 500k documents vs 25k documents and the 25k query was noticeably faster - ranging anywhere from a few milliseconds to 1-3 seconds faster. On production the time difference is about 2x-10x more.
The one aspect where document size comes into play is in query sorting, in which case, document size will affect whether the query itself will run or not. I've reached this limit numerous times trying to sort as little as 2k documents.
More references with some solutions here:
https://docs.mongodb.org/manual/reference/limits/#operations
https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-limit
At the end of the day, its the end user that suffers.
When I attempt to remedy large queries causing unacceptably slow performance. I usually find myself creating a new collection with a subset of data, and using a lot of query conditions along with a sort and a limit.
Hope this helps!
First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:
http://docs.mongodb.org/manual/core/storage/
http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor
Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth. Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2.
Overall, performance will be much better if all updates fit within the original size allocation. The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.
If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues. If that is the case there are a couple of approaches you can consider:
1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction. In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.
2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example. The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists. Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.
In either case I'd consider using covered queries by building the appropriate indexes:
A covered query is a query in which:
all the fields in the query are part of an index, and
all the fields returned in the results are in the same index.
Because the index “covers” the query, MongoDB can both match the query
conditions and return the results using only the index; MongoDB does
not need to look at the documents, only the index, to fulfill the
query.
Querying only the index can be much faster than querying documents
outside of the index. Index keys are typically smaller than the
documents they catalog, and indexes are typically available in RAM or
located sequentially on disk.
More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/
Just wanted to share my experience when dealing with large documents in MongoDB... don't do it!
We made the mistake of allowing users to include files encoded in base64 (normally images and screenshots) in documents. We ended up with a collection of ~500k documents ranging from 2 Mb to 10 Mb each.
Doing a simple aggregate in this collection would bring down the cluster!
Aggregate queries can be very heavy in MongoDB, especially with large documents like these. Indexes in aggregates can only be used in some conditions and since we needed to $group, indexes were not being used and MongoDB would have to scan all the documents.
The exact same query in a collection with smaller sized documents was very fast to execute and the resource consumption was not very high.
Hence, querying in MongoDB with large documents can have a big impact in performance, especially aggregates.
Also, if you know that the document will continue to grow after it is created (e.g. like including log events in a given entity (document)) consider creating a collection for these child items because the size can also become a problem in the future.
Bruno.
Short answer: yes.
Long answer: how it will affect the queries depends on many factors, like the nature of the queries, the memory available and the indices sizes.
The best you can do is testing.
The code bellow will generate two collections named smallDocuments and bigDocuments, with 1024 documents each, being different only by a field 'c' containing a big string and the _id. The bigDocuments collection will have about 2GB, so be careful running it.
const numberOfDocuments = 1024;
// 2MB string x 1024 ~ 2GB collection
const bigString = 'a'.repeat(2 * 1024 * 1024);
// generate and insert documents in two collections: shortDocuments and
// largeDocuments;
for (let i = 0; i < numberOfDocuments; i++) {
let doc = {};
// field a: integer between 0 and 10, equal in both collections;
doc.a = ~~(Math.random() * 10);
// field b: single character between a to j, equal in both collections;
doc.b = String.fromCharCode(97 + ~~(Math.random() * 10));
//insert in smallDocuments collection
db.smallDocuments.insert(doc);
// field c: big string, present only in bigDocuments collection;
doc.c = bigString;
//insert in bigDocuments collection
db.bigDocuments.insert(doc);
}
You can put this code in a file (e.g. create-test-data.js) and run it directly in the mongoshell, typing this command:
mongo testDb < create-test-data.js
It will take a while. After that you can execute some test queries, like these ones:
const numbersToQuery = [];
// generate 100 random numbers to query documents using field 'a':
for (let i = 0; i < 100; i++) {
numbersToQuery.push(~~(Math.random() * 10));
}
const smallStart = Date.now();
numbersToQuery.forEach(number => {
// query using inequality conditions: slower than equality
const docs = db.smallDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Small:' + (Date.now() - smallStart) + ' ms');
const bigStart = Date.now();
numbersToQuery.forEach(number => {
// repeat the same queries in the bigDocuments collection; note that the big field 'c'
// is ommited in the projection
const docs = db.bigDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Big: ' + (Date.now() - bigStart) + ' ms');
Here I got the following results:
Without index:
Small: 1976 ms
Big: 19835 ms
After indexing field 'a' in both collections, with .createIndex({ a: 1 }):
Small: 2258 ms
Big: 4761 ms
This demonstrates that queries on big documents are slower. Using index, the result time from bigDocuments is more than 100% bigger than in smallDocuments.
My sugestions are:
Use equality conditions in queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#query-selectivity);
Use covered queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#covered-query);
Use indices that fit in memory (https://docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram/);
Keep documents small;
If you need phrase queries using text indices, make sure the entire collection fits in memory (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet);
Generate test data and make test queries, simulating your app use case; use random strings generators if needed.
I had problems with text queries in big documents, using MongoDB: Autocomplete and text search memory issues in apostrophe-cms: need ideas
Here there is some code I wrote to generate sample data, in ApostropheCMS, and some test results: https://github.com/souzabrs/misc/tree/master/big-pieces.
This is more a database design issue than a MongoDB internal one. I think MongoDB was made to behave this way. But, it would help a lot to have more obvious explanation in its documentation.