Does MongoDb $set operator in (non index field) will be expensive? - mongodb

I have a collection which have multiple indexes, and often i have to push some data into an array of that collection. I have tried to go through MongoDb Doc, but the best i can get was,
For inserts and updates to un-indexed fields, the overhead for sparse indexes is less than for non-sparse indexes. Also for non-sparse indexes, updates that do not change the record size have less indexing overhead.
I am aware of the difference of sparse indexes and non sparse indexes, and its makes sense that overhead for sparse indexes will be less.
But why is it that, even when i am updating just a un-indexed field in my document, why all other index has to update ! Is it because every index has the same data and all the data has to be updated ?
My Document
var sample = new Schema({
***
student_list: [ {type :Schema.Types.Mixed}],
location: [ {type :Schema.Types.Mixed}],
****
});
student_list.studID will be indexed
{studID:1,city:M,Time:"... e}
Now i often have to update location field. Queries
db.sample.find({student_list.studID:"studid"})
db.sample.find({student_list.studID:"studid", student_list.city:"M"})
all using student_list_studId_1 index
Is this approach is fine or shall i create a diff collection and with every student list as a seperate doc, (every sample doc will have multiple student ids, which may be common across diff samples docs )

The reason why index is updated on every insert is connected with document size and its allocation.
Let's say that document has 1765 bytes, and we are adding next 950 bytes (data + bson overhead), that could execute a relocation of given document as it is not fitting in current allocated data block -> and db engine need to update pointers in all indexes to point to new document location.

Related

How to efficiently loop through a MongoDB collection in order update a sequence column?

I am new to MongoDB/Mongoose and have a challenge I'm trying to solve in order to avoid a performance rabbit hole!
I have a MongoDB collection containing a numeric column called 'sequence' and after inserting a new document, I need to cycle through the collection starting at the position of the inserted document and to increment the value of sequence by one. In this way I maintain a collection of documents numbered from 1 to n (i.e. where n = the number of documents in the collection), and can render the collection as a table in which newly inserted records appear in the right place.
Clearly one way to do this is to loop through the collection, doing a seq++ in each iteration, and then using Model.updateOne() to apply the value of seq to sequence for each document in turn.
My concern is that this involves calling updateOne() potentially hundreds of times, which might not be optimal for performance. Any suggestions on how I should approach this in a more efficient way?

Unique multi key hashed index in MongoDB

I have a collection with several billion documents and need to create a unique multi-key index for every attribute of my documents.
The problem is, I get an error if I try to do that because the generated keys would be too large.
pymongo.errors.OperationFailure: WiredTigerIndex::insert: key too large to index, failing
I found out MongoDB lets you create hashed indexes, which would resolve this problem, however they are not to be used for multi-key indexes.
How can i resolve this?
My first idea was to create another attribute for each of my document with an hash of every value of its attributes, then creating an index on that new field.
However this would mean to recalculate the hash every time I wish to add a new attribute, plus the excessive amount of time necessary to create both the hashes and the indexes.
This is a feature added in mongoDB since 2.6 to prevent the total size of an index entry to exceed 1024 bytes (also known as Index Key Length Limit).
In MongoDB 2.6, if you attempt to insert or update a document so that the value of an indexed field is longer than the Index Key Length Limit, the operation will fail and return an error to the client. In previous versions of MongoDB, these operations would successfully insert or modify a document but the index or indexes would not include references to the document.
For migration purposes and other temporary scenarios you can downgrade to 2.4 handling of this use case where this exception would not be triggered via setting this mongoDB server flag:
db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
This however is not recommended.
Also consider that creating indexes for every attribute of your documents may not be the optimal solution at all.
Have you examined how you query your documents and on which fields you key on? Have you used explain to view the query plan? It would be an exception to the rule if you tell us that you query on all fields all the time.
Here are the recommended MongoDB indexing strategies.
Excessive indexing has a price as well and should be avoided.

Slow creation of four-field index in MongoDB

I have a ProductRequest collection in MongoDB. It is somewhat large collection, but not that many documents. Number of documents is a bit over 300,000, but average size of a document is close to 1MB, thus data footprint is large.
To speed up certain queries I am setting up index on this collection:
db.ProductRequest.ensureIndex ({processed: 1, parsed: 1, error:1,processDate:1})
First three fields are Boolean, the last one is date time.
The command runs for soon 24 hours and would not come back
I already have index on ‘processed’ and ‘parsed’ fields (together) and a separate one on ‘error’. Why creation of that four-field index takes forever? My understanding is that size of an individual record should not matter in this case, am I wrong?
Additional Info:
MongoDB version 2.6.1 64-bit
Host OS Centos 6.5
Sharding: yes, shard key is _id. Number of shards: 2, number of replica sets in each shard is 3.
I belive its because of putting index for boolean fields.
since there are only two values (true or false), if you have 300.000 rows putting an index on that field will have to scan 150.00 rows to find all documents and in your case you have 3 Boolean fields, it makes it more slow.
You won't see a huge benefit from an index on those three fields and processDate compared to an index just on processDate. Indexes on boolean fields aren't very useful in the presence of other index-able fields because they aren't very selective. If you give a process date, there are only 8 possibilities for the combination of the other fields to further narrow down the results via the index.
Also, you should switch the order. Put processDate first as it is much more selective than a boolean field. That should greatly simplify the index and speed up the index build.
Finally, index creation in MongoDB is sometimes unavoidably slow and expensive because it involves creating large B-trees. The payoff, which is absolutely worth it, of course, is faster queries. It's possible that more than 24 hours are needed for an index build. Have you checked what the saturated resource is? It's likely the CPU for an index build. Your best option for this case is to create the index in the background. Background index builds
don't block read and write operation for the duration like foreground index builds
take longer
produce initially larger indexes that will converge to the size of an equivalent foreground index over time
You set an index build to occur in the background with an extra option to the ensureIndex call:
db.myCollection.ensureIndex({ "myField" : 1 }, { "background" : 1 })

MongoDB: does document size affect query performance?

Assume a mobile game that is backed by a MongoDB database containing a User collection with several million documents.
Now assume several dozen properties that must be associated with the user - e.g. an array of _id values of Friend documents, their username, photo, an array of _id values of Game documents, last_login date, count of in-game currency, etc, etc, etc..
My concern is whether creating and updating large, growing arrays on many millions of User documents will add any 'weight' to each User document, and/or slowness to the overall system.
We will likely never eclipse 16mb per document, but we can safely say our documents will be 10-20x larger if we store these growing lists directly.
Question: is this even a problem in MongoDB? Does document size even matter if your queries are properly managed using projection and indexes, etc? Should we be actively pruning document size, e.g. with references to external lists vs. embedding lists of _id values directly?
In other words: if I want a user's last_login value, will a query that projects/selects only the last_login field be any different if my User documents are 100kb vs. 5mb?
Or: if I want to find all users with a specific last_login value, will document size affect that sort of query?
One way to rephrase the question is to say, does a 1 million document query take longer if documents are 16mb vs 16kb each.
Correct me if I'm wrong, from my own experience, the smaller the document size, the faster the query.
I've done queries on 500k documents vs 25k documents and the 25k query was noticeably faster - ranging anywhere from a few milliseconds to 1-3 seconds faster. On production the time difference is about 2x-10x more.
The one aspect where document size comes into play is in query sorting, in which case, document size will affect whether the query itself will run or not. I've reached this limit numerous times trying to sort as little as 2k documents.
More references with some solutions here:
https://docs.mongodb.org/manual/reference/limits/#operations
https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-limit
At the end of the day, its the end user that suffers.
When I attempt to remedy large queries causing unacceptably slow performance. I usually find myself creating a new collection with a subset of data, and using a lot of query conditions along with a sort and a limit.
Hope this helps!
First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:
http://docs.mongodb.org/manual/core/storage/
http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor
Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth. Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2.
Overall, performance will be much better if all updates fit within the original size allocation. The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.
If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues. If that is the case there are a couple of approaches you can consider:
1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction. In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.
2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example. The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists. Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.
In either case I'd consider using covered queries by building the appropriate indexes:
A covered query is a query in which:
all the fields in the query are part of an index, and
all the fields returned in the results are in the same index.
Because the index “covers” the query, MongoDB can both match the query
conditions and return the results using only the index; MongoDB does
not need to look at the documents, only the index, to fulfill the
query.
Querying only the index can be much faster than querying documents
outside of the index. Index keys are typically smaller than the
documents they catalog, and indexes are typically available in RAM or
located sequentially on disk.
More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/
Just wanted to share my experience when dealing with large documents in MongoDB... don't do it!
We made the mistake of allowing users to include files encoded in base64 (normally images and screenshots) in documents. We ended up with a collection of ~500k documents ranging from 2 Mb to 10 Mb each.
Doing a simple aggregate in this collection would bring down the cluster!
Aggregate queries can be very heavy in MongoDB, especially with large documents like these. Indexes in aggregates can only be used in some conditions and since we needed to $group, indexes were not being used and MongoDB would have to scan all the documents.
The exact same query in a collection with smaller sized documents was very fast to execute and the resource consumption was not very high.
Hence, querying in MongoDB with large documents can have a big impact in performance, especially aggregates.
Also, if you know that the document will continue to grow after it is created (e.g. like including log events in a given entity (document)) consider creating a collection for these child items because the size can also become a problem in the future.
Bruno.
Short answer: yes.
Long answer: how it will affect the queries depends on many factors, like the nature of the queries, the memory available and the indices sizes.
The best you can do is testing.
The code bellow will generate two collections named smallDocuments and bigDocuments, with 1024 documents each, being different only by a field 'c' containing a big string and the _id. The bigDocuments collection will have about 2GB, so be careful running it.
const numberOfDocuments = 1024;
// 2MB string x 1024 ~ 2GB collection
const bigString = 'a'.repeat(2 * 1024 * 1024);
// generate and insert documents in two collections: shortDocuments and
// largeDocuments;
for (let i = 0; i < numberOfDocuments; i++) {
let doc = {};
// field a: integer between 0 and 10, equal in both collections;
doc.a = ~~(Math.random() * 10);
// field b: single character between a to j, equal in both collections;
doc.b = String.fromCharCode(97 + ~~(Math.random() * 10));
//insert in smallDocuments collection
db.smallDocuments.insert(doc);
// field c: big string, present only in bigDocuments collection;
doc.c = bigString;
//insert in bigDocuments collection
db.bigDocuments.insert(doc);
}
You can put this code in a file (e.g. create-test-data.js) and run it directly in the mongoshell, typing this command:
mongo testDb < create-test-data.js
It will take a while. After that you can execute some test queries, like these ones:
const numbersToQuery = [];
// generate 100 random numbers to query documents using field 'a':
for (let i = 0; i < 100; i++) {
numbersToQuery.push(~~(Math.random() * 10));
}
const smallStart = Date.now();
numbersToQuery.forEach(number => {
// query using inequality conditions: slower than equality
const docs = db.smallDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Small:' + (Date.now() - smallStart) + ' ms');
const bigStart = Date.now();
numbersToQuery.forEach(number => {
// repeat the same queries in the bigDocuments collection; note that the big field 'c'
// is ommited in the projection
const docs = db.bigDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Big: ' + (Date.now() - bigStart) + ' ms');
Here I got the following results:
Without index:
Small: 1976 ms
Big: 19835 ms
After indexing field 'a' in both collections, with .createIndex({ a: 1 }):
Small: 2258 ms
Big: 4761 ms
This demonstrates that queries on big documents are slower. Using index, the result time from bigDocuments is more than 100% bigger than in smallDocuments.
My sugestions are:
Use equality conditions in queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#query-selectivity);
Use covered queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#covered-query);
Use indices that fit in memory (https://docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram/);
Keep documents small;
If you need phrase queries using text indices, make sure the entire collection fits in memory (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet);
Generate test data and make test queries, simulating your app use case; use random strings generators if needed.
I had problems with text queries in big documents, using MongoDB: Autocomplete and text search memory issues in apostrophe-cms: need ideas
Here there is some code I wrote to generate sample data, in ApostropheCMS, and some test results: https://github.com/souzabrs/misc/tree/master/big-pieces.
This is more a database design issue than a MongoDB internal one. I think MongoDB was made to behave this way. But, it would help a lot to have more obvious explanation in its documentation.

MongoDB vs Columnar

Is MongoDB a good fit when there are several combinations of columns used for querying , thus creating indexes on all of the columns is not feasible? How does MongoDB performs when, say, you have no index on the column and you have millions of entries for that column?
If you have no index, a table scan is performed, as with any database system.
If the documents are in memory this will still be relatively fast but still take a given amount of time based on the number of documents in the collection as the database must look at each one. O(n)
Is the problem that you have a small set of varying keys per document or a large numer of keys that every document must have?
Column oriented datastores must store a large amount of columns to model varying attributes but mongodb is more flexible because of the document data model.
If you have documents that have a small number of varying attributes (out of a large set of attributes), this is indexable and will be O(logn).
Your documents would look like this:
{
"name":"some name",
"attrs":[
{"n":"subject","v":"the subject"},
{"n":"description","v":"Some amazing description"},
{"n":"comments","v":"Comments on this thing"},
]
}
Be indexible like this:
db.mycollection.ensureIndex({"attrs.n":1, "attrs.v":1})
and be queryable like this:
db.mycollection.find({attrs: {$elemMatch: {n: "subject", v: "the subject"}}})