MongoDB vs Columnar - mongodb

Is MongoDB a good fit when there are several combinations of columns used for querying , thus creating indexes on all of the columns is not feasible? How does MongoDB performs when, say, you have no index on the column and you have millions of entries for that column?

If you have no index, a table scan is performed, as with any database system.
If the documents are in memory this will still be relatively fast but still take a given amount of time based on the number of documents in the collection as the database must look at each one. O(n)
Is the problem that you have a small set of varying keys per document or a large numer of keys that every document must have?
Column oriented datastores must store a large amount of columns to model varying attributes but mongodb is more flexible because of the document data model.
If you have documents that have a small number of varying attributes (out of a large set of attributes), this is indexable and will be O(logn).
Your documents would look like this:
{
"name":"some name",
"attrs":[
{"n":"subject","v":"the subject"},
{"n":"description","v":"Some amazing description"},
{"n":"comments","v":"Comments on this thing"},
]
}
Be indexible like this:
db.mycollection.ensureIndex({"attrs.n":1, "attrs.v":1})
and be queryable like this:
db.mycollection.find({attrs: {$elemMatch: {n: "subject", v: "the subject"}}})

Related

Sharding with array in Cloud Firestore with composite index

I have read in the documentation, that writes per second can be limited to 500 per second if a collection has sequential values with an index.
I can add a shard field to avoid this.
Therefore I should add the shard field before the sequential field in a composite index.
But what if my sequential field is an array?
An array must always be the first field in a composite index.
For example:
I have a Collection "users" with an array field "reminders".
The field reminders contains time strings like ["12:15", "17:45", "20:00", ...].
I think these values could result in hot spotting but maybe I am wrong.
I don't know how Firestore handles arrays in composite indexes.
Clould my array reminders slow down the writes per second? And if so how could I implement a shard field? Or is there a completely different solution?

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production
The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

MongoDB schema design for unbounded growing table

I'm practicing on MongoDB through a small personal project,
in which, may encounter a need to store some intermediate data abstracted as a unbounded growing table. Both rows and columns would grow boundlessly.
The usage of this abstract table is that I want to be able to
know the corresponding column for each entry in a row
know the corresponding row for each entry in a column
Or, in other word, know the index of each table entry
Hence there comes up two choices to model the table:
Make two collections:
one holds each row as a document which embeds a growing structure as row entries to have reference to the corresponding columns;
and similarly, another collection holds each column as a document embedding a growing structure to reference to the corresponding rows.
Make a single separate collection that holds each table entry as a document. Hence each document size is fixed.
The first model has problem with document growth (In fact, in my application, the table grows a bit askew, and only one collection would encounter document growth issue). The second model seems fine to me. Is there some pitfall or some other issue that should be aware of? And what is the common practice to deal with such problem?
UPDATE: explain things in more detail
I am trying to do automatic summarization of an ongoing conversation. The input is a corpus of sentences, and terms are extracted from each sentences. For example, English terms are stemmed, and sentences in CJK languages are segmented. Hence obtained a term-sentence matrix. Then one of the method needs to compute (sparse) SVD of such term-sentence matrix.
The sentences and extracted terms would be stored into the database. But the term-sentence matrix would grow unbounded.
(Or one can think of the problem of storing a mapping between tweets and hashtags)
There were two choices of draft schema that comes up to my mind:
choice one (hold two-way linkages between sentences and terms)
{ // sentence collection doc
"_id" : // generated by timestamp
, "text" : //
, "contained_terms" : [
// an array of "_id"s in term collection
]
}
{ // term collection doc
"_id" : // use term name
, "in_sentences" : [
// an array of "_id"s in sentence collection
]
}
choice two (make linkages into a separate collection)
{ // linkage collection doc (as matrix entries)
"_id" : // generated by timestamp
, "term" : // an "_id" in term collection
, "in_sentence" : // an "_id" in sentence collection
}
{ // sentence collection doc
"_id" : // generated by time stamp
, "text" : //
}
{ // term collection doc
"_id" : // use term name
}
The choice one encounters document growth problem because "in sentences" array of a term collection doc is very likely to grow beyond limit when sentences come in nonstop.
The choice two extract the linkage between terms and sentences into a separate collection, hence avoids the document growth. Although querying "which sentences contain the term" costs more, but in the end, it seems I don't actually need such operation much.
Currently, I'm thinking that the choice two better suit my needs. The linkage collection seems conform to the input of sparse SVD. To speed up computation, very high frequency terms can be filtered out if the term frequency field is added to each term collection docs (or in a separate collection when there are more than one conversations). This filtering seems fine in the case of automatic summarization.
But still wonder
Is there some issues or pitfalls that should be aware of?
What is the common practice for similar situation?
My understanding of mongodb is that you need to design your schema around your queries. So how you save your data is highly dependent on what data will you be querying. So even for the same set of data, your schema can vary depending on the actual use case. Additionally, data redundancy is quite common in NoSql database design. In case you are going to need some data again and again, there is no point in saving it in a separate collection. You can duplicate it in 2 collections, and that's a fair enough cost for faster querying. Memory is cheap, processing isn't! Additionally, pre-aggregation helps in case of mongo for huge data sets. Your queries will work fine for decent number of documents, but once you go into the realm of millions of records, you may face problems with a certain class of queries like counts, aggregation, etc. Pre-aggregation helps in keeping things real time, though it may have a higher write/insertion overhead. Always avoid a full table scan, whenever you can.
Above are some broad level concepts that I find relevant to your question. I'll try and explain it in your context with some examples (as I am not sure what data you are eventually going to need, or the queries you will do).
Let's say you are going to need terms per sentence frequently, to highlight them. In that case the recommended schema will be:
{ "_id" : // sentence id - you will query on this
, "text" : // sentence text
, "terms" : ["term1", "term2", "term3"]
}
So for each new sentence, you extract all the terms and save it (not the id) along with the sentence. The advantage here being that you will not need to query for the term separately. You can get all the terms for a given sentence in a single query. Additionally, the document size doesn't grow, and hence no document relocation.
Let's say you also want to have a unique list of terms and some per term meta data. You can have a separate terms collection which has a list of all the unique terms:
{ "_id": ,
, "term": //term
, "meaning":
, "metadata""
, "count": 1
}
You can have a unique index on term. Each time you extract terms from a sentence, you look up for it in this collection, and in case you don't find it, you insert it. Now let's say you also want to maintain a count of term appearance. So each time you find a term in a sentence and do a lookup in terms collection, you can increment (atomic) the count as well - pre-aggregation. If you add an index on count, you can get the top 100 terms, etc. easily on the fly.
Now let's say you want to query/count all the sentences with a given term. You can add an index on terms array and directly look up for all the sentences with a given term:
Sentence.where(:term => "term1").count //mongoid query
Again, you are achieving this with a single query, as opposed to getting a term id first in your case, and then the sentences.
Other than this it's always advisable to ensure that your working set and indexes fit into RAM for best performance.
So again, there are no right and wrong answers for schema design and it definitely depends on the queries you will be doing. I would also advise you to unlearn some of your relational DB concepts when trying to design for NoSQL databases. I learned it the hard way =) Hope some of this helps you in coming up with an efficient schema for your use case.
If you are trying to model a matrix with the whole collection representing the matrix, I think the go-to model should be to have each entry (row i, column j) as a document. If you put in a field like "index" : { "row" : i, "column" : j} and appropriate indices then it's easy and fast to do fun things like
get the entry at (i, j)
get row i
get column j
The matrix is represented sparsely so if row i only has 10 columns with values, row i is just 10 documents. If the rows/columns really do grow unboundedly to very large sizes then modeling a document as a row or column or something of "1 dimension" could hit the hard 16MB BSON document size limit.
I'm thinking the biggest drawback could be large index sizes given that every entry is its own document.

MongoDB: does document size affect query performance?

Assume a mobile game that is backed by a MongoDB database containing a User collection with several million documents.
Now assume several dozen properties that must be associated with the user - e.g. an array of _id values of Friend documents, their username, photo, an array of _id values of Game documents, last_login date, count of in-game currency, etc, etc, etc..
My concern is whether creating and updating large, growing arrays on many millions of User documents will add any 'weight' to each User document, and/or slowness to the overall system.
We will likely never eclipse 16mb per document, but we can safely say our documents will be 10-20x larger if we store these growing lists directly.
Question: is this even a problem in MongoDB? Does document size even matter if your queries are properly managed using projection and indexes, etc? Should we be actively pruning document size, e.g. with references to external lists vs. embedding lists of _id values directly?
In other words: if I want a user's last_login value, will a query that projects/selects only the last_login field be any different if my User documents are 100kb vs. 5mb?
Or: if I want to find all users with a specific last_login value, will document size affect that sort of query?
One way to rephrase the question is to say, does a 1 million document query take longer if documents are 16mb vs 16kb each.
Correct me if I'm wrong, from my own experience, the smaller the document size, the faster the query.
I've done queries on 500k documents vs 25k documents and the 25k query was noticeably faster - ranging anywhere from a few milliseconds to 1-3 seconds faster. On production the time difference is about 2x-10x more.
The one aspect where document size comes into play is in query sorting, in which case, document size will affect whether the query itself will run or not. I've reached this limit numerous times trying to sort as little as 2k documents.
More references with some solutions here:
https://docs.mongodb.org/manual/reference/limits/#operations
https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-limit
At the end of the day, its the end user that suffers.
When I attempt to remedy large queries causing unacceptably slow performance. I usually find myself creating a new collection with a subset of data, and using a lot of query conditions along with a sort and a limit.
Hope this helps!
First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:
http://docs.mongodb.org/manual/core/storage/
http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor
Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth. Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2.
Overall, performance will be much better if all updates fit within the original size allocation. The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.
If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues. If that is the case there are a couple of approaches you can consider:
1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction. In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.
2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example. The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists. Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.
In either case I'd consider using covered queries by building the appropriate indexes:
A covered query is a query in which:
all the fields in the query are part of an index, and
all the fields returned in the results are in the same index.
Because the index “covers” the query, MongoDB can both match the query
conditions and return the results using only the index; MongoDB does
not need to look at the documents, only the index, to fulfill the
query.
Querying only the index can be much faster than querying documents
outside of the index. Index keys are typically smaller than the
documents they catalog, and indexes are typically available in RAM or
located sequentially on disk.
More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/
Just wanted to share my experience when dealing with large documents in MongoDB... don't do it!
We made the mistake of allowing users to include files encoded in base64 (normally images and screenshots) in documents. We ended up with a collection of ~500k documents ranging from 2 Mb to 10 Mb each.
Doing a simple aggregate in this collection would bring down the cluster!
Aggregate queries can be very heavy in MongoDB, especially with large documents like these. Indexes in aggregates can only be used in some conditions and since we needed to $group, indexes were not being used and MongoDB would have to scan all the documents.
The exact same query in a collection with smaller sized documents was very fast to execute and the resource consumption was not very high.
Hence, querying in MongoDB with large documents can have a big impact in performance, especially aggregates.
Also, if you know that the document will continue to grow after it is created (e.g. like including log events in a given entity (document)) consider creating a collection for these child items because the size can also become a problem in the future.
Bruno.
Short answer: yes.
Long answer: how it will affect the queries depends on many factors, like the nature of the queries, the memory available and the indices sizes.
The best you can do is testing.
The code bellow will generate two collections named smallDocuments and bigDocuments, with 1024 documents each, being different only by a field 'c' containing a big string and the _id. The bigDocuments collection will have about 2GB, so be careful running it.
const numberOfDocuments = 1024;
// 2MB string x 1024 ~ 2GB collection
const bigString = 'a'.repeat(2 * 1024 * 1024);
// generate and insert documents in two collections: shortDocuments and
// largeDocuments;
for (let i = 0; i < numberOfDocuments; i++) {
let doc = {};
// field a: integer between 0 and 10, equal in both collections;
doc.a = ~~(Math.random() * 10);
// field b: single character between a to j, equal in both collections;
doc.b = String.fromCharCode(97 + ~~(Math.random() * 10));
//insert in smallDocuments collection
db.smallDocuments.insert(doc);
// field c: big string, present only in bigDocuments collection;
doc.c = bigString;
//insert in bigDocuments collection
db.bigDocuments.insert(doc);
}
You can put this code in a file (e.g. create-test-data.js) and run it directly in the mongoshell, typing this command:
mongo testDb < create-test-data.js
It will take a while. After that you can execute some test queries, like these ones:
const numbersToQuery = [];
// generate 100 random numbers to query documents using field 'a':
for (let i = 0; i < 100; i++) {
numbersToQuery.push(~~(Math.random() * 10));
}
const smallStart = Date.now();
numbersToQuery.forEach(number => {
// query using inequality conditions: slower than equality
const docs = db.smallDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Small:' + (Date.now() - smallStart) + ' ms');
const bigStart = Date.now();
numbersToQuery.forEach(number => {
// repeat the same queries in the bigDocuments collection; note that the big field 'c'
// is ommited in the projection
const docs = db.bigDocuments
.find({ a: { $ne: number } }, { a: 1, b: 1 })
.toArray();
});
print('Big: ' + (Date.now() - bigStart) + ' ms');
Here I got the following results:
Without index:
Small: 1976 ms
Big: 19835 ms
After indexing field 'a' in both collections, with .createIndex({ a: 1 }):
Small: 2258 ms
Big: 4761 ms
This demonstrates that queries on big documents are slower. Using index, the result time from bigDocuments is more than 100% bigger than in smallDocuments.
My sugestions are:
Use equality conditions in queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#query-selectivity);
Use covered queries (https://docs.mongodb.com/manual/core/query-optimization/index.html#covered-query);
Use indices that fit in memory (https://docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram/);
Keep documents small;
If you need phrase queries using text indices, make sure the entire collection fits in memory (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet);
Generate test data and make test queries, simulating your app use case; use random strings generators if needed.
I had problems with text queries in big documents, using MongoDB: Autocomplete and text search memory issues in apostrophe-cms: need ideas
Here there is some code I wrote to generate sample data, in ApostropheCMS, and some test results: https://github.com/souzabrs/misc/tree/master/big-pieces.
This is more a database design issue than a MongoDB internal one. I think MongoDB was made to behave this way. But, it would help a lot to have more obvious explanation in its documentation.