How to efficiently do complex queries on MongoDB unindexed fields? - mongodb

I am building filtering functionality for web application which should look like JIRA of TFS filtering query. So user should be able to filter on fields contents and use logical operators in filter query.
The data lives in MongoDB and the main challenge is that the fields on which we will filter should support not only strict equality but also full text search are difficult to index because they can vary for each user.
In a nutshell, there is a nested object, which has three other nested object that can have different amount of fields depending of user, field names are also set by user, so we don't know them.
For example document structure in collection can be:
{
_id: ObjectId()
storage: {
obj_1:{}
obj_2:{}
}
},
{
_id: ObjectId()
storage: {
obj_1:{
field_1 : val,
field_2 : val
}
obj_2:{}
}
}
I imagine queries will be something like:
find({$and:[{"storage.obj_1.field_1":{$regex: "va"}},{"storage.obj_1.var_2":"val"}]})
Unfortunately, I am not a database expert so the solutions that I see now are:
1) Use Elasticsearch as a search engine. But the question is: how do I set Elastic index if I don't know my documents structure?
2) Use sparse index in Mongo. But I will need to use regex for matching, is that solution better than Elastic?
So the question is: what is the best way to do filtering in such a DB structure?
p.s.
I have put this question in SO and not Software Engineering because SO has more active members, pls keep your downvotes for later

Elasticsearch and MongoDB (much like a relational database) behave differently for indexing: In MongoDB you need to explicitly index every field (for a non $text index). In Elasticsearch every field is automatically indexed. Don't go too crazy on the number of fields in Elasticsearch, since there is a little overhead for each field (in terms of disk space though that has improved in version 6).
As soon as you are having more than a test data set, regular expressions are often slow, since they can only use indices in specific cases and you need to define those indices explicitly. Maybe the $text index and search operator are more what you are looking for. That one can index every field in a collection as well. If you need more features and a system that is fully built for search, then Elasticsearch will be the better option though.

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

mongoDB vs. elasticsearch query/aggregation performance comparison

This question is about choosing the type of database to run queries on for an application. Keeping other factors aside for the moment, and given that the choice is between mongodb and elastic, the key criterion is that the query should be resolved in near real time. The queries will be ad-hoc and as such can contain any of the fields in the JSON objects and will likely contain aggregations and subaggregations. Furthermore, there will not be nested objects and none of the fields will be containing 'descriptive' text (like movie reviews etc.), i.e., all the fields will be keyword type fields like State, Country, City, Name etc.
Now, I have read that elasticsearch performance is near real time and that elasticsearch uses inverted indices and creates them automatically for every field.
Given all the above, my questions are as follows.
(there is a similar question posted in stack but I do not think it answers my questions
elasticsearch v.s. MongoDB for filtering application)
1) Since the fields in the use case I mentioned do not contain descriptive text and hence would not require the full-text search capability and other additional features that elastic provides (especially for text search), what would be a better choice between elastic and mongo? How would elastic search and mongo query/aggregation performance compare if I were to create single field indices on all the available fields in mongo?
2) I am not familiar with advanced indexing, so I am assuming that it would be possible to create indices on all available fields in mongo (either using multiple single field indices or maybe compound indices?). I understand that this will come with a cost for storage and write speed which is true for elastic as well.
3) Also, in elastic the user can trade off write speed (indexing rate) with the speed with which the written document becomes available (refresh_interval) for a query. Is there a similar feature in mongo?
I think the size of your data set is also a very important aspect about choosing DB engine. According to this benckmark (2015), if you have over 10 millions of documents, Elasticsearch could be a better choice. If your data set is small there should be no obvious different about performance between Elasticsearch and MongoDB.

MongoDB Find performance: single compound index VS two single field indexes

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.
Let's suppose we have a people collection of documents with the following shape:
{
_id: 10,
name: "Bob",
age: 32,
profession: "Hacker"
}
Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25
Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).
To better clarify our scenario, consider that:
the field name was already in our documents and we already have an index on that field
we are going to add the new field age due to some new features of our application
the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)
That said, we have to decide which of the following indexes offer the best performance:
One compound index: {name: 1, age: 1}
Two single-field indexes: {name: 1} and {age: 1}
According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.
By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.
This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.
According to test results, we are considering to use the single compound index approach.
Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?
Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:
do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.
And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.
Additional reads:
https://docs.mongodb.com/manual/core/query-optimization/index.html
https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in
Index intersection vs. compound index?
mongodb compund index vs. index intersect
How does the order of compound indexes matter in MongoDB performance-wise?
In MongoDB, I am using a large query, how I will create compound index or single index, So My response time boost up

MongoDB Query - limit fields where name matches pattern

I've read everything I can find about Projection in MongoDB. I'm hoping this is simple and I just missed it due to the overwhelming flexibility of Mongo queries.
In our MySql database, we've adopted a business practice of having "hidden" fields be prefixed with an underscore. Our application knows how to hide these fields.
Moving some data to mongo, I need to retrieve the documents, with ALL underscore prefixed fields omitted. Of course this should be done in the query rather than document manipulation after retrieval.
All the operators like $regex, $in, $all seem to apply to values. I need to build a projection that ignores an unknown number of fields based on their name. Something like:
db.coll.find({}, {"_*": 0})
Of course that doesn't work, but explains the idea.
I should note: this is necessary because the documents are editable by our application users, so I have no idea what the schema might look like. I do know our "internal" fields are prefixed with an _, and those need to be protected by omission from the editor.
Hope it's easy...
You can have a separate field as hidden_fields or something. See the following schema.
{_id: 'myid1', hidden_fields: {"_foo": "bar", "_foo2": "bar2"}, key1: value1 ...}
Now on the basis of above schema just do,
db.collection.find({ ... }, {hidden_fields: 1})
This will display hidden fields. Also you can have indexes on fields within sub documents so no loss in terms of performance as well.
There is no functionality for this for good reason. It would be a nightmare to implement this kind of functionality and it would not scale nor would it be very fast.
The best way to do this currently is to set up a key-value store like:
{
fields: [
{k: "_ghhg", v: 5},
{k: "ghg", v: 6}
]
}
You would then $regex on the k field to understand which key names (fields) have underscore in them.
As a piece of advice I would highly recommend prefixed $regexs since they are way more effective at using the indexes you create, i.e. for the query you show: ^_*.
Moving some data to mongo, I need to retrieve the documents, with ALL underscore prefixed fields omitted. Of course this should be done in the query rather than document manipulation after retrieval.
I would personally do this client side, it will be 100x faster than database side.
As was mentioned by #Sammaye, MongoDB doesn't support this type of query in a natural/efficient way.
However, to optimize performance if you don't always need the internal data, I'd suggest you consider creating two documents, one with the always available data, and one with just the "_internal" fields. MongoDB will read and write less, and there will be less to manipulate on the client. This would be similar to having two tables in a RDBMS (one with public and one with private data).
This could make updates to the non-internal data simple as well by just updating the entire document (if possible in your scenario).
Of course, you could then also drop the extra "_" character as well, as that just adds extra unnecessary characters to the BSON data. :)

How should I store the "tags" of a document in MongoDb?

I know this question was asked several times, but as far as my search skills go, every post is how to implement this in a SQL database, and no mention of NoSQL databases.
I have documents for which I want to implement a tagging feature.
Users will be able to tag them with whatever string they want, and then I need to be able to query the documents as fast as possible (by these generic tags)
Should I have a String array for my tags (which would allow me to support any number of tags), like this
{"_id":"aaa", "prop":"value", "tags":["tag1","tag2","tagN"]}
or limit the amount of tags to, say, 5, and have them as different properties, like this
{"_id":"aaa", "prop":"value", "tag1":"value", "tag2":"value", "tag3":"value" }
Which structure would be better for fast querying, specifically in mongodb?
Using the second structure would allow me to index the collection by this fields, but are 5 indexes recommended? Should I have less tags?
You simply want to use an array, the first example you have. That will allow you to have a consistent model and indexes.
{"_id":"aaa", "prop":"value", "tags":["tag1","tag2","tagN"]}
If you index a field that contains an array, MongoDB indexes each value in the array separately, in a “multikey index.”[1]
[1] http://docs.mongodb.org/manual/core/indexes/#multikey-indexes