I have a collection called calls containing properties DateStarted, DateEnded, IdAccount, From, To, FromReversed, ToReversed. In other words this is how a call document looks like:
{
_id : "LKDJLDKJDLKDJDLKJDLKDJDLKDJLK",
IdAccount: 123,
DateStarted: ISODate('2020-11-05T05:00:00Z'),
DateEnded: ISODate('2020-11-05T05:20:00Z'),
From: "1234567890",
FromReversed: "0987654321",
To: "1231231234",
ToReversed: "4321321321"
}
On our website we want to give customers the option to search by custom calls. When they search for calls they must specify the DateStarted and DateEnded Those fields are required the other ones are optional. The IdAccount will be injected on our end so that the customer can only get calls that belong to his account.
Because we have about 5 million records we have created the following indexes
db.calls.ensureIndex({"IdAccount":1});
db.calls.ensureIndex({"DateStarted":1});
db.calls.ensureIndex({"DateEnded":1});
db.calls.ensureIndex({"From":1});
db.calls.ensureIndex({"FromReversed":1});
db.calls.ensureIndex({"To":1});
db.calls.ensureIndex({"ToReversed":1});
The reason why we did not created a compound index is because we want to be able to search by custom criteria. For example we may want to search by all calls with date smaller than December 11 and from a specific account.
Because of the indexes all these queries execute very fast:
db.calls.find({'DateStarted' : {'$gte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'DateEnded' : {'$lte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'IdAccount' : 123 ).limit(200).explain();
// etc...
Even queries that use regexes execute very fast. They only work fast if I use ^... meaning that it must start with a search pattern as:
db.calls.find({ 'From' : /^305/ ).limit(200).explain();
and that is the reason why we created the field FromReversed and ToReversed. If I want to search for a To phone number that ends with 3985 I will execute:
db.calls.find({ 'ToReversed' : /^5893/ ).limit(200).explain(); // note I will have to reverse the search option to
So the only queries that are slow are the ones that do not start with something such as this query:
db.calls.find({ 'ToReversed' : /1234/ ).limit(200).explain();
Question
Why is it that if I combine all the queries it is very slow? For example this query is very slow:
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).limit(200).explain();
The problem is the 'ToReversed' : /^5893/. If I execute that query by itself it is really fast. Even if I put something that does not give me the limit of 200 results fast. Should I add a compound index as well? just for the scenario where it is slow
I need to give our customers the option to search by phone numbers that end with or start with a specific criteria. The moment I add extra stuff to the query it becomes really slow.
Edit
By researching on the internet if I use the hint option it is faster. It goes from 20 seconds to 5 seconds.
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).hint({'ToReversed':1}).limit(200).explain();
This is still slow and it will be great if I can lower it to 1 second just like the simple queries take milliseconds.
For the find query you showed us involving filtering on 4 fields, ideally the optimal index would cover all 4 fields:
db.calls.createIndex( {
"DateStarted": 1,
"DateEnded": 1,
"IdAccount": 1,
"ToReversed": 1
} )
As to which columns should appear first, you should generally place the most restrictive columns first. Check the cardinality of your data to determine this.
Related
I have smallish size database with 2 million records of phone calls.
When I execute:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte':ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum':"2545874578"
}).limit(100).count({})
it is supper fast and it takes 95ms. Note that IsIncoming, DateCreted and CallerIdNum have indexes. Every time I search using those fields it is supper fast
The moment I search for something containing part of a text it is very slow. For example this query now takes 25 seconds:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte': ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum' : /2545874/
}).limit(100).count({})
I know the reason is because I am searching within CallerIdNum. If I where to know the full caller id in advance like on my first query then it will be fast.
Question
I will like the last query to execute faster. I know it is probably impossible, and the only way of getting a great performance is by searching by the whole CallerIdNum. But maybe/hopefully I am wrong and someone can help me find a way of executing my last query faster.
The problem here is that you are searching for a substring of a caller ID number /2545874/. The is not sargable, and generally can't use an index. Assuming you really want numbers which start with that prefix, then use this sargable version:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte': ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum' : /^2545874/
}).limit(100).count({})
You might also want to add a compound index on all three fields, though at least the version of the query I gave above can use an index involving the CallerIdNum field.
Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.
I need to tag documents in a collection, let's call it 'Contacts'.
The first idea I had was to create an attribute called "tags" for each document.
Well, in this case we have something like:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:['mongodb', 'maths', 'travels']
}
Now, let's suppose that we have users that want to tag any document in 'Contacts'.
If we keep the decision to save the tags attribute for each document, as the tags are personal, we need to use the userId for each tag.
So our document would be something like that (or not):
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
]
}
Now, let's complicate it a bit. Let's imagine that we have A LOT of users and each one want to tag documents with his personal tags.
How to deal with that?
Ok, we could create thousands of tags for each document:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
{.....................................................}
{.....................................................}
{......................................................}
]
}
But, what if we have millions of users? In this case we have a 16mg limitation for each document, as I know....
At this point, worrying about the future growth of my application, I decided
to create a nice separated collection called 'tags' that would contain documents similar to:
{
"contact_name" : "Asya Kamsky",
"useriId" : "alex",
"tags" : ['mongodb', 'maths', 'travels'],
"timestamp" : "2017-08-08 14:33:28"
},
{
"contact_name" : "Asya Kamsky",
"useriId" : "eric",
"tags" : ['databases', 'friends', 'japan'],
"timestamp" : "2017-08-08 14:33:28"
}
That's, we have a separated documents that represent a tag of each user.
Cool and clean, right?
Well, i this case, we face 2 problems:
Minor problem: We return to the SQL logic that I don't like anymore but I accept in some cases.
Big (for me) problem: how to search a contact by PERSONAL tags? In this case we have a nice 'JOIN' problem that MongoDB resolves well using $lookup.
"Resolves well" for 10000, 20000, or even 500000 documents. But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
How to resolve this challenge?
Thanks all
If your usage is such that the number of users X number/size of tags per contact (plus whatever other data is in a contacts document) is likely to bring you near the 16MB document size limit then storing the tags ins a separate collection seems valid. But before you go down that route are you sure this is likely? Have you tried creating contact documents in a bid to see how many tags, how many users per contact would get you near the 16MB limit. If the answer implies a number of users and/or tags which you are unlikely ever to reach then maybe your concerns are strictly theoretical and you could consider sticking with the simplest solution which is to embed the user specific tags inside contacts.
The rest of this answer assumes that the size estimates and your knowledge about the likely number of tags and users per contact are such that the size constraints are valid. On this basis, you stated this specific concern about join performance ...
But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
Have you tried measuring this performance? Generate seed documents for contacts and tags and then persist variations of these and then run queries using $lookup and measure the performance. You could do this for a few benchmarks, for example:
1,000 contacts and 10,000 tags
100,000 contacts and 1,000,000 tags
1,000,000 contacts and 10,000,000 tags
10,000,000 contacts and 100,000,000 tags
When running your benchmark tests you can additionally use explain() to understand what's going on inside MongoDB.
You might find that performance is acceptable, only you can know this since you understand what expectations the users of your system have with respect to performance.
One last point, if the use case here is that a given user wants to find all of their contacts and tags then this could be handled with a 'client side join' i.e. two queries (1) to get the tags for "userId" : "..." and (2) to find the contacts referenced by those tags. Depending on what your use cases are, this could be more performant that a server side join (aka $lookup).
I have a list of about 50 tags in an array, and want to search through my documents to find records that match these tags.
Because they're user-submitted and mongoDB is case-sensitive, I'm using /wildcard/i as a means of searching. I know this is not the fastest way to do a search but I can't think of a better solution.
I can do my query in two ways. The first is to run a for loop over my tags array, and for each result, perform:
db.collection.find({tags: /<tag[x]>/i})
Or, I can collect all of the tags and run one single lookup using $or, like so:
db.collection.find({$or:[{tags:/<tag1>/i},{tags:/<tag2>/i},{tags:/<tag3>/i}, ... {tags:/<tag50>/i}]});
I have tried both, and found using $or to be significantly faster - but because of the work-in-progress state of my application, it's very difficult to tell whether this is because it's actually faster or whether my app is causing significant overhead in other areas (it is).
So for clarification, in MongoDB is a big query performed once faster than small queries performed many times?
EDIT: Another example would be whether looking up 3 individual records based on _id is faster than doing one lookup using {$or:[{_id: ObjectId([id1])},{_id: ObjectId([id2])},{_id: ObjectId([id3])}]}. Is less more?
I recommend you adjust your schema so it keeps a normalized array of tags. When you insert a new document, do it like this:
tags : [ "business", "Computing", "PayPal" ],
lowercaseTags : [ "business", "computing", "paypal" ]
Similarly when you update the tags, update both arrays.
Create an index on lowercaseTags, and then when you want to query them, use a single query with the $in operator, and the normalized form of the search terms.
For example, to search for business iTunes YouTube, use this query:
db.collection.find( { tags : $in: [ "business", "itunes", "youtube" ] } )
This answer gives an example of this approach. It should be loads faster than what you have.
An alternate approach you can take is to create a text index and use the text command.
Both of these approaches are geared toward index optimization, and designing your schema to work well with Mongo. The payoff should be a lot higher than whatever difference there is between a single $or query and 50 simpler queries.
I have browsed through various examples but have failed to find what I am looking for.. What I want is to search for a specific document by _id and skip multiple times between a collection by using one query? Or some alternative which is fast enough to my case.
Following query would skip first one and return second in advance:
db.posts.find( { "_id" : 1 }, { comments: { $slice: [ 1, 1 ] } } )
That would be skip 0, return 1 and leaves the rest out from result..
But what If there would be like 10000 comments and I would want to use same pattern, but return that array values like this:
skip 0, return 1, skip 2, return 3, skip 4, return 5
So that would return collection which comments would be size of 5000, because half of them is skipped away. Is this possible? I applied large number like 10000 because I fear that using multiple queries to apply this would not be performance wise.. (example shown in here: multiple queries to accomplish something similar). Thnx!
I went through several resources and concluded that currently this is impossible to make with one query.. Instead, I agreed on that there are only two options to overcome this problem:
1.) Make a loop of some sort and run several slice queries while increasing the position of a slice. Similar to resource I linked:
var skip = NUMBER_OF_ITEMS * (PAGE_NUMBER - 1)
db.companies.find({}, {$slice:[skip, NUMBER_OF_ITEMS]})
However, depending on the type of a data, I would not want to run 5000 individual queries to get only half of the array contents, so I decided to use option 2.) Which seems for me relatively fast and performance wise.
2.) Make single query by _id to row you want and before returning results to client or some other part of your code, skip your unwanted array items away by using for loop and then return the results. I made this at java side since I talked to mongo via morphia. I also used query explain() to mongo and understood that returning single line with array which has 10000 items while specifying _id criteria is so fast, that speed wasn't really an issue, I bet that slice skip would only be slower.