how to build index in mongodb in this situation - mongodb

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?

You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}

Related

Mongodb searching on array / indexing

I'm using the airbnb sample set and it has a field that looks like:
"amenities": ["TV", "Cable TV", "Wifi"....
So I'm trying to do a case-INsensitive, wildcard search (on one or more values passed in).
Only thing I've found that works is:
{ amenities: { $in: [ /wi/ ] }}
Is that the best way?
So I ran it in Compass as the dataset was imported (5600 docs), and the Explain says it took ~20ms on my machine and warned there was no index. I then created an index on the amenities column and the same search jumped up to ~100ms. I just created the index through the Compass UI, so not sure why its taking 5x as long with an index? Or if there is a better way to do this?
The way to run that query is:
{ amenities: /wi/i }
//better but not always useful
{ amenities: /wi/i }, { amenities:1, _id:0 }
It already traverses the array, and to be case insensitive it must be on the options.
For multikey indexes the second query won't be a covered query. Otherwise, it would be blazing fast.
I've tested a similar search with and without index though. Exec. time is reduced 10X. (1500ms to 150ms, in a huge collection). Measure with Mongo Hacker.
As you report executionTimeMilliseconds is not that different. But still smaller.
The reason why you don't see a huge decrease in time is because the index stores each array entry separately. When it finds a match, it comes back to collection to fetch the whole array field, instead of using the indexes.
Probably indexes aren't very useful for arrays.
When querying with an unanchored regex, the query executor will have to scan every index key to see if there is a match.
You might find a collated index to be helpful.
Create an index with the appropriate collation, like:
(strength 1 and 2 are case-insensitive)
db.collection.createIndex({amenities:1},{collation:{locale:"en",strength:1}})
Then query using the same collation:
db.collection.find({amenities:"wifi"}).collation({locale:"en",strength:1})
The search will be case insensitive, and it can efficiently use the index.

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

Fundamental misunderstanding of MongoDB indices

So, I read the following definition of indexes from [MongoDB Docs][1].
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
Indexes are special data structures that store a small portion of the
collection’s data set in an easy to traverse form. The index stores
the value of a specific field or set of fields, ordered by the value
of the field. The ordering of the index entries supports efficient
equality matches and range-based query operations. In addition,
MongoDB can return sorted results by using the ordering in the index.
I have a sample database with a collection called pets. Pets have the following structure.
{
"_id": ObjectId(123abc123abc)
"name": "My pet's name"
}
I created an index on the name field using the following code.
db.pets.createIndex({"name":1})
What I expect is that the documents in the collection, pets, will be indexed in ascending order based on the name field during queries. The result of this index can potentially reduce the overall query time, especially if a query is strategically structured with available indices in mind. Under that assumption, the following query should return all pets sorted by name in ascending order, but it doesn't.
db.pets.find({},{"_id":0})
Instead, it returns the pets in the order that they were inserted. My conclusion is that I lack a fundamental understanding of how indices work. Can someone please help me to understand?
Yes, it is misunderstanding about how indexes work.
Indexes don't change the output of a query but the way query is processed by the database engine. So db.pets.find({},{"_id":0}) will always return the documents in natural order irrespective of whether there is an index or not.
Indexes will be used only when you make use of them in your query. Thus,
db.pets.find({name : "My pet's name"},{"_id":0}) and db.pets.find({}, {_id : 0}).sort({name : 1}) will use the {name : 1} index.
You should run explain on your queries to check if indexes are being used or not.
You may want to refer the documentation on how indexes work.
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

How to make distinct operation more quickly in mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?
Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.
The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.

Sorting on Multiple fields mongo DB

I have a query in mongo such that I want to give preference to the first field and then the second field.
Say I have to query such that
db.col.find({category: A}).sort({updated: -1, rating: -1}).limit(10).explain()
So I created the following index
db.col.ensureIndex({category: 1, rating: -1, updated: -1})
It worked just fined scanning as many objects as needed, i.e. 10.
But now I need to query
db.col.find({category: { $ne: A}}).sort({updated: -1, rating: -1}).limit(10)
So I created the following index:
db.col.ensureIndex({rating: -1, updated: -1})
but this leads to scanning of the whole document and when I create
db.col.ensureIndex({ updated: -1 ,rating: -1})
It scans less number of documents:
I just want to ask to be clear about sorting on multiple fields and what is the order to be preserved when doing so. By reading the MongoDB documents, it's clear that the field on which we need to perform sorting should be the last field. So that is the case I assumed in my $ne query above. Am I doing anything wrong?
The MongoDB query optimizer works by trying different plans to determine which approach works best for a given query. The winning plan for that query pattern is then cached for the next ~1,000 queries or until you do an explain().
To understand which query plans were considered, you should use explain(1), eg:
db.col.find({category:'A'}).sort({updated: -1}).explain(1)
The allPlans detail will show all plans that were compared.
If you run a query which is not very selective (for example, if many records match your criteria of {category: { $ne:'A'}}), it may be faster for MongoDB to find results using a BasicCursor (table scan) rather than matching against an index.
The order of fields in the query generally does not make a difference for the index selection (there are a few exceptions with range queries). The order of fields in a sort does affect the index selection. If your sort() criteria does not match the index order, the result data has to be re-sorted after the index is used (you should see scanAndOrder:true in the explain output if this happens).
It's also worth noting that MongoDB will only use one index per query (with the exception of $ors).
So if you are trying to optimize the query:
db.col.find({category:'A'}).sort({updated: -1, rating: -1})
You will want to include all three fields in the index:
db.col.ensureIndex({category: 1, updated: -1, rating: -1})
FYI, if you want to force a particular query to use an index (generally not needed or recommended), there is a hint() option you can try.
That is true but there are two layers of ordering you have here since you are sorting on a compound index.
As you noticed when the first field of the index matches the first field of sort it worked and the index was seen. However when working the other way around it does not.
As such by your own obersvations the order needed to be preserved is query order of fields from first to last. The mongo analyser can sometimes move around fields to match an index but normally it will just try and match the first field, if it cannot it will skip it.
try this code it will sort data first based on name then keeping the 'name' in key holder it will sort 'filter'
var cursor = db.collection('vc').find({ "name" : { $in: [ /cpu/, /memo/ ] } }, { _id: 0, }).sort( { "name":1 , "filter": 1 } );
Sort and Index Use
MongoDB can obtain the results of a sort operation from an index which
includes the sort fields. MongoDB may use multiple indexes to support
a sort operation if the sort uses the same indexes as the query
predicate. ... Sort operations that use an index often have better
performance than blocking sorts.
db.restaurants.find().sort( { "borough": 1, "_id": 1 } )
more information :
https://docs.mongodb.com/manual/reference/method/cursor.sort/