I have this collection in MongoDB. It contains
values of different types under the val key.
Also, note that I am sorting it by val ascending.
[test] 2014-02-20 08:53:11.857 >>> db.account.find().sort({val:1});
{
"_id" : ObjectId("5304d25786dd4b348bcc2b2e"),
"username" : "usr10",
"password" : "123",
"val" : [ ]
}
{
"_id" : ObjectId("5304d29986dd4b348bcc2b2f"),
"username" : "usr20",
"password" : "456",
"val" : null
}
{
"_id" : ObjectId("5304e31686dd4b348bcc2b37"),
"username" : "usr80",
"password" : "555",
"val" : 1
}
{
"_id" : ObjectId("5304d50a86dd4b348bcc2b32"),
"username" : "usr50",
"password" : "555",
"val" : [
40
]
}
{
"_id" : ObjectId("5304d4c886dd4b348bcc2b31"),
"username" : "usr40",
"password" : "777",
"val" : 200
}
{
"_id" : ObjectId("5304d2a186dd4b348bcc2b30"),
"username" : "usr30",
"password" : "888",
"val" : {
}
}
{
"_id" : ObjectId("5304d97786dd4b348bcc2b33"),
"username" : "usr50",
"password" : "555",
"val" : {
"ok" : 1
}
}
{
"_id" : ObjectId("5304e2dc86dd4b348bcc2b36"),
"username" : "usr80",
"password" : "555",
"val" : true
}
{
"_id" : ObjectId("5304e22f86dd4b348bcc2b34"),
"username" : "usr60",
"password" : "555",
"val" : ISODate("2014-02-19T16:56:15.787Z")
}
{
"_id" : ObjectId("5304e2c786dd4b348bcc2b35"),
"username" : "usr70",
"password" : "555",
"val" : /abc/
}
[test] 2014-02-20 08:53:19.357 >>>
I am reading a book which says the following.
MongoDB has a hierarchy as to how types compare. Sometimes you will have
a single key with multiple types: for instance, integers and booleans, or strings
and nulls. If you do a sort on a key with a mix of types, there is a predefined
order that they will be sorted in. From least to greatest value, this ordering
is as follows:
1. Minimum value
2. null
3. Numbers (integers, longs, doubles)
4. Strings
5. Object/document
6. Array
7. Binary data
8. Object ID
9. Boolean
10. Date
11. Timestamp
12. Regular expression
13. Maximum value
So why is my sorting order different? For example,
when I sort (see above) I see these strange things:
1) I have no idea what 'minimum value' and 'maximum value' mean.
2) An array comes before a number. And an empty
array comes even before null.
3) The number 1 comes before an array
4) The array [40] comes between numbers 1 and 200.
Could someone just explain this result in some details?
Many thanks in advance.
Your book says the same as the official documentation. But this also does not explain the obscure sorting order of the two arrays. At least the two types Minimum value and Maximum value are explained. They are internal.
The type order is only used when there isn't another supported way of ordering elements. Array fields have their own sorting behavior where the minimum value of their elements are used on an ascending sort, and the maximum value on a descending sort. The type of that minimum or maximum value is then used to order the docs with fields of that type.
So [40] comes after 1, but before 200 because the minimum value of that array is 40.
The empty array has no value at all, which is why it ends up with the doc where the value is null. If I reverse the sort, they stay in the same order which implies that MongoDB considers them equal.
Where is the sort clause in your query? Your sort order appears to be the default order - notice the ascending ObjectIds. You mentioned you're sorting by val so I would expect your query to be
db.account.find({val:1})
MongoDB type comparison
Why
MongoDB is a schemaless database, which allows you to store pieces of information (document) without defining a structure (schema of fields), like in SQL where we have to define a schema in the form of columns and their data types.
In the case of sorting or data retrival this may be problematic. In order to be predictable, MongoDB has a fixed type order to sort documents of different types as you already mentioned:
Minimum value (MinKey)
Null
Numbers (ints, longs, doubles, decimals)
Strings
Object/document
Array
Binary data
ObjectId
Boolean
Date
Timestamp
Regex
Maximum value (MaxKey)
In the process of sorting values get compared, in order to decide how they should be positioned.
This list defines from lowest to highest what should happen when theses data types get compared.
{ "a": 0 }
{ "a": "a" }
{ "a": 1 }
When sorted by a ascending numbers precide strings, as the list states Strings (4) and Numbers (3).
Conversions
For certain data types MongoDB tries to convert them when compared (If you know JavaScript this should feel familiar).
[] -> null
[[]] -> Array
// but
"" -> String
0 -> Number
...
For one dimensional Arrays, their conversions depend on the order of sorting.
// ascending
[1, 2, 3, 4, 5] (Array) -> 1 (Int)
[5, 8, 10] (Array) -> 5 (Int)
// descending
[1, 2, 3, 4, 5] (Array) -> 5 (Int)
[5, 8, 10] (Array) -> 10 (Int)
MinKey / MaxKey
MinKey is the lowest possible value in every comparison, MaxKey the highest. Both are used internally, as they are always at the beginning or end of a collection when sorted.
Questions
I have no idea what 'minimum value' and 'maximum value' mean.
These are MinKey and MaxKey. See above.
An array comes before a number. And an empty array comes even before null.
It does not. As already explained an [] is always converted to null. When comparing equal values like, null and null or 40 and [40], the documents are sorted by natural order, which is in your case the timestamp in the ObjectId.
Try creating first the null entry and then the empty array.
The number 1 comes before an array
[40] is not an Array since it's converted to a Number. [[40]] would be an Array.
The array [40] comes between numbers 1 and 200.
If think you've got it. See 3.
Sources and further reading
https://www.w3resource.com/mongodb/mongodb-data-types.php
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/
https://docs.mongodb.com/manual/reference/bson-types/
Related
I am new to Mongo DB and while doing some practising, I came across a weird problem. The schema being:
{
"_id" : ObjectId("5c8eccc1caa187d17ca6ed29"),
"city" : "CLEVELAND",
"zip" : "35049",
"loc" : {
"y" : 33.992106,
"x" : 86.559355
},
"pop" : 2369,
"state" : "AL"
} ...
I want to find the number of cities, that have a population of more than 5000 but less than 1000000.
Both these queries, this:
db.zips.find({"$nor":[{"pop":{"$lt":5000}},{"pop":{"$gt":"1000000"}}]}).count()
and this:
db.zips.find({"$nor":[{"pop":{"$gt":1000000}},{"pop":{"$lt":"5000"}}]}).count()
give different results.
The first one gives 11193 and the second one gives 29470. Since I am from MySql background, both the queries are making no difference to me. According to me, both are the same and should return the number of zip codes with a population of less than 1000000 and more than 5000. Please help me understand.
Thanks in advance.
$gte and $lte should be used to compare same data type.
your first query quoted "100000" and your second query quoted "5000", the two queries ended up as not the same, since you are comparing Numeric data type in one, and string in another.
We need to create a compound index in the same order as the parameters are being queried. Does this order matter performance-wise at all?
Imagine we have a collection of all humans on earth with an index on sex (99.9% of the time "male" or "female", but string nontheless (not binary)) and an index on name.
If we would want to be able to select all people of a certain sex with a certain name, e.g. all "male"s named "John", is it better to have a compound index with sex first or name first? Why (not)?
Redsandro,
You must consider Index Cardinality and Selectivity.
1. Index Cardinality
The index cardinality refers to how many possible values there are for a field. The field sex only has two possible values. It has a very low cardinality. Other fields such as names, usernames, phone numbers, emails, etc. will have a more unique value for every document in the collection, which is considered high cardinality.
Greater Cardinality
The greater the cardinality of a field the more helpful an index will be, because indexes narrow the search space, making it a much smaller set.
If you have an index on sex and you are looking for men named John. You would only narrow down the result space by approximately %50 if you indexed by sex first. Conversely if you indexed by name, you would immediately narrow down the result set to a minute fraction of users named John, then you would refer to those documents to check the gender.
Rule of Thumb
Try to create indexes on high-cardinality keys or put high-cardinality keys first in the compound index. You can read more about it in the section on compound indexes in the book:
MongoDB The Definitive Guide
2. Selectivity
Also, you want to use indexes selectively and write queries that limit the number of possible documents with the indexed field. To keep it simple, consider the following collection. If your index is {name:1}, If you run the query { name: "John", sex: "male"}. You will have to scan 1 document. Because you allowed MongoDB to be selective.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Consider the following collection. If your index is {sex:1}, If you run the query {sex: "male", name: "John"}. You will have to scan 4 documents.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Imagine the possible differences on a larger data set.
A little explanation of Compound Indexes
It's easy to make the wrong assumption about Compound Indexes. According to MongoDB docs on Compound Indexes.
MongoDB supports compound indexes, where a single index structure
holds references to multiple fields within a collection’s documents.
The following diagram illustrates an example of a compound index on
two fields:
When you create a compound index, 1 Index will hold multiple fields. So if we index a collection by {"sex" : 1, "name" : 1}, the index will look roughly like:
["male","Rick"] -> 0x0c965148
["male","John"] -> 0x0c965149
["male","Sean"] -> 0x0cdf7859
["male","Bro"] ->> 0x0cdf7859
...
["female","Kate"] -> 0x0c965134
["female","Katy"] -> 0x0c965126
["female","Naji"] -> 0x0c965183
["female","Joan"] -> 0x0c965191
["female","Sara"] -> 0x0c965103
If we index a collection by {"name" : 1, "sex" : 1}, the index will look roughly like:
["John","male"] -> 0x0c965148
["John","female"] -> 0x0c965149
["John","male"] -> 0x0cdf7859
["Rick","male"] -> 0x0cdf7859
...
["Kate","female"] -> 0x0c965134
["Katy","female"] -> 0x0c965126
["Naji","female"] -> 0x0c965183
["Joan","female"] -> 0x0c965191
["Sara","female"] -> 0x0c965103
Having {name:1} as the Prefix will serve you much better in using compound indexes. There is much more that can be read on the topic, I hope this can offer some clarity.
I'm going to say I did an experiment on this myself, and found that there seems to be no performance penalty for using the poorly distinguished index key first. (I'm using mongodb 3.4 with wiredtiger, which may be different than mmap). I inserted 250 million documents into a new collection called items. Each doc looked like this:
{
field1:"bob",
field2:i + "",
field3:i + ""
"field1" was always equal to "bob". "field2" was equal to i, so it was completely unique. First I did a search on field2, and it took over a minute to scan 250 million documents. Then I created an index like so:
`db.items.createIndex({field1:1,field2:1})`
Of course field1 is "bob" on every single document, so the index should have to search a number of items before finding the desired document. However, this was not the result I got.
I did another search on the collection after the index finished creating. This time I got results which I listed below. You'll see that "totalKeysExamined" is 1 each time. So perhaps with wired tiger or something they have figured out how to do this better. I have read the wiredtiger actually compresses index prefixes, so that may have something to do with it.
db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")
{
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 4,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
...
"docsExamined" : 1,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
...
"indexName" : "field1_1_field2_1",
"isMultiKey" : false,
...
"indexBounds" : {
"field1" : [
"[\"bob\", \"bob\"]"
],
"field2" : [
"[\"250888000\", \"250888000\"]"
]
},
"keysExamined" : 1,
"seeks" : 1
}
}
Then I created an index on field3 (which has the same value as field 2). Then I searched:
db.items.find({field3:"250888000"});
It took the same 4ms as the one with the compound index. I repeated this a number of times with different values for field2 and field3 and got insignificant differences each time. This suggests that with wiredtiger, there is no performance penalty for having poor differentiation on the first field of an index.
Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.
https://www.alexbevi.com/blog/2020/05/16/optimizing-mongodb-compound-indexes-the-equality-sort-range-esr-rule/
This blog article disagrees with the accepted answer. The benchmark in the other answer also shows that it doesn't matter. The author of that article is a "Senior Technical Services Engineer at MongoDB" which sounds like a creditable person to me on this topic, so I guess the order really doesn't affect performance after all on equality fields. I'll follow the ESR rule instead.
Also consider prefixes. Filtering for { a: 1234 } won't work with a index of { b: 1, a: 1 }: https://docs.mongodb.com/manual/core/index-compound/#prefixes
I found an strange behavior in MongoDB::Collection->find function and wanted to ask if this is a result of my misunderstanding of the way the driver works of it it's really an odd behavior.
I want to make a query that searches on the indexed fields but also on some of the other ones. To be sure that MongoDB would use the existing index, I thought of passing the search terms as an array_ref, with the fields in the order I wanted them to be, but the explain() function shows that uses the BasicCursor and scans all documents. If, on the other side, I pass the search terms as a hash_ref (which doesn't ensure and field order), MongoDB gets the fields in the correct order and uses the existing index.
I have a collection with documents as follow:
{
"customer" : NumberLong(),
"date" : ISODate(),
field_a : 1,
field_b: 2,
[...]
field_n: 1
}
There is an index called customer_1_date_1 on fields customer and date.
When I give the search terms as an array ref, the explain command states:
my $SearchTerms = [
'customer', $customer,
'date' , $date ,
'field_a' , $field_a ,
'field_b' , $field_b ,
];
cursor "BasicCursor",
[...]
nscanned 5802,
nscannedAllPlans 5802,
[...]
On the other side, when I give the search terms as a hash ref, the explain command states:
my $SearchTerms = {
customer => $customer,
date => $date,
field_a => $field_a,
field_b => $field_b,
};
allPlans [
[0] {
cursor "BtreeCursor customer_1_date_1",
[...]
n 1,
nscanned 4,
nscannedObjects 4
},
[1] {
cursor "BasicCursor",
[...]
n 0,
nscanned 4,
nscannedObjects 4
}
],
[...]
nscanned 4,
nscannedAllPlans 8,
nscannedObjects 4,
nscannedObjectsAllPlans 8,
So, my question is: should I trust the hash ref to always get the existing index?
Is there any way, without using additional modules, to ensure this?
Thank you all,
Juanma
I have a collection of documents that looks like:
[{ id : 1, a : 123, b : 342, name : 'test'}, { id : 2, a : 23, b : 32, name : 'another'}]
I am trying to sort over column a and then add another column to each document containing the rank of each value (where ties are allowed and the average is taken). It seems like I should be using the MongoDB aggregation framework, but I cannot figure out how to sort, and then assign the rank to another column. In the end I should end up with :
[
{ id : 1, a : 123, b : 32, name : 'test', aRank : 1, bRank : 1.5},
{ id : 2, a : 23, b : 32, name : 'another', aRank : 2, bRank : 1.5}
]
Any help? Thank you
Since no answer was ever given -- tied to searches bringing nothing up. I built an npm module that provides standard and fractional ranking for any js array over a numeric column.
I never did come up with a way of doing it with aggregation in mongodb, but this solution works well enough for me.
npm module: https://www.npmjs.org/package/rank.js
I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.