If I have documents that look like:
{
_id: 1,
val1: x,
val2: aa
},
{
_id: 2,
val1: y,
val2: bb
},
{
_id: 3,
val1: x,
val2: cc
},
{
_id: 4,
val1: z,
val2: bb
}
Is it possible to group them in MongoDB so that docs 1 and 3 are paired and docs 2 and 4 are paired?
In essence, I'm looking to group docs if their val1 OR val2 are the same. Also, there will NOT be the possibility of docs being able to be in two different groups. Meaning, I should be able to partition the set of docs. Is the possible in Mongo?
Ultimately, I want to partition my set of documents based on the aforementioned criteria and then count the size of each subset.
I've tried attacking this problem by grouping on the val1 field and using $addToSet to created an array of val2's. But then I'm stuck because I don't know of a way in Mongo to merge arrays that contain at least one common element. If I did, I could use that list of arrays and aggregate again using $in.
Please let me know if I can clarify my question in any way!
Related
My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}
Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?
It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.
It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()
I want to sort a collection of documents on the backend with Meteor in a publication. I want to make sure I index the collection properly and I couldnt find any concrete ways to do this for the fields I want.
Basically I want to find all the documents in collection "S" that have field "a" equal to a specific string value. Then after I have all those documents, I want to sort them by three fields: b, c (descending), and d.
Would this be the right way to index it:
S._ensureIndex({ a: 1 });
S._ensureIndex({ b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
As a bonus question, field "d" is the time added in milliseconds and it would be highly unlikely that there is a duplicate document with all 3 fields being the same. Should I also include the unique option as well in the index?
S._ensureIndex({ b: 1, c: -1, d: 1 }, { unique: true });
Will adding the unique option help any with sorting and indexing performance?
I was also wondering what the general performance was for an indexed sort by mongo? On their page it says indexed sorts dont need to be done in memory because it is just doing a read. Does this mean the sort is instant?
Thanks for all your help :)
I was able to find my own answer here:
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
Basically the correct way to index for my example here is:
S._ensureIndex({ a: 1, b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
-- As for my other subquestions.
Because you are using the index here, the performance is instant. However do keep mind that having a 4 field index can be very expensive in other ways.
Without looking this up for sure....my hunch is that including the "unique: true" would not make a difference in performance since it is already instant.
db.numbers.find().sort( { a : 1, b : 1, c : 1 })
If I execute this command MongoDB will sort numbers collection by property 'a', if 'a' is the same on two docs it will sort them by 'b' property, if that is the same too it will go on to 'c'. I hope I got that right, correct me if not.
But how does it pick 'a' property as first when it is just a JS object? Does it iterate over sorting object properties using for(var propr in ...) and whichever is first is also first to be sorted by?
Internally, MongoDB doesn't use JSON, is uses BSON. While JSON is technically un-ordered, BSON, (per the specification) is ordered. This is how MongoDB knows that in {a:1, b:1, c:1} that the keys are ordered "a,b,c": the underlying implementation is ordered as well.
As #Sammaye posted above in the comments, the JavaScript dictionary must be created with key priority in mind.
Hence, if you do something like this:
db.numbers.find().sort({
a: 1,
b: 1,
c: 1
});
your results will be sorted first by a, then by b, then by c.
If you do this, however:
db.numbers.find().sort({
c: 1,
a: 1,
b: 1
});
your results will be sorted first by c, then by a, then by b.
By using those keys you mentioned:
db.numbers.find().sort({
a: 1,
b: 1,
c: 1
});
MongoDB sorts with property a, b, and then c.
Basically, MongoDB scans from the beginning of the datafile (CMIIW). Hence, if MongoDB finds two or more documents with the same keys as yours (a,b, and c), it prints the first document found in the datafile first.
So I have a very large set of metrics (15GB and growing) that has some of the data in nested hashes. Like so:
{
_id: 'abc0000',
type: 'foo',
data: { a: 20, b: 30, c: 3 }
},
... more data following this schema...
{
_id: 'abc5000',
type: 'bar'
data: { a: 1, b: 2, c: 4, d: 10 }
}
What is the performance implications when I run a query on the nested hashes? The data inside the hash can't be indexed...or rather, it would be pointless to index it.
I can always reform the data into a flat style data_a, data_b, etc...
You can create indexes on attributes in nested hashes. Take a look at Indexing with dot notation for more details. You can also create compound indexes if you need, but be careful about caveats with parallel arrays. Basically, if you create a compound index only one of the indexed values can be array, however, that shouldn't effect you(judging from posted schema).
So you can create indexes on data.a, data.b or data.a, data.c as per your needs.