Does Indexing small arrays of subdocuments in Mongodb affect performance? - mongodb

My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}

Related

MongoDB fast count of subdocuments - maybe trough index

I'm using MongoDB 4.0 on mongoDB Atlas cluster (3 replicas - 1 shard).
Assuming i have a collection that contains multiple documents.
Each of this documents holding an array out of subdocuments that represent cities in a certain year with additional information. An example document would look like that (i removed unessesary information to simplify example):
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
I have a compound index on and year. What is the fastest way to count the occurrences of name and year combinations?
I already tried the following aggregation:
[{$unwind: {
path: '$cities'
}}, {$group: {
_id: {
name: 'cities.name',
year: '$cities.year'
},
count: {
$sum: 1
}
}}, {$project: {
count: 1,
name: '$_id.name',
year: '$_id.year',
_id: 0
}}]
Another approach i tried was a map-reduce in the following form - the map reduce performed a bit better ~30% less time needed.
map function:
function m() {
for (var i in this.cities) {
emit({
name: this.cities[i].name,
year: this.cities[i].year
},
1);
}
}
reduce function (also tried to replace sum with length, but surprisingly sum is faster):
function r(id, counts) {
return Array.sum(counts);
}
function call in mongoshell:
db.test.mapReduce(m,r,{out:"mr_test"})
Now i was asking myself - Is it possible to access the index? As far as i know it is a B+ tree that holds the pointers to the relevant documents on disk, therefore from a technical point of view I think is would be possible to iterate through all leaves of the index tree and just counting the pointers? Does anybody if this is possible?
Does anybody knows another way to solve this approach in a high performant way? (It is not possible to change the design, because of other dependencies of the software, we are running this on a very big dataset). Has anybody maybe experience in solve such task via shards?
The index will not be very helpful in this situation.
MongoDB indexes were designed for identifying documents that match a given critera.
If you create an index on {cities.name:1, cities.year:1}
This document:
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
Will have 2 entries in the b-tree that refer to this document:
vienna|1985
berlin|2001
Even if it were possible to count the incidence of a specific key in the index, this does not necessarily correspond.
MongoDB does not provide a method to examine the raw entries in an index, and it explicitly refuses to use an index on a field containing an array for counting.
The MongoDB count command and helper functions all count documents, not elements inside of them. As you noticed, you can unwind the array and count the items in an aggregation pipeline, but at that point you've already loaded all of the documents into memory, so it's too late to make use of an index.

Is it possible to sort, group and limit efficiently in Mongo with a pipeline?

Given Users with an index on age:
{ name: 'Bob',
age: 21 }
{ name: 'Cathy,
age: 21 }
{ name: 'Joe',
age: 33 }
To get the output:
[
{ _id: 21,
names: ['Bob, 'Cathy'] },
{ _id: 33,
names: ['Joe'] }
]
Is it possible to sort, group and limit by age?
db.users.aggregate(
[
{
$sort: {
age: 1
}
},
{
$group : {
_id : $age,
names:{ $push: '$name' }
},
{
$limit: 10
}
]
I did some research, but it's not clear if it is possible to sort first and then group. In my testing, the group loses the sort, but I don't see why.
If the group preserves the sort, then the sort and limit can greatly reduce the required processing. It only needs to do enough work to "fill" the limit of 10 groups.
So,
Does group preserve the sort order? Or do you have to group and then sort?
Is it possible to sort, group and limit doing only enough processing to return the limit? Or does it need to process the entire collection and then limit?
To answer your first question: $group does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:
https://jira.mongodb.org/browse/SERVER-24799
https://jira.mongodb.org/browse/SERVER-4507
https://jira.mongodb.org/browse/SERVER-21022
Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.
Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:
When a $sort immediately precedes a $limit in the pipeline, the $sort
operation only maintains the top n results as it progresses, where n
is the specified limit, and MongoDB only needs to store n items in
memory. This optimization still applies when allowDiskUse is true and
the n items exceed the aggregation memory limit.
So this code gets the job done in your case:
collection.aggregate({
$group: {
_id: '$age',
names: { $push: '$name' }
}
}, {
$sort: {
'_id': 1
}
}, {
$limit: 10
})
EDIT following your comments:
I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group was smart enough to use an index then it shouldn't even require a $sort stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group will never use an index and it won't take shortcuts based on the following stages ($limit in this case). Also see this link where someone ran some basic tests.
The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.
There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.
When it comes to performance of an aggregation pipeline, $match stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)

Mongo Query - Number of Constraints vs Speed (and Indexing!)

Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?
It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.
It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.

Count fields in a MongoDB Collection

I have a collection of documents like this one:
{
"_id" : ObjectId("..."),
"field1": "some string",
"field2": "another string",
"field3": 123
}
I'd like to be able to iterate over the entire collection, and find the entire number of fields there are. In this example document there are 3 (I don't want to include _id), but it ranges from 2 to 50 fields in a document. Ultimately, I'm just looking for the average number of fields per document.
Any ideas?
Iterate over the entire collection, and find the entire number of fields there are
Now you can utilise aggregation operator $objectToArray (SERVER-23310) to turn keys into values and count them. This operator is available in MongoDB v3.4.4+
For example:
db.collection.aggregate([
{"$project":{"numFields":{"$size":{"$objectToArray":"$$ROOT"}}}},
{"$group":{"_id":null, "fields":{"$sum":"$numFields"}, "docs":{"$sum":1}}},
{"$project":{"total":{"$subtract":["$fields", "$docs"]}, _id:0}}
])
First stage $project is to turn all keys into array to count fields. Second stage $group is to sum the number of keys/fields in the collection, also the number of documents processed. Third stage $project is subtracting the total number of fields with the total number of documents (As you don't want to count for _id ).
You can easily add $avg to count for average on the last stage.
PRIMARY> var count = 0;
PRIMARY> db.my_table.find().forEach( function(d) { for(f in d) { count++; } });
PRIMARY> count
1074942
This is the most simple way I could figure out how to do this. On really large datasets, it probably makes sense to go the Map-Reduce path. But, while your set is small enough, this'll do.
This is O(n^2), but I'm not sure there is a better way.
You could create a Map-Reduce job. In the Map step iterate over the properties of each document as a javascript object, output the count and reduce to get the total.
For a simple way just find() all value and for each set of record get size of array.
db.getCollection().find(<condition>)
then for each set of result, get the size of array.
sizeOf(Array[i])