MongoDB lists - get every Nth item - mongodb

I have a Mongodb schema that looks roughly like:
[
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
...
]
},
{
"name" : "name2",
"instances" : [
...
]
}
]
where the number of instances for each element can be quite big.
I sometimes want to get only a sample of the data, that is, get every 3rd instance, or every 10th instance... you get the picture.
I can achieve this goal by getting all instances and filtering them in my server code, but I was wondering if there's a way to do it by using some aggregation query.
Any ideas?
Updated
Assuming the data structure was flat as #SylvainLeroux suggested below, that is:
[
{"name": "name1", "value": 1, "date": ISODate("2015-03-04T00:00:00.000Z")},
{"name": "name2", "value": 5, "date": ISODate("2015-04-04T00:00:00.000Z")},
{"name": "name1", "value": 2, "date": ISODate("2015-04-01T00:00:00.000Z")},
{"name": "name1", "value": 2.5, "date": ISODate("2015-03-05T00:00:00.000Z")},
...
]
will the task of getting every Nth item (of specific name) be easier?

It seems that your question clearly asked "get every nth instance" which does seem like a pretty clear question.
Query operations like .find() can really only return the document "as is" with the exception of general field "selection" in projection and operators such as the positional $ match operator or $elemMatch that allow a singular matched array element.
Of course there is $slice, but that just allows a "range selection" on the array, so again does not apply.
The "only" things that can modify a result on the server are .aggregate() and .mapReduce(). The former does not "play very well" with "slicing" arrays in any way, at least not by "n" elements. However since the "function()" arguments of mapReduce are JavaScript based logic, then you have a little more room to play with.
For analytical processes, and for analytical purposes "only" then just filter the array contents via mapReduce using .filter():
db.collection.mapReduce(
function() {
var id = this._id;
delete this._id;
// filter the content of "instances" to every 3rd item only
this.instances = this.instances.filter(function(el,idx) {
return ((idx+1) % 3) == 0;
});
emit(id,this);
},
function() {},
{ "out": { "inline": 1 } } // or output to collection as required
)
It's really just a "JavaScript runner" at this point, but if this is just for anaylsis/testing then there is nothing generally wrong with the concept. Of course the output is not "exactly" how your document is structured, but it's as near a facsimile as mapReduce can get.
The other suggestion I see here requires creating a new collection with all the items "denormalized" and inserting the "index" from the array as part of the unqique _id key. That may produce something you can query directly, bu for the "every nth item" you would still have to do:
db.resultCollection.find({
"_id.index": { "$in": [2,5,8,11,14] } // and so on ....
})
So work out and provide the index value of "every nth item" in order to get "every nth item". So that doesn't really seem to solve the problem that was asked.
If the output form seemed more desirable for your "testing" purposes, then a better subsequent query on those results would be using the aggregation pipeline, with $redact
db.newCollection([
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$mod": [ { "$add": [ "$_id.index", 1] }, 3 ] },
0 ]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
That at least uses a "logical condition" much the same as what was applied with .filter() before to just select the "nth index" items without listing all possible index values as a query argument.

No $unwind is needed here. You can use $push with $arrayElemAt to project the array value at requested index inside $group aggregation.
Something like
db.colname.aggregate(
[
{"$group":{
"_id":null,
"valuesatNthindex":{"$push":{"$arrayElemAt":["$instances",N]}
}}
},
{"$project":{"valuesatNthindex":1}}
])

You might like this approach using the $lookup aggregation. And probably the most convenient and fastest way without any aggregation trick.
Create a collection Names with the following schema
[
{ "_id": 1, "name": "name1" },
{ "_id": 2, "name": "name2" }
]
and then Instances collection having the parent id as "nameId"
[
{ "nameId": 1, "value" : 1, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 1, "value" : 2, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 1, "value" : 3, "date" : ISODate("2015-03-05T00:00:00.000Z") },
{ "nameId": 2, "value" : 7, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 2, "value" : 8, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 2, "value" : 4, "date" : ISODate("2015-03-05T00:00:00.000Z") }
]
Now with $lookup aggregation 3.6 syntax you can use $sample inside the $lookup pipeline to get the every Nth element randomly.
db.Names.aggregate([
{ "$lookup": {
"from": Instances.collection.name,
"let": { "nameId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$nameId", "$$nameId"] }}},
{ "$sample": { "size": N }}
],
"as": "instances"
}}
])
You can test it here

Unfortunately, with the aggregation framework it's not possible as this would require an option with $unwind to emit an array index/position, of which currently aggregation can't handle. There is an open JIRA ticket for this here SERVER-4588.
However, a workaround would be to use MapReduce but this comes at a huge performance cost since the actual calculations of getting the array index are performed using the embedded JavaScript engine (which is slow), and there still is a single global JavaScript lock, which only allows a single JavaScript thread to run at a single time.
With mapReduce, you could try something like this:
Mapping function:
var map = function(){
for(var i=0; i < this.instances.length; i++){
emit(
{ "_id": this._id, "index": i },
{ "index": i, "value": this.instances[i] }
);
}
};
Reduce function:
var reduce = function(){}
You can then run the following mapReduce function on your collection:
db.collection.mapReduce( map, reduce, { out : "resultCollection" } );
And then you can query the result collection to geta list/array of every Nth item of the instance array by using the map() cursor method :
var thirdInstances = db.resultCollection.find({"_id.index": N})
.map(function(doc){return doc.value.value})

You can use below aggregation:
db.col.aggregate([
{
$project: {
instances: {
$map: {
input: { $range: [ 0, { $size: "$instances" }, N ] },
as: "index",
in: { $arrayElemAt: [ "$instances", "$$index" ] }
}
}
}
}
])
$range generates a list of indexes. Third parameter represents non-zero step. For N = 2 it will be [0,2,4,6...], for N = 3 it will return [0,3,6,9...] and so on. Then you can use $map to get correspinding items from instances array.

Or with just a find block:
db.Collection.find({}).then(function(data) {
var ret = [];
for (var i = 0, len = data.length; i < len; i++) {
if (i % 3 === 0 ) {
ret.push(data[i]);
}
}
return ret;
});
Returns a promise whose then() you can invoke to fetch the Nth modulo'ed data.

Related

Most efficient way to change a string field value to its substring

I have a collection filled with documents that look like this:
{
data: 11,
version: "0.0.32"
}
and some have a test suffix to version:
{
data: 55,
version: "0.0.42-test"
}
The version field has different values but it always conforms to the pattern: 0.0.XXX. I would like to update all the documents to look like this:
{
data: 11,
version: 32
}
and the suffixed version (for test documents - version should be negative):
{
data: 55,
version: -42
}
The collection with these documents is used by our critical system, that needs to be turned off while updating the data - so I want the update/change to be as fast as possible. There are about 66_000_000 documents in this collection, and it's about 100GB in size.
Which type of mongodb operation would be the most efficient one?
The most efficient way to do this is in the upcoming release of MongoDB as of this writing using the $split operator to split our string as shown here then assign the last element in the array to a variable using the $let variable operator and the $arrayElemAt operators.
Next, we use the $switch operator to perform a logical condition processing or case statement against that variable.
The condition here is $gt which returns true if the value contains "test", and in which case in the in expression we split that string and simply return the $concatenated value of the first element in the newly computed array and the -. If the condition evaluates to false, we just return the variable.
Of course in our case statement, we use the $indexOfCP which returns -1 if there were no occurrences of "test".
let cursor = db.collection.aggregate(
[
{ "$project": {
"data": 1,
"version": {
"$let": {
"vars": {
"v": {
"$arrayElemAt": [
{ "$split": [ "$version", "." ] },
-1
]
}
},
"in": {
"$switch": {
"branches": [
{
"case": {
"$gt": [
{ "$indexOfCP": [ "$$v", "test" ] },
-1
]
},
"then": {
"$concat": [
"-",
"",
{ "$arrayElemAt": [
{ "$split": [ "$$v", "-" ] },
0
]}
]
}
}
],
"default": "$$v"
}
}
}
}
}}
]
)
The aggregation query produces something like this:
{ "_id" : ObjectId("57a98773cbbd42a2156260d8"), "data" : 11, "version" : "32" }
{ "_id" : ObjectId("57a98773cbbd42a2156260d9"), "data" : 55, "version" : "-42" }
As you can see, the "version" field data are string. If the data type for that field does not matter then, you can simply use the $out aggregation pipeline stage operator to write the result into a new collection or replace your collection.
{ "out": "collection" }
If you need to convert your data to floating point number then, the only way to do this, simply because MongoDB doesn't not provides a way to do type conversion out of the box except for integer to string, is to iterate the aggregation Cursor object and convert your value using parseFloat or Number then update your documents using the $set operator and the bulkWrite() method for maximum efficiency.
let requests = [];
cursor.forEach(doc => {
requests.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"data": doc.data,
"version": parseFloat(doc.version)
},
"$unset": { "person": " " }
}
}
});
if ( requests.length === 1000 ) {
// Execute per 1000 ops and re-init
db.collection.bulkWrite(requests);
requests = [];
}}
);
// Clean up queues
if(requests.length > 0) {
db.coll.bulkWrite(requests);
}
While the aggregation query will perfectly work in MongoDB 3.4 or newer our best bet from MongoDB 3.2 backwards is mapReduce with the bulkWrite() method.
var results = db.collection.mapReduce(
function() {
var v = this.version.split(".")[2];
emit(this._id, v.indexOf("-") > -1 ? "-"+v.replace(/\D+/g, '') : v)
},
function(key, value) {},
{ "out": { "inline": 1 } }
)["results"];
results looks like this:
[
{
"_id" : ObjectId("57a98773cbbd42a2156260d8"),
"value" : "32"
},
{
"_id" : ObjectId("57a98773cbbd42a2156260d9"),
"value" : "-42"
}
]
From here you use the previous .forEach loop to update your documents.
From MongoDB 2.6 to 3.0 you will need to use the now deprecated Bulk() API and it associated method as show in my answer here.

MongoDB : limit query to a field and array projection

I have a collection that contains following information
{
"_id" : 1,
"info" : { "createdby" : "xyz" },
"states" : [ 11, 10, 9, 3, 2, 1 ]}
}
I project only states by using query
db.jobs.find({},{states:1})
Then I get only states (and whole array of state values) ! or I can select only one state in that array by
db.jobs.find({},{states : {$slice : 1} })
And then I get only one state value, but along with all other fields in the document as well.
Is there a way to select only "states" field, and at the same time slice only one element of the array. Of course, I can exclude fields but I would like to have a solution in which I can specify both conditions.
You can do this in two ways:
1> Using mongo projection like
<field>: <1 or true> Specify the inclusion of a field
and
<field>: <0 or false> Specify the suppression of the field
so your query as
db.jobs.find({},{states : {$slice : 1} ,"info":0,"_id":0})
2> Other way using mongo aggregation as
db.jobs.aggregate({
"$unwind": "$states"
}, {
"$match": {
"states": 11
}
}, // match states (optional)
{
"$group": {
"_id": "$_id",
"states": {
"$first": "$states"
}
}
}, {
"$project": {
"_id": 0,
"states": 1
}
})

Compare a date of two elements

My problem is difficult to explain :
In my website I save every action of my visitors (view, click, buy etc).
I have a simple collection named "flow" where my data is registered
{
"_id" : ObjectId("534d4a9a37e4fbfc0bf20483"),
"profile" : ObjectId("534bebc32939ffd316a34641"),
"activities" : [
{
"id" : ObjectId("534bebc42939ffd316a3af62"),
"date" : ISODate("2013-12-13T22:39:45.808Z"),
"verb" : "like",
"product" : "5"
},
{
"id" : ObjectId("534bebc52939ffd316a3f480"),
"date" : ISODate("2013-12-20T19:19:10.098Z"),
"verb" : "view",
"product" : "6"
},
{
"id" : ObjectId("534bebc32939ffd316a3690f"),
"date" : ISODate("2014-01-01T07:11:44.902Z"),
"verb" : "buy",
"product" : "5"
},
{
"id" : ObjectId("534bebc42939ffd316a3741b"),
"date" : ISODate("2014-01-11T08:49:02.684Z"),
"verb" : "favorite",
"product" : "26"
}
]
}
I would like to aggregate these data to retrieve the number of people who made an action (for example "view") and then another later in time (for example "buy"). To to that I need to compare "date" inside my "activities" array...
I tried to use aggregation framework to do that but I do not see how too make this request
This is my beginning :
db.flows.aggregate([
{ $project: { profile: 1, activities: 1, _id: 0 } },
{ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}, //First verb + second verb
{ $unwind: '$activities' },
{ $match: { 'activities.verb': {$in:['view', 'buy']} } }, //First verb + second verb,
{
$group: {
_id: '$profile',
view: { $push: { $cond: [ { $eq: [ "$activities.verb", "view" ] } , "$activities.date", null ] } },
buy: { $push: { $cond: [ { $eq: [ "$activities.verb", "buy" ] } , "$activities.date", null ] } }
}
}
])
Maybe the format of my collection "flow" is not the best to do what I want...If you have any better idea dont hesitate
Thank you for your help !
Here is the aggregation that will give you the total number of buyers who viewed first and then bought (though not necessarily the same product that they viewed).
db.flow.aggregate(
{$match: {"activities.verb":{$all:["view","buy"]}}},
{$unwind :"$activities"},
{$match: {"activities.verb":{$in:["view","buy"]}}},
{$group: {
_id:"$_id",
firstViewed:{$min:{$cond:{
if:{$eq:["$activities.verb","view"]},
then : "$activities.date",
else : new Date(9999,0,1)
}}},
lastBought: {$max:{$cond:{
if:{$eq:["$activities.verb","buy"]},
then:"$activities.date",
else:new Date(1900,0,1)}
}}}
},
{$project: {viewedThenBought:{$cond:{
if:{$gt:["$lastBought","$firstViewed"]},
then:1,
else:0
}}}},
{$group:{_id:null,totalViewedThenBought:{$sum:"$viewedThenBought"}}}
)
Here you first pass through the pipeline only the documents that have all the "verbs" you are interested in. When you group the first time, you want to use the earliest "view" and the last "buy" and the next project compares them to see if they viewed before they bought.
The last step gives you the count of all the people who satisfied your criteria.
Be careful to leave out all $project phases that don't actually compute any new fields (like you very first $project). The aggregation framework is smart enough to never pass through any fields that it sees are not used in any later stages, so there is never a need to $project just to "eliminate" fields as that will happen automatically.
For your query:
I would like to aggregate these data to retrieve the number of people who made an action
Try this:
db.flows.aggregate([
// De-normalize the array into individual documents
{"$unwind" : "$activities"},
// Match for the verbs you are interested in
{"$match" : {"activities.verb":{$in:["buy", "view"]}}},
// Group by verb to get the count
{"$group" : {_id:"$activities.verb", count:{$sum:1}}}
])
The above query would produce an output like:
{
"result" : [
{
"_id" : "buy",
"count" : 1
},
{
"_id" : "view",
"count" : 1
}
],
"ok" : 1
}
Note: The $and operator in your query ({ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}) is not required as that's the default if you specify multiple conditions. Only if you need a logical OR, $or operator is required.
If you want to use the date in the aggregation query to do queries like how many "views by day", etc.. the Date Aggregation Operators will come in handy.
I see where you are going with this and I think you are basically on the right track. So more or less un-altered (but for formatting preference) and the few tweeks at the end:
db.flows.aggregate([
// Try to $match "first" always to make sure you can get an index
{ "$match": {
"$and": [
{"activities.verb": "view"},
{"activities.verb": "buy"}
]
}},
// Don't worry, the optimizer "sees" this and will sort of "blend" with
// with the first stage.
{ "$project": {
"profile": 1,
"activities": 1,
"_id": 0
}},
{ "$unwind": "$activities" },
{ "$match": {
"activities.verb": { "$in":["view", "buy"] }
}},
{ "$group": {
"_id": "$profile",
"view": { "$min": { "$cond": [
{ "$eq": [ "$activities.verb", "view" ] },
"$activities.date",
null
]}},
"buy": { "$max": { "$cond": [
{ "$eq": [ "$activities.verb", "buy" ] },
"$activities.date",
null
]}}
}},
{ "$project": {
"viewFirst": { "$lt": [ "$view", "$buy" ] }
}}
])
So essentially the $min and $max operators should be self explanatory in the context in that you should be looking for the "first" view to correspond with the "last" purchase. As for me, and would make sense, you would actually be matching these by product (but hint: "Grouping") but I'll leave that part up to you.
The other advantage here is that the false values will always be negated if there is an actual date to match the "verb". Otherwise this goes through as false and this turns out to be okay.
That is because the next thing you do is $project to "compare" the values and ask the question "Did the 'view' happen before the 'buy'?" which is a logical evaluation of the "less than" $lt operator.
As for the schema itself. If you are storing a lot of these "events" then you are probably better off flattening things out into separate documents and finding some way to mark each with the same "session" identifier if that is separate to "profile".
Getting away from large arrays ( which this seems to lead to ) if likely going to help performance, and with care, makes little different to the aggregation process.

Mongo. Narroving down results of nested array

If I have a document like this:
{
"name" : "Foo",
"words" :
[
"lorem",
"ipsum",
"dolor",
"sit",
"amet",
...
]
}
Let's say this words array is pretty big. Now I need a query that would fetch that document:
db.docs.find({'name':'Foo'}) - that will get whole document
but what I want, instead of fetching the entire words array (cause it's too big) I would like to retrieve only elements that meet some criteria. Let's say I want to see only words that start with "a" or have a length of at least 3 characters.
You know maybe something like this:
// this won't work!
db.docs.find({
"$where":"(this.words.map(function(e){ if (e.length >=3) { return e } }))"
})
EDIT
You cannot filter array contents using find, You can only match that the array contains the condition. So in order to filter the contents of the array you need to make use of aggregate:
db.docs.aggregate([
// Still makes sense to match the documents that meet the condition
{ "$match": {
"name": "Foo",
"words": { "$regex": "^[A-Za-z0-9_]{4,}" }
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$words" },
// Actually "filter" the array elements
{ "$match": { "words": { "$regex": "^[A-Za-z0-9_]{4,}" } } },
// Group back the document with the "filtered" array
{ "$group": {
"_id": "$_id",
"name": { "$first": "$name" },
"words": { "$push": "$words" }
}}
])
That makes use a regular expression condition that will match at least 4 characters from the start of the string. The ^ anchor is quite important here as it allows an index to be used which is much more optimal than whatever else you can do.
The result returned will look like this:
{
"result" : [
{
"_id" : ObjectId("5341f0476cbcc02b995092ac"),
"name" : "Foo",
"words" : [
"lorem",
"ipsum",
"dolor"
]
}
],
"ok" : 1
}
You can also throw a lot of arbitrary JavaScript at mapReduce and test the length of elements in the array, but that will take considerably longer to execute.
--
The terms are quite simple, you simply add the additional operator to the query document as so:
db.docs.find({ "name": "Foo", "$where": "(this.words.length > 3)" })
You really should not be using the $where operator unless absolutely necessary, and even then you really should think about what you are doing. Heed the warnings that are given in that document.
As stated in the manual page for $size, probably the best way to deal with detecting array length for a given range (rather than exact) is to create a "counter" field in your document that is updated as elements are added/removed from the array. This makes a very simple and efficient query:
db.docs.find({ "name": "Foo", "counter": { "$gt": 3 } })
Of course from MongoDB versions 2.6 and upwards you can also do this:
db.docs.aggregate([
{ "$project": {
"name": 1,
"words": 1,
"count": { "$size": "$words" }
}},
{ "$match": {
"count": { "$gt": 3 }
}}
])
Either of those forms is going to perform a lot better than using something that is going to remove the use of an index and then invoke the JavaScript interpreter over each resulting document. Or even just use the $size operator for an exact size of the array.

How to sort by 'value' of a specific key within a property stored as an array with k-v pairs in mongodb

I have a mongodb collection, let's call it rows containing documents with the following general structure:
{
"setid" : 154421,
"date" : ISODate("2014-02-22T14:06:48.229Z"),
"version" : 2,
"data" : [
{
"k" : "name",
"v" : "ryan"
},
{
"k" : "points",
"v" : "375"
},
{
"k" : "email",
"v" : "ryan#123.com"
}
],
}
There is no guarantee what values of k and v might populate the "data" property for any particular document (eg. other documents might have 5 k-v pairs with different key names in it). The only rule is that documents with the same setid have the same k-v pairs. (i.e. the rows collection might hold 100 other documents with setid = 154421, that have the same set of 3 keys in the data property: "name", "points", "email", with their own respective values.
How would one, with this setup, construct a query to retrieve all rows with a particular setid sorted by points? I need, in effect, some way of saying 'sort by the the field data.v where the value of k==points or something like that...?
Something like this:
db.rows.find({setid:154421},{$sort:{'data.v',-1}, {$where: k:'points'}}})
I know this is the incorrect syntax, but I'm just taking a stab at it to illustrate my point.
Is it possible?
Assuming that what you want would be all the documents that have the "points" value as a "key" in the array, and then sort on the "value" for that "key", then this is a little out of scope for the .find() method.
Reason being if you did something like this
db.collection.find({
"setid": 154421, "data.k": "point" }
).sort({ "data.v" : -1 })
The problem is that even though the matched elements do have the matching key of "point", there is no way of telling which data.v you are referring to for the sort. Also, a sort within .find() results will not do something like this:
db.collection.find({
"setid": 154421, "data.k": "point" }
).sort({ "data.$.v" : -1 })
Which would be trying to use a positional operator within a sort, essentially telling which element to use the value of v on. But this is not supported and not likely to be, and for the most likely explaination, that "index" value would be likely different in every document.
But this kind of selective sorting can be done with the use of .aggregate().
db.collection.aggregate([
// Actually shouldn't need the setid
{ "$match": { "data": {"$elemMatch": { "k": "points" } } } },
// Saving the original document before you filter
{ "$project": {
"doc": {
"_id": "$_id",
"setid": "$setid",
"date": "$date",
"version": "$version",
"data": "$data"
},
"data": "$data"
}}
// Unwind the array
{ "$unwind": "$data" },
// Match the "points" entries, so filtering to only these
{ "$match": { "data.k": "points" } },
// Sort on the value, presuming you want the highest
{ "$sort": { "data.v": -1 } },
// Restore the document
{ "$project": {
"setid": "$doc.setid",
"date": "$doc.date",
"version": "$doc.version",
"data": "$doc.data"
}}
])
Of course that presumes the data array only has the one element that has the key points. If there were more than one, you would need to $group before the sort like this:
// Group to remove the duplicates and get highest
{ "$group": {
"_id": "$doc",
"value": { "$max": "$data.v" }
}},
// Sort on the value
{ "$sort": { "value": -1 } },
// Restore the document
{ "$project": {
"_id": "$_id._id",
"setid": "$_id.setid",
"date": "$_id.date",
"version": "$_id.version",
"data": "$_id.data"
}}
So there is one usage of .aggregate() in order to do some complex sorting on documents and still return the original document result in full.
Do some more reading on aggregation operators and the general framework. It's a useful tool to learn that takes you beyond .find().