Mongodb distinct aggregation of 3 billion documents

Mongodb distinct aggregation of 3 billion documents - mongodb

I have a huge collection with 3 billion documents. Each document looks like the following:
"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")
I would like to get the list of distinct users (they should at least appear once as a caller 'caller_id'), their counts (how many times each user appeared in the collection as either caller or receiver) and the count of locations if they are callers (i.e., the count for each location_id per user).
I want to end up with the following:
"number_of_records" : 20,
"locations" : [{location_id: 658_55525, count:5}, {location_id: 840_5425, count:15}],
"user" : NumberLong("475035504705")
I tried the solution described here and here but they are not efficient enough (extremely slow). What would be an efficient way to achieve this?

Use aggregation for your result:
db.<collection>.aggregate([
{ $group : { _id : { user: "$caller_id", localtion: '$location_id'} , count : { $sum : 1} } },
{ $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
{ $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
{ $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
{ $out : 'outputCollection'},
])
The output will be:
{
"0" : {
"locations" : [
{
"location_id" : "840_5425",
"count" : 8
},
{
"location_id" : "658_55525",
"count" : 5
}
],
"number_of_records" : 13,
"user" : NumberLong(475035504705)
}
}
Update using allowDiskUse:
var pipe = [
{ $group : { _id : { user: "$caller_id", localtion: '$location_id'} , count : { $sum : 1} } },
{ $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
{ $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
{ $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
{ $out : 'outputCollection'},
];
db.runCommand(
{ aggregate: "collection",
pipeline: pipe,
allowDiskUse: true
}
)

A map-reduce solution would be more suitable here rather than an aggregation pipeline, simply because it avoids two unwinds. If you could bring out an aggregation solution with a single unwind, that would be it. But the below map-reduce solution is one way to do it, though you would need to measure its running time against large data and see if it works for you.
The map function:
var map = function(){
emit(this.caller_id,
{locs:[{"location_id":this.location_id,"count":1}]});
}
The reduce function:
var reduce = function(key,values){
var result = {locs:[]};
var locations = {};
values.forEach(function(value){
value.locs.forEach(function(loc){
if(!locations[loc.location_id]){
locations[loc.location_id] = loc.count;
}
else{
locations[loc.location_id]++;
}
})
})
Object.keys(locations).forEach(function(k){
result.locs.push({"location_id":k,"count":locations[k]});
})
return result;
}
The finalize function:
var finalize = function(key,value){
var total = 0;
value.locs.forEach(function(loc){
total += loc.count;
})
return {"total":total,"locs":value.locs};
}
Invoking map-reduce:
db.collection.mapReduce(map,reduce,{"out":"t1","finalize":finalize});
Aggregating the result once the map-reduce produces its output.
db.t1.aggregate([
{$project:{"_id":0,
"number_of_records":"$value.total",
"locations":"$value.locs","user":"$_id"}}
])
Sample o/p:
{
"number_of_records" : 3,
"locations" : [
{
"location_id" : "658_55525",
"count" : 1
},
{
"location_id" : "658_55525213",
"count" : 2
}
],
"user" : 2
}
{
"number_of_records" : 1,
"locations" : [
{
"location_id" : "658_55525",
"count" : 1
}
],
"user" : NumberLong("475035504705")
}
The map-reduce java script code should be self explanatory.

Related

Mongodb embedded document - aggregation query

I have got the below documents in Mongo database:
db.totaldemands.insert({ "data" : "UKToChina", "demandPerCountry" :
{ "from" : "UK" , to: "China" ,
"demandPerItem" : [ { "item" : "apples" , "demand" : 200 },
{ "item" : "plums" , "demand" : 100 }
] } });
db.totaldemands.insert({ "data" : "UKToSingapore",
"demandPerCountry" : { "from" : "UK" , to: "Singapore" ,
"demandPerItem" : [ { "item" : "apples" , "demand" : 100 },
{ "item" : "plums" , "demand" : 50 }
] } });
I need to write a query to find the count of apples exported from UK to any country.
I have tried the following query:
db.totaldemands.aggregate(
{ $match : { "demandPerCountry.from" : "UK" ,
"demandPerCountry.demandPerItem.item" : "apples" } },
{ $unwind : "$demandPerCountry.demandPerItem" },
{ $group : { "_id" : "$demandPerCountry.demandPerItem.item",
"total" : { $sum : "$demandPerCountry.demandPerItem.demand"
} } }
);
But it gives me the output with both apples and plums like below:
{ "_id" : "apples", "total" : 300 }
{ "_id" : "plums", "total" : 150 }
But, my expected output is:
{ "_id" : "apples", "total" : 300 }
So, How can I modify the above query to return only the count of apples exported from UK ?
Also, is there any other better way to achieve the output without unwinding ?

You can add another $match to get only apples.
As you have embedded document structure and performing aggregation, $unwind is required here. The alternate option could be map and reduce. However, unwind is most suitable here.
If you are thinking about performance, unwind shouldn't cause performance issue.
db.totaldemands.aggregate(
{ $match : { "demandPerCountry.from" : "UK" ,
"demandPerCountry.demandPerItem.item" : "apples" } },
{ $unwind : "$demandPerCountry.demandPerItem" },
{ $group : { "_id" : "$demandPerCountry.demandPerItem.item",
"total" : { $sum : "$demandPerCountry.demandPerItem.demand"
} } },
{$match : {"_id" : "apples"}}
);

How can I split a MongoDB collection into 3 and assign a new field?

I have a json collection with 300 records like this:
{
salesNumber: 23839,
batch: null
},
{
salesNumber 389230,
batch: null
}
...etc.
I need to divide this collection into 3 different batches. So, when sorted by salesNumber, the first 100 would be in batch 1, the next 100 would be batch 2, and the last 100 would be batch 3. How do I do this?
I wrote a script to select the first 100, but when I tried to turn it into an array to use in an update, the result was 0 records.
var firstBatchCompleteRecords = db.properties.find(
{
"auction": ObjectId("50")
}
).sort("saleNumber").limit(100);
// This returned 174 records as excepted with all the fields
var firstBatch = firstBatchCompleteRecords.distinct( "saleNumber", {});
// This returned 0 records
I was going to take the results of that last query and use it in an update statement:
db.properties.update(
{
"saleNumber":
{
"$in": firstBatch
}
}
,
{
$set:
{
batch: "1"
}
}
,
{
multi: true
}
);
...then I would have created an array using distinct of the next 100 and update those, but I never got that far.

there is a chance to get results using aggregation framework - and store them in new collection - then you can use this answer to iterate and update fields in source collection
Have a fun!
db.sn.aggregate([{
$sort : {
salesNumber : 1
}
}, {
$group : {
_id : null,
arrayOfData : {
$push : "$$ROOT"
},
}
}, {
$project : {
_id : 0,
firstHundred : {
$slice : ["$arrayOfData", 0, 100]
},
secondHundred : {
$slice : ["$arrayOfData", 99, 100]
},
thirdHundred : {
$slice : ["$arrayOfData", 199, 100]
},
}
}, {
$project : {
"firstHundred.batch" : {
$literal : 1
},
"firstHundred.salesNumber" : 1,
"firstHundred._id" : 1,
"secondHundred.batch" : {
$literal : 2
},
"secondHundred.salesNumber" : 1,
"secondHundred._id" : 1,
"thirdHundred.batch" : {
$literal : 3
},
"thirdHundred.salesNumber" : 1,
"thirdHundred._id" : 1,
}
}, {
$project : {
allValues : {
$setUnion : ["$firstHundred", "$secondHundred", "$thirdHundred"]
}
}
}, {
$unwind : "$allValues"
}, {
$project : {
_id : "$allValues._id",
salesNumber : "$allValues.salesNumber",
batch : "$allValues.batch",
}
}, {
$out : "collectionName"
}
])
db.collectionName.find()
and output generated for 6 document divided by 2:
{
"_id" : ObjectId("5733ade7eeeccba2bd546121"),
"salesNumber" : 389230,
"batch" : 2
}, {
"_id" : ObjectId("5733ade7eeeccba2bd546120"),
"salesNumber" : 23839,
"batch" : 1
}, {
"_id" : ObjectId("5733ade7eeeccba2bd546122"),
"salesNumber" : 43839,
"batch" : 1
}, {
"_id" : ObjectId("5733ade7eeeccba2bd546124"),
"salesNumber" : 63839,
"batch" : 2
}, {
"_id" : ObjectId("5733ade7eeeccba2bd546123"),
"salesNumber" : 589230,
"batch" : 3
}, {
"_id" : ObjectId("5733ade7eeeccba2bd546125"),
"salesNumber" : 789230,
"batch" : 3
}
Any comments welcome!

MongoDB $sum and $avg of sub documents

I need to get $sum and $avg of subdocuments, i would like to get $sum and $avg of Channels[0].. and other channels as well.
my data structure looks like this
{
_id : ... Location : 1,
Channels : [
{ _id: ...,
Value: 25
},
{
_id: ... ,
Value: 39
},
{
_id: ..,
Value: 12
}
]
}

In order to get the sum and average of the Channels.Value elements for each document in your collection you will need to use mongodb's Aggregation processing. Further, since Channels is an array you will need to use the $unwind operator to deconstruct the array.
Assuming that your collection is called example, here's how you could get both the document sum and average of the Channels.Values:
db.example.aggregate( [
{
"$unwind" : "$Channels"
},
{
"$group" : {
"_id" : "$_id",
"documentSum" : { "$sum" : "$Channels.Value" },
"documentAvg" : { "$avg" : "$Channels.Value" }
}
}
] )
The output from your post's data would be:
{
"_id" : SomeObjectIdValue,
"documentSum" : 76,
"documentAvg" : 25.333333333333332
}
If you have more than one document in your collection then you will see a result row for each document containing a Channels array.

Solution 1: Using two groups based this example:
previous question
db.records.aggregate(
[
{ $unwind: "$Channels" },
{ $group: {
_id: {
"loc" : "$Location",
"cId" : "$Channels.Id"
},
"value" : {$sum : "$Channels.Value" },
"average" : {$avg : "$Channels.Value"},
"maximun" : {$max : "$Channels.Value"},
"minimum" : {$min : "$Channels.Value"}
}},
{ $group: {
_id : "$_id.loc",
"ChannelsSumary" : { $push :
{ "channelId" : '$_id.cId',
"value" :'$value',
"average" : '$average',
"maximun" : '$maximun',
"minimum" : '$minimum'
}}
}
}
]
)
Solution 2:
there is property i didn't show on my original question that might of help "Channels.Id" independent from "Channels._Id"
db.records.aggregate( [
{
"$unwind" : "$Channels"
},
{
"$group" : {
"_id" : "$Channels.Id",
"documentSum" : { "$sum" : "$Channels.Value" },
"documentAvg" : { "$avg" : "$Channels.Value" }
}
}
] )

mongodb aggregation find min value and other fields in nested array

Is it possible to find in a nested array the max date and show its price then show the parent field like the actual price.
The result I want it to show like this :
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice":19500,
"lastModifDate" :ISODate("2015-05-04T22:53:50.583Z"),
"price":"16000"
}
The data :
db.adds.findOne()
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"addTitle" : "Clio pack luxe",
"actualPrice" : 19500,
"fistModificationDate" : ISODate("2015-05-03T22:00:00Z"),
"addID" : "1746540",
"history" : [
{
"price" : 18000,
"modifDate" : ISODate("2015-05-04T22:01:47.272Z"),
"_id" : ObjectId("5547ec4bfeb20b0414e8e51b")
},
{
"price" : 16000,
"modifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fa")
},
{
"price" : 19000,
"modifDate" : ISODate("2015-04-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fe")
}
],
"__v" : 1
}
my query
db.adds.aggregate(
[
{ $match:{addID:"1746540"}},
{ $unwind:"$history"},
{ $group:{
_id:0,
lastModifDate:{$max:"$historique.modifDate"}
}
}
])
I dont know how to include other fields I used $project but I get errors
thanks for helping

You could try the following aggregation pipeline which does not need to make use of the $group operator stage as the $project operator takes care of the fields projection:
db.adds.aggregate([
{
"$match": {"addID": "1746540"}
},
{
"$unwind": "$history"
},
{
"$project": {
"actualPrice": 1,
"lastModifDate": "$history.modifDate",
"price": "$history.price"
}
},
{
"$sort": { "lastModifDate": -1 }
},
{
"$limit": 1
}
])
Output
/* 1 */
{
"result" : [
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice" : 19500,
"lastModifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"price" : 16000
}
],
"ok" : 1
}

$avg in mongodb aggregation

Document looks like this:
{
"_id" : ObjectId("361de42f1938e89b179dda42"),
"user_id" : "u1",
"evaluator_id" : "e1",
"candidate_id" : ObjectId("54f65356294160421ead3ca1"),
"OVERALL_SCORE" : 150,
"SCORES" : [
{ "NAME" : "asd", "OBTAINED_SCORE" : 30}, { "NAME" : "acd", "OBTAINED_SCORE" : 36}
]
}
Aggregation function:
db.coll.aggregate([ {$unwind:"$SCORES"}, {$group : { _id : { user_id : "$user_id", evaluator_id : "$evaluator_id"}, AVG_SCORE : { $avg : "$SCORES.OBTAINED_SCORE" }}} ])
Suppose if there are two documents with same "user_id" (say u1) and different "evaluator_id" (say e1 and e2).
For example:
1) Average will work like this ((30 + 20) / 2 = 25). This is working for me.
2) But for { evaluator_id : "e1" } document, score is 30 for { "NAME" : "asd" } and { evaluator_id : "e2" } document, score is 0 for { "NAME" : "asd" }. In this case, I want the AVG_SCORE to be 30 only (not (30 + 0) / 2 = 15).
Is it possible through aggregation??
Could any one help me out.

It's possible by placing a $match between the $unwind and $group aggregation pipelines to first filter the arrays which match the specified condition to include in the average computation and that is, score array where the obtained score is not equal to 0 "SCORES.OBTAINED_SCORE" : { $ne : 0 }
db.coll.aggregate([
{
$unwind: "$SCORES"
},
{
$match : {
"SCORES.OBTAINED_SCORE" : { $ne : 0 }
}
},
{
$group : {
_id : {
user_id : "$user_id",
evaluator_id : "$evaluator_id"
},
AVG_SCORE : {
$avg : "$SCORES.OBTAINED_SCORE"
}
}
}
])
For example, the aggregation result for this document:
{
"_id" : ObjectId("5500aaeaa7ef65c7460fa3d9"),
"user_id" : "u1",
"evaluator_id" : "e1",
"candidate_id" : ObjectId("54f65356294160421ead3ca1"),
"OVERALL_SCORE" : 150,
"SCORES" : [
{
"NAME" : "asd",
"OBTAINED_SCORE" : 0
},
{
"NAME" : "acd",
"OBTAINED_SCORE" : 36
}
]
}
will yield:
{
"result" : [
{
"_id" : {
"user_id" : "u1",
"evaluator_id" : "e1"
},
"AVG_SCORE" : 36
}
],
"ok" : 1
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Mongodb distinct aggregation of 3 billion documents - mongodb

Related

Mongodb embedded document - aggregation query

How can I split a MongoDB collection into 3 and assign a new field?

MongoDB $sum and $avg of sub documents

mongodb aggregation find min value and other fields in nested array

$avg in mongodb aggregation

Categories

Resources