mongodb count num of distinct values per field/key - mongodb

Is there a query for calculating how many distinct values a field contains in DB.
f.e I have a field for country and there are 8 types of country values (spain, england, france, etc...)
If someone adds more documents with a new country I would like the query to return 9.
Is there easier way then group and count?

MongoDB has a distinct command which returns an array of distinct values for a field; you can check the length of the array for a count.
There is a shell db.collection.distinct() helper as well:
> db.countries.distinct('country');
[ "Spain", "England", "France", "Australia" ]
> db.countries.distinct('country').length
4
As noted in the MongoDB documentation:
Results must not be larger than the maximum BSON size (16MB). If your results exceed the maximum BSON size, use the aggregation pipeline to retrieve distinct values using the $group operator, as described in Retrieve Distinct Values with the Aggregation Pipeline.

Here is example of using aggregation API. To complicate the case we're grouping by case-insensitive words from array property of the document.
db.articles.aggregate([
{
$match: {
keywords: { $not: {$size: 0} }
}
},
{ $unwind: "$keywords" },
{
$group: {
_id: {$toLower: '$keywords'},
count: { $sum: 1 }
}
},
{
$match: {
count: { $gte: 2 }
}
},
{ $sort : { count : -1} },
{ $limit : 100 }
]);
that give result such as
{ "_id" : "inflammation", "count" : 765 }
{ "_id" : "obesity", "count" : 641 }
{ "_id" : "epidemiology", "count" : 617 }
{ "_id" : "cancer", "count" : 604 }
{ "_id" : "breast cancer", "count" : 596 }
{ "_id" : "apoptosis", "count" : 570 }
{ "_id" : "children", "count" : 487 }
{ "_id" : "depression", "count" : 474 }
{ "_id" : "hiv", "count" : 468 }
{ "_id" : "prognosis", "count" : 428 }

With MongoDb 3.4.4 and newer, you can leverage the use of $arrayToObject operator and a $replaceRoot pipeline to get the counts.
For example, suppose you have a collection of users with different roles and you would like to calculate the distinct counts of the roles. You would need to run the following aggregate pipeline:
db.users.aggregate([
{ "$group": {
"_id": { "$toLower": "$role" },
"count": { "$sum": 1 }
} },
{ "$group": {
"_id": null,
"counts": {
"$push": { "k": "$_id", "v": "$count" }
}
} },
{ "$replaceRoot": {
"newRoot": { "$arrayToObject": "$counts" }
} }
])
Example Output
{
"user" : 67,
"superuser" : 5,
"admin" : 4,
"moderator" : 12
}

I wanted a more concise answer and I came up with the following using the documentation at aggregates and group
db.countries.aggregate([{"$group": {"_id": "$country", "count":{"$sum": 1}}}])

You can leverage on Mongo Shell Extensions. It's a single .js import that you can append to your $HOME/.mongorc.js, or programmatically, if you're coding in Node.js/io.js too.
Sample
For each distinct value of field counts the occurrences in documents optionally filtered by query
> db.users.distinctAndCount('name', {name: /^a/i})
{
"Abagail": 1,
"Abbey": 3,
"Abbie": 1,
...
}
The field parameter could be an array of fields
> db.users.distinctAndCount(['name','job'], {name: /^a/i})
{
"Austin,Educator" : 1,
"Aurelia,Educator" : 1,
"Augustine,Carpenter" : 1,
...
}

To find distinct in field_1 in collection but we want some WHERE condition too than we can do like following :
db.your_collection_name.distinct('field_1', {WHERE condition here and it should return a document})
So, find number distinct names from a collection where age > 25 will be like :
db.your_collection_name.distinct('names', {'age': {"$gt": 25}})
Hope it helps!

I use this query:
var collection = "countries"; var field = "country";
db[collection].distinct(field).forEach(function(value){print(field + ", " + value + ": " + db[collection].count({[field]: value}))})
Output:
countries, England: 3536
countries, France: 238
countries, Australia: 1044
countries, Spain: 16
This query first distinct all the values, and then count for each one of them the number of occurrences.

If you're on MongoDB 3.4+, you can use $count in an aggregation pipeline:
db.users.aggregate([
{ $group: { _id: '$country' } },
{ $count: 'countOfUniqueCountries' }
]);

Related

MongoDB two groups Aggregate

Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. MongoDB provides three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and single purpose aggregation methods.
I would like to transform that :
{
"_id" : ObjectId("5836b919885383034437d4a7"),
"Identificador" : "G-3474",
"Miembros" : [
{
"_id" : ObjectId("5836b916885383034437d238"),
"Nombre" : "Pilar",
"Email" : "pcarrillocasa#gmail.es",
"Edad" : 24,
"País" : "España",
"Tipo" : "Usuario individual",
"Apellidos" : "Carrillo Casa",
"Teléfono" : 637567234,
"Ciudad" : "Santander",
"Identificador" : "U-3486",
"Información_creación" : {
"Fecha_creación" : {
"Mes" : 4,
"Día" : 22,
"Año" : 2016
},
"Hora_creación" : {
"Hora" : 15,
"Minutos" : 34,
"Segundos" : 20
}
}
}
}
into that
{
"Nombre_Grupo" : "Amigo invisible"
"Ciudades" : [
{
"Ciudad" : "Madrid",
"Miembros": 30
},
{
"Ciudad" : "Almería",
"Miembros": 10
}
{
"Ciudad" : "Badajoz",
"Miembros": 20
}
]
}
with MongoDB.
I tried with that:
db.Grupos_usuarios.aggregate([
{ $group: { _id: "$Nombre_Grupo",total: { $sum: "$amount" } },
$group: { _id: "$Ciudad",total: { $sum: "$amount" } } }
])
but I could not get what I needed.
May somebody help me to know what I am doing bad?
The following aggregation gets the output you are looking for.
The $unwind stage deconstructs an array field from the input documents to output a document for each element. These documents are used to group by the Miembros.Ciudad and get the total Miembros for each Ciudad. In the second group stage we Pivot data to get all the Ciudades from the previous grouping into an array. The last $project is for formatting the output.
db.test.aggregate( [
{
$unwind: "$Miembros"
},
{
$group: {
_id: "$Miembros.Ciudad",
total: { $sum: 1 }
}
},
{
$group: {
_id: "Amigo invisible",
Ciudades: { $push: { Ciudad: "$_id", Miembros: "$total"} }
}
},
{
$project: {
Nombre_Grupo: "$_id",
Ciudades: 1,
_id: 0
}
}
] )

Mongodb splitting aggregation result

I'm currently trying to split an aggregation result in two differents arrays using only mongodb.
My main goal is to create two subset of user with the same distribution regarding the number of interactions that they have made. For this I'm currently making this request:
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $group : { _id :{$mod : [_rand() * 2, 2]}, ids : { $push: "$_id"}}}
}
My main issue actualy is that the _rand() function is called only once during the aggregation execution to I only have all my result in a single array.
Also, a random distribution is not so good. Is there a way to use the index of each result ?
Edit 1 :
After #dnickless answer I still got an issue on distribution in the groupBy part. Ideally I would like to do something like this
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $bucket: {
groupBy: { $mod: [ { $indexOfArray : ??? }, 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"users": { $push: "$_id"}
}
}
}
],
{ allowDiskUse: true })
That could split even index and odd index into two separated array. But I would like to apply the $indexOfArray on the current aggregation result.
To give you more context here is my Interaction object model :
{ "_id" : ObjectId("5af01..."), "name" : "WATCH", "date" : ISODate("2018-05-07T09:32:53.219Z") }
Without the bucket part I have this result :
{ "_id" : "5b1e7f...", "count" : 43.0 }
{ "_id" : "5b1e75...", "count" : 41.0 }
{ "_id" : "5b1e7a...", "count" : 40.0 }
...
I would like my answer to look like this :
{
{ "_id" : 0, "users" : [ "5b1e7f...", "5b1e7a...", ... ] }, // even index results
{ "_id" : 1, "users" : [ "5b1e75...", ... ] } // odd index results
}
My end goal is to split my users in 2 groups with evenly distributed numbers of interactions.
Edit 2 :
Finally found a solution to resolve my problem :
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $group : { _id : "whatever" , user : { $push : { _id : "$_id" , count : "$count"}}}},
{ $unwind : { path : "$user" , "includeArrayIndex" : "rank"}},
{ $bucket: {
groupBy: { $mod: [ "$rank" , 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"users": { $push: "$user._id"}
}
}
}
],
{ allowDiskUse: true })
Probably not the most optimized solution at all, but still do the job :)
If you have any advise to improve it I'm still interested in.
I don't fuly understand what exactly you are trying to achieve here without seeing some sample input and output. However, have you tried using $bucketAuto? Something like this:
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $bucketAuto : {
groupBy : "$count",
buckets : 2, // number of buckets goes here
output : {
ids : { $push : "$id" }
}
}
}])
If you want to go more sophisticated regarding the distribution aspect you could perhaps try something like this which would throw all even counts into one pot and all odd ones into another:
$bucket: {
groupBy: { $mod: [ "$count", 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"docs": { $push: "$$ROOT" }
}
}
Depending on the type of your userId field you could perhaps come up with a more "random" distribution.
Lastly, I am not sure what exactly you mean by
"Is there a way to use the index of each result ?"
Perhaps something like $size, $arrayElemAt and/or $indexOfArray...?
Alternatively, you could perhaps try to $slice the sorted array into two equally sized parts (using $size $divided by 2), then $reverseArray one of them and then $zip both arrays up again which should result in something like when you shuffle a deck of playing cards. After that, you would need to flatten the nested array into a single one again (using $reduce and $concatArrays or so) and then slice the array again in two parts which should be what you are looking for if I am not too tired by now to think through the statistical parts here.

Combining group and project in mongoDB aggregation framework

my document looks like this:
{
"_id" : ObjectId("5748d1e2498ea908d588b65e"),
"some_item" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"arbitrary_value" : 111,
"hourly_rates" : [
{
"price" : 74.45,
"hour" : "0"
},
{
"price" : 74.45,
"hour" : "1"
},
{
"price" : 74.45,
"hour" : "2"
},
]
}
}
I did average the price per day via:
db.hourly.aggregate([
{$match: {timestamp : "2016-05-28"}},
{$unwind: "$price_information.hourly_rates"},
{$group: { _id: "$unique_item_identifier", total_price: { $avg: "$price_information.hourly_rates.price"}}}
]);
I am struggling with bringing (projecting) other params with in the result set. I would like to have also some_item and timestampin the result set. I tried to use a $project: {some_item: 1, total_price: 1, ...} within the query, but that wasn't right.
My desired output would be like:
{
"_id" : ObjectId("5693afb1b49eb7d5ed97de27"),
"someItem" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"avg_price": 34
}
}
If somebody could give me a hint, how to project the grouping and the other params into the result set, I would be thankful.
Best
Rob
If using MongoDB 3.2 and newer, you can use $avg in the $project pipeline since it returns the average of the specified expression or list of expressions for each document e.g
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{
"$project": {
"price_information": {
"avg_price": { "$avg": "$price_information.hourly_rates.price" }
},
"someItem": 1,
"timestamp": 1,
}
}
]);
In previous versions of MongoDB, $avg is available in the $group stage only. So to include the other fields, use the $first operator in your grouping:
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{ "$unwind": "$price_information.hourly_rates" },
{
"$group": {
"_id": "$_id",
"avg_price": { "$avg": "$price_information.hourly_rates.price" },
"someItem": { "$first": "$some_item" },
"timestamp": { "$first": "$timestamp" },
}
},
{
"$project": {
"price_information": { "avg_price": "$avg_price" },
"someItem": 1
"timestamp": 1
}
}
]);
Note: Usage of the $first operator in a $group stage will largely depend on how the documents getting in that pipeline are ordered as well as the group by key. Because $first will returns the first document value in a group of documents that share the same group by key, the $group stage logically should precede a $sort stage to have the input documents in a defined order. This is only sensible to use when you know the order that the data is being processed in.
However, as the above is grouping by the main document's _id key, the $first operator when applied to non-denormalized fields (and not the flattened price_information array fields) will guarantee the original value in the result. Hence no need for a pre-sort stage to define the order since it won't be necessary in this case.

Mongo $group with $project

I am trying to get keyword count along with parentId, categioryId and llcId.
My db is
{
"_id" : ObjectId("5673f5b1e4b0822f6f0a5b89"),
"keyword" : "electronic content management system",
"llcId" : "CL1K9B",
"categoryId" : "CL1K8V",
"parentId" : "CL1K8V",
}
I tried $project with $group
db.keyword.aggregate([
{
$group: {
_id: "$llcId",
total: {$sum: 1},
}
},
{
$project: {
categoryId: 1, total: 1
}
}
])
And it gives me a result like
{ "_id" : "CL1KJQ", "total" : 17 }
{ "_id" : "CL1KKW", "total" : 30 }
But I need actual data in result also e.g. llcId, categoryId, keyword, total. I tried to display cetgoryId and keyword by using $project but it displays only _id and total. What I am missing?
To get the keyword count you'd need to group the documents by the keyword field, then use the accumulator operator $sum to get the documents count. As for the other field values, since you are grouping all the documents by the keyword value, the best you can do to get the other fields is use the $first operator which returns a value from the first document for each group. Otherwise you may have to use the $push operator to return an array of the field values for each group:
var pipeline = [
{
"$group": {
"_id": "$keyword",
"total": { "$sum": 1 },
"llcId": { "$first": "$llcId"},
"categoryId": { "$first": "$categoryId"},
"parentId": { "$first": "$parentId"}
}
}
];
db.keyword.aggregate(pipeline)
You are grouping by llcId so it will give more than one categoryId per llcId.
If you want categoryId as in your result, you have to write that in your group query. For example:
db.keyword.aggregate([
{
$group: {
_id: "$llcId",
total: {$sum: 1},
categoryId:{$max:"$categoryId"}
}
},
{
$project: {
categoryId: 1, total: 1
}
}])

Limiting Query result in MongoDB

I have 20,000+ documents in my mongodb. I just learnt that you cannot query them all in one go.
So my question is this:
I want to get my document using find(query) then limit its results for 3 documents only and I can choose where those documents start from.
For example if my find() query resulted in 8 documents :
[{doc1}, {doc2}, {doc3}, {doc4}, {doc5}, {doc6}, {doc7}, {doc 8}]
command limit(2, 3) will gives [doc3, doc4, doc5]
And I also need to get total count for all that result(without limit) for example : length() will give 8 (the number of total document resulted from find() function)
Any suggestion? Thanks
add .skip(2).limit(3) to the end of your query
I suppose you have the following documents in your collection.
{ "_id" : ObjectId("56801243fb940e32f3221bc2"), "a" : 0 }
{ "_id" : ObjectId("56801243fb940e32f3221bc3"), "a" : 1 }
{ "_id" : ObjectId("56801243fb940e32f3221bc4"), "a" : 2 }
{ "_id" : ObjectId("56801243fb940e32f3221bc5"), "a" : 3 }
{ "_id" : ObjectId("56801243fb940e32f3221bc6"), "a" : 4 }
{ "_id" : ObjectId("56801243fb940e32f3221bc7"), "a" : 5 }
{ "_id" : ObjectId("56801243fb940e32f3221bc8"), "a" : 6 }
{ "_id" : ObjectId("56801243fb940e32f3221bc9"), "a" : 7 }
From MongoDB 3.2 you can use the .aggregate() method and the $slice operator.
db.collection.aggregate([
{ "$group": {
"_id": null,
"count": { "$sum": 1 },
"docs": { "$push": "$$ROOT" }
}},
{ "$project": {
"count": 1,
"_id": 0,
"docs": { "$slice": [ "$docs", 2, 3 ] }
}}
])
Which returns:
{
"count" : 8,
"docs" : [
{
"_id" : ObjectId("56801243fb940e32f3221bc4"),
"a" : 2
},
{
"_id" : ObjectId("56801243fb940e32f3221bc5"),
"a" : 3
},
{
"_id" : ObjectId("56801243fb940e32f3221bc6"),
"a" : 4
}
]
}
You may want to sort your document before grouping using the $sort operator.
From MongoDB 3.0 backwards you will need to first $group your documents and use the $sum accumulator operator to return the "count" of documents; also in that same group stage you need to use the $push and the $$ROOT variable to return an array of all your documents. The next stage in the pipeline will then be the $unwind stage where you denormalize that array. From there use use the $skip and $limit operators respectively skip the first 2 documents and passes 3 documents to the next stage which is another $group stage.
db.collection.aggregate([
{ "$group": {
"_id": null,
"count": { "$sum": 1 },
"docs": { "$push": "$$ROOT" }
}},
{ "$unwind": "$docs" },
{ "$skip": 2 },
{ "$limit": 3 },
{ "$group": {
"_id": "$_id",
"count": { "$first": "$count" },
"docs": { "$push": "$docs" }
}}
])
As #JohnnyHK pointed out in this comment
$group is going to read all documents and build a 20k element array with them just to get three docs.
You should then run two queries using find()
db.collection.find().skip(2).limit(3)
and
db.collection.count()