How to count usage of items in mongodb with a single query? - mongodb

Suppose I have the following data:
{_id: 1, tags: ['foo', 'bar']}
{_id: 2, tags: ['bar',]}
{_id: 3, tags: ['foo',]}
{_id: 4, tags: ['bar', 'foo']}
{_id: 5, tags: ['foo']}
I would like a query to return the number of times each tag is used. In this case the tag "foo" was used 4 times and "bar" was used 3 times. I'm guessing the aggregate functions would help me here but not sure how. Please help me with an example!
Thanks gurus!

Figured it out :)
db.test.aggregate({$unwind: "$tags"}, {$group: {_id: "$tags", total: { $sum: 1 }}});
I wouldn't mind if someone knew a more efficient way though :)

Related

MongoDB Query generating same result set for multiple pages

Sample Data in mongodb collection attached as image.
Query with sort and limit.
Same Data Getting populated in multiple result set.
Page 1 ->Query
db.user_profile.find({},{ email: 1}).skip(0).limit(3).sort( {isEnabled:-1,firstName:1} )
Page 2 ->Query
db.user_profile.find({},{ email: 1}).skip(3).limit(3).sort( {isEnabled:-1,firstName:1} )
Page 3 ->Query
db.user_profile.find({},{ email: 1}).skip(6).limit(3).sort( {isEnabled:-1,firstName:1} )
You Need to project what you are sorting. One way to do it is:
db.user_profile.find({},{ email: 1, isEnabled: 1, firstName:1}).skip(3).limit(3).sort( {isEnabled:-1,firstName:1} )
Notice the projection of isEnabled and firstName that were missing before. This is a simple query but will result in adding two more fields to the output.
If you want to return only the email, you can use an aggregation pipeline that will remove the projection on a latter stage:
db.user_profile.aggregate([
{
$sort: {isEnabled: -1, firstName: 1}
},
{$skip: 0},
{$limit: 3},
{$project: {email: 1}} // or {$project: {email: 1, _id: 0}}
])
You can see the aggregation example working on the playground

MongoDB aggregation: Get samples at specific intervals

I have a MongoDB collection containing timestamped documents. The important part of their shape is:
{
receivedOn: {
date: ISODate("2018-10-01T07:50:06.836Z")
}
}
They are indexed on the date.
These documents relate to and contain data from UDPs constantly arriving at a server. The rate of the UDPs vary, but there are usually around 20 per second
I'm trying to take samples from this collection. I have a list of timestamps, and I want to get the documents closest to these timestamps in the past.
For example, if I have the following documents
{_id: 1, "receivedOn.date": ISODate("2018-10-01T00:00:00.000Z")}
{_id: 2, "receivedOn.date": ISODate("2018-10-01T00:00:02.000Z")}
{_id: 3, "receivedOn.date": ISODate("2018-10-01T00:00:04.673Z")}
{_id: 4, "receivedOn.date": ISODate("2018-10-01T00:00:05.001Z")}
{_id: 5, "receivedOn.date": ISODate("2018-10-01T00:00:09.012Z")}
{_id: 6, "receivedOn.date": ISODate("2018-10-01T00:00:10.065Z")}
and the timestamps
new Date("2018-10-01T00:00:05.000Z")
new Date("2018-10-01T00:00:10.000Z")
I want the result to be
[
{_id: 3, "receivedOn.date": ISODate("2018-10-01T00:00:04.673Z")},
{_id: 5, "receivedOn.date": ISODate("2018-10-01T00:00:09.012Z")}
]
Using aggregation, I made this work. The following code gives the correct result, but it is slow and appears to have complexity O(n*m), where n is number of matched documents and m is number of timestamps
const timestamps = [
new Date("2018-10-01T00:00:00.000Z")
new Date("2018-10-01T00:00:05.000Z")
new Date("2018-10-01T00:00:10.000Z")
];
collection.aggregate([
{$match: {
$and: [
{"receivedOn.date": {$lte: new Date("2018-10-01T00:00:10.000Z")}},
{"receivedOn.date": {$gte: new Date("2018-10-01T00:00:00.000Z")}}
]},
{$project: ...},
{$sort: {"receivedOn.date": -1}},
{$bucket: {
groupBy: "$receivedOn.date",
boundaries: timestamps,
output: {
docs: {$push: "$$CURRENT"}
}
}},
// The buckets contain sorted arrays. The first element is the newest
{$project: {
doc: {
$arrayElemAt: ["$docs", 0]
}
}},
// Lift the document out of its bucket wrapper
{$replaceRoot: {newRoot: "$doc"}}
]);
Is there a way to make this faster? Like somehow telling $bucket that the data is sorted? I assume what is taking most time here is $bucket trying to figure out which bucket to put the document in. Or is there another, better way to do this?
I've also tried running one findOne query per timestamp in parallel. That also gives the correct result, and is much faster, but having a few thousand timestamps is not uncommon. I don't want to do thousands of queries each time I need to do this.

MongoDB aggregate query for values in an array

So I have data that looks like this:
{
_id: 1,
ranking: 5,
tags: ['Good service', 'Clean room']
}
Each of these stand for a review. There can be multiple reviews with a ranking of 5. The tags field can be filled with up to 4 different tags.
4 tags are: 'Good service', 'Good food', 'Clean room', 'Need improvement'
I want to make a MongoDB aggregate query where I say 'for each ranking (1-5) give me the number of times each tag occurred for each ranking.
So an example result might look like this, _id being the ranking:
[
{ _id: 5,
totalCount: 5,
tags: {
goodService: 1,
goodFood: 3,
cleanRoom: 1,
needImprovement: 0
},
{ _id: 4,
totalCount: 7,
tags: {
goodService: 0,
goodFood: 2,
cleanRoom: 3,
needImprovement: 0
},
...
]
Having trouble with the counting the occurrences of each tag. Any help would be appreciated
You can try below aggregation.
db.colname.aggregate([
{"$unwind":"$tags"},
{"$group":{
"_id":{
"ranking":"$ranking",
"tags":"$tags"
},
"count":{"$sum":1}
}},
{"$group":{
"_id":"$_id.ranking",
"totalCount":{"$sum":"$count"},
"tags":{"$push":{"tags":"$_id.tags","count":"$count"}}
}}
])
To get the key value pair instead of array you can replace $push with $mergeObjects from 3.6 version.
"tags":{"$mergeObjects":{"$arrayToObject":[[["$_id.tags","$count"]]]}}

Mongo query: array of objects where a key's value is repeated

I am new to Mongo. Posting this question because i am not sure how to search this on google
i have a book documents like below
{
bookId: 1
title: 'some title',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 2
}]
}
and another one like this
{
bookId: 2
title: 'some title 2',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 1
}]
}
I want to write a query db.books.find({}) which would return only those books where editions.editionId has been duplicated for a book.
So in this example, for bookId: 2 there are two editions with the editionId:1.
Any suggestions?
You can use the aggregation framework; specifically, you can use the $group operator to group the records together by book and edition id, and count how many times they occur : if the count is greater than 1, then you've found a duplication.
Here is an example:
db.books.aggregate([
{$unwind: "$editions"},
{$group: {"_id": {"_id": "$_id", "editionId": "$editions.editionId"}, "count": {$sum: 1}}},
{$match: {"count" : {"$gt": 1}}}
])
Note that this does not return the entire book records, but it does return their identifiers; you can then use these in a subsequent query to fetch the entire records, or do some de-duplication for example.

MongoDB - distinct with query doesn't use indexes

Using Mongo 3.2.
Let's say I have a collection with this schema:
{ _id: 1, type: a, source: x },
{ _id: 2, type: a, source: y },
{ _id: 3, type: b, source: x },
{ _id: 4, type: b, source: y }
Of course that my db is much larger and with many more types and sources.
I have created 4 indexes combinations of type and source (even though 1 should be enough):
{type: 1}
{source: 1},
{type: 1, source: 1},
{source: 1, type: 1}
Now, I am running this distinct query:
db.test.distinct("source", {type: "a"})
The problem is that this query takes much more time that it should take.
If I run it with runCommand:
db.runCommand({distinct: 'test', key: "source", query: {type: "a"}})
this is the result i get:
{
"waitedMS": 0,
"values": [
"x",
"y"
],
"stats": {
"n": 19400840,
"nscanned": 19400840,
"nscannedObjects": 19400840,
"timems": 14821,
"planSummary": "IXSCAN { type: 1 }"
},
"ok": 1
}
For some reason, mongo use only the type: 1 index for the query stage.
It should use the index also for the distinct stage.
Why is that? Using the {type: 1, source: 1} index would be much better, no? right now it is scanning all the type: a documents while it has an index for it.
Am I doing something wrong? Do I have a better option for this kind of distinct?
As Alex mentioned, apparently MongoDB doesn't support this right now.
There is an open issue for it:
https://jira.mongodb.org/browse/SERVER-19507
Just drop first 2 indexes. You don't need them. Mongo can use {type: 1, source: 1} in any query that may need {type: 1} index.