MongoDB - Count and group on key - mongodb

I'm quite new to MongoDb and I'm trying to get my head around the grouping/counting functions. I want to retrieve a list of votes per track ID, ordered by their votes. Can you help me make sense of this?
List schema:
{
"title" : "Bit of everything",
"tracks" : [
ObjectId("5310af6518668c271d52aa8d"),
ObjectId("53122ffdc974dd3c48c4b74e"),
ObjectId("53123045c974dd3c48c4b74f"),
ObjectId("5312309cc974dd3c48c4b750")
],
"votes" : [
{
"track_id" : "5310af6518668c271d52aa8d",
"user_id" : "5312551c92d49d6119481c88"
},
{
"track_id" : "53122ffdc974dd3c48c4b74e",
"user_id" : "5310f488000e4aa02abcec8e"
},
{
"track_id" : "53123045c974dd3c48c4b74f",
"user_id" : "5312551c92d49d6119481c88"
}
]
}
I'm looking to generate the following result (ideally ordered by the number of votes, also ideally including those with no entries in the votes array, using tracks as a reference.):
[
{
track_id: 5310af6518668c271d52aa8d,
track_votes: 1
},
{
track_id: 5312309cc974dd3c48c4b750,
track_votes: 0
}
...
]
In MySQL, I would execute the following
SELECT COUNT(*) AS track_votes, track_id FROM list_votes GROUP BY track_id ORDER BY track_votes DESC
I've been looking into the documentation for the various aggregation/reduce functions, but I just can't seem to get anything working for my particular case.
Can anyone help me get started with this query? Once I know how these are structured I'll be able to apply that knowledge elsewhere!
I'm using mongodb version 2.0.4 on Ubuntu 12, so I don't think I have access to the aggregation functions provided in later releases. Would be willing to upgrade, if that's the general consensus.
Many thanks in advance!

I recommend you to upgrade your MongoDB version to 2.2 and use the Aggregation Framework as follows:
db.collection.aggregate(
{ $unwind:"$votes"},
{ $group : { _id : "$votes.track_id", number : { $sum : 1 } } }
)
MongoDB 2.2 introduced a new aggregation framework, modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
The output will look like this:
{
"result":[
{
"_id" : "5310af6518668c271d52aa8d", <- ObjectId
"number" : 2
},
...
],
"ok" : 1
}
If this is not possible to upgrade, I recommend doing it in a programmatically way.

Related

find() return the latest value only on MongoDB

I have this collection in MongoDB that contains the following entries. I'm using Robo3T to run the query.
{
"_id" : ObjectId("xxx1"),
"Evaluation Date" : "2021-09-09",
"Results" : [
{
"Name" : "ABCD",
"Version" : "3.2.x"
}
]
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
}
This document contains multiple entries of similar format. Now, I need to extract the latest value for "Version".
Expected output:
5.2.x
Measures I've taken so far:
(1) I've only tried findOne() and while I was able to extract the value of "Version": db.getCollection('TestCollectionName').findOne().Results[0].Version
...only the oldest entry was returned.
3.2.x
(2) Using the find().sort().limit() like below, returns the entire document for the latest entry and not just the data value that I wanted; db.getCollection('TestCollectionName').find({}).sort({"Results.Version":-1}).limit(1)
Results below:
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
(3) I've tried to use sort() and limit() alongside findOne() but I've read that findOne is maybe deprecated and also not compatible with sort. And thus, resulting to an error.
(4) Finally, if I try to use sort and limit on find like this: db.getCollection('LD_exit_Evaluation_Result_MFC525').find({"Results.New"}).sort({_id:-1}).limit(1) I would get an unexpected token error.
What would be a good measure for this?
Did I simply mistake to/remove a bracket or need to reorder the syntax?
Thanks in advance.
I'm not sure if I understood well, but maybe this could be what are you looking for:
db.collection.aggregate([
{
"$project": {
lastResult: {
"$last": "$Results"
},
},
},
{
"$project": {
version: "$lastResult.Version",
_id: 0
}
}
])
It uses aggregate with some operators: the first $project calculate a new field called lastResult with the last element of each array using $last operator. The second $project is just to clean the output. If you need the _id reference, just remove _id: 0 or change its value to 1.
You can check how it works here: https://mongoplayground.net/p/jwqulFtCh6b
Hope I helped

MongoDB aggregate query performance improvement

I recently started shifting data from Microsoft SQL Server to MongoDB to obtain scalability. All good in term of migration.
The document has 2 important fields: customer, timestamphash (year month day).
We imported only 75 Million data in Azure Linux where we install MongoDB.
After adding compound index on both fields, we are having the following problem:
On 3 Milion data (after filterin) it takes 24 seconds to finish an aggregate group by count by customerId. The SQL Server gives the result in less then 1 second on the same data.
Do you think Casandra will be a better solution? We need query performance on big number of data.
I tried disk write, giving the VM more RAM. Nothing works.
Query:
aaggregate([
{ "$match" : { "Customer" : 2 } },
{ "$match" : { "TimestampHash" : { "$gte" : 20160710 } } },
{ "$match" : { "TimestampHash" : { "$lte" : 20190909 } } },
{ "$group" : { "_id" : { "Device" : "$Device" }, "__agg0" : { "$sum" : 1 } } },
{ "$project" : { "Device" : "$_id.Device", "Count" : "$__agg0", "_id" : 0 } },
{ "$skip" : 0 },
{ "$limit" : 10 }])
Update:
I used 'allowDiskUse: true' and the problem was solved. Reduced to 4 seconds for 3M data filtered.
I have encounter a similar problem before, during this question, and to be honest, I guess Cassandra is better in your certain case, but the question was about Mongo aggregation query optimization, right?
As for now, one of my collections have more then 3M+ docs, and shouldn't take 24s for aggregation queries if you build indexes correctly.
First of all, check out the index usage via Mongo Compass. Does Mongo actually using it? If your app spam queries to DB and your index have 0 usage (as in example below) than, as you already guessed it, something wrong with your index.
The Second thing is, using explain method (this doc will help you out), to check out more info about your query.
And for the third: index field sorting matters. For example if you have $match stage with 3 fields and you request docs by fields:
{ $match: {a_field:a, b_field:b, c_field:c} }
then you should build compound index on a,b,c fields in the exact same order.
There is always some kind of DB architecture problem. I highly recommend you not to stockpile all data inside one collection. Using {timestamps:true} on insertion (it created two fields, like createdAt: and updatedAt:
{
timestamps: true
}
in your schema, store old-time/outdated data in different collection and use $lookup aggregation method for them, when you really needs to operate with them them.
Hope you'll find something useful in my answer.

Optimize Mongodb documents versioning

In my application I have need to load a lot of data and compare it to existing documents inside a specific collection, and version them.
In order to do it, for every new document I have to insert, I simply made a query and search for last version, using a specific key (not _id), group data together and found last version.
Example of data:
{
"_id" : ObjectId("5c73a643f9bc1c2fg4ca6ef5"),
"data" : {
the data
}
},
"key" : {
"value1" : "545454344",
"value2" : "123212321",
"value3" : "123123211"
},
"version" : NumberLong("1"),
}
As you can see, key is composed of three values, related to data and my query to find last version is the following:
db.collection.aggregate(
{
{
"$sort" : {
"version" : NumberInt("-1")
}
},
{
"$group" : {
"_id" : "$key",
"content" : {
"$push" : "$data"
},
"version" : {
"$push" : "version"
},
"_oid" : {
"$push" : "$_id"
},
}
},
{
"$project" : {
"data" : {
"$arrayElemAt" : [
"$content",
NumberInt("0")
]
},
"version" : {
"$arrayElemAt" : [
"$version",
NumberInt("0")
]
},
"_id" : {
"$arrayElemAt" : [
"$_oid",
NumberInt("0")
]
}
}
}
}
)
To improve performance (from exponential to linear), I build an index that holds key and version:
db.getCollection("collection").createIndex({ "key": 1, "version" : 1})
So my question is: there are some other capabilities/strategies to optimize this search ?
Notes
in these collection there are some other field I already use to filter data using match, omitted for brevity
my prerequisite is to load a lot of data, process one to one, before insert: if there is a better approach to calculate version, I can consider also to change this
I'm not sure if an unique index on key could do the same as my query. I mean, if I do an unique index on key and version, I could have the uniqueness on that couple an iterate on it, for example:
no data on collection: just insert first version
insert new document: try to insert version 1, then get error, iterate on it, this should hit unique index, right ?
I had similar situation and this is how I solved it.
Create a seperate collection that will hold Key and corresponding latest version, say KeyVersionCollection
Make this collection "InMemory" for faster response
Store Key in "_id" field
When inserting document in your versioned collection, say EntityVersionedCollection
Query latest version from KeyVersionCollection
Update the version number by 1 or insert a new document with version 0 in KeyVersionCollection
You can even combine above 2 operations in 1 (https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/#db.collection.findAndModify)
Use new version number to insert document in EntityVersionedCollection
This will save time of aggregation and sorting. On side note, I would keep latest versions in seperate collection - EntityCollection. In this case, for each entity - insert a new version in EntityVersionedCollection and upsert it in EntityCollection.
In corner cases, where process is interrupted between getting new version number and using it while inserting entity, you might see that the version is skipped in EntityVersionedCollection; but that should be ok. Use timestamps to track inserts/updates so that it can be used to correlate/audit in future.
Hope that helps.
You can simply pass an array into the mongoDB insert function, and it should insert an entire JSON payload without any memory deficiencies.
You're welcome

query to retrieve multiple objects in an array in mongodb

Suppose I have a array of objects as below.
"array" : [
{
"id" : 1
},
{
"id" : 2
},
{
"id" : 2
},
{
"id" : 4
}
]
If I want to retrieve multiple objects ({id : 2}) from this array, the aggregation query goes like this.
db.coll.aggregate([{ $match : {"_id" : ObjectId("5492690f72ae469b0e37b61c")}}, { $unwind : "$array"}, { $match : { "array.id" : 2}}, { $group : { _id : "$_id", array : { $push : { id : "$array.id"}}}} ])
The output of above aggregation is
{
"_id" : ObjectId("5492690f72ae469b0e37b61c"),
"array" : [
{
"id" : 2
},
{
"id" : 2
}
]
}
Now the question is:
1) Is retrieving of multiple objects from an array possible using find() in MongoDB?
2) With respect to performance, is aggregation is the correct way to do? (Because we need to use four pipeline operators) ?
3) Can we use Java manipulation (looping the array and only keep {id : 2} objects) to do this after
find({"_id" : ObjectId("5492690f72ae469b0e37b61c")}) query? Because find will once retrieve the document and keeps it in RAM. But if we use aggregation four operations need to be performed in RAM to get the output.
Why I asked the 3) question is: Suppose if thousands of clients accessing at the same time, then RAM memory will be overloaded. If it is done using Java, less task on RAM.
4) For how long the workingSet will be in RAM??
Is my understanding correct???
Please correct me if I am wrong.
Please suggest me to have right insight on this..
No. You project the first matching one with $, you project all of them, or you project none of them.
No-ish. If you have to work with this array, aggregation is what will allow you to extract multiple matching elements, but the correct solution, conceptually and for performance, is to design your document structure so this problem does not arise, or arises only for rare queries whose performance is not particularly important.
Yes.
We have no information that would allow us to give a reasonable answer to this question. This is also out of scope relative to the rest of the question and should be a separate question.

group in aggregate framework stopped working properly

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.