MongoDB Aggregate multiple count and latest date - mongodb

I'm trying get a Mongo 3.0 query that is beyond my depth. Was hoping for a bit of help. Basically, my database has transcription records whereby there is a given username, project_id, expedition_id and finished_date. Those are the fields I'm interested in. A project will have multiple expeditions, each expedition multiple transcriptions.
I would like to display information for a given user in a stats page for a given project. The display would be User Name, Total Project Transcriptions that user submitted for the whole project, Total Participated Expeditions the number of expeditions the user participated in across the project, and the last date the user actually performed a transcription.
So far, it's easy enough to get the Total Project Transcriptions by using the count on the user_name and matching the project_id
db.transcriptions.aggregate([
{ "$match" : {"projectId" => 13}},
{ "$group": {
"_id": "$user_name",
"transcriptionCount" : {"$sum" : 1 }
}
}
])
Each transcription document has an expeditionId field (4, 7, 9, 10, etc.) and the finished_date. So if a user performed 100 transcriptions, only participating in expedition 7 and 10, the Total Participated Expeditions would = 2
The last finished_date being a date showing the last time a user performed a transcription. Example of returned record:
user_name: john smith
transcriptionCount: 100
expeditionCount: 2
last_date: 2017-08-15
Hope I explained that well enough. Would appreciate any help.

You can try the below aggregation.
db.transcriptions.aggregate([
{
"$match": {
"projectId" => 13
}
},
{
"$sort": {
"finished_date": -1
}
},
{
"$group": {
"_id": "$user_name",
"transcriptionCount": {
"$sum": 1
},
"expedition": {
"$addToSet": "$expedition_id"
},
"last_date": {
"$first": "$finished_date"
}
}
},
{
"$project": {
"_id": 0,
"user_name": "$_id",
"transcriptionCount": 1,
"expeditionCount": {
"$size": "$expedition"
},
"last_date": 1
}
}
])

Related

Windowing function in MongoDB

I have a collection that is made up of companies. Each company has a "number_of_employees" as well as a subdocument of "offices" which includes "state_code" and "country_code". For example:
{
'_id': ObjectId('52cdef7c4bab8bd675297da5'),
'name': 'Technorati',
'number_of_employees': 35,
'offices': [
{'description': '',
'address1': '360 Post St. Ste. 1100',
'address2': '',
'zip_code': '94108',
'city': 'San Francisco',
'state_code': 'CA',
'country_code': 'USA',
'latitude': 37.779558,
'longitude': -122.393041}
]
}
I'm trying to get the number of employees per state across all companies. My latest attempt looks like:
db.research.aggregate([
{ "$match": {"offices.country_code": "USA" } },
{ "$unwind": "$offices" },
{ "$project": { "_id": 1, "number_of_employees": 1, "offices.state_code": 1 } }
])
But now I'm stuck on how to do the $group. Because the num_of_employees is at the company level and not the office level I want to split them evenly across the offices. For example, if Technorati has 5 offices in 5 different states then each state would be allocated 7 employees.
In SQL I could do this easily enough using a windowed function to get average employees across offices by company and then summing those while grouping by state. I can't seem to find any clear examples of similar functionality in MongoDB though.
Note, this is for a school assignment, so the use of third-party libraries isn't feasible. Also, I'm hoping that this can all be done in a simple snippet of code, possibly even one call. I could certainly create new intermediate collections or do this in Python and process data there, but that's probably outside of the scope of the homework.
Anything to point me in the right direction would be greatly appreciated!
You are actually on the right track. You just need to derive an extra field numOfEmpPerOffice by using $divide and $sum it when $group by state.
db.collection.aggregate([
{
"$match": {
"offices.country_code": "USA"
}
},
{
"$addFields": {
"numOfEmpPerOffice": {
"$divide": [
"$number_of_employees",
{
"$size": "$offices"
}
]
}
}
},
{
"$unwind": "$offices"
},
{
$group: {
_id: "$offices.state_code",
totalEmp: {
$sum: "$numOfEmpPerOffice"
}
}
}
])
Here is the Mongo playground for your reference.

MongoDB query is slow and not using Indexes

I have a collection with structure like this:
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"enrollment" : [
{ "month":-10, "enroll":'00'},
{ "month":-9, "enroll":'00'},
{ "month":-8, "enroll":'01'},
//other months
{ "month":8, "enroll":'11'},
{ "month":9, "enroll":'11'},
{ "month":10, "enroll":'00'}
]
}
I am trying to run the following query:
db.getCollection('collection').find({
"enrollment": {
"$not": {
"$elemMatch": { "month": { "$gte": -2, "$lte": 9 }, "enroll": "00" }
}
}
}).count()
This query is taking 1.6 to 1.9 seconds. I need to get this down as low as possible, to milli seconds if that is possible.
I tried creating multi key index on month and enrolled fields. I tried various combinations but the query is not using any indexes.
I tried all these combinations:
1. { 'enrollment.month':1 }
2. { 'enrollment.month':1 }, { 'enrollment.enroll':1 } -- two seperate indexes
3. { 'enrollment.month':1, 'enrollment.enroll':1}
4. { 'enrollment.enroll':1, 'enrollment.month':1}
Parsed Query:
Query Plan:
Any suggestions to improve the performance are highly appreciated.
I am fairly confident that the hardware is not an issues but open for any suggestions.
My data size is not huge. Its just under 1GB. Total number of documents are 41K and sub document count is approx. 13 million
Note: I have posted couple of questions on this in last few days, but with this i am trying to narrow down the area. Please do not take this as a duplicate of my earlier questions.
Try to inverse the query:
db.getCollection('collection').find({
"enrollment": {
"$elemMatch": {
"month": { "$lt": -2, "$gt": 9 },
"enroll": {$ne: "00"}
}
}
}).count()

MongoDB aggregate query extremely slow

I've a MongoDB query here which is running extremely slow without an index but the query fields are too big to index so i'm looking for advice on how to optimise this query or create valid index for it:
collection.aggregate([{
$match: {
article_id: {
$nin: read_article_ids
},
author_id: {
$in: liked_authors,
$nin: disliked_authors
},
word_count: {
$gte: 1000,
$lte: 10000
},
article_sentiment: {
$elemMatch: {
sentiments: mood
}
}
}
}, {
$sample: {
size: 4
}
}])
The collection in this case is a collection of articles with article_id, author_id, word_count, and article_sentiment. There is around 1.6 million documents in the collection and a query like this takes upwards of 10 seconds without an index. The box has 56gb of memory and is all around pretty specced out.
The query's function is to retrieve a batch of 4 articles by authors the user likes and that they've not read and that match a given sentiment (The article_sentiment key holds a nested array of key:value pairs)
So is this query incorrect for what i'm trying to achieve? Is there a way to improve it?
EDIT: Here is a sample document for this collection.
{
"_id": ObjectId("57f7dd597a1026d326fc02c4"),
"publication_name": "National News Inc",
"author_name": "John Hardwell",
"title": "How Shifting Policy Has Stunted Cultural Growth",
"article_id": "2f0896cd47c9423cb5a309c7277dd90d",
"author_id": "51b7f46f6c0f46f2949608c9ec2624d4",
"word_count": 1202,
"article_sentiment": [{
"sentiments": "happy",
"weight": 0.528596282005
}, {
"sentiments": "serious",
"weight": 0.569274544716
}, {
"sentiments": "relaxed",
"weight": 0.825395524502
}]
}

mongoDB find document greatest date and check value

I have a Conversation collection that looks like this:
[
{
"_id": "QzuTQYkGDBkgGnHrZ",
"participants": [
{
"id": "YymyFZ27NKtuLyP2C"
},
{
"id": "d3y7uSA2aKCQfLySw",
"lastVisited": "2016-02-04T02:59:10.056Z",
"lastMessage": "2016-02-04T02:59:10.056Z"
}
]
},
{
"_id": "e4iRefrkqrhnokH7Y",
"participants": [
{
"id": "d3y7uSA2aKCQfLySw",
"lastVisited": "2016-02-04T03:26:33.953Z",
"lastMessage": "2016-02-04T03:26:53.509Z"
},
{
"id": "SRobpwtjBANPe9hXg",
"lastVisited": "2016-02-04T03:26:35.210Z",
"lastMessage": "2016-02-04T03:15:05.779Z"
}
]
},
{
"_id": "twXPHb76MMxQ3MQor",
"participants": [
{
"id": "d3y7uSA2aKCQfLySw"
},
{
"id": "SRobpwtjBANPe9hXg",
"lastMessage": "2016-02-04T03:27:35.281Z",
"lastVisited": "2016-02-04T03:57:51.036Z"
}
]
}
]
Each conversation (document) can have a participant object with the properties of id, lastMessage, lastVisited.
Sometimes, depending on how new the conversation is, some of these values don't exist just yet (such as lastMessage, lastVisited).
What I'm trying to do is compare each participant in each individual conversation (document) and see if out of the all the participants, the greatest lastMessage field value belongs to the logged in user. Otherwise, I'm assuming that the conversation has messages that the logged in user hasn't seen yet. I want to get that count of messages that the user possibly hasn't seen yet.
In the example above, say we're logged in as d3y7uSA2aKCQfLySw. We can see that he was the last person to send a message for conversation 1, 2 BUT not 3. The count returning for how many updated conversations that d3y7uSA2aKCQfLySw hasn't seen should be 1.
Can someone point me in the right direction? I haven't the slightest clue as to how to approach the issue. My apologies for the lengthy question.
It is always advisable to store dates as ISODate rather than strings to leverage the flexibility provided by various date operators in the aggregation framework.
One way of getting the count is to,
$match the conversations in which the user is involved.
$unwind the participants field.
$sort by the lastMessage field in descending order
$group by the _id to get back the original conversations intact, and get the latest message per group(conversation) using the $first operator.
$project a field with value 0, for each group where the top most record is of the user we are looking for and 1 for others.
$group again to get the total count of the conversations in which he has not been the last one to send a message.
sample code:
var userId = "d3y7uSA2aKCQfLySw";
db.t.aggregate([
{
$match:{"participants.id":userId}
},
{
$unwind:"$participants"
},
{
$sort:{"participants.lastMessage":-1}
},
{
$group:{"_id":"$_id","lastParticipant":{$first:"$$ROOT.participants"}}
},
{
$project:{
"hasNotSeen":{$cond:[
{$eq:["$lastParticipant.id",userId]},
0,
1
]},
"_id":0}
},
{
$group:{"_id":null,"count":{$sum:"$hasNotSeen"}}
},
{
$project:{"_id":0,"numberOfConversationsNotSeen":"$count"}
}
])
I'd like to try this function.
function findUseen(uId) {
var numMessages = db.demo.aggregate(
[
{
$project: {
"participants.lastMessage": 1,
"participants.id": 1
}
},
{$unwind: "$participants"},
{$sort: {"participants.lastMessage": -1}},
{
$group: {
_id: "$_id",
participantsId: {$first: "$participants.id"},
lastMessage: {$max: "$participants.lastMessage"}
}
},
{$match: {participantsId: {$ne: uId}}},
]
).toArray().length;
return numMessages;
}
calling findUnseen("d3y7uSA2aKCQfLySw") will return 1.
I have adopted this function just to return count, but as you see it's easy to tweak it to return all unseen message metadata too.

How to return index of array item in Mongodb?

The document is like below.
{
"title": "Book1",
"dailyactiviescores":[
{
"date": 2013-06-05,
"score": 10,
},
{
"date": 2013-06-06,
"score": 21,
},
]
}
The daily active score is intended to increase once the book is opened by a reader. The first solution comes to mind is use "$" to find whether target date has a score or not, and deal with it.
err = bookCollection.Update(
{"title":"Book1", "dailyactivescore.date": 2013-06-06},
{"$inc":{"dailyactivescore.$.score": 1}})
if err == ErrNotFound {
bookCollection.Update({"title":"Book1"}, {"$push":...})
}
But I cannot help to think is there any way to return the index of an item inside array? If so, I could use one query to do the job rather than two. Like this.
index = bookCollection.Find(
{"title":"Book1", "dailyactivescore.date": 2013-06-06}).Select({"$index"})
if index != -1 {
incTarget = FormatString("dailyactivescore.%d.score", index)
bookCollection.Update(..., {"$inc": {incTarget: 1}})
} else {
//push here
}
Incrementing a field that's not present isn't the issue as doing $inc:1 on it will just create it and set it to 1 post-increment. The issue is when you don't have an array item corresponding to the date you want to increment.
There are several possible solutions here (that don't involve multiple steps to increment).
One is to pre-create all the dates in the array elements with scores:0 like so:
{
"title": "Book1",
"dailyactiviescores":[
{
"date": 2013-06-01,
"score": 0,
},
{
"date": 2013-06-02,
"score": 0,
},
{
"date": 2013-06-03,
"score": 0,
},
{
"date": 2013-06-04,
"score": 0,
},
{
"date": 2013-06-05,
"score": 0,
},
{
"date": 2013-06-06,
"score": 0
}, { etc ... }
]
}
But how far into the future to go? So one option here is to "bucket" - for example, have an activities document "per month" and before the start of a month have a job that creates the new documents for next month. Slightly yucky. But it'll work.
Other options involve slight changes in schema.
You can use a collection with book, date, activity_scores. Then you can use a simple upsert to increment a score:
db.books.update({title:"Book1", date:"2013-06-02", {$inc:{score:1}}, {upsert:true})
This will increment the score or insert new record with score:1 for this book and date and your collection will look like this:
{
"title": "Book1",
"date": 2013-06-01,
"score": 10,
},
{
"title": "Book1",
"date": 2013-06-02,
"score": 1,
}, ...
Depending on how much you simplified your example from your real use case, this might work well.
Another option is to stick with the array but switch to using the date string as a key that you increment:
Schema:
{
"title": "Book1",
"dailyactiviescores":{
{ "2013-06-01":10},
{ "2013-06-02":8}
}
}
Note it's now a subdocument and not an array and you can do:
db.books.update({title:"Book1"}, {"dailyactivityscores.2013-06-03":{$inc:1}})
and it will add a new date into the subdocument and increment it resulting in:
{
"title": "Book1",
"dailyactiviescores":{
{ "2013-06-01":10},
{ "2013-06-02":8},
{ "2013-06-03":1}
}
}
Note it's now harder to "add-up" the scores for the book so you can atomically also update a "subtotal" in the same update statement whether it's for all time or just for the month.
But here it's once again problematic to keep adding days to this subdocument - what happens when you're still around in a few years and these book documents grow hugely?
I suspect that unless you will only be keeping activity scores for the last N days (which you can do with capped array feature in 2.4) it will be simpler to have a separate collection for book-activity-score tracking where each book-day is a separate document than to embed the scores for each day into the book in a collection of books.
According to the docs:
The $inc operator increments a value of a field by a specified amount.
If the field does not exist, $inc sets the field to the specified
amount.
So, if there won't be a score field in the array item, $inc will set it to 1 in your case, like this:
{
"title": "Book1",
"dailyactiviescores":[
{
"date": 2013-06-05,
"score": 10,
},
{
"date": 2013-06-06,
},
]
}
bookCollection.Update(
{"title":"Book1", "dailyactivescore.date": 2013-06-06},
{"$inc":{"dailyactivescore.$.score": 1}})
will result into:
{
"title": "Book1",
"dailyactiviescores":[
{
"date": 2013-06-05,
"score": 10,
},
{
"date": 2013-06-06,
"score": 1
},
]
}
Hope that helps.