How to remove http:// or https:// from the beginning and '/' from the end of the tags.Domain in MongoDB aggregation?
Sample document:
{
"_id" : ObjectId("5d9f074f5833c8cd1f685e05"),
"tags" : [
{
"Domain" : "http://www.google.com",
"rank" : 1
},
{
"Domain" : "https://www.stackoverflow.com/",
"rank" : 2
}
]
}
Taking an assumption that the Domain field in tags would contain valid URLs with valid appends and prepends of (https, http, //, /, com/, org/, /in)
The $trim operator is used to remove https://, http://, and / from tags.Domain
NOTE: This would not work for a URL that is already formatted and doesn't contain those characters at the beginning/end. Example: 'hello.com' would become 'ello.com', 'xyz.ins' would become 'xyz.in' etc.
Aggregation Query:
db.collection.aggregate([
{
$addFields:{
"tags":{
$map:{
"input":"$tags",
"as":"tag",
"in":{
$mergeObjects:[
"$$tag",
{
"Domain":{
$trim: {
"input": "$$tag.Domain",
"chars": "https://"
}
}
}
]
}
}
}
}
}
]).pretty()
Output:(demo)
{
"_id" : 2, //ObjectId
"tags" : [
{
"rank" : 1,
"Domain" : "www.google.com"
},
{
"rank" : 2,
"Domain" : "www.stackoverflow.com"
}
]
}
The solution ended up being longer than I expected it to be (I hope someone can find a more concise solution), but here you go:
db.test.aggregate([
{$unwind:"$tags"}, //unwind tags so that we can separately deal with http and https
{
$facet: {
"https": [{ // the first stage will...
$match: { // only contain documents...
"tags.Domain": /^https.*/ // that are allowed by the match the regex /^https.*/
}
}, {
$addFields: { // for all matching documents...
"tags.Domain": {"$substr": ["$tags.Domain",8,-1]} // we change the tags.Domain field to required substring (skip 8 characters and go on till the last character)
}
}],
"http": [{ // similar as above except we're doing the inverse filter using $not
$match: {
"tags.Domain": { $not: /^https.*/ }
}
}, {
$addFields: { // for all matching documents...
"tags.Domain": {"$substr": ["$tags.Domain",7,-1]} // we change the tags.Domain field to required substring (skip 7 characters and go on till the last character)
}
}
]
}
},
{ $project: { all: { $concatArrays: [ "$https", "$http" ] } } }, //we have two arrays at this point, so we just concatenate them both to have one array called "all"
//unwind and group the array by _id to get the document back in the original format
{$unwind: "$all"},
{$group: {
_id: "$all._id",
tags: {$push: "$all.tags"}
}}
])
For removing the / from the end, you can have another facet with a regex that matches the url (something like /.*\/$/ should work), and use that facet in the concat as well.
With help from: https://stackoverflow.com/a/49660098/5530229 and https://stackoverflow.com/a/44729563/5530229
As dnickless said in the first answered referred above, as always with the aggregation framework, it may help to remove individual stages from the end of the pipeline and run the partial query in order to get an understanding of what each individual stage does.
Related
I got a question that I would expect to be pretty simple, but I cannot figure it out. What I want to do is this:
Find all documents in a collection and:
sort the documents by a certain date field
apply distinct on one of its other fields, but return the whole document
Best shown in an example.
This is a mock input:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("1998-11-04T18:46:14.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("1970-05-09T20:16:37.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
The expected output is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Or, in other words:
Group the input data by the commandName field
Inside each group sort the documents
Return the newest document from each group
My attempts to write this query have failed:
The distinct() function will only return the value of the field I am distinct-ing on, not the whole document. That makes it unsuitable for my case.
Tried writing an aggregate query, but ran into an issue of how to sort-and-select a single document from inside of each group? The sort aggreation stage will sort the groups among one other, which is not what I want.
I am not too well-versed in Mongo and this is where I hit a wall. Any ideas on how to continue?
For reference, this is the work-in-progress aggregation query I am trying to expand on:
db.getCollection('some_collection').aggregate([
{ $group: { '_id': '$commandName', 'docs': {$addToSet: '$$ROOT'} } },
{ $sort: {'_id.docs.???': 1}}
])
Post-resolved edit
Thank you for the answers. I got what I needed. For future reference, this is the full query that will do what was requested and also return a list of the filtered documents, not groups.
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': 1}},
{ $group: { '_id': '$commandName', 'result': { $last: '$$ROOT'} } },
{ $replaceRoot: {newRoot: '$result'} }
])
The query result without the $replaceRoot stage would be:
[
{
"_id": "migration_a",
"result": {
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
}
},
{
"_id": "migration_b",
"result": {
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
}
]
The outer _id and _result are just "group-wrappers" around the actual document I want, which is nested under the result key. Moving the nested document to the root of the result is done using the $replaceRoot stage. The query result when using that stage is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Try this:
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': -1}},
{ $group: { '_id': '$commandName', 'doc': {$first: '$$ROOT'} } }
])
I believe this will result in what you're looking for:
db.collection.aggregate([
{
$group: {
"_id": "$commandName",
"executionDate": {
"$last": "$executionDate"
}
}
}
])
You can check it out here
Of course, if you want to match your expected output exactly, you can add a sort (this may not be necessary since your goal is to simply return the newest document from each group):
{
$sort: {
"executionDate": 1
}
}
You can check this version out here.
The use-case the question presents is nearly covered in the $last aggregation operator documentation.
Which summarises:
the $group stage should follow a $sort stage to have the input
documents in a defined order. Since $last simply picks the last
document from a group.
Query: Link
db.collection.aggregate([
{
$sort: {
executionDate: 1
}
},
{
$group: {
_id: "$commandName",
executionDate: {
$last: "$executionDate"
}
}
}
]);
Match documents if a value in an array of sub-documents is greater than some value only if the same document contains a field that is equal to some value
I have a collection that contains documents with an array of sub-documents. This array of sub-documents contains a field that dictates whether or not I can filter the documents in the collection based on another field in the sub-document. This'll make more sense when you see an example of the document.
{
"_id":"ObjectId('XXX')",
"Data":{
"A":"",
"B":"-25.78562 ; 28.35629",
"C":"165"
},
"SubDocuments":[
{
"_id":"ObjectId('XXX')",
"Data":{
"Value":"XXX",
"DataFieldId":"B"
}
},
{
"_id":"ObjectId('XXX')",
"Data":{
"Value":"",
"DataFieldId":"A"
}
},
{
"_id":"ObjectId('XXX')",
"Data":{
"Value":"105",
"DataFieldId":"Z"
}
}
]
}
I only want to match documents that contain sub-documents with a DataFieldId that is equal to Z but also filter for Values that are greater than 105 only if Data Field Id is equal to Z.
Try as below:
db.collection.aggregate([
{
$project: {
_id:1,
Data:1,
filteredSubDocuments: {
$filter: {
input: "$SubDocuments",
as: "subDoc",
cond: {
$and: [
{ $eq: ["$$subDoc.Data.DataFieldId", "Z"] },
{ $gte: ["$$subDoc.Data.Value", 105] }
]
}
}
}
}
}
])
Resulted response will be:
{
"_id" : ObjectId("5cb09659952e3a179190d998"),
"Data" : {
"A" : "",
"B" : "-25.78562 ; 28.35629",
"C" : "165"
},
"filteredSubDocuments" : [
{
"_id" : "ObjectId('XXX')",
"Data" : {
"Value" : 105,
"DataFieldId" : "Z"
}
}
]
}
This can be done by using the $elemMatch operator on sub-documents, for details you can click on provided link. For your problem you can try below query by using $elemMatch which is match simpler than aggregation:
db.collectionName.find({
"SubDocuments": {
$elemMatch: {
"Data.DataFieldId": "Z" ,
"Data.Value" : {$gte: 105}
}
} })
Its working fine, I have verified it locally, one modification you required is that you have to put the value of SubDocuments.Data.Value as Number or Long as per your requirements.
WHAT I WANT TO ACHIEVE
Let us say I have this object:
{
"_id" : ObjectId("5aec063380a7490014e88792"),
"personal_info" : {
"dialing_code": "+44",
"phone_number": "67467885664"
}
}
I need to concat the two values personal_info.dialing_code (+44) and phone_number (67467885664) into one. +4467467885664 and compare it to a value. I need to retrieve a specific record from the database that will match the said value.
PROBLEM
I am having trouble concatinating two fields inside a subdocument and I am receiving this error:
{
"name": "MongoError",
"message": "$concat only supports strings, not object",
"ok": 0,
"errmsg": "$concat only supports strings, not object",
"code": 16702,
"codeName": "Location16702"
}
ATTEMPT #1
I have tried this:
UserModel.aggregate([
{ $unwind: '$personal_info' },
{$project: {
concat_p: {$concat: [
'$personal_info.dialing_code',
'$personal_info.phone_number'
]}
}}
])
It is giving me an error as mentioned above and in result I cannot do a $match right after.
ATTEMPT #2
I also tried this:
UserModel.aggregate([
{ $unwind: '$personal_info' },
{$project: {
p_dialing_code: '$personal_info.dialing_code',
p_phone_number: '$personal_info.phone_number',
concat_p: {$concat: [
'$p_dialing_code',
'$p_phone_number'
]}
}}
])
I have successfully took out the subdocument values one level however when I tried concatinating, it is producing me null values. This is the result I am getting:
{
"_id": "5af0998036daa90014129d6e",
"p_dialing_code": "+44",
"p_phone_number": "13231213213244",
"concat_p": null
}
I know how to do it on the $match pipeline but I have no luck concatinating the values inside the subdocument. Clearly, I need to do this first before I can compare. Thanks
It seems like you have different types under personal_info.dialing_code and personal_info.phone_number fields. In your example $concat is applied to every document in your collection and that's why you're getting an exception since $concat strictly expects its parameters to be strings.
So it will be working fine for document posted in your question but will throw an exception for something like this:
{
"_id" : ObjectId("5aec063380a7490014e88792"),
"personal_info" : {
"dialing_code": {},
"phone_number": "67467885664"
}
}
One way to fix this is to add $match condition before $project and use $type operator to get only documents having strings on those fields you want to concatenate.
db.UserModel.aggregate([
{
$match: {
$expr: {
$and: [
{ $eq: [ { $type: "$personal_info.dialing_code" }, "string" ] },
{ $eq: [ { $type: "$personal_info.phone_number" }, "string" ] }
]
}
}
},
{$project: {
concat_p: {$concat: [
"$personal_info.dialing_code",
"$personal_info.phone_number"
]}
}}
])
My problem is difficult to explain :
In my website I save every action of my visitors (view, click, buy etc).
I have a simple collection named "flow" where my data is registered
{
"_id" : ObjectId("534d4a9a37e4fbfc0bf20483"),
"profile" : ObjectId("534bebc32939ffd316a34641"),
"activities" : [
{
"id" : ObjectId("534bebc42939ffd316a3af62"),
"date" : ISODate("2013-12-13T22:39:45.808Z"),
"verb" : "like",
"product" : "5"
},
{
"id" : ObjectId("534bebc52939ffd316a3f480"),
"date" : ISODate("2013-12-20T19:19:10.098Z"),
"verb" : "view",
"product" : "6"
},
{
"id" : ObjectId("534bebc32939ffd316a3690f"),
"date" : ISODate("2014-01-01T07:11:44.902Z"),
"verb" : "buy",
"product" : "5"
},
{
"id" : ObjectId("534bebc42939ffd316a3741b"),
"date" : ISODate("2014-01-11T08:49:02.684Z"),
"verb" : "favorite",
"product" : "26"
}
]
}
I would like to aggregate these data to retrieve the number of people who made an action (for example "view") and then another later in time (for example "buy"). To to that I need to compare "date" inside my "activities" array...
I tried to use aggregation framework to do that but I do not see how too make this request
This is my beginning :
db.flows.aggregate([
{ $project: { profile: 1, activities: 1, _id: 0 } },
{ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}, //First verb + second verb
{ $unwind: '$activities' },
{ $match: { 'activities.verb': {$in:['view', 'buy']} } }, //First verb + second verb,
{
$group: {
_id: '$profile',
view: { $push: { $cond: [ { $eq: [ "$activities.verb", "view" ] } , "$activities.date", null ] } },
buy: { $push: { $cond: [ { $eq: [ "$activities.verb", "buy" ] } , "$activities.date", null ] } }
}
}
])
Maybe the format of my collection "flow" is not the best to do what I want...If you have any better idea dont hesitate
Thank you for your help !
Here is the aggregation that will give you the total number of buyers who viewed first and then bought (though not necessarily the same product that they viewed).
db.flow.aggregate(
{$match: {"activities.verb":{$all:["view","buy"]}}},
{$unwind :"$activities"},
{$match: {"activities.verb":{$in:["view","buy"]}}},
{$group: {
_id:"$_id",
firstViewed:{$min:{$cond:{
if:{$eq:["$activities.verb","view"]},
then : "$activities.date",
else : new Date(9999,0,1)
}}},
lastBought: {$max:{$cond:{
if:{$eq:["$activities.verb","buy"]},
then:"$activities.date",
else:new Date(1900,0,1)}
}}}
},
{$project: {viewedThenBought:{$cond:{
if:{$gt:["$lastBought","$firstViewed"]},
then:1,
else:0
}}}},
{$group:{_id:null,totalViewedThenBought:{$sum:"$viewedThenBought"}}}
)
Here you first pass through the pipeline only the documents that have all the "verbs" you are interested in. When you group the first time, you want to use the earliest "view" and the last "buy" and the next project compares them to see if they viewed before they bought.
The last step gives you the count of all the people who satisfied your criteria.
Be careful to leave out all $project phases that don't actually compute any new fields (like you very first $project). The aggregation framework is smart enough to never pass through any fields that it sees are not used in any later stages, so there is never a need to $project just to "eliminate" fields as that will happen automatically.
For your query:
I would like to aggregate these data to retrieve the number of people who made an action
Try this:
db.flows.aggregate([
// De-normalize the array into individual documents
{"$unwind" : "$activities"},
// Match for the verbs you are interested in
{"$match" : {"activities.verb":{$in:["buy", "view"]}}},
// Group by verb to get the count
{"$group" : {_id:"$activities.verb", count:{$sum:1}}}
])
The above query would produce an output like:
{
"result" : [
{
"_id" : "buy",
"count" : 1
},
{
"_id" : "view",
"count" : 1
}
],
"ok" : 1
}
Note: The $and operator in your query ({ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}) is not required as that's the default if you specify multiple conditions. Only if you need a logical OR, $or operator is required.
If you want to use the date in the aggregation query to do queries like how many "views by day", etc.. the Date Aggregation Operators will come in handy.
I see where you are going with this and I think you are basically on the right track. So more or less un-altered (but for formatting preference) and the few tweeks at the end:
db.flows.aggregate([
// Try to $match "first" always to make sure you can get an index
{ "$match": {
"$and": [
{"activities.verb": "view"},
{"activities.verb": "buy"}
]
}},
// Don't worry, the optimizer "sees" this and will sort of "blend" with
// with the first stage.
{ "$project": {
"profile": 1,
"activities": 1,
"_id": 0
}},
{ "$unwind": "$activities" },
{ "$match": {
"activities.verb": { "$in":["view", "buy"] }
}},
{ "$group": {
"_id": "$profile",
"view": { "$min": { "$cond": [
{ "$eq": [ "$activities.verb", "view" ] },
"$activities.date",
null
]}},
"buy": { "$max": { "$cond": [
{ "$eq": [ "$activities.verb", "buy" ] },
"$activities.date",
null
]}}
}},
{ "$project": {
"viewFirst": { "$lt": [ "$view", "$buy" ] }
}}
])
So essentially the $min and $max operators should be self explanatory in the context in that you should be looking for the "first" view to correspond with the "last" purchase. As for me, and would make sense, you would actually be matching these by product (but hint: "Grouping") but I'll leave that part up to you.
The other advantage here is that the false values will always be negated if there is an actual date to match the "verb". Otherwise this goes through as false and this turns out to be okay.
That is because the next thing you do is $project to "compare" the values and ask the question "Did the 'view' happen before the 'buy'?" which is a logical evaluation of the "less than" $lt operator.
As for the schema itself. If you are storing a lot of these "events" then you are probably better off flattening things out into separate documents and finding some way to mark each with the same "session" identifier if that is separate to "profile".
Getting away from large arrays ( which this seems to lead to ) if likely going to help performance, and with care, makes little different to the aggregation process.
Lets say I have mongo documents like this:
Question 1
{
answers:[
{content: 'answer1'},
{content: '2nd answer'}
]
}
Question 2
{
answers:[
{content: 'answer1'},
{content: '2nd answer'}
{content: 'The third answer'}
]
}
Is there a way to order the collection by size of answers?
After a little research I saw suggestions of adding another field, that would contain number of answers and use it as a reference but may be there is native way to do it?
I thought you might be able to use $size, but that's only to find arrays of a certain size, not ordering.
From the mongo documentation:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24size
You cannot use $size to find a range of sizes (for example: arrays with more than 1 element). If you need to query for a range, create an extra size field that you increment when you add elements. Indexes cannot be used for the $size portion of a query, although if other query expressions are included indexes may be used to search for matches on that portion of the query expression.
Looks like you can probably fairly easily do this with the new aggregation framework, edit: which isn't out yet.
http://www.mongodb.org/display/DOCS/Aggregation+Framework
Update Now the Aggregation Framework is out...
> db.test.aggregate([
{$unwind: "$answers"},
{$group: {_id:"$_id", answers: {$push:"$answers"}, size: {$sum:1}}},
{$sort:{size:1}}]);
{
"result" : [
{
"_id" : ObjectId("5053b4547d820880c3469365"),
"answers" : [
{
"content" : "answer1"
},
{
"content" : "2nd answer"
}
],
"size" : 2
},
{
"_id" : ObjectId("5053b46d7d820880c3469366"),
"answers" : [
{
"content" : "answer1"
},
{
"content" : "2nd answer"
},
{
"content" : "The third answer"
}
],
"size" : 3
}
],
"ok" : 1
}
I use $project for this:
db.test.aggregate([
{
$project : { answers_count: {$size: { "$ifNull": [ "$answers", [] ] } } }
},
{
$sort: {"answers_count":1}
}
])
It also allows to include documents with empty answers.
But also has a disadvantage (or sometimes advantage): you should manually add all needed fields in $project step.
you can use mongodb aggregation stage $addFields which will add extra field to store count and then followed by $sort stage.
db.test.aggregate([
{
$addFields: { answers_count: {$size: { "$ifNull": [ "$answers", [] ] } } }
},
{
$sort: {"answers_count":1}
}
])
You can use $size attribute to order by array length.
db.getCollection('test').aggregate([
{$project: { "answers": 1, "answer_count": { $size: "$answers" } }},
{$sort: {"answer_count": -1}}])