How to do SQL INTERSECT OPERATION IN MONGODB - mongodb

SELECT SOME_COLUMN
FROM TABLE
WHERE SOME_COLUMN_NAME = 'VALUE'
INTERSECT
SELECT SOME_COLUMN
FROM TABLE
WHERE SOME_COLUMN_NAME_VALUE = 'NEW_VALUE'
How to get the common or intersection values for the 2 queries (using INTERSECT operator in SQL) in MongoDB?
INTERSECT is a keyword for SQL, how is it done for MongoDB?

As with so many things from SQL, there is no exact counterpart for SQL INTERSECT in MongoDB, but depending on the actual problem there might be an alternative solution.
MongoDB has no operations which affects more than one collection, so creating an intersection between two collections can't be done completely on the database.
When both queries come from the same collection, you could maybe do something with aggregation. What you could do would depend on what you actually want to do.

Your question seems a little off with the statements "VALUE" and "NEWVALUE" in each sub-query portion. The point of INTERSECT is is matching on the column(s) with the "same" value.
But as long as you are talking about the same collection, then you can get the intersection of tho columns using the aggregation framework like so:
db.collection.aggregate([
// Get the "sets" for each field
{ "$group": {
"_id": null,
"field1": { "$addToSet": "$field1" },
"field2": { "$addToSet": "$field2" }
}},
// Intersect the "sets"
"same": { "$setIntersection": [ "$field1", "$field2" ] }
}},
// Unwind the result set
{ "$unwind": "$same" },
// Just project the wanted field
{ "$project": { "_id": 0, "same": 1 } }
])
That does make use of the $setIntersection operator introduced in MongoDB 2.6 in order to return a "set" with the common elements from the two "sets" being compared. The $addToSet operation constructs the two sets from the "unique" values in each field.
You can essentially do the same thing if your available MongoDB version is prior to 2.6, but just with a little more work:
db.collection.aggregate([
// Group each "set"
{ "$group": {
"_id": null,
"field1": { "$addToSet": "$field1" },
"field2": { "$addToSet": "$field2" }
}},
// Unwind each set
{ "$unwind": "$field1" },
{ "$unwind": "$field2" },
// Group on the compared values
{ "$group": {
"_id": null,
"same": {
"$addToSet": {
"$cond": [
{ "$eq": [ "$field1", "$field2" ] },
"$field1",
false
]
}
}
}},
// Unwind again, should be compacted now
{ "$unwind": "$same" },
// Filter out the "false" values
{ "$match": { "same": { "$ne": false } } },
// Just project the wanted field
{ "$project": { "_id": 0, "same": 1 } }
])
Lacking support for the "set operators" in earlier versions, you just emulate the behavior by comparing the values of the two "sets". This largely works as when you $unwind an array, what is produced is essentially a new document for each of those values. So "unwinding" an array on top of another results in documents where each element can be compared against the other.
So with the single collection form this is a perfectly valid operation in order to get the "intersection". Like all things in MongoDB though, the general gearing is towards working with a single collection at a time. The general onus is on your design to structure the data so that comparisons are made on a single collection.
Similar results can be obtained with an incremental mapReduce process over multiple collections, but as your general question seems to refer to a single table source then this would in fact be a different question to the one you appear to be asking. Also of course, it is not a single operation and involves multiple processing steps.
You would generally be advised to take a good look at the manual section on SQL to aggregation mapping. This gives many common examples and is getting better over time to add additional use cases.

Related

MongoDB Aggregation: Dedupe by array in subdocuments

I have an aggregation query which calculates records by tag combinations this query is working well however it has one issue which is that it duplicates documents for tag combinations that are in different orders e.g. i could have one document with the tags: ['one', 'two'] and a second document with ['two' 'one'] the rest of the data would be exactly the same.
My first thought would be to do a $group aggregation query and search how to order the arrays in a project query however i cannot find anywhere how to do this. I did see for update queries you can use '$push' however this feature doesnt seem to exist for $project queries.
an example document at this phase is something like this
_id: "sadasdsad"
tags: ['one', 'two'],
total_count:37,
second_count:14,
what would be the best approach to solving this issue?
You can sort your array using $unwind,$sort and finally $group so all your tags are the same before grouping. Example : https://mongoplayground.net/p/EZi04LfY1ff
However, I would try to store those tags already sorted. So you can avoid these steps.
db.collection.aggregate({
"$unwind": "$tag"
},
{
"$sort": {
key: 1,
tag: 1
}
},
{
"$group": {
"_id": "$key",
"tag": {
"$push": "$tag"
}
}
},
{
"$group": {
"_id": "$tag",
"field": {
"$push": "$$ROOT"
}
}
})

MapReduce: aggregate in map function?

Suppose you have a DB where every document is a tweet from Twitter, and you want, with MapReduce, to generate another document that contains:
Number of tweets published on every country
List of words contained in those tweets, with a counter that counts the total hits of that word. This, for every country too.
My question: is it fine to aggregate and count the words on the map function, and then again on the reduce function? Doing it like this, the output of the map function represents the information of a single tweet, and the reduce function aggregates the info from several tweets, all from the same country, but I don't know if this is a good practice with the MapReduce algorithm...
Thank you in advance!
In mongoDB 3.4 you can do this process with aggregation framework.
For the first bullet, you just have to use $group operator at the country field and count the tweets.
For the second bullet, you have to use $split(new in 3.4) operator at the field of the tweet text, then use $unwind at the generated array and finally use $group with word as _id or country + word as _id.
If you have an older version of mongodb then you have to use map-reduce procedure but, have in mind, aggregation framework is much faster than map-reduce at mongodb.
$split: https://docs.mongodb.com/manual/reference/operator/aggregation/split/#exp._S_split
$unwind: https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
$group: https://docs.mongodb.com/manual/reference/operator/aggregation/group/
Building from the great answer above by Moi Syme, you ideally would run the following aggregate operation to get the desired result:
db.tweets.aggregate([
{ "$project": { "wordList": { "$split": [ "$text", " " ] }, "user.country": 1 } },
{ "$unwind": "$wordList" },
{
"$group": {
"_id": {
"country": "$user.country",
"word": "$wordList"
},
"count": { "$sum": 1 }
}
},
{
"$group": {
"_id": "$_id.country",
"numberOfTweets": { "$sum": 1 },
"counts": {
"$push": {
"word": "$_id.word",
"count": "$count"
}
}
}
}
])

How do I use mongodb to count only collections that match two fields

I have some documents that have a newOne: true or false and that have an owner tag on them. I want to count all fields that have both newOne : true and the owner field equal to "MSlaton" How do I go about this in mongodb?
Thank you!
You could use the count() method as
db.collection.count( { "newOne": true, "owner": "MSlaton" } )
which is equivalent to
db.collection.find( { "newOne": true, "owner": "MSlaton" } ).count()
Another route, albeit slower, would be via the aggregation framework where you run the following aggregation operation to get the count:
db.collection.aggregate([
{ "$match" : { "newOne": true, "owner": "MSlaton" } },
{ "$group": { "_id": null, "count": { "$sum": 1 } } }
]);
The aggregation operation is slower since it reads each and every document in the collection and processes it which can only be halfway in the same order of magnitude with count() when doing it over only a significantly large collection.

Mongodb Aggregate a $slice to get an element in exact position from a nested array

I would like to retrieve a value from a nested array where it exists at an exact position within the array.
I want to create name value pairs by doing $slice[0,1] for the name and then $slice[1,1] for the value.
Before I attempt to use aggregate, I want to attempt a find within a nested array. I can do what I want on a single depth array in a document as shown below:
{
"_id" : ObjectId("565cc5261506995581569439"),
"a" : [
4,
2,
8,
71,
21
]
}
I apply the following: db.getCollection('anothertest').find({},{_id:0, a: {$slice:[0,1]}})
and I get:
{
"a" : [
4
]
}
This is fantastic. However, what if the array I want to $slice [0,1] is located within the document at objectRawOriginData.Reports.Rows.Rows.Cells?
If I can first of all FIND then I want to apply the same as an AGGREGATE.
Your best bet here and especially if your application is not yet ready for release is to hold off until MongoDB 3.2 for deployment, or at least start working with a release candidate in the interim. The main reason being is that the "projection" $slice does not work with the aggregation framework, as do not other forms of array matching projection as well. But this has been addressed for the upcoming release.
This is going to give you a couple of new operators, being $slice and even $arrayElemAt which can be used to address array elements by position in the aggregation pipeline.
Either:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$slice": ["$a",0,1] }
}}
])
Which returns the familiar:
{ "a" : [ 4 ] }
Or:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$arrayElemAt": ["$a", 0] }
}}
])
Which is just the element and not an array:
{ "a" : 4 }
Until that release becomes available other than in release candidate form, the currently available operators make it quite easy for the "first" element of the array:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
{ "$group": {
"_id": "$_id",
"a": { "$first": "$a" }
}}
])
Through use of the $first operator after $unwind. But getting another indexed position becomes horribly iterative:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
// Keeps the first element
{ "$group": {
"_id": "$_id",
"first": { "$first": "$a" },
"a": { "$push": "$a" }
}},
{ "$unwind": "$a" },
// Removes the first element
{ "$redact": {
"$cond": {
"if": { "$ne": [ "$first", "$a" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Top is now the second element
{ "$group": {
"_id": "$_id",
"second": { "$first": "$a" }
}}
])
And so on, and also a lot of handling to alter that to deal with arrays that might be shorter than the "nth" element you are looking for. So "possible", but ugly and not performant.
Also noting that is "not really" working with "indexed positions", and is purely matching on values. So duplicate values would easily be removed, unless there was another unique identifier per array element to work with. Future $unwind also has the ability to project an array index, which is handy for other purposes, but the other operators are more useful for this specific case than that feature.
So for my money I would wait till you had the feature available to be able to integrate this in an aggregation pipeline, or at least re-consider why you believe you need it and possibly design around it.

MongoDB Nested Array Intersection Query

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you
There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".