MapReduce: aggregate in map function? - mongodb

Suppose you have a DB where every document is a tweet from Twitter, and you want, with MapReduce, to generate another document that contains:
Number of tweets published on every country
List of words contained in those tweets, with a counter that counts the total hits of that word. This, for every country too.
My question: is it fine to aggregate and count the words on the map function, and then again on the reduce function? Doing it like this, the output of the map function represents the information of a single tweet, and the reduce function aggregates the info from several tweets, all from the same country, but I don't know if this is a good practice with the MapReduce algorithm...
Thank you in advance!

In mongoDB 3.4 you can do this process with aggregation framework.
For the first bullet, you just have to use $group operator at the country field and count the tweets.
For the second bullet, you have to use $split(new in 3.4) operator at the field of the tweet text, then use $unwind at the generated array and finally use $group with word as _id or country + word as _id.
If you have an older version of mongodb then you have to use map-reduce procedure but, have in mind, aggregation framework is much faster than map-reduce at mongodb.
$split: https://docs.mongodb.com/manual/reference/operator/aggregation/split/#exp._S_split
$unwind: https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
$group: https://docs.mongodb.com/manual/reference/operator/aggregation/group/

Building from the great answer above by Moi Syme, you ideally would run the following aggregate operation to get the desired result:
db.tweets.aggregate([
{ "$project": { "wordList": { "$split": [ "$text", " " ] }, "user.country": 1 } },
{ "$unwind": "$wordList" },
{
"$group": {
"_id": {
"country": "$user.country",
"word": "$wordList"
},
"count": { "$sum": 1 }
}
},
{
"$group": {
"_id": "$_id.country",
"numberOfTweets": { "$sum": 1 },
"counts": {
"$push": {
"word": "$_id.word",
"count": "$count"
}
}
}
}
])

Related

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

MongoDB Aggregation: Dedupe by array in subdocuments

I have an aggregation query which calculates records by tag combinations this query is working well however it has one issue which is that it duplicates documents for tag combinations that are in different orders e.g. i could have one document with the tags: ['one', 'two'] and a second document with ['two' 'one'] the rest of the data would be exactly the same.
My first thought would be to do a $group aggregation query and search how to order the arrays in a project query however i cannot find anywhere how to do this. I did see for update queries you can use '$push' however this feature doesnt seem to exist for $project queries.
an example document at this phase is something like this
_id: "sadasdsad"
tags: ['one', 'two'],
total_count:37,
second_count:14,
what would be the best approach to solving this issue?
You can sort your array using $unwind,$sort and finally $group so all your tags are the same before grouping. Example : https://mongoplayground.net/p/EZi04LfY1ff
However, I would try to store those tags already sorted. So you can avoid these steps.
db.collection.aggregate({
"$unwind": "$tag"
},
{
"$sort": {
key: 1,
tag: 1
}
},
{
"$group": {
"_id": "$key",
"tag": {
"$push": "$tag"
}
}
},
{
"$group": {
"_id": "$tag",
"field": {
"$push": "$$ROOT"
}
}
})

How to group documents of a collection to a map with unique field values as key and count of documents as mapped value in mongodb?

I need a mongodb query to get the list or map of values with unique value of the field(f) as the key in the collection and count of documents having the same value in the field(f) as the mapped value. How can I achieve this ?
Example:
Document1: {"id":"1","name":"n1","city":"c1"}
Document2: {"id":"2","name":"n2","city":"c2"}
Document3: {"id":"3","name":"n1","city":"c3"}
Document4: {"id":"4","name":"n1","city":"c5"}
Document5: {"id":"5","name":"n2","city":"c2"}
Document6: {"id":"6,""name":"n1","city":"c8"}
Document7: {"id":"7","name":"n3","city":"c9"}
Document8: {"id":"8","name":"n2","city":"c6"}
Query result should be something like this if group by field is "name":
{"n1":"4",
"n2":"3",
"n3":"1"}
It would be nice if the list is also sorted in the descending order.
It's worth noting, using data points as field names (keys) is somewhat considered an anti-pattern and makes tooling difficult. Nonetheless if you insist on having data points as field names you can use this complicated aggregation to perform the query output you desire...
Aggregation
db.collection.aggregate([
{
$group: { _id: "$name", "count": { "$sum": 1} }
},
{
$sort: { "count": -1 }
},
{
$group: { _id: null, "values": { "$push": { "name": "$_id", "count": "$count" } } }
},
{
$project:
{
_id: 0,
results:
{
$arrayToObject:
{
$map:
{
input: "$values",
as: "pair",
in: ["$$pair.name", "$$pair.count"]
}
}
}
}
},
{
$replaceRoot: { newRoot: "$results" }
}
])
Aggregation Explanation
This is a 5 stage aggregation consisting of the following...
$group - get the count of the data as required by name.
$sort - sort the results with count descending.
$group - place results into an array for the next stage.
$project - use the $arrayToObject and $map to pivot the data such
that a data point can be a field name.
$replaceRoot - make results the top level fields.
Sample Results
{ "n1" : 4, "n2" : 3, "n3" : 1 }
For whatever reason, you show desired results having count as a string, but my results show the count as an integer. I assume that is not an issue, and may actually be preferred.

MongoosJS: Best approach for a derived/calculated value

I am creating a college football betting app for my family.
Here are my schemas:
const GameSchema = new mongoose.Schema({
home: {
type: String,
required: true
},
opponent: {
type: String,
required: true
},
homeScore: Number,
opponentScore: Number,
week:{
type: Number,
required: true
},
winner: String,
userPicks: [
{
user: {
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
},
choosenTeam: String
}
]
});
const UserSchema = new mongoose.Schema({
name: String
});
I need to be able to calculate each user's weekly score (i.e. the number of football games they predict correctly each week) and their accumulative score (i.e. the number of games each user predicts correctly overall)
I am still very new to MongoDB and Mongoose, so I am unsure how to handle the issue. Since the Game document will never grow beyond 200 records, I think both scores should be derived or calculated from the data stored in the database.
Here are the possible solutions that I have thought of so far:
Make both scores virtual attributes, not sure how this would work for the multiple users
Persist the attributes to the document, but use middleware to re-calculate the scores, when the results for the week's games are saved to the database.
Use a static method to calculate the scores.
Any advice would be appreciated.
You could use the aggregation framework for calculating the aggregates. This is a faster alternative to Map/Reduce for common aggregation operations.
In MongoDB, a pipeline consists of a series of special operators applied to a collection to process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. For more details, please consult the documentation.
Consider running the following pipeline to get the desired result:
var pipeline = [
{ "$unwind": "$userPicks" },
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
},
{
"$group": {
"_id": "$_id.user",
"weeklyScores": {
"$push": {
"week": "$_id.week",
"score": "$weeklyScore"
}
},
"totalScores": { "$sum": "$weeklyScore" }
}
}
];
Game.aggregate(pipeline, function(err, results){
User.populate(results, { "path": "_id" }, function(err, results) {
if (err) throw err;
console.log(JSON.stringify(results, undefined, 4));
});
})
In the above pipeline, the first step is the $unwind operator
{ "$unwind": "$userPicks" }
which comes in quite handy when the data is stored as an array. When the unwind operator is applied on a list data field, it will generate a new record for each and every element of the list data field on which unwind is applied. It basically flattens the data.
This is a necessary operation for the next pipeline stage, the $group step where you group the flattened documents by the fields week and the "userPicks.user"
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
}
The $group pipeline operator is similar to the SQL's GROUP BY clause. In SQL, you can't use GROUP BY unless you use any of the aggregation functions. The same way, you have to use an aggregation function in MongoDB as well. You can read more about the aggregation functions here.
In this $group operation, the logic to calculate each user's weekly score (i.e. the number of football games they predict correctly each week) is done through the ternary operator $cond that takes a logical condition as it's first argument (if) and then returns the second argument where the evaluation is true (then) or the third argument where false (else). This makes true/false returns into 1 and 0 to feed to $sum respectively:
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
So, if within the document being processed the "$userPicks.chosenTeam" field is the same as the "$winner" field, the $cond operator feeds the value 1 to the sum else it sums zero value.
The second group pipeline:
{
"$group": {
"_id": "$user",
"weeklyScores": {
"$push": {
"week": "$_id.week",
"score": "$weeklyScore"
}
},
"totalScores": { "$sum": "$weeklyScore" }
}
}
takes the documents from the previous pipeline and groups them further by the user field and calculates another aggregate i.e. the total score, using the $sum accumulator operator. Within the same pipeline, you can aggregate a list of the weekly scores by using the $push operator which returns an array of expression values for each group.
One thing to note here is when executing a pipeline, MongoDB pipes operators into each other. "Pipe" here takes the Linux meaning: the output of an operator becomes the input of the following operator. The result of each operator is a new collection of documents. So Mongo executes the above pipeline as follows:
collection | $unwind | $group | $group => result
Now, when you run the aggregation pipeline in Mongoose, the results will have an _id key which is the user id and you need to populate the results on this field i.e. Mongoose will perform a "join" on the users collection and return the documents with the user schema in the results.
As a side note, to help with understanding the pipeline or to debug it should you get unexpected results, run the aggregation with just the first pipeline operator. For example, run the aggregation in mongo shell as:
db.games.aggregate([
{ "$unwind": "$userPicks" }
])
Check the result to see if the userPicks array is deconstructed properly. If that gives the expected result, add the next:
db.games.aggregate([
{ "$unwind": "$userPicks" },
{
"$group": {
"_id": {
"week": "$week",
"user": "$userPicks.user"
},
"weeklyScore": {
"$sum": {
"$cond": [
{ "$eq": ["$userPicks.chosenTeam", "$winner"] },
1, 0
]
}
}
}
}
])
Repeat the steps till you get to the final pipeline step.

How to do SQL INTERSECT OPERATION IN MONGODB

SELECT SOME_COLUMN
FROM TABLE
WHERE SOME_COLUMN_NAME = 'VALUE'
INTERSECT
SELECT SOME_COLUMN
FROM TABLE
WHERE SOME_COLUMN_NAME_VALUE = 'NEW_VALUE'
How to get the common or intersection values for the 2 queries (using INTERSECT operator in SQL) in MongoDB?
INTERSECT is a keyword for SQL, how is it done for MongoDB?
As with so many things from SQL, there is no exact counterpart for SQL INTERSECT in MongoDB, but depending on the actual problem there might be an alternative solution.
MongoDB has no operations which affects more than one collection, so creating an intersection between two collections can't be done completely on the database.
When both queries come from the same collection, you could maybe do something with aggregation. What you could do would depend on what you actually want to do.
Your question seems a little off with the statements "VALUE" and "NEWVALUE" in each sub-query portion. The point of INTERSECT is is matching on the column(s) with the "same" value.
But as long as you are talking about the same collection, then you can get the intersection of tho columns using the aggregation framework like so:
db.collection.aggregate([
// Get the "sets" for each field
{ "$group": {
"_id": null,
"field1": { "$addToSet": "$field1" },
"field2": { "$addToSet": "$field2" }
}},
// Intersect the "sets"
"same": { "$setIntersection": [ "$field1", "$field2" ] }
}},
// Unwind the result set
{ "$unwind": "$same" },
// Just project the wanted field
{ "$project": { "_id": 0, "same": 1 } }
])
That does make use of the $setIntersection operator introduced in MongoDB 2.6 in order to return a "set" with the common elements from the two "sets" being compared. The $addToSet operation constructs the two sets from the "unique" values in each field.
You can essentially do the same thing if your available MongoDB version is prior to 2.6, but just with a little more work:
db.collection.aggregate([
// Group each "set"
{ "$group": {
"_id": null,
"field1": { "$addToSet": "$field1" },
"field2": { "$addToSet": "$field2" }
}},
// Unwind each set
{ "$unwind": "$field1" },
{ "$unwind": "$field2" },
// Group on the compared values
{ "$group": {
"_id": null,
"same": {
"$addToSet": {
"$cond": [
{ "$eq": [ "$field1", "$field2" ] },
"$field1",
false
]
}
}
}},
// Unwind again, should be compacted now
{ "$unwind": "$same" },
// Filter out the "false" values
{ "$match": { "same": { "$ne": false } } },
// Just project the wanted field
{ "$project": { "_id": 0, "same": 1 } }
])
Lacking support for the "set operators" in earlier versions, you just emulate the behavior by comparing the values of the two "sets". This largely works as when you $unwind an array, what is produced is essentially a new document for each of those values. So "unwinding" an array on top of another results in documents where each element can be compared against the other.
So with the single collection form this is a perfectly valid operation in order to get the "intersection". Like all things in MongoDB though, the general gearing is towards working with a single collection at a time. The general onus is on your design to structure the data so that comparisons are made on a single collection.
Similar results can be obtained with an incremental mapReduce process over multiple collections, but as your general question seems to refer to a single table source then this would in fact be a different question to the one you appear to be asking. Also of course, it is not a single operation and involves multiple processing steps.
You would generally be advised to take a good look at the manual section on SQL to aggregation mapping. This gives many common examples and is getting better over time to add additional use cases.