How to convert an sql data retrieval into mongodb? - mongodb

Imagine there are two collections in MongoDB User and History. Here I have written the SQL query of a data retrieval if those are in a relational DB. I want to prepare a similar query for this in MongoDB.
(History table contains many records for a particular User)
SELECT U.id FROM User U
WHERE EXISTS (SELECT * FROM History H
WHERE H.userId = U.id AND H.usage > 25 AND H.balance < 100)
AND U.category = 'VIP' AND U.area = 'XXX';

Based on my understanding of question i have written query using $lookup. Hope this is what you are looking for. Also adding link to MongoPlaygroud so that you can run the query.
db.user.aggregate([
{
"$lookup": {
"from": "history",
"localField": "id",
"foreignField": "userId",
"as": "history"
}
},
{
"$match": {
"category": "VIP",
"area": "XXX",
"history.usage": {
"$gt": 25
},
"history.balance": {
"$lt": 100
}
}
},
{
"$project": {
"id": 1
}
}
])
Try it here

There is no way to retrieve users in a collection based on a condition in a different collection.
But the following query filters users, and brings data to those you asked for. If this is not found, an empty array is retrieved, so we filter those out.
I'll show you a way with $graphLookup
//from users
var pipeline = [
{ $match:{category:"VIP", area:"XXX"} }, //get a subset of users
{ $graphLookup:{ //include docs from History as "myNewData"
from:"History",
as:"myNewData",
startWith:"$id", //match _id with
connectToField:"userId", //userId in History
connectFromField:"id", //this is irrelevant
maxDepth:0, //because is used for depth > 0
restrictSearchWithMatch:{usage:{$gt:25}, balance:{$lt:100}} //condition }
},
{$match:{"myNewData":{$elemMatch:{$exists:true}}}
}]
db.Users.aggregate(pipeline)
Collections have to be in the same database.
No data is permanently moved unless you use $out or other stage.
This should perform better than $lookup (only a subset of docs is brought over)
Test here (I just tweaked the example provided by #zac786)

Related

Mongodb aggregation that returns all documents where $lookup foreign doc DOESN'T exist

I'm working with a CMS system right now where if pages are deleted the associated content isn't. In the case of one of my clients this has become burdensome, as we now have millions of content docs accumulated over time, and it's making it prohibitive to do daily functions, like restoring dbs, backing up dbs, etc.
Consider this structure:
Page document:
{
_id: pageId,
contentDocumentId: someContentDocId
}
Content document:
{
_id: someContentDocId,
page_id: pageId,
content: [someContent, ...etc]
}
Is there a way to craft a MongoDB aggregation where we aggregate Content docs based on checking page_id, and if our check for page_id returns null, then we aggregate that doc? It's not something as simple as foreignField in a $lookup being set to null, is it?
This should do the trick:
db.content.aggregate([
{
"$lookup": {
"from": "pages",
"localField": "page_id",
"foreignField": "_id",
"as": "pages"
}
},
{
"$addFields": {
"pages_length": {
"$size": "$pages"
}
}
},
{
"$match": {
"pages_length": 0
}
},
{
"$unset": [
"pages",
"pages_length"
]
}
])
We create an aggregation from the content collection and do a normal $loopup with the pages collection. When no matching page is found, the array pages will be [] so we just filter every document where the array is empty.
You can't use $size inside $match to filter array lengths, so we need to create a temporary field pages_length to store the length of the array.
In the end we remove the temporary fields with $unset.
Playground

Mongodb : select all the questions which has no user record and delete it

Development server:
I have two tables users and questions, user_id is stored in each question.
Unintentionally I have deleted some users, now I want delete questions for which the user doesn't exist,
delete from questions where q.id not in( select q.id from questions q inner join users u on u.id = q.user_id);
I think the above query does that in mysql but I want to do that in mongodb.
I am new to mongodb I know lookup aggregate function does the join but I don't how to do the above query.
I believe you will have to use $lookup and aggregation pipeline to get all the questions where the user does not exist and then delete those questions.
Try this:
var pipeline = [{
"$lookup": {
"from": "users",
"localField": "user_id",
"foreignField": "_id",
"as": "user_id"
}
}, {
"$match": {
"user_id": {
"$size": 0
}
}
}
]
var cursor = db.questions.aggregate(pipeline);
// create a map to get _id of all the question where user doesnt exist
var ids = cursor.map(function (doc) { return doc._id; });
// remove all those questions
db.questions.remove({"_id": { "$in": ids }});
A $lookup is indeed needed for this, but you need another pipeline stage, which is $match to get only those documents where the userId count is 0 (we do not want to have documents in our result set where the length of userId is > 0 as this means that the user_id exists in users).
With the result set from the aggregation, you can do a simple iteration and remove all docs which remain in the set.
Something like this should do it (since I cannot test it right now, maybe give it a quick test run though):
db.getCollection('questions').aggregate([{
"$lookup": {
"from": "users",
"localField": "user_id",
"foreignField": "user_id",
"as": "userId"
}
}, {
"$match": {
"userId": {
"$size": 0
}
}
},
]).forEach((doc) => {
db.getCollection("questions").remove({ "_id": doc._id });
});
Instead of the forEach in the last part, you also get all the ids and remove them in one single remove-query like Ravi did.

pymongo huge collection return with list

You know that there is no join in mongodb, So I execute likes join query like this.
users = user_collection.find({"region": "US", `and some condition here`}, projection={"user_id": 1"})
user_list = [
user['user_id']
for user in users
]
posts = post_collection.find({"user_id": {"$in": user_list}, `and some condition here`)
(To avoid bring unnecessary field, also used projection option in find())
Collection and list size
users = 2000000
user_list = 100000
posts = 2000000
When I execute query, it takes almost 4 seconds.
Among them, make user_list takes almost 3 seconds.
Question
How can I make a result to list only contains user_id efficiently?
Any way to improve performance here?
Thanks.
First, make sure that the fields you query on are properly indexed. If it's already done, you can try this:
1. use distinct()
you could use distinct to get the user_list in one single query:
something like this:
user_list = user_collection.distinct("user_id", {"region": "US", ...})
2. Aggregation with a $lookup
second option is to retrieve the posts in a single query by performing a $lookup from the user_collection:
user_collection.aggregate([
{
"$match": {"region": "US", ...}
},
{
"$lookup": {
"from": "post_collection",
"localField": "user_id",
"foreignField": "user_id",
"as": "post"
}
},
...
])
and then filter the posts with a $unwind and a $match stage

Incredibly slow query performance with $lookup and "sub" aggregation pipeline

Let's say I have two collections, tasks and customers.
Customers have a 1:n relation with tasks via a "customerId" field in customers.
I now have a view where I need to display tasks with customer names. AND I also need to be able to filter and sort for customer names. Which means I can't do the $limit or $match stage before $lookup in the following query.
So here is my example query:
db.task.aggregate([
{
"$match": {
"_deleted": false
}
},
"$lookup": {
"from": "customer",
"let": {
"foreignId": "$customerId"
},
"pipeline": [
{
"$match": {
"$expr": {
"$and": [
{
"$eq": [
"$_id",
"$$foreignId"
]
},
{
"$eq": [
"$_deleted",
false
]
}
]
}
}
}
],
"as": "customer"
},
{
"$unwind": {
"path": "$customer",
"preserveNullAndEmptyArrays": true
}
},
{
"$match": {
"customer.name": 'some_search_string'
}
},
{
"$sort": {
"customer.name": -1
}
},
{
"$limit": 35
},
{
"$project": {
"_id": 1,
"customer._id": 1,
"customer.name": 1,
"description": 1,
"end": 1,
"start": 1,
"title": 1
}
}
])
This query is getting incredibly slow when the collections are growing in size. With 1000 tasks and 20 customers it already takes about 500ms to deliver result.
I'm aware, that this happens because the $lookup operator has to do a tablescan for each row that enters the aggregation pipeline's lookup stage.
I have tried to set indexes like described here: Poor lookup aggregation performance but that doesn't seem to have any impact.
My next guess was that the "sub"-pipeline in the $lookup stage is not capable of using indexes, so I replaced it with a simple
"$lookup": {
"from": "customer",
"localField": "customerId",
"foreignField": "_id",
"as": "customer"
}
But still the indexes are not used or don't have any impact on performance. (To be honest I don't know which of both is the case since .explain() won't work with aggregation pipelines.)
I have tried the following indexes:
Ascending, desecending, hashed and text index on customerId
Ascending, desecending, hashed and text index on customer.name
I'm grateful for any ideas on what I'm doing wrong or how I could achive the same thing with a better aggregation pipeline.
Additional info:
I'm using a three member replica set. I'm on MongoDB 4.0.
Please note: I'm aware that I'm using a non-relational database to achieve highly relational objectives, but in this project MongoDB was our choice due to it's ChangeStream feature. If anybody knows a different database with a comparable feature (realtime push notifications on changes), which can be run on-premise (so Firebase drops out), I would love to hear about it!
Thanks in advance!
I found out why my indexes weren't used.
I queried the collection using a different collation than the collection's own collation.
But the id indexes on a collection are always implemented using the collections default collation.
Therefore the indexes were not used.
I changed the collection's collation to the same as for the queries and now the query takes just a fraction of the time (but still slow :)).
(Yes you have to recreate the collections to change the collation, no on-the-fly change is possible.)
Have you considered having a single collection for customer with tasks as an embedded array in each document? That way, you would be able to index search on both customer and task fields.

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?
No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.