Mongodb aggregate with 'join' - mongodb

I have a 'reports' collection in mongodb. The collection holds a the ObjectId of the user that the report belongs to. I am currently doing an aggregate so that I can group the reports by user.
db.reports.aggregate(
{
$group: {
_id: "$user",
stuff: {
$push: {
things: {
properties: "$properties"
}
}
}
}
},
{
$project: {
_id: 0,
user: "$_id",
stuff: "$stuff"
}
}
)
This give me an array of users id and the report 'stuff'. Instead of just the userId I am wondering if I can form the aggregate such that instead of just the userId I could hit the users collection and return more information about the user.
Is that possible? I am using mongoose as an ORM, but looking at the mongoose docs, aggregate looks to be a straight pass through to mongodb's aggregate function. Don't think I can take advantage of mongoose's populate as aggregate is not dealing with a schema, but I could be wrong.
Lastly reports are computer generated,each user could have millions. This prevents me from storing the reports in an array with the users collection.

No, you can't do that as there are no joins in MongoDB (not in normal queries, MapReduce, or the Aggregation Framework).
Only one collection can be accessed at a time from the aggregation functions.
Mongoose won't directly help, or necessarily make the query for additional user information any more efficient than doing an $in on a large batch of Users at a time (an array of userId). ($in docs here)
There really aren't work-arounds for this as the lack of joins is currently an intentional design of MongoDB (ie., it's how it works). Two paths you might consider:
You may find that another database platform would be better suited to the types of queries that you're trying to run.
You might try using $in as suggested above after the aggregation results are returned to your client code (it's one of the ways Mongoose handles fetching related documents). Just grab the userIds, and in batches request the associated User documents. While it's a bit more network traffic, depending on how you want to display the results, you may consider it an optimization to only fetch extra User information as it's shown to a user (if that's possible), by incrementally loading the extra data.

You could combine the two collections into one and do aggregation thusly. You don't need to have same structured top-level documents in an collection. It wouldn't be too difficult to mark each document of type A or B and adapt your queries involving A type documents to include if the document is type A.
You might not like how it looks but would set you up for the "join".

Related

How to efficiently do complex queries on MongoDB unindexed fields?

I am building filtering functionality for web application which should look like JIRA of TFS filtering query. So user should be able to filter on fields contents and use logical operators in filter query.
The data lives in MongoDB and the main challenge is that the fields on which we will filter should support not only strict equality but also full text search are difficult to index because they can vary for each user.
In a nutshell, there is a nested object, which has three other nested object that can have different amount of fields depending of user, field names are also set by user, so we don't know them.
For example document structure in collection can be:
{
_id: ObjectId()
storage: {
obj_1:{}
obj_2:{}
}
},
{
_id: ObjectId()
storage: {
obj_1:{
field_1 : val,
field_2 : val
}
obj_2:{}
}
}
I imagine queries will be something like:
find({$and:[{"storage.obj_1.field_1":{$regex: "va"}},{"storage.obj_1.var_2":"val"}]})
Unfortunately, I am not a database expert so the solutions that I see now are:
1) Use Elasticsearch as a search engine. But the question is: how do I set Elastic index if I don't know my documents structure?
2) Use sparse index in Mongo. But I will need to use regex for matching, is that solution better than Elastic?
So the question is: what is the best way to do filtering in such a DB structure?
p.s.
I have put this question in SO and not Software Engineering because SO has more active members, pls keep your downvotes for later
Elasticsearch and MongoDB (much like a relational database) behave differently for indexing: In MongoDB you need to explicitly index every field (for a non $text index). In Elasticsearch every field is automatically indexed. Don't go too crazy on the number of fields in Elasticsearch, since there is a little overhead for each field (in terms of disk space though that has improved in version 6).
As soon as you are having more than a test data set, regular expressions are often slow, since they can only use indices in specific cases and you need to define those indices explicitly. Maybe the $text index and search operator are more what you are looking for. That one can index every field in a collection as well. If you need more features and a system that is fully built for search, then Elasticsearch will be the better option though.

Get all usernames from a items user_id? (Mongodb query)

I am having some difficulty with my leaderboard for my application.
I have a database with two collections.
Users
Fish
In my Fish collection I also have user_id. When I fetch all fish, I get everything including user_id.
However, the user_id doesn't really help me, I want to display the username belonging to that user_id.
This is how my query looks.
Fish.find().sort({weight: -1}).limit(10).exec(function(err, leaderboard) {
if(err) return res.json(500, {errorMsg: 'Could not get leaderboard'});
res.json(leaderboard);
})
I feel like I need to make another query, to get all the usernames belonging to the user_ids I get from the first query. Perhaps use a loop somehow?
MongoDb is pretty new to me and don't really know what to look for.
Any advice, tips, link are much appriecated.
You may find useful information on MongoDB's Database References documentation.
The first thing to consider on using fields from different collections in MongoDB is:
MongoDB does not support joins. In MongoDB some data is denormalized, or stored with related data in documents to remove the need for joins. However, in some cases it makes sense to store related information in separate documents, typically in different collections or databases.
In your case, you might want to consider storing the information from the Fish collection as embedded documents within the users from the User collection.
If this is not an option, then you might want to use Manual References or loop over the user_ids provided in the result from your query over the Fish collection.
With the second option you may use a query to obtain the corresponding usernames from the User collection such as:
Users.find({user_id:<USER_ID>},{username:1})

Two mongodb collections in one query

I have a big collection of clients and a huge collection of the clients data, the collections are separated and I don't want to combine them to a single collection (because of the other already working servlets) but now I need to "Join" data from both collection in a single result.
Since The query should return a big number of results I don't want to query the server once and then use the result to query again. I'm also concerned about the traffic between the server and the DB and the memory that the result set will occupy in the server RAM.
The way it's working now is that I get the relevant client list from the 'clients' collection and send this list to the query of the 'client data' collection and only then I get the aggregated results.
I want to cut off the getting and sending the client list from and right back to the server, get the server to ask himself, let the query of client data collection to ask clients collection for the relevant client list.
How can I use a stored procedure(javascript functions) to do the query in the DB and return only the relevant clients out of the collection.
Alternatively, Is there a way to write a query that joins result from another collection ?
"Good news everyone", this aggregation query work just fine in the mongo shell as a join query
db.clientData.aggregate([{
$match: {
id: {
$in: db.clients.distinct("_id",
{
"tag": "qa"
})
}
}
},
$group: {
_id: "$computerId",
total_usage: {
$sum: "$workingTime"
}
}
}]);
The key idea with MongoDB data modelling is to be write-heavy, not read-heavy: store the data in the format that you need for reading, not in some format that minimizes/avoids redundancy (i.e. use a de-normalized data model).
I don't want to combine them to a single collection
That's not a good argument
I'm also concerned about the traffic between the server and the DB [...]
If you need the data, you need the data. How does the way it is queried make a difference here?
[...] and the memory that the result set will occupy in the server RAM.
Is the amount of data so large that you want to stream it from the server to the client, such that is transferred in chunks? How much data are we talking, and why does the client read it all?
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection
There are no stored procedures in MongoDB, but you can use server-side map/reduce to 'join' collections. Generally, code that is stored in and run by the database is a violation of the layer architecture separation of concerns. I consider it one of the most ugly hacks of all time - but that's debatable.
Also, less debatable, keep in mind that M/R has huge overhead in MongoDB and is not geared towards real-time queries made e.g. in a web server call. These calls will take hundreds of milliseconds.
Is there a way to write a query that joins result from another collection ?
No, operations are constrained to a single collection. You can perform a second query and use the $in operator there, however, which is similar to a subselect and reasonably fast, but of course requires two round-trips.
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection. Alternatively
There are no procedure in Mongodb
Alternatively, Is there a way to write a query that joins result from another collection ?
You normally don't need to do any Joins in MongoDB and there is no such thing. The flexibility of the document handled already typical need of joins. You should the think about your document model and asking how to design joins out of your schema should always be your first port of call. As alternative you may need to use aggregation or Map-Reduce in server side to handle this.
First of all, mnemosyn and Michael9 are right. But if I were in your shoes, also assuming that the client data collection is one document per client, I would store the document ID of the client data document in the client document to make the "join" (still no joins in Mongo) easier.
If you have more client data documents per client then an array of document IDs.
But all this does not save you from that you have to implement the "join" in your application code, if it's a Rails app then in your controller probably.

Which mongo document schema/structure is correct?

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}
The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.
I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link
There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

Atomicity of Model.create in Mongoose when passing an array of documents

So I understand that MongoDB (and by proxy Mongoose) does not support transactions, but that operations involving a single document are always atomic. In looking over the Mongoose docs, I ran into Model.create which allows one to pass an array of documents and store them in a single action, like so:
var array = [{ type: 'jelly bean' }, { type: 'snickers' }];
Candy.create(array, function (err, jellybean, snickers) {
// ...
}
Is this action atomic? Does Mongo save all the documents at once, or does the Mongoose ODM loop through the array, saving one document at a time? Sources (or source code) would be greatly appreciated. (Also, I'm new, so please, don't shoot!)
MongoDB Wire Protocol accepts either single document or multiple documents with OP_INSERT. However, on the server they are still inserted one at a time.
In other words, if the server were to crash part-way through the insert, some documents would be inserted and others would not be. Within each document you are guaranteed consistent view of it - either it's all inserted or it's not. But for multiple documents no such guarantee exists.