Two mongodb collections in one query - mongodb

I have a big collection of clients and a huge collection of the clients data, the collections are separated and I don't want to combine them to a single collection (because of the other already working servlets) but now I need to "Join" data from both collection in a single result.
Since The query should return a big number of results I don't want to query the server once and then use the result to query again. I'm also concerned about the traffic between the server and the DB and the memory that the result set will occupy in the server RAM.
The way it's working now is that I get the relevant client list from the 'clients' collection and send this list to the query of the 'client data' collection and only then I get the aggregated results.
I want to cut off the getting and sending the client list from and right back to the server, get the server to ask himself, let the query of client data collection to ask clients collection for the relevant client list.
How can I use a stored procedure(javascript functions) to do the query in the DB and return only the relevant clients out of the collection.
Alternatively, Is there a way to write a query that joins result from another collection ?

"Good news everyone", this aggregation query work just fine in the mongo shell as a join query
db.clientData.aggregate([{
$match: {
id: {
$in: db.clients.distinct("_id",
{
"tag": "qa"
})
}
}
},
$group: {
_id: "$computerId",
total_usage: {
$sum: "$workingTime"
}
}
}]);

The key idea with MongoDB data modelling is to be write-heavy, not read-heavy: store the data in the format that you need for reading, not in some format that minimizes/avoids redundancy (i.e. use a de-normalized data model).
I don't want to combine them to a single collection
That's not a good argument
I'm also concerned about the traffic between the server and the DB [...]
If you need the data, you need the data. How does the way it is queried make a difference here?
[...] and the memory that the result set will occupy in the server RAM.
Is the amount of data so large that you want to stream it from the server to the client, such that is transferred in chunks? How much data are we talking, and why does the client read it all?
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection
There are no stored procedures in MongoDB, but you can use server-side map/reduce to 'join' collections. Generally, code that is stored in and run by the database is a violation of the layer architecture separation of concerns. I consider it one of the most ugly hacks of all time - but that's debatable.
Also, less debatable, keep in mind that M/R has huge overhead in MongoDB and is not geared towards real-time queries made e.g. in a web server call. These calls will take hundreds of milliseconds.
Is there a way to write a query that joins result from another collection ?
No, operations are constrained to a single collection. You can perform a second query and use the $in operator there, however, which is similar to a subselect and reasonably fast, but of course requires two round-trips.

How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection. Alternatively
There are no procedure in Mongodb
Alternatively, Is there a way to write a query that joins result from another collection ?
You normally don't need to do any Joins in MongoDB and there is no such thing. The flexibility of the document handled already typical need of joins. You should the think about your document model and asking how to design joins out of your schema should always be your first port of call. As alternative you may need to use aggregation or Map-Reduce in server side to handle this.

First of all, mnemosyn and Michael9 are right. But if I were in your shoes, also assuming that the client data collection is one document per client, I would store the document ID of the client data document in the client document to make the "join" (still no joins in Mongo) easier.
If you have more client data documents per client then an array of document IDs.
But all this does not save you from that you have to implement the "join" in your application code, if it's a Rails app then in your controller probably.

Related

Is it encouraged to query a MongoDB database in a loop?

In an SQL database, if I wanted to access some sort of nested data, such as a list of tags or categories for each item in a table, I'd have to use some obscure form of joining in order to send the SQL query once and then only loop through the result cursor.
My question is, in a NoSQL database such as MongoDB, is it OK to query the database repeatedly such that I can do the previous task as follows:
cursor = query for all items
for each item in cursor do
tags = query for item's tags
I know that I can store the tags in an array in the item's document, but I'm assuming that it is somehow not possible to store everything inside the same document. If that is the case, would it be expensive to requery the database repeatedly or is it designed to be used that way?
No, neither in Mongo, nor in any other database should you query a database in a loop. And one good reason for this is performance: in most web apps, database is a bottleneck and devs trying to make as small amount of db calls as possible, whereas here you are trying to make as many as possible.
I mongo you can do what you want in many ways. Some of them are:
putting your tags inside the document {itemName : 'item', tags : [1, 2, 3]}
knowing the list of elements, you do not need a loop to find information about them. You can fetch all results in one query with $in : db.tags.find({ field: { $in: [<value1>, <value2>, ... <valueN> ] }})
You should always try to fulfill a request with as few queries as possible. Keep in mind that each query, even when the database can answer it entirely from cache, requires a network roundtrip between application server, database and back.
Even when you assume that both servers are in the same datacenter and only have a latency of microseconds, these latency times will add up when you query for a large number of documents.
Relational databases solve this issue with the JOIN command. But unfortunately MongoDB has no support for joins. For that reason you should try to build your documents in a way that the most common queries can be answered by a single document. That means that you should denormalize your data. When you have a 1:n relation, you should consider to embed the referencing documents as an array in the main document. Having redundancies in your database is usually not as unacceptable in MongoDB as it is in relational databases.
When you still have good reasons to keep the child-documents as separate documents, you should use a query with the $in operator to query them all at once, as Salvador Dali suggested in his answer.

What if need to to sort grouped data on a server?

MongoDb 2.0, c# driver 1.6rc: Is there any way to sort results of data aggregations (group or map-reduce) on a server side? Lets say as a result of grouping we have many thousands of records which would be much faster to sort on a server side. All I found on the official MongoDb web site is this comment: "To order the grouped data, simply sort it client-side upon return." (Aggregation). Does it mean server sorting is not supported for such cases?
The results of group() are returned as single BSON object, so sorting must take place client-side. The output of map reduce, on the other hand, can be placed into a collection, which you can subsequently query and sort server-side.
Output options for MR:
http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions

SQL view in mongodb

I am currently evaluating mongodb for a project I have started but I can't find any information on what the equivalent of an SQL view in mongodb would be. What I need, that an SQL view provides, is to lump together data from different tables (collections) into a single collection.
I want nothing more than to clump some documents together and label them as a single document. Here's an example:
I have the following documents:
cc_address
us_address
billing_address
shipping_address
But in my application, I'd like to see all of my addresses and be able to manage them in a single document.
In other cases, I may just want a couple of fields from collections:
I have the following documents:
fb_contact
twitter_contact
google_contact
reddit_contact
each of these documents have fields that align, like firstname lastname and email, but they also have fields that don't align. I'd like to be able to compile them into a single document that only contains the fields that align.
This can be accomplished by Views in SQL correct? Can I accomplish this kind of functionality in MongoDb?
The question is quite old already. However, since mongodb v3.2 you can use $lookup in order to join data of different collections together as long as the collections are unsharded.
Since mongodb v3.4 you can also create read-only views.
There are no "joins" in MongoDB. As said by JonnyHK, you can either enormalize your data or you use embedded documents or you perform multiple queries
However, you could also use Map-Reduce.
or if you're prepared to use the development branch, you could test the new aggregation framework though maybe it's too much? This new framework will be in the soon-to-be-released 2.2, which is production-ready unlike 2.1.x.
Here's the SQL-Mongo chart also, which may be of some help in your learning.
Update: Based on your re-edit, you don't need Map-Reduce or the Aggregation Framework because you're just querying.
You're essentially doing joins, querying multiple documents and merging the results. The place to do this is within your application on the client-side.
MongoDB queries never span more than a single collection as there is no support for joins. So if you have related data you need available in the results of a query you must either add that related data to the collection you're querying (i.e. denormalize your data), or make a separate query for it from another collection.
I am currently evaluating mongodb for a project I have started but I
can't find any information on what the equivalent of an SQL view in
mongodb would be
In addition to this answer, mongodb now has on-demand materialized views. In a nutshell, this feature allows you to use aggregate and $merge (in 4.2) to create/update a quick view collection that you can query from faster. The strategy is used to update the quick view collection whenever the main collection has a record change. This has the side effect unlike SQL of increasing your data storage size. But the benefits can be huge depending on your querying needs.

Mongodb data storage performance - one doc with items in array vs multiple docs per item

I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.

Moving messaging schema to MongoDB

I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.