MongoDB $lookup: Limitations & Usage - mongodb

With the new aggregation pipeline stage $lookup we are now able to perform 'left outer joins'.
At first glance, I want to immediately replace one of our denormalised collections with two separate collections and use the $lookup to join them upon querying. This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
But surely this is too good to be true? This is a NoSQL, document database after all!
MongoDB's CTO also highlights his concerns:
We’re still concerned that $lookup can be misused to treat MongoDB
like a relational database. But instead of limiting its availability,
we’re going to help developers know when its use is appropriate, and
when it’s an anti-pattern. In the coming months, we will go beyond the
existing documentation to provide clear, strong guidance in this area.
What are the limitations of $lookup? Can I use them in real-time, operational querying of our data or should they be left for reporting, offline situations?

I share your same enthusiasm for $lookup.
I think there are trade-offs. One of the major concerns of SQL databases (and which is one of the reasons for the genesis of NoSQL) is that at large scale, joins can take a lot of time (well, relatively speaking).
It definitely helps in giving you a declarative model for your data, but then if you start to model your entire NoSQL database as though its a database of rows and tables (just using refs, for example), then you begin modeling it as though it's simply a SQL database (to a degree). Even MongoDB mentioned it (like you put in your question):
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
You mentioned:
This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
I'm not sure what your collections look like exactly, but that definitely sounds like it could be a good use for $lookup.
Can I use them in real-time, operational querying
I would say, again, it depends on your use-case. You'll have to compare:
Desired semantics of your queries (declarative vs imperative)
Whether modeling your data as more relational (and thus using $lookup) in certain circumstances is worth the potential trade-off in computational time (that's assuming that querying across collections is even something to be concerned about, computationally speaking)
etc...
I'm sure in the coming months we'll see perf tests of the "left outer joins" and perhaps MongoDB will start writing some posts about when $lookup is an antipattern.
Hope this answer helps add to the discussion.

First of all MongoDB is a document-based database and will always be. So the $lookup aggregation pipeline stage new in version 3.2 didn't change MongoDB to relational database (RDBMS) as MongoDB's CTO mentioned:
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
The first limitation of $lookup as mentioned in the documentation is that it:
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing.
Which means that you can't use it with a sharded collection.
Also the $lookup operator doesn't work directly with an array as mentioned in post therefore you will need a preliminary $unwind stage to denormalize the localField if it is an array.
Now you said:
This will solve the problem of having, when necessary, to update a huge number of documents.
This is a good idea if your data are updated often than they are read.
as mentioned in 6 Rules of Thumb for MongoDB Schema Design: Part 3 especially if you have a large hierarchical data sets.
Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
I believe that with careful schema design you probably will not need the $lookup operator.

Related

Are there technical downsides to using a single collection over multiple collections in MongoDB?

Since MongoDB is schemaless, I could just drop all my documents into a single collection, with a key collection and an index on that key.
For example this:
db.getCollection('dogs').find()
db.getCollection('cars').find()
Would become this:
db.getCollection('all').find({'collection': 'dogs'})
db.getCollection('all').find({'collection': 'cars'})
Is there any technical downside to doing this?
There are multiple reasons to have different collections, maybe the two most importants are:
Performance: even if MongoDB has been designed to be flexible, it doesn't prevent the need to have indexes on fields that will be used during the search. You would have dramatic response times if the collection is too heterogeneous.
Maintenability/evolutivity: design should be driven by the usecases (usually you'll store the data as it's received by the application) and the design should be explicit to anyone looking at the database collections
MongoDB University is a great e-learning platform, it is free and there is in particular this course:
M320: Data Modeling
schema questions are often better understood by working backwards from the queries you'll rely on and how the data will get written.... if you were going to query Field1 AND Field2 together in 1 query statement you do want them in the same collection....dogs and cars don't sound very related while dogs and cats do...so really look at how you're going to want to query.....joining collections is not really ideal - doable via $lookup but not ideal....

I wonder if there are a lot of collections

Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
https://docs.mongodb.com/manual/tutorial/evaluate-operation-performance/
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.

How to merge two collections in mongodb

I have two collections called Company_Details and Company_Ranks...Comp_ID is common in two collections. How do I merge these two collections to get complete details of a company.
Please help me
Thanks
Satyam
To make long story short, you either do that on client-side or consider the benefits of embedding those documents.
MongoDB does not support joins, as opposed to relational databases. This is both a pro and a con. It has helped MongoDB's developers to focus on scalability which is much harder to implement when you have joins and transactions.
You can follow the DBRef specification. Lots of drivers support DBRef and do the composition seamlessly for you. You can even do that manually. But most importantly, you can take advantage of embedding documents.
Embedding documents in MongoDB is a unique ability over relational databases. Meaning, you can create one collection consisting of compound documents. You'll enjoy atomicity, as there is no "partial success", and data locality: spinning disks are better in accessing data in sequence.
If querying is your motive and you don't want to change your schema. Then, try Apache Drill which allows you to query with SQLs. Then perform the full join, inner join, etc whatever you want. You can check for drill with MongoDB.
With MongoDB Version 3.2 and higher we got now the $lookup Command, which is the "same" as a Join in a RDBMS.
With that you can easy Query between your 2 Collections and get the Information you want.
For further Details Checkt out the Documentation
https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

Denormalization vs Parent Referencing vs MapReduce

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.
Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.

Aggregate,Find,Group confusion?

I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.
Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie