How to model very-many to very-many relations in MongoDB - mongodb

I'm working on porting a database to MongoDB and have run into some problems with the document size limit.
My understanding is that if you're going to always view one entity in the context of another entity, that embedding is the way to go.
However the data (genomic) has so many entities of each type, that even just storing the _id field in the embedded document puts me over the 16 MB size limit:
Genome
{
...
has_reactions:[id1, id2, ... idn] // Where n is really large
}
I've also tried modelling it the other way, but hit the same limitation:
Reaction
{
...
in_genomes:[id1, id2, ... idn] // Still really large
}
The MongoDB documentation gives great examples for one-to-one, and one-to-many relations, but doesn't have much to say on many-to-many.
In traditional SQL, I'd model this with a Genome, Reaction, and GenomeReaction set of tables. Is that the only way to go here as well?
Edit:
As for the background, reaction is a metabolic reaction, though it doesn't really matter what genomes and reactions mean in this context. It could just as well be a relationship between the types of gaskets in each of my widgets. It's a standard many-to-many relationship where both instances of "many" can be a very large number.
I'm aware that Mongo doesn't allow joins, but that's easily solved with using multiple queries, which is the recommended way of handling document references in Mongo.
We haven't chosen Mongo as a solution, we're just evaluating it as a possible solution. It looked attractive because it is billed as being able to handle "huMONGOus datasets", so I was a bit surprised by this limitation.
In all of our other use cases, Mongo has worked well. It's just this particular relationship that I'm unable to port from mysql to mongo without using a Genome, Reaction, and GenomeReaction set of collections. I can easily do this, but I was hoping that there was a more mongoy way to handle it.
Perhaps mongo doesn't handle many-to-many relationships well, which would explain its conspicuous absence from the list of data model scenarios in its docs.

After asking about this on the official mongo-db mailing list, I discovered that the recommended way to handle scenarios like this is to either use the three collection mapping I mentioned in my original post, or to use a "hybrid schema design" where one of the collections is stored in buckets.
So you'd have something like:
// genomes collection
{
_id: 1,
genome_thingee: 'blah blah'
...
}
// reaction_buckets collection
{
_id: ObjectId(...),
genome_id: 1,
count: 100,
reactions: [
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
...]
}
As you might imagine, there are all kinds of implications to this kind of model that your application has to take into account when adding or querying data.
While in the end this approach doesn't really appeal to me (and thus I decided to look at Neo4j at #Philipp's suggestion), I thought I'd post the solution in case anyone else needs to solve a similar problem and doesn't mind the hybrid/bucket approach.

Related

why does MongoDB recommend two-way referencing? Isn't it just circular referencing?

Reference material:
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-2
db.person.findOne()
{
_id: ObjectID("AAF1"),
name: "Kate Monster",
tasks [ // array of references to Task documents
ObjectID("ADF9"),
ObjectID("AE02"),
ObjectID("AE73")
// etc
]
}
db.tasks.findOne()
{
_id: ObjectID("ADF9"),
description: "Write lesson plan",
due_date: ISODate("2014-04-01"),
owner: ObjectID("AAF1") // Reference to Person document
}
In a tutorial post from MongoDB, it specifically encourages two-way referencing. As you see, Person document references Tasks document and vice versa.
I thought it was circular referencing that should be avoided in most cases. The site didn't explain much why it's not a problem for MongoDB though. Could please someone help me understand why this is possible in MongoDB when it is such a big no no in SQL? I know it's more like a theoretical question, but I would like to implement this type of design in the database I'm working on if there is compelling reason to.
Its only a circular reference, if you make one out of it.
Meaning: Lets say you want to print your Mongo document to some JSON-String to print it in your browser. Instead of printing a bunch of ID's under the Tasks-Section you want to print the actual name. In this case you have to follow the ID and print the name.
However: if you now go into the object and resolve the IDs behind the Owner Object, you'll be printing your Person again. This could go on indefinetely, if you program it that way. If you don't its just a bunch of IDs either way.
EDIT: depending on your implementation, IDs are not resolved automatically and thus cause no headache.
One thing: depending on your data structure and performance considerations, its sometimes easier to put any object directly into your parent document. Referencing the ID on both sides only makes sense in many-to-many relationship.
HTH

Mongo for Meteor data design: opposite of normalizing?

I'm new to Meteor and Mongo. Really digging both, but want to get feedback on something. I am digging into porting an app I made with Django over to Meteor and want to handle certain kinds of relations in a way that makes sense in Meteor. Given, I am more used to thinking about things in a Postgres way. So here goes.
Let's say I have three related collections: Locations, Beverages and Inventories. For this question though, I will only focus on the Locations and the Inventories. Here are the models as I've currently defined them:
Location:
_id: "someID"
beverages:
_id: "someID"
fillTo: "87"
name: "Beer"
orderWhen: "87"
startUnits: "87"
name: "Second"
number: "102"
organization: "The Second One"
Inventories:
_id: "someID"
beverages:
0: Object
name: "Diet Coke"
units: "88"
location: "someID"
timestamp: 1397622495615
user_id: "someID"
But here is my dilemma, I often need to retrieve one or many Inventories documents and need to render the "fillTo", "orderWhen" and "startUnits" per beverage. Doing things the Mongodb way it looks like I should actually be embedding these properties as I store each Inventory. But that feels really non-DRY (and dirty).
On the other hand, it seems like a lot of effort & querying to render a table for each Inventory taken. I would need to go get each Inventory, then lookup "fillTo", "orderWhen" and "startUnits" per beverage per location then render these in a table (I'm not even sure how I'd do that well).
TIA for the feedback!
If you only need this for rendering purposes (i.e. no further queries), then you can use the transform hook like this:
var myAwesomeCursor = Inventories.find(/* selector */, {
transform: function (doc) {
_.each(doc.beverages, function (bev) {
// use whatever method you want to receive these data,
// possibly from some cache or even another collection
// bev.fillTo = ...
// bev.orderWhen = ...
// bev.startUnits = ...
}
}
});
Now the myAwesomeCursor can be passed to each helper, and you're done.
In your case you might find denormalizing the inventories so they are a property of locations could be the best option, especially since they are a one-to-many relationship. In MongoDB and several other document databases, denormalizing is often preferred because it requires fewer queries and updates. As you've noticed, joins are not supported and must be done manually. As apendua mentions, Meteor's transform callback is probably the best place for the joins to happen.
However, the inventories may contain many beverage records and could cause the location records to grow too large over time. I highly recommend reading this page in the MongoDB docs (and the rest of the docs, of course). Essentially, this is a complex decision that could eventually have important performance implications for your application. Both normalized and denormalized data models are valid options in MongoDB, and both have their pros and cons.

Server side paging and grouping of large dataset

I'll try to explain the issue as best I can. Implement a grid with server paging. On request for N entities, DB should return a set of data which should be grouped or better said transformed in such a way that when the transformation phase is done it should result in those N entities.
Best way as I can see is something like this:
Query_all_data() => Result; (10000000 documents)
Transform(Result) => Transformed (100 groups)
Transformed.Skip(N).Take(N)
Transformation phase should be something like this:
Result = [d0, d1, d2..., dN]
Transformed = [
{ info: "foo", docs: [d0. d2, d21, d67, d100042] },
{ info: "bar", docs: [d3. d28, d121, d6271, d100042] },
{ info: "baz", docs: [d41. d26, d221, d567, d100043] },
{ info: "waz", docs: [d22. d24, d241, d167, d1000324] }
]
Every object in Transformed is an entity in grid.
I'm not sure if it's important but the DB in question is MongoDB and all documents are stored in one collection. Now, the huge pitfall of this approach is that it's way to slow on large dataset which will most certainly be the case.
Is there a better approach. Maybe different DB design?
#dakt, you can store your data in couple of different ways based on how you are going to use the data. In the process it may also be useful to store data in de-normalized form where in some duplication of data may occur.
Store data as individual documents as mentioned in your problem statement
Store the data in transformed format in your problem statement. It looks like you have a consistent way of mapping the docs to some tag. If so, why not maintain documents such that they are always embedded for those tags. This certainly has limitation on number of docs that you may be able to contain base on the 16MB document limit.
I would suggest looking at the MongoDB use-cases - http://docs.mongodb.org/ecosystem/use-cases/ and see if any of those are similar to what you are trying to achieve.

Ensure data coherence across documents in MongoDB

I'm working on a Rails app that implements some social network features as relationships, following, etc. So far everything was fine until I came across with a problem on many to many relations. As you know mongo lacks of joins, so the recommended workaround is to store the relation as an array of ids on both related documents. OK, it's a bit redundant but it should work, let's say:
field :followers, type: Array, default: []
field :following, type: Array, default: []
def follow!(who)
self.followers << who.id
who.following << self.id
self.save
who.save
end
That works pretty well, but this is one of those cases where we would need a transaction, uh, but mongo doesn't support transactions. What if the id is added to the 'followed' followers list but not to the 'follower' following list? I mean, if the first document is modified properly but the second for some reason can't be updated.
Maybe I'm too pessimistic, but there isn't a better solution?
I would recommend storing relationships only in one direction, storing the users someone follows in their user document as "following". Then if you need to query for all followers of user U1, you can query for {users.following : "U1"} Since you can have a multi-key index on an array, this query will be fast if you index this field.
The other reason to go in that direction only is a single user has a practical limit to how many different users they may be following. But the number of followers that a really popular user may have could be close to the total number of users in your system. You want to avoid creating an array in a document that could be that large.

When to embed documents in Mongo DB

I'm trying to figure out how to best design Mongo DB schemas. The Mongo DB documentation recommends relying heavily on embedded documents for improved querying, but I'm wondering if my use case actually justifies referenced documents.
A very basic version of my current schema is basically:
(Apologies for the psuedo-format, I'm not sure how to express Mongo schemas)
users {
email (string)
}
games {
user (reference user document)
date_started (timestamp)
date_finished (timestamp)
mode (string)
score: {
total_points (integer)
time_elapsed (integer)
}
}
Games are short (about 60 seconds long) and I expect a lot of concurrent writes to be taking place.
At some point, I'm going to want to calculate a high score list, and possibly in a segregated fashion (e.g., high score list for a particular game.mode or date)
Is embedded documents the best approach here? Or is this truly a problem that relations solves better? How would these use cases best be solved in Mongo DB?
... is this truly a problem that relations solves better?
The key here is less about "is this a relation?" and more about "how am I going to access this?"
MongoDB is not "anti-reference". MongoDB does not have the benefits of joins, but it does have the benefit of embedded documents.
As long as you understand these trade-offs then it's perfectly fair to use references in MongoDB. It's really about how you plan to query these objects.
Is embedded documents the best approach here?
Maybe. Some things to consider.
Do games have value outside of the context of the user?
How many games will a single user have?
Is games transactional in nature?
How are you going to access games? Do you always need all of a user's games?
If you're planning to build leaderboards and a user can generate hundreds of game documents, then it's probably fair to have games in their own collection. Storing ten thousand instances of "game" inside of each users isn't particularly useful.
But depending on your answers to the above, you could really go either way. As the litmus test, I would try running some Map / Reduce jobs (i.e. build a simple leaderboard) to see how you feel about the structure of your data.
Why would you use a relation here? If the 'email' is the only user property than denormalization and using an embedded document would be perfectly fine. If the user object contains other information I would go for a reference.
I think that you should to use "entity-object" and "object-value" definitions from DDD. For entity use reference,but for "object-value" use embed document.
Also you can use denormalization of your object. i mean that you can duplicate your data. e.g.
// root document
game
{
//duplicate part that you need of root user
user: { FirstName: "Some name", Id: "some ID"}
}
// root document
user
{
Id:"ID",
FirstName:"someName",
LastName:"last name",
...
}