Mongo for Meteor data design: opposite of normalizing?

Mongo for Meteor data design: opposite of normalizing? - mongodb

I'm new to Meteor and Mongo. Really digging both, but want to get feedback on something. I am digging into porting an app I made with Django over to Meteor and want to handle certain kinds of relations in a way that makes sense in Meteor. Given, I am more used to thinking about things in a Postgres way. So here goes.
Let's say I have three related collections: Locations, Beverages and Inventories. For this question though, I will only focus on the Locations and the Inventories. Here are the models as I've currently defined them:
Location:
_id: "someID"
beverages:
_id: "someID"
fillTo: "87"
name: "Beer"
orderWhen: "87"
startUnits: "87"
name: "Second"
number: "102"
organization: "The Second One"
Inventories:
_id: "someID"
beverages:
0: Object
name: "Diet Coke"
units: "88"
location: "someID"
timestamp: 1397622495615
user_id: "someID"
But here is my dilemma, I often need to retrieve one or many Inventories documents and need to render the "fillTo", "orderWhen" and "startUnits" per beverage. Doing things the Mongodb way it looks like I should actually be embedding these properties as I store each Inventory. But that feels really non-DRY (and dirty).
On the other hand, it seems like a lot of effort & querying to render a table for each Inventory taken. I would need to go get each Inventory, then lookup "fillTo", "orderWhen" and "startUnits" per beverage per location then render these in a table (I'm not even sure how I'd do that well).
TIA for the feedback!

If you only need this for rendering purposes (i.e. no further queries), then you can use the transform hook like this:
var myAwesomeCursor = Inventories.find(/* selector */, {
transform: function (doc) {
_.each(doc.beverages, function (bev) {
// use whatever method you want to receive these data,
// possibly from some cache or even another collection
// bev.fillTo = ...
// bev.orderWhen = ...
// bev.startUnits = ...
}
}
});
Now the myAwesomeCursor can be passed to each helper, and you're done.

In your case you might find denormalizing the inventories so they are a property of locations could be the best option, especially since they are a one-to-many relationship. In MongoDB and several other document databases, denormalizing is often preferred because it requires fewer queries and updates. As you've noticed, joins are not supported and must be done manually. As apendua mentions, Meteor's transform callback is probably the best place for the joins to happen.
However, the inventories may contain many beverage records and could cause the location records to grow too large over time. I highly recommend reading this page in the MongoDB docs (and the rest of the docs, of course). Essentially, this is a complex decision that could eventually have important performance implications for your application. Both normalized and denormalized data models are valid options in MongoDB, and both have their pros and cons.

Related

How to avoid inconsistent embedded documents

Having a bit of trouble understanding when and why to use embedded documents in a mongo database.
Imagine we have three collections: users, rooms and bookings.
I have a few questions about a situation like this:
1) How would you update the embedded document? Would it be the responsibility of the application developer to find all instances of kevin as a embedded document and update it?
2) If the solution is to use document references, is that as heavy as a relational db join? Is this just a case of the example not being a good fit for Mongo?
As always let me know if I'm being a complete idiot.

Imho, you overdid it. Given the question from you use cases are
For a given reservation, what room is booked by which user?
For a given user, what are his or her details?
How many beds does a given room provide?
I would go with the following model for rooms
{
_id: 1001,
beds: 2
}
for users
{
_id: new ObjectId(),
username: "Kevin",
mobile:"12345678"
}
and for reservations
{
_id: new ObjectId(),
date: new ISODate(),
user: "Kevin",
room: 1001
}
Now in a reservation overview, you can have all relevant information ("who", "when" and "which") by simply querying reservations, without any overhead to answer the first question from you use cases. In a reservation details view, admittedly you would have to do two queries, but they are lightning fast with proper indexing and depending on your technology can be done asynchronously, too. Note that I saved an index by using the room number as id. How to answer the remaining questions should be obvious.
So as per your original question: embedding is not necessary here, imho.

Many to many relationship on Mongodb based e-learning webapp?

I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?

I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.

How to model very-many to very-many relations in MongoDB

I'm working on porting a database to MongoDB and have run into some problems with the document size limit.
My understanding is that if you're going to always view one entity in the context of another entity, that embedding is the way to go.
However the data (genomic) has so many entities of each type, that even just storing the _id field in the embedded document puts me over the 16 MB size limit:
Genome
{
...
has_reactions:[id1, id2, ... idn] // Where n is really large
}
I've also tried modelling it the other way, but hit the same limitation:
Reaction
{
...
in_genomes:[id1, id2, ... idn] // Still really large
}
The MongoDB documentation gives great examples for one-to-one, and one-to-many relations, but doesn't have much to say on many-to-many.
In traditional SQL, I'd model this with a Genome, Reaction, and GenomeReaction set of tables. Is that the only way to go here as well?
Edit:
As for the background, reaction is a metabolic reaction, though it doesn't really matter what genomes and reactions mean in this context. It could just as well be a relationship between the types of gaskets in each of my widgets. It's a standard many-to-many relationship where both instances of "many" can be a very large number.
I'm aware that Mongo doesn't allow joins, but that's easily solved with using multiple queries, which is the recommended way of handling document references in Mongo.
We haven't chosen Mongo as a solution, we're just evaluating it as a possible solution. It looked attractive because it is billed as being able to handle "huMONGOus datasets", so I was a bit surprised by this limitation.
In all of our other use cases, Mongo has worked well. It's just this particular relationship that I'm unable to port from mysql to mongo without using a Genome, Reaction, and GenomeReaction set of collections. I can easily do this, but I was hoping that there was a more mongoy way to handle it.
Perhaps mongo doesn't handle many-to-many relationships well, which would explain its conspicuous absence from the list of data model scenarios in its docs.

After asking about this on the official mongo-db mailing list, I discovered that the recommended way to handle scenarios like this is to either use the three collection mapping I mentioned in my original post, or to use a "hybrid schema design" where one of the collections is stored in buckets.
So you'd have something like:
// genomes collection
{
_id: 1,
genome_thingee: 'blah blah'
...
}
// reaction_buckets collection
{
_id: ObjectId(...),
genome_id: 1,
count: 100,
reactions: [
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
...]
}
As you might imagine, there are all kinds of implications to this kind of model that your application has to take into account when adding or querying data.
While in the end this approach doesn't really appeal to me (and thus I decided to look at Neo4j at #Philipp's suggestion), I thought I'd post the solution in case anyone else needs to solve a similar problem and doesn't mind the hybrid/bucket approach.

MongoDB database schema design

I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 - Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 - Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following "notifications" documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.

I'd go with the following structure:
Use one collection for all actions that happend, Actions
Use another collection for who follows whom, Subscribers
Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.
The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).
If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.
Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.

I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.
To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.
From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.
Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).
on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.
hth

When to embed documents in Mongo DB

I'm trying to figure out how to best design Mongo DB schemas. The Mongo DB documentation recommends relying heavily on embedded documents for improved querying, but I'm wondering if my use case actually justifies referenced documents.
A very basic version of my current schema is basically:
(Apologies for the psuedo-format, I'm not sure how to express Mongo schemas)
users {
email (string)
}
games {
user (reference user document)
date_started (timestamp)
date_finished (timestamp)
mode (string)
score: {
total_points (integer)
time_elapsed (integer)
}
}
Games are short (about 60 seconds long) and I expect a lot of concurrent writes to be taking place.
At some point, I'm going to want to calculate a high score list, and possibly in a segregated fashion (e.g., high score list for a particular game.mode or date)
Is embedded documents the best approach here? Or is this truly a problem that relations solves better? How would these use cases best be solved in Mongo DB?

... is this truly a problem that relations solves better?
The key here is less about "is this a relation?" and more about "how am I going to access this?"
MongoDB is not "anti-reference". MongoDB does not have the benefits of joins, but it does have the benefit of embedded documents.
As long as you understand these trade-offs then it's perfectly fair to use references in MongoDB. It's really about how you plan to query these objects.
Is embedded documents the best approach here?
Maybe. Some things to consider.
Do games have value outside of the context of the user?
How many games will a single user have?
Is games transactional in nature?
How are you going to access games? Do you always need all of a user's games?
If you're planning to build leaderboards and a user can generate hundreds of game documents, then it's probably fair to have games in their own collection. Storing ten thousand instances of "game" inside of each users isn't particularly useful.
But depending on your answers to the above, you could really go either way. As the litmus test, I would try running some Map / Reduce jobs (i.e. build a simple leaderboard) to see how you feel about the structure of your data.

Why would you use a relation here? If the 'email' is the only user property than denormalization and using an embedded document would be perfectly fine. If the user object contains other information I would go for a reference.

I think that you should to use "entity-object" and "object-value" definitions from DDD. For entity use reference,but for "object-value" use embed document.
Also you can use denormalization of your object. i mean that you can duplicate your data. e.g.
// root document
game
{
//duplicate part that you need of root user
user: { FirstName: "Some name", Id: "some ID"}
}
// root document
user
{
Id:"ID",
FirstName:"someName",
LastName:"last name",
...
}