One to many Mongo strategy with query on both collections - mongodb

I've got two collections: one with ~7.600.000 documents containing information about available trips and one with ~5000 documents containing information about hotels with region, city and country data. The trips collection has field with id of certain hotel.
my problem is, that I have to query both collections to get information about certain trip: location information from hotels collection and other information like price, number of people etc from trips collection.
I've read about mapreduce strategy of merging two collections, but i think that it won't fit in my case because it'll create only 5000 documents if I link them using hotel id? Is it possible?
Another approach is two embed hotels information in trip collection, but I'm afraid of updating hotel information in this case.
Please give me some advice, and tell which approach will be the best?

You have many options. It's all about deciding where to "join" the data. The options:
Join on the front end. Maybe bring back all trips first and then use AJAX calls to lazily load the hotel information. (Assuming a web application). Point being, two calls might not be the worst thing!
Use map/reduce in Mongo to output the data as you want it. It won't work in real time, but it will give you the right results. It wouldn't be limited to 5,000 documents. You could start with the bigger trip collection and bring in what you need. It's very flexible.
Embed the hotel information. As a note, you only want to embed hotel information if it's not changing all that often. If it changes constantly, I would consider leaving things as is.
For a lot of the work I do with Mongo, I've found that two calls isn't so bad - especially when dealing with fast changing data.

Related

MongoDB - MANY documents or MANY collections

I am currently working on a service that stores products owned by multiple stores, and I am trying to figure how is the best way of structuring the DB. The only problem is that products are from different domains, like, clothes, toys, electronics and sold by different sellers.
First idea that came to mind was having different collections for each seller, but I see this as a headache, having to manage different DB connections.
The idea of storing all products in the same document seems bad to me, because they are from different domains.
The only idea that I believe could work in this case is like this:
Let's say we have 3 stores: Store One, Store Two, Store Three. I would create a document for each store like this: products_storeone, products_storetwo, products_storethree and access them based on an identifier each store has. Now, each store will have multiple documents to store different things like products_identifier, users_identifier, orders_identifier.
Do you consider this is a good idea?
Please tell me your opinion on what could be the best way of achieving a structure for storing items for each store independently, without mixing them.
After some calculation, there will be maximum 50 documents for each store. I don't see this as a way to big number to handle like 1.000 stores. Do you think 50.000 documents are too much? Impacts performance?
Any tips in order to achieve high performance for queries are more than welcome.
Thanks!
You can make an single db "products" and it will be an array of objects and each will be a single product. Each product will have "store" and "category" attribute by which we query efficiently.
Hope this can be of help.

MongoDB database design - contest application

I'm building a contest application. Which have 4 collections so far:
contest
questions
matches
users
I want to store every user score for every match he's assigned into. But I really can't find a proper way to achieve this.
All what I've came up with, Is to replace matches in users with an array in which each element contains a reference to matches collection and score field. But I think this is not very efficient.
EDIT
I was thinking about another solution. A separate collection called scores that contains three fields user, match and score.
Here's my schema structure:
Contests:
Questions:
Matches:
Users:
Note Any recommended adjustments on the current design is welcomed too.
Since mongodb is not designed to support collections relationships you migth end up with some duplicated work, I would suggest you to find a way of storing as much data as you can in a single document.
Your scores would go in each match document, probably the users array would have this structure {'users':[{user_id:'xxx',score:xxx}{user_id:'xxx',score:xxx}]}
The other solution, would be what you say, to have in each user doccument, a matches array with a structure like this: {'matches':[{match_id:'xxx',score:xxx}{match_id:'xxx',score:xxx}]}
You can have both also, this migth be more efficient depending the kind of queries you will need to do. You can also have a field in the subdocuments that stores the user/match name/title
Note: As you can see, you have two solutions, or you optimize for doccument size(so you can store more) or you optimize for performance (so you can read faster/with less resources)
Hope this be of any help.

Where should I put activities timeline in mongodb, embedded in user or separately?

I am building an e-learning app, and showing student activities as a timeline, should I embed them in the user collection, or create a separate collection with an userId.
Constraints:
One to many relationship.
User activities are detailed and numerous
For 90% of the time, we only need to see one user at an time, the other case
is where a supervisor(teacher) needs to see an summary of the activities of
users(maybe another collection?)
I haven't thought of the use case of searching for activities and finding students, maybe I'll have a use for this later on? (eg. see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question)
I have found different schemas for the related problem in these two questions:
MongoDB schema design -- Choose two collection approach or embedded document recommends to try and embed as much as possible
MongoDB schema for storing user location history reminds don't bloat a collection, because querying the elements deep below might be hard, especially if you're going to use lists
Both of those articles are right and both are wrong.
To embed or not to embed? This is the always the key question and it comes down to your needs, querying and storage and even your working set.
At the end of the day we can only give pointers you can't actually tell you which is best.
However, considering the size of an activities feed I personally would not embed it since it could easily grow past 16meg (per user) however for the speed and power of querying you could aggregate, say, the last 20 activites of a user and then embed that into the users row (since the last 20 is normally what is queried the most).
But then embedding an aggregate depends, sharding can take care of querying huge horizontally scaled collections and using the right queries means that you don't gain any real benefit from embedding and could potientially make your life harder by having to maintain the indexes, storage and queries required to maintain that subdocument.
As for embedding to the point of death. A lot of MongoDBs querying at the moment relies mostly upon one or two level embedding so that is why it could get hard to maintain say 12 nested tables, at which time you start to see questions on here and the Google group of how to maintain such a huge document (answer is client side if you really want to).
For 90% of the time, we only need to see one user at an time, the other case is where a supervisor(teacher) needs to see an summary of the activities of users(maybe another collection?)
Considering this I would house an aggregate on the user which means the user can see their own or other users activity singulary with one round trip.
However considering that a teacher would have to most likely have pages results from all users I would house a separate activities collection and query on that for them. Paging an aggregate of subdocuments requires a few queries and in this case it would be better to just do it this way.
Hopefully that should get you started.
You should not embed activities into student document.
The reason I'm pretty confident of this is the following statements:
"User activities are detailed and numerous"
"showing student activities as a timeline"
"teacher needs to see an summary of the activities of users"
It is a bad practice to design schema that has ever-growing documents - so having a student document that keeps growing every time they complete/add another activity is a recipe for poor performance.
If you want to sort student's activities, it's a lot simpler if each is a separate document in an activity collection than if it's an array within a student document.
When you need to query about activities across multiple students, having all activities in a single collection makes it trivial, but having activities embedded in student documents makes it difficult (you will need aggregation framework, most likely, which will make it slower).
You also say you might have need in the future to "see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question" - this is not the case. You do not need to treat this as a many-to-many relationship - you can still store multiple activities associated with a single user and then query for all records matching activity "X" sorting by time finished (or whatever) and seeing which student has lowest time.

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.