I got a collection Users, that is name, password, email etc.
Also i got a collection Groups, every group has it's members - array of users.
How should i design my database? I clearly see 2 ways of doing so:
Way 1 (MySQL-like): every user has an _id, so i just put it into the members array and so be it.
Way 2: copy a whole user document inside plus add some fields.
On the MongoDB site they are telling that duplicate data is nothing to worry bcs of the low price of storages. Also they say that we should avoid JOINs on data read.
duplicate data is nothing to worry about
This is something to worry about when it comes to updating. Suppose you have user details nested and duplicated in every document. What happens when a user changes their name? You'll have to update every instance of that user in every document.
Be careful to differentiate between data and entities. A user is an entity, think carefully before duplicating entities as fixing it later could be hard work.
Personally, I'd split them unless you find yourself in a situation where performance is too slow to do the joining in real time. Then, and only then, consider merging.
Actually answer to this question depends on what kind of screens you are designing and what kind of queries you are going to make to fetch data. Lets go through pros and cons of each option which will help you in weighing each option.
Way 1 :- Putting array of user_ids in group collection
Pros
1) If you have a screen which shows group details of a particular group and list of all members (users_ids) belonging to that group, then one query can fetch all the details needed for this screen and it would be faster too.
Cons
1) If in group detail screen, you have to show details of users along with group details, then since mongodb does not provide any joins, you would be fetching user details in a separate query and would be joining both on the client side. This can lead to a impact on performance.
2) If you have a screen which shows user details and all the groups he/she belongs, then you will be searching user_id in user array in group collection. If you are expecting number of members in a group to be very high(millions), then searching inside the array can lead to a huge performance impact.
Way 2 :- Copy user document inside inside group collection
Duplicating data is not a problem in Mongodb, but you should have a really good reason for that. Thumb rule should be duplicate data when relationship is 1:few and not 1:many.
Pros
1) This approach will save you from joining group and user collection at client side as one query can fetch all the details of group along with its users.
Cons
1) Suppose you have a million groups and user_id_1 belongs to 100,000 groups, then whenever you have an update on user_id_1, you will have to update 100,000 documents. This can again lead to huge performance impact.
2) Also if a large number of users subscribe to 1 group, then document size of this group keeps on increasing. In Mongodb The maximum BSON document size is 16 megabytes that means you cannot have a document greater than 16MB, so you cannot add users to a group infinitely. This will limit your functionality.
Way 3 :- Embed group details in user collection
Pros
1) One query can fetch user details along with all the details of all the groups this user belongs to.
2) If you have are expecting few users in a group, then you will have few group arrays in a user document. This will not exceed 16MB limit.
Cons
1) If you are expecting that a user can subscribe to a lot many groups(millions), then user document may exceed 16MB limit.
2) Also if you have very frequent updates in group details then you will have to update the same in many user documents.
You can also go through the following link to get more details about data model design :-
https://docs.mongodb.org/manual/core/data-model-design/
It depends on how you will use data in your application.
If you have more than 2 groups and you will have to search a user in all of the groups, embed the user document within the group (way 2) is not a good idea. So in this case I sugest to use the way 1.
If you have only 2 groups or the user group will be known before your application when doing the query, then use the way 2.
I guess that separating the data is the way to go, since it will be better to direct update, get and delete user data directly.
Related
I have a server storing content 5,000 documents. Lets say I have 1 million users who all query for 50 new documents at their own pace, until all content has been seen.
I want to make sure that each user only sees and interacts with the content once and never again, like Tinder.
My first thought was to tag each document with a list of user-ids of the users who have seen the document. However, this list would get really long... like a list of 1 million user-ids per document - but this sounds like it would really kill query performance.
Does anyone have any better ideas of how I can return content to users just once and never again.
p.s i am planning on doing this build out with mongoDB
p.p.s i thought about making a list of 'document-ids-seen' and attaching that to the user's document, and then with every query made by that user 'filter' out results that match 'document-ids-seen', but same challenge here, the query length would grow linearly as the user keeps interacting and bringing in new content.
The solution depends on the exact meaning of "at their own pace".
Your second post suggests that the time schedule is up to the user, but she will be presented with the documents in an order determined by your application, like e.g. getting news items in the order of the timestamp of news creation. In that case, your timestamp or auto increment solution will work, and it has only a small impact on data volume and query complexity.
If, however, the user may also choose which documents to view, this won't work any more, as the documents already viewed may be scattered across the entire document set. A solution to handle this efficiently consists of two design ideas:
(a) Imagine whether most users, at a given point of time, will have viewed a small or a large part of the entire document set. If only a small selection of documents is expected to be of interest to a particular user, then the count of documents the user has viewed will be rather small. (E.g. assume the documents are about IT and one user only wants to look at MongoDB docs, another mainly at Linux docs.) If all users will be interested in most or all of documents, then the count of documents a particular user has not viewed will be small. (E.g. a set of news that everyone tries to follow.) Depending on which is the case, store only a small list of viewed/not viewed document ids with each user, which will also simplify the query for the documents still to be viewed.
(b) With each user, don't store a list of single document ids (viewed or not viewed), but a list of intervals of such ids. E.g., if you store ids of documents not yet viewed, and some documents get added to the database, then, when a user is opened, her highest interval will be updated from (someLowerId, formerHighestId) to (someLowerId, currentHighestId). When a user views a document, the interval containing its id gets split from (lowId, highId) to (lowId, viewedId - 1), (viewedId + 1, highId), where one or both of these intervals may get empty. Including or excluding intervals like these will also simplify the queries as opposed to listing single ids.
I just had the idea that I could avoid the many-to-many relationship of content-to-users' interaction altogether, if I put a time-stamp on each document, and therefore only queried for more documents after a particular time-stamp 'X'.
Where 'X' could be stored in my 'users' table.
So when opening the app, I would sync my 'users' table, then issue queries after time-stamp 'X', then when results are returned, I'd update my 'users' table again with my new time-stamp X.
Or 'x' could not be a time-stamp, 'x' could just be an auto-incrementing id
I am building an e-learning app, and showing student activities as a timeline, should I embed them in the user collection, or create a separate collection with an userId.
Constraints:
One to many relationship.
User activities are detailed and numerous
For 90% of the time, we only need to see one user at an time, the other case
is where a supervisor(teacher) needs to see an summary of the activities of
users(maybe another collection?)
I haven't thought of the use case of searching for activities and finding students, maybe I'll have a use for this later on? (eg. see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question)
I have found different schemas for the related problem in these two questions:
MongoDB schema design -- Choose two collection approach or embedded document recommends to try and embed as much as possible
MongoDB schema for storing user location history reminds don't bloat a collection, because querying the elements deep below might be hard, especially if you're going to use lists
Both of those articles are right and both are wrong.
To embed or not to embed? This is the always the key question and it comes down to your needs, querying and storage and even your working set.
At the end of the day we can only give pointers you can't actually tell you which is best.
However, considering the size of an activities feed I personally would not embed it since it could easily grow past 16meg (per user) however for the speed and power of querying you could aggregate, say, the last 20 activites of a user and then embed that into the users row (since the last 20 is normally what is queried the most).
But then embedding an aggregate depends, sharding can take care of querying huge horizontally scaled collections and using the right queries means that you don't gain any real benefit from embedding and could potientially make your life harder by having to maintain the indexes, storage and queries required to maintain that subdocument.
As for embedding to the point of death. A lot of MongoDBs querying at the moment relies mostly upon one or two level embedding so that is why it could get hard to maintain say 12 nested tables, at which time you start to see questions on here and the Google group of how to maintain such a huge document (answer is client side if you really want to).
For 90% of the time, we only need to see one user at an time, the other case is where a supervisor(teacher) needs to see an summary of the activities of users(maybe another collection?)
Considering this I would house an aggregate on the user which means the user can see their own or other users activity singulary with one round trip.
However considering that a teacher would have to most likely have pages results from all users I would house a separate activities collection and query on that for them. Paging an aggregate of subdocuments requires a few queries and in this case it would be better to just do it this way.
Hopefully that should get you started.
You should not embed activities into student document.
The reason I'm pretty confident of this is the following statements:
"User activities are detailed and numerous"
"showing student activities as a timeline"
"teacher needs to see an summary of the activities of users"
It is a bad practice to design schema that has ever-growing documents - so having a student document that keeps growing every time they complete/add another activity is a recipe for poor performance.
If you want to sort student's activities, it's a lot simpler if each is a separate document in an activity collection than if it's an array within a student document.
When you need to query about activities across multiple students, having all activities in a single collection makes it trivial, but having activities embedded in student documents makes it difficult (you will need aggregation framework, most likely, which will make it slower).
You also say you might have need in the future to "see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question" - this is not the case. You do not need to treat this as a many-to-many relationship - you can still store multiple activities associated with a single user and then query for all records matching activity "X" sorting by time finished (or whatever) and seeing which student has lowest time.
I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.
I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.
you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine
You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.
Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.
We have two very similar data types that are both "users". The first one consists of active users and the other has users that are automatically extracted and pulled into our system and have a much lower priority (in terms of speed of access) than active users.
Every active user has the potential to bring in at least 1000 data-mined users. We'll be using the active users much more frequently and performance is our primary concern. With the data-mined users, performance is secondary but we will be storing large quantities of them.
Any input on how we should be handling this? Either one collection for every user (both active and data-mined), or two collections (one for active, one for data-mined users)?
Mongo is great for storing similar, but different, objects in the same collection as long as your app can handle it.
Are the data-mined users a child of the active users? If so, then you would probably want to keep them embedded in the active users documents. You dont need to access them all the time - MongoDB allows you to fetch parts of a document if you dont need the whole thing.
Will you be querying them differently? If so, you may want to keep them separate so that your indexes do not become bloated.
Will you be querying either of them with queries that will not hit indexes? If so you will want to separate them so that you dont need to do full collection scans every time.