MongoDB database schema design - mongodb

I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 - Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 - Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following "notifications" documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.

I'd go with the following structure:
Use one collection for all actions that happend, Actions
Use another collection for who follows whom, Subscribers
Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.
The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).
If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.
Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.

I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.
To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.
From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.
Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).
on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.
hth

Related

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

Many to many relationship on Mongodb based e-learning webapp?

I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?
I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.

Mongo for Meteor data design: opposite of normalizing?

I'm new to Meteor and Mongo. Really digging both, but want to get feedback on something. I am digging into porting an app I made with Django over to Meteor and want to handle certain kinds of relations in a way that makes sense in Meteor. Given, I am more used to thinking about things in a Postgres way. So here goes.
Let's say I have three related collections: Locations, Beverages and Inventories. For this question though, I will only focus on the Locations and the Inventories. Here are the models as I've currently defined them:
Location:
_id: "someID"
beverages:
_id: "someID"
fillTo: "87"
name: "Beer"
orderWhen: "87"
startUnits: "87"
name: "Second"
number: "102"
organization: "The Second One"
Inventories:
_id: "someID"
beverages:
0: Object
name: "Diet Coke"
units: "88"
location: "someID"
timestamp: 1397622495615
user_id: "someID"
But here is my dilemma, I often need to retrieve one or many Inventories documents and need to render the "fillTo", "orderWhen" and "startUnits" per beverage. Doing things the Mongodb way it looks like I should actually be embedding these properties as I store each Inventory. But that feels really non-DRY (and dirty).
On the other hand, it seems like a lot of effort & querying to render a table for each Inventory taken. I would need to go get each Inventory, then lookup "fillTo", "orderWhen" and "startUnits" per beverage per location then render these in a table (I'm not even sure how I'd do that well).
TIA for the feedback!
If you only need this for rendering purposes (i.e. no further queries), then you can use the transform hook like this:
var myAwesomeCursor = Inventories.find(/* selector */, {
transform: function (doc) {
_.each(doc.beverages, function (bev) {
// use whatever method you want to receive these data,
// possibly from some cache or even another collection
// bev.fillTo = ...
// bev.orderWhen = ...
// bev.startUnits = ...
}
}
});
Now the myAwesomeCursor can be passed to each helper, and you're done.
In your case you might find denormalizing the inventories so they are a property of locations could be the best option, especially since they are a one-to-many relationship. In MongoDB and several other document databases, denormalizing is often preferred because it requires fewer queries and updates. As you've noticed, joins are not supported and must be done manually. As apendua mentions, Meteor's transform callback is probably the best place for the joins to happen.
However, the inventories may contain many beverage records and could cause the location records to grow too large over time. I highly recommend reading this page in the MongoDB docs (and the rest of the docs, of course). Essentially, this is a complex decision that could eventually have important performance implications for your application. Both normalized and denormalized data models are valid options in MongoDB, and both have their pros and cons.

How can I efficiently use MongoDB to create real-time analytics with pivots?

So I'm getting a ton of data continuously that's getting put into a processedData collection. The data looks like:
{
date: "2011-12-4",
time: 2243,
gender: {
males: 1231,
females: 322
},
age: 32
}
So I'll get lots and lots of data objects like this continually. I want to be able to see all "males" that are above 40 years old. This is not an efficient query it seems because of the sheer size of the data.
Any tips?
Generally speaking, you can't.
However, there may be some shortcuts, depending on actual requirements. Do you want to count 'males above 40' across all dataset, or just one day?
1 day: split your data into daily collections (processedData-20111121, ...), this will help your queries. Also you can cache results of such query.
whole dataset: pre-aggregate data. That is, upon insertion of new data entry, do something like this:
db.preaggregated.update({_id : 'male_40'},
{$set : {gender : 'm', age : 40}, $inc : {count : 1231}},
true);
Similarly, if you know all your queries beforehand, you can just precalculate them (and not keep raw data).
It also depends on how you define "real-time" and how big a query load you will have. In some cases it is ok to just fire ad-hoc map-reduces.
My guess your target GUI is a website? In that case you are looking for something called comet. You should make a layer which processes all the data and broadcasts new mutations to your client or event bus (more on that below). Mongo doesn't enable real-time data as it doesn't emit anything on an mutation. So you can use any data store which suites you.
Depending on the language you'll use you have different options (for comet):
Socket.io (nodejs) - Javascript
Cometd - Java
SignalR - C#
Libwebsocket - C++
Most of the times you'll need an event bus or message queue to put the mutation events on. Take a look at JMS, Redis or NServiceBus (depending on what you'll use).

When to embed documents in Mongo DB

I'm trying to figure out how to best design Mongo DB schemas. The Mongo DB documentation recommends relying heavily on embedded documents for improved querying, but I'm wondering if my use case actually justifies referenced documents.
A very basic version of my current schema is basically:
(Apologies for the psuedo-format, I'm not sure how to express Mongo schemas)
users {
email (string)
}
games {
user (reference user document)
date_started (timestamp)
date_finished (timestamp)
mode (string)
score: {
total_points (integer)
time_elapsed (integer)
}
}
Games are short (about 60 seconds long) and I expect a lot of concurrent writes to be taking place.
At some point, I'm going to want to calculate a high score list, and possibly in a segregated fashion (e.g., high score list for a particular game.mode or date)
Is embedded documents the best approach here? Or is this truly a problem that relations solves better? How would these use cases best be solved in Mongo DB?
... is this truly a problem that relations solves better?
The key here is less about "is this a relation?" and more about "how am I going to access this?"
MongoDB is not "anti-reference". MongoDB does not have the benefits of joins, but it does have the benefit of embedded documents.
As long as you understand these trade-offs then it's perfectly fair to use references in MongoDB. It's really about how you plan to query these objects.
Is embedded documents the best approach here?
Maybe. Some things to consider.
Do games have value outside of the context of the user?
How many games will a single user have?
Is games transactional in nature?
How are you going to access games? Do you always need all of a user's games?
If you're planning to build leaderboards and a user can generate hundreds of game documents, then it's probably fair to have games in their own collection. Storing ten thousand instances of "game" inside of each users isn't particularly useful.
But depending on your answers to the above, you could really go either way. As the litmus test, I would try running some Map / Reduce jobs (i.e. build a simple leaderboard) to see how you feel about the structure of your data.
Why would you use a relation here? If the 'email' is the only user property than denormalization and using an embedded document would be perfectly fine. If the user object contains other information I would go for a reference.
I think that you should to use "entity-object" and "object-value" definitions from DDD. For entity use reference,but for "object-value" use embed document.
Also you can use denormalization of your object. i mean that you can duplicate your data. e.g.
// root document
game
{
//duplicate part that you need of root user
user: { FirstName: "Some name", Id: "some ID"}
}
// root document
user
{
Id:"ID",
FirstName:"someName",
LastName:"last name",
...
}