Let's say I have a service with thousands of users, and I want to post news alerts that they can view. Once they view one, it's marked as seen (for just that user, obviously).
I think I know the answer to this, but is it better to store on the news item a list of users who have seen it? Or is it better to store on the user document a list of all news items they've seen?
I'm assuming the latter is better, mostly because if I have 20,000 users, that means if all of them have seen a particular news alert, then I've got an array of 20,000 IDs stored in that news alert document, which probably isn't good. But this structure seems better:
{
email: 'person#person.net',
name: 'Person',
seenNews: [
'TTJGGiPsTqqLio4sf',
'vhePmuShra3MSzYsu',
'JKFqqCKDmtuuoQBXu',
'gCFyzu8BAihj8NnXB'
]
}
I probably won't have more than a few hundred news items, plus I can always go back and delete old ones anyway.
Or is there an even better way to handle this?
Given you have news
{
_id: "Fubar2.0",
title: "Fubar 2.0 released"
}
and users
{
_id: "12345",
name: "CoolName"
}
storing what has been seen in either of the above models would sooner or later exceed the BSON document size limit of 16MB. Furthermore, increasing documents in size isn't efficiently handled with the mmapv1 storage engine, which is still the default.
Conclusion: you need to store the news read in separate documents in a seen collection:
{
_id: {
newsitem: "Fubar2.0",
user:"12345"
}
}
Since we have a compound _id for seen, which is automatically indexed (and held in RAM as long as possible), queries are quite efficient.
The problem is obvious: you need two queries to get news unseen by a user
var seen = new Array()
db.seen.find({"_id.user":"12345"},{_id:1}).forEach(
function(doc){
seen.push(doc._id.news);
}
)
var unseen = db.news.find({_id: {$nin: seen}})
While this works and imho is the proper solution for the situation described, the "unseen" query isn't very efficient.
Depending on the use case, you could rather go with something like this for users
{
_id:"12345",
name: "CoolName",
lastSeen: ISODate("2015-05-05T03:26:36Z")
}
and news like this
{
_id:{
title:"FuBar 2.0 released",
date: ISODate("2015-05-05T03:46:00Z")
}
}
So when a user logs in, you already loaded the user document, right? With this, you can get all the news he or she presumably hasn't seen with
db.news.find({"_id.date":{$gte: user.lastSeen} })
Admittedly you can not really check which user has seen which news item, but if the goal is to make sure the user is presented with all the news since his or her last visit, the latter solution is efficient and easy to implement (and scale).
Related
What would be the good approach to design following swipe card style app with skip functionality?
Core functionality of the app I'm working on is as follows.
On the main page, a user first make a query for the list of the posts.
list should be sorted by date in reverse chronological order or some kind of internal score that determines the active post (with large number of votes or comments etc)
A each post is shown to user one by one in the form of a card like tinder or jelly style feed.
For each card, user can either skip or vote for it.
when user consumes all cards fetched and make query again for next items, skipped or already voted card by the current user should not appear again.
Here, the point is that a user could have huge number of skipped or voted post since user only can skip or vote for a post on the main page.(user can browse these already processed items on his/her profile)
The approaches I simply thought about are
1.to store the list of skipped or voted post ids for each user in somewhere and use them in the query with $nin operator.
db.posts.find({ _id: {$nin: [postid1,...,postid999]} }).sort({ date: -1 })
2.to embed all the userId of users that voted or skipped the post to an array and query using $ne operator
{
_id: 'postid',
skipOrVoteUser: ['user1', 'user2' ...... 'user999'],
date: 1429286816366
}
db.posts.find({ skipOrVoteUser: {$ne: 'user1'} }).sort({ date: -1 })
3.Maintaining feedCache for each user and fanout on write.
FeedCache
{
userId: 'user1',
posts: [{id:1, data: {..}}, {id:2, data: {...}},.... {id:3, data: {...}}]
}
Operations:
-When a user create a post, write copy of the post to all user's feed cache in the system.
-Fetch posts from the user's feed cache.
-When the user vote or skip a post, delete the post from his/her feed cache.
But since the list of the posts that user skipped or voted is ever growing and could be really large as time goes. I'm concern that this query would be too slow with large number of list for $nin for approach 1.
Also with approach 2, since all user on the system(or many depending on the filtering) could either vote or skip for a post, embedded user array of each post could be really large( max with number of all user) and performance of the query with $ne will be poor.
With approach 3, for every post created, there will be too much write operation and It won't be efficient.
What would be the good approach to design schema to support this kind of functionality? I've tried come up with good solution and could not think of better solutions. Please help me to solve this problem. Thanks!
On a relational database I would use approach 1. It's and obvious choice as you have good SQL operators for the task and you can easily optimize the query.
With document databases I would choose approach 2. In this case there is a good chance the vote/skip list remaining relatively small as the system grows.
What I'm talking about is:
Meteor.users.findOne() =
{
_id: "..."
...
followers: {
users: Array[], // ["someUserId1", "someUserId2"]
pages: Array[] // ["somePageId1", "somePageId2"]
}
}
vs.
Followings.findOne() =
{
_id: "..."
followeeId: "..."
followeeType: "user"
followerId: "..."
}
I found second one totally inefficient because I need to use smartPublish to publish user's followers.
Meteor.smartPublish('userFollowers', function(userId) {
var coursors = [],
followings = Followings.find({followeeId: userId});
followings.forEach(function(following) {
coursors.push(Meteor.users.find({_id: following.followerId}));
});
return coursors;
});
And I can't filter users inside the iron-router. I cache subscriptions so there may be more users than I need.
I want to do something like this:
data: function() {
return {
users: Meteor.users.find({_id: {$in: Meteor.user().followers.users}})
};
},
A bad thing about using nested arrays inside the Document is that if I've added an item to followers.users[], the whole array will be sent back to the client.
So what do you think? Is it better to keep such data inside the user Document so it'll become fat? May be it's a 'Meteor way' of solving such problems.
I think it's a better idea to keep it nested inside the user document. Storing it in a separate collection leads to a lot of unnecessary duplication, and every time the publish function is run you have to scan the entire collection again. If you're worrying about the arrays growing too large, in most cases, don't (generally, a full-text novel only takes a few hundred kb). Plus, if you're publishing your user document already, you don't have to pull any new documents into memory; you already have everything you need.
This MongoDB blog post seems to advocate a similar approach (see one-to-many section). It might be worth checking out.
You seem to be aware of the pros and cons of each option. Unfortunately, your question is mostly opinion based.
Generally, if your follower arrays will be small in size and don't change often, keep them embedded.
Otherwise a dedicated collection is the way to go.
For that case, you might want to take a look at https://atmospherejs.com/cottz/publish which seems very efficient in what it does and very easy to implement syntactically.
I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?
I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.
I wanted to know if anyone knew if you can over use embedding on MongoDB. Not saying something like 100 levels deep, in my application my average document size can get pretty large, simple tests have shown documents of 177kb.
The application is for logging, so for example I take the Apache access log and get lots of things from it like a list of all the pages that were called, a lit of all the IP address and so on. And these are done by minute.
It is unlikely that that I would ever have a document that was at the MongoDB document size limit, but wanted to know if I keep each of the sub lists as there own document, would that make for better performance regarding, returning subset information (querying for all the IP addresses that took place over 5 minutes).
When I run the query I filter to only show the IP addresses, am I wasting the databases performance if I group each minute into one document, or am I wasting it if I split each list into its own document?
You want to structure your collections and documents in a way that reflects how you intend to use the data. If you're going to do a lot of complex queries, especially with subdocuments, you might find it easier to split your documents up into separate collections. An example of this would be splitting comments from blog posts.
Your comments could be stored as an array of subdocuments:
# Example post document with comment subdocuments
{
title: 'How to Mongo!'
content: 'So I want to talk about MongoDB.',
comments: [
{
author: 'Renold',
content: 'This post, it's amazing.'
},
...
]
}
This might cause problems, though, if you want to do complex queries on just comments (e.g. picking the most recent comments from all posts or getting all comments by one author.) If you plan on making these complex queries, you'd be better off creating two collections: one for comments and the other for posts.
# Example post document with "ForeignKeys" to comment documents
{
_id: ObjectId("50c21579c5f2c80000000000"),
title: 'How to Mongo!',
content: 'So I want to talk about MongoDB.',
comments: [
ObjectId("50c21579c5f2c80000000001"),
ObjectId("50c21579c5f2c80000000002"),
...
]
}
# Example comment document with a "ForeignKey" to a post document
{
_id: ObjectId("50c21579c5f2c80000000001"),
post_id: ObjectId("50c21579c5f2c80000000000"),
title: 'Renold',
content: 'This post, it's amazing.'
}
This is similar to how you'd store "ForeignKeys" in a relational database. Normalizing your documents like this makes for querying both comments and posts easy. Also, since you're breaking up your documents, each document will take up less memory. The trade-off, though, is you have to maintain the ObjectId references whenever there's a change to either document (e.g. when you insert/update/delete a comment or post.) And since there are no event hooks in Mongo, you have to do all this maintenance in your application.
On the other-hand, if you don't plan on doing any complex queries on a document's subdocuments, you might benefit from storing monolithic objects. For instance, a user's preferences isn't something you're likely to make queries for:
# Example user document with address subdocument
{
ObjectId("50c21579c5f2c800000000421"),
name: 'Howard',
password: 'naughtysecret',
address: {
state: 'FL',
city: 'Gainesville',
zip: 32608
}
}
I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 - Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 - Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following "notifications" documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.
I'd go with the following structure:
Use one collection for all actions that happend, Actions
Use another collection for who follows whom, Subscribers
Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.
The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).
If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.
Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.
I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.
To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.
From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.
Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).
on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.
hth