How to implement unread/new posts/comments in NoSQL document store like Mongodb? - mongodb

I've searched and didn't find any exact answer to this common problem.
I would like to show to users new/unread posts in a way that they get for example list of topics with unreaded posts.
If user decide and open any of that topics then automatically it's marked as read and will not show inside that list again when he click on unread posts. Plus the possibility to mark all as read.
I was thinking about maybe showing unread posts only from last 30 days, so data would not be so big.
The obvious and best solution would be making this inside objects embedded inside arrays, every array would have userid and timestamp of last view of specific topic, then i would just compare timestamp of the last post in thread to timestamp of last view of that thread by the user.
Only 2-3 queries then would be needed to show results to the user.
So for example it would look like this:
{
_id: uniqueObjectid,
id: topic_id,
topic: topic of the thread,
last_update: timestamp of last reply to that topic,
reads: [
{id: userid, last_view: timestamp of last view on this topic by user},
...
]
}
I would delete all threads from this collection that have last_update field older than 30 days.
Showing then unread posts for users would be very easy, just compare last_update with last_view for certain userid.
But it's not a good solution, from what i've read the way arrays are implemented in mongodb make that solution very slow. Imagine having last view of some topic for 1000 users, it means 1000 indexed array elements.
So it can't be done in arrays.
Here Asya from MongoDB describes why big embedded arrays should not be used link
I am having difficulties to think of any other efficient way of solving this issue.

Related

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

Mongodb schema design for swipe card style application

What would be the good approach to design following swipe card style app with skip functionality?
Core functionality of the app I'm working on is as follows.
On the main page, a user first make a query for the list of the posts.
list should be sorted by date in reverse chronological order or some kind of internal score that determines the active post (with large number of votes or comments etc)
A each post is shown to user one by one in the form of a card like tinder or jelly style feed.
For each card, user can either skip or vote for it.
when user consumes all cards fetched and make query again for next items, skipped or already voted card by the current user should not appear again.
Here, the point is that a user could have huge number of skipped or voted post since user only can skip or vote for a post on the main page.(user can browse these already processed items on his/her profile)
The approaches I simply thought about are
1.to store the list of skipped or voted post ids for each user in somewhere and use them in the query with $nin operator.
db.posts.find({ _id: {$nin: [postid1,...,postid999]} }).sort({ date: -1 })
2.to embed all the userId of users that voted or skipped the post to an array and query using $ne operator
{
_id: 'postid',
skipOrVoteUser: ['user1', 'user2' ...... 'user999'],
date: 1429286816366
}
db.posts.find({ skipOrVoteUser: {$ne: 'user1'} }).sort({ date: -1 })
3.Maintaining feedCache for each user and fanout on write.
FeedCache
{
userId: 'user1',
posts: [{id:1, data: {..}}, {id:2, data: {...}},.... {id:3, data: {...}}]
}
Operations:
-When a user create a post, write copy of the post to all user's feed cache in the system.
-Fetch posts from the user's feed cache.
-When the user vote or skip a post, delete the post from his/her feed cache.
But since the list of the posts that user skipped or voted is ever growing and could be really large as time goes. I'm concern that this query would be too slow with large number of list for $nin for approach 1.
Also with approach 2, since all user on the system(or many depending on the filtering) could either vote or skip for a post, embedded user array of each post could be really large( max with number of all user) and performance of the query with $ne will be poor.
With approach 3, for every post created, there will be too much write operation and It won't be efficient.
What would be the good approach to design schema to support this kind of functionality? I've tried come up with good solution and could not think of better solutions. Please help me to solve this problem. Thanks!
On a relational database I would use approach 1. It's and obvious choice as you have good SQL operators for the task and you can easily optimize the query.
With document databases I would choose approach 2. In this case there is a good chance the vote/skip list remaining relatively small as the system grows.

Statistical Datacollection with MongoDB

I'm building an application with Nodejs and Mongodb to scan Stackoverflow for new content, and find hot and trending topic, and I need to know what way to do this, because I'm not sure I'm doing it correctly as I come form MySQL and my gut feeling tells me there is something different here.
I'm not actually scanning Stackoverflow, it's just easy to use as an analogy, but nonetheless I have Posts, I have Comments, and Users who posted the thread (disregarding users who posted comments atm).
My initial solution was to create three tables (collections):
Posts - where I store all the information about the post
Post Stats - where I store all the dynamic information about post (number of comments, overall score, etc') once every X minutes
Users - where I store information about the users who have posted the Posts
Essentially I want to be able to query the database with "Give me the top Users of today", and "give me the history of this post" to create a sort of graph how this post behaved (ranked, scored, commented, etc') over time.
What's the correct way of doing something like this with Mongodb? Should I store the Post Stats as part of the Posts documents?
I would personally go for a hybrid solution here.
It is inevitable that you want some kind of aggregated data on the post for all time. So within the post I would house an extra subdocument that contains stats for all time:
stats: {
views: 456, // Just an example
vote_ups: 5,
vote_downs: 4,
rank: 1, // vote ups minus vote downs
comments: 5,
answers: 6
}
Then for individual periods of time I would use post_stats the way you explain creating a document like:
{
post_id: 45,
// etcera for minute by minute changes
time: ISODate()
}
Using the post_id (or _id rather) to query for the graph you wish to make. Since MongoDB is good at scaling horizontally you will be taking full advantage of it here.

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.

MongoDB storing user-specific data on shared collection objects

I'm designing an application that processes RSS feeds using MongoDB. Currently my collections are as follows:
Entry
fields: content, feed_id, title, publish_date, url
Feed
fields: description, title, url
User
fields: email_address
subscriptions (embedded collection; fields: feed_id, tags)
A user can subscribe to feeds which are linked from the embedded subscription collection. From the subscriptions I can get a list of all the feeds a user should see and also the corresponding entries.
How should I store entry status information (isRead, isStarred, etc.) that is specific to a user? When a user views an entry I need to record isRead = 1. Two common queries I need to be able to perform are:
Find all entries for a specific feed where isRead = 0 or no status exists currently
For a specific user, mark all entries prior to a publish date with isRead = 1 (this could be hundreds or even thousands of records so it must be efficient)
Hmm, this is a tricky one!
It makes sense to me to store a record for entries that are unread, and delete them when they're read. I'm basing this on the assumption that there will be more read posts than unread for each individual user, so you might as well not have documents for all of those already-read entries sitting around in your DB forever. It also makes it easier to not have to worry about the 16MB document size limit if you're not having to drag around years of history with you everywhere.
For starred entries, I would simply add an array of Entry ObjectIds to User. No need to make these subscription-specific; it'll be much easier to pull a list of items a User has starred that way.
For unread entries, it's a little more complex. I'd still add it as an array, but to satisfy your requirement of being able to quickly mark as-read entries before a specific date, I would denormalize and save the publish-date alongside the Entry ObjectId, in a new 'UnreadEntry' document.
User
fields: email_address, starred_entries[]
subscriptions (embedded collection; fields: feed_id, tags, unread_entries[])
UnreadEntry
fields: id is Entry ObjectId, publish_date
You need to be conscious of the document limit, but 16MB is one hell of a lot of unread entries/feeds, so be realistic about whether that's a limit you really need to worry about. (If it is, it should be fairly straightforward to break out User.subscriptions to its own document.)
Both of your queries now become fairly easy to write:
All entries for a specific feed that are unread:
user.subscriptions.find(feedID).unread_entries
Mark all entries prior to a publish date read:
user.subscriptions.find(feedID).unread_entries.where(publish_date.lte => my_date).delete_all
And, of course, if you simply need to mark all entries in a feed as read, that's very easy:
user.subscriptions.find(feedID).unread_entries.delete_all