MongoDB Feed Design and Query - mongodb

I am designing a news feed for a blog site. I am trying to design the feed so blogs with recent activity from your friends keeps those blogs on the top of your feed while feeds you have no participation in fall towards the bottom of the list. Basically, think of your Facebook feed but for Blogs.
Here is the current design I have but I'm open to suggestions to make this easier to select from:
{
_id: 1,
author: {first: "John", last: "Doe", id: 123},
title: "This is a test post.",
body: "This is the body of my post."
date: new Date("Feb 1, 2013"),
edited: new Date("Feb 2, 2013"),
comments: [
{
author: {first: "Jane", last: "Doe", id: 124},
date: new Date("Feb 2, 2013"),
comment: "Awesome post."
},
],
likes: [
{
who: {first: "Black", last: "Smith", id: 125},
when: new Date("Feb 3, 2013")
}
],
tagged: [
{
who: {first: "Black", last: "Smith", id: 126},
when: new Date("Feb 4, 2013")
}
]}
Question 1: Assuming my friends have the ids 124 and 125, how do I select the feed so that the order of this post in the results is by them, not by user 126 that was tagged in the feed later.
Question 2: Is this single collection of blogs a good design or should I normalize actions into a separate collection?

So this document you show represents one blog post and those are the comments, tags, likes, etc? If that's the case this isn't too bad.
1.
db.posts.find({'$or':[{'comments.author.id':{$in:[some list of friends]}}, {'likes.who.id':{$in:[some list of friends]}}, {'tagged.who.id':{$in:[some list of friends]}}]}).sort({date:-1})
This will give you the posts all your friends have activity on sorted by the post's date descending. I don't think mongodb yet supports advanced sorting (like the min/max of the dates in comments, likes or tags) so sorting by either one of comments, likes or tags or sorting on post date is your best bet with this model.
2.
Personally, I would setup a separate collection for dumping a user's feed events into. Then as events happen, just push the event into the array of events in the document.
They will automatically be sorted and you can just slice the array and cap it as needed.
However with documents that grow like that you need to be careful and allocate an initial sizable amount of memory or you will encounter slow document moves on disk.
See the blurb on updates
Edit additional comments:
There are two ways to do it. Either a collection where every document is a feed event or where every document is the user's entire feed. Each has advantages and disadvantages. If you are ok with capping it at say 1000 recent feed events I would use the document to represent an entire feed method.
So I would create a document structure like
{userid:1, feed:[(feed objects)]}
where feed is an array of feed event objects. These should be subdocuments like
{id:(a users id), name:(a users name), type:(an int for like/comment/tag), date:(some iso date), postName:(the name of the post acted on), postId:(the id of the post acted on)}
To update this feed you just need to push a new feed document onto the feed array when the feed event happens. So if user A likes a post, push the feed document onto all of user A's friends feeds.
This works well for small feeds. If you need a very large feed I would recommend using a document per feed entry and sharding off of the recipient user's id and indexing the date field. This is getting closer to how the very very large feeds at twitter/fb work but they use mysql which is arguably better than mongodb for this specific use case.

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Embedding duplicate "friendship" subdocuments to model mutual friendship edges in MongoDB

I'm trying to model bidirectional friendships in MongoDB. "Bidirectional" means, like Facebook but unlike Twitter, if you're friends with Sam then Sam must also be friends with you. In MongoDB, the usually-recommended solution (example) seems to be to something like this:
Create a User collection (aka nodes to use the proper graph theory term) and a Friendship collection (aka edges)
Each User document contains an embedded friends array, each element of which contains the ObjectID of each friend. Each element can also cache read-only info about each friend (e.g. name, photo URL) that can avoid cross-document queries for common use-cases like "display my friend list".
Adding friendships involves inserting a new Friendship document and then $push-ing the Friendship's Object ID into a friends array in both users, as well as read-only cached info about friends (e.g. name) to avoid multi-document queries when displaying friend lists.
I'm considering a different design where, instead of a separate Friendship collection, edge data will be stored (duplicated) in both nodes of the bidirectional relationship. Like this:
{
_id: new ObjectID("111111111111111111111111"),
name: "Joe",
pictureUrl: "https://foo.com/joe.jpg",
invites: [
... // similar schema to friends array below
],
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"),
user1: {
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
user2: {
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
}
]
},
{
_id: new ObjectID("222222222222222222222222"),
name: "Bill",
pictureUrl: "https://foo.com/bill.jpg",
invites: [
... // similar schema to friends array below
],
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"), // shared data about the edge
user1: { // data specific to each friend
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
user2: { // data specific to each friend
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
}
]
}
Here's how I'm planning to deal with the following:
Reads - all reads for common operations happen only from individual User documents. High-level info about friends (e.g. name, picture URL) are cached inside the friends array.
Inviting a friend - add a new document to an invites array embedded into both users (not shown above) that's similar in structure and functionality to the friends collection shown above
Invitation accepted - using updateMany, add a new identical embedded document to the friends array of both users, and $pull an element from invites array of both users. Initially I'll use multi-document transactions for these updates, but because adding friendships isn't time-critical, this could be adapted to use eventual consistency.
Friendship revoked - use updateMany with a filter for {'friends.friendshipId': new ObjectID("123456789012345678901234")} to $pull the friendship subdocument from both users' friends arrays. Like above, this could use multi-document transactions initially, and eventual consistency later if needed for scale.
Updates to cached data - if a user changes info cached in friends (e.g. name or picture URL), this is an uncommon operation that can proceed slowly and one-document-at-a-time, so eventual consistency is fine.
I have two basic concerns that I'd like your advice about:
What are the problems and pitfalls with the approach described above? I know about the obvious things: extra storage, slower updates, need to add queuing and retry logic to support eventual consistency, and the risk of edge data getting out-of-sync between its two copies. I think I'm OK with these problems. But are there other, non-obvious problems that I will likely run into?
Instead of having a user1 and user2 field for each node of the edge, would it be better to use a 2-element array instead? Why or why not? Here's an example of what I mean:
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"),
users: [
{
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
{
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
],
}
]
BTW, I know that graph databases and even relational databases are better at modeling relationships compared to MongoDB. But for a variety of reasons I've settled on MongoDB for now, so please limit answers to MongoDB solutions rather than pointing me to using a graph or relational database. Thanks!

Mongo: Two collections with pagination (in single list in html)

Currently in our system we have two separate collections, of invites, and users. So we can send an invite to someone, and that invite will have some information attached to it and is stored in the invites collection. If the user registers his account information is stored in the users collection.
Not every user has to have an invite, and not every invite has to have a user. We check if a user has an invite (or visa versa) on the email address, which in those case is stored in both collections.
Originally in our dashboard we have had a user overview, in which there is a page where you can see the current users and paginate between them.
Now we want to have one single page (and single table) in which we can view both the invites and the users and paginate through them.
Lets say our data looks like this:
invites: [
{ _id: "5af42e75583c25300caf5e5b", email: "john#doe.com", name: "John" },
{ _id: "53fbd269bde85f02007023a1", email: "jane#doe.com", name: "Jane" },
...
]
users: [
{ _id: "53fe288be081540200733892", email: "john#doe.com", firstName: "John" },
{ _id: "53fd103de08154020073388d", email: "steve#doe.com", firstName: "Steve" },
...
]
Points to note.
Some users can be matched with an invite based on the email (but that is not required)
Some invites never register
The field names are not always exactly the same
Is it possible to make a paginated list of all emails and sort on them? So if there is an email that starts with an a in collection invites, that is picked before the email that starts with a b in collection users etc. And then use offset / limit to paginate through it.
Basically, I want to "merge" the two collections in something that would be akin to a MySQL view and be able to query on that as if the entire thing was one collection.
I'm preferably looking for a solution without changing the data structure in the collection (as a projected view or something, that is fine). But parts of the code already rely on the given structure. Which in light of this new requirement might not be the best approach.

How to efficiently count one-to-squillion relationships in Mongoose / MongoDB? (e.g. the "like" count for a Tweet on Twitter)

Let's imagine I'm building an app similar to Twitter just for an easy example (and using Mongoose / MongoDB).
I'd have a collection for "tweets" and my question is: how can I manage the "like count" for a tweet without putting unnecessary strain on the database?
My first instinct was to have another collection named "likes" and each document would store the id of the user who liked the tweet, and the id of the tweet they liked.
But then I realized if I want show 20 tweets on the front-end it would take me 21 queries (this is where I think I'm misunderstanding something basic and it shouldn't take me this many queries). One query to find the 20 most recent tweets, and another query-per tweet to count how many related "like" documents there are for it. Is there a more efficient way of handling that in MongoDB? Or is this where I'd need to turn to some sort of caching solution in my app?
My next thought was to instead embed a "usersWhoHaveLiked" array within each tweet document like this:
{
_id: ObjectId("abc123abc123"),
title: "My first tweet",
author: 3,
usersWhoHaveLiked: [3, 20, 17, 5]
}
But if hundreds of thousands of users can "like" a tweet that array could become incredibly large and I'm worried that modifying an array of that size could be CPU-expensive / slow, or outright overflow the 16mb per document allowed.
I realize there are many different ways of architecting this solution, so I'm not looking for a best way, which I know would be highly subjective... what makes this quesiton atleast a little bit objective is that we want to minimize stress put on the db & server; which is measurable.
I'm a database rookie so if there's a Mongoose / MongoDB flavored way of handling this please feel free to point out things that might be painfully obvious to others :)
Thanks!
Referring to the three types of references stated by Mongo's blog on this topic:
One-to-Few
Generally less than a few hundred items but other factors do have an impact.
A data object for your example might look like:
{
_id: ObjectId("abc123abc123"),
title: "My first tweet",
author: 3,
usersWhoHaveLiked: [
{ name: 'Foo' }
{ name: 'Bar' }
]
}
To get the tweet and like count would be one query to mongo and then getting the length of the usersWhoHaveLiked array:
Tweets.findById('abc123abc123').exec().then((tweet) => {
const likeCount = tweet.usersWhoHaveLiked.length;
// do something with tweet and likeCount
});
One-to-Many
Generally "up to several hundred [items], but never more than a couple thousand or so".
A data object for your example might look like:
{
_id: ObjectId("abc123abc123"),
title: "My first tweet",
author: 3,
usersWhoHaveLiked: [3, 20, 17, 5]
}
To get the tweet and like count would be the same as one-to-few:
Tweets.findById('abc123abc123').exec().then((tweet) => {
const likeCount = tweet.usersWhoHaveLiked.length;
// do something with tweet and likeCount
});
One-to-Squillions
Generally "more than a couple thousand or so".
A data object for your example might look like:
// tweet
{
_id: ObjectId("abc123abc123"),
title: "My first tweet",
author: 3
}
// likes
{
_id: ObjectId("abc123abc124"),
tweet: ObjectId("abc123abc123"),
author: 4 // or could be embedded info as well or a mix
}
To get the tweet and like count would be two queries:
Promise.all([
Tweets.findById('abc123abc123').exec(),
Likes.count({ tweet: 'abc123abc123' }).exec()
]).then(([tweet, likeCount]) => {
// do something with tweet and likeCount
});
There are some ways to simplify this and I will leave them up to you to explore:
In the first two examples, create a virtual getter that will get the array length for you (i.e. tweet.likeCount)
For the last example, create a post save hook from likes that will update a property on tweets (e.g. likeCount).
A final note regarding when to use which of the three strategies depends on more than just the number of items. A couple other key concerns are if the data needs to stand on it's own and the velocity of change of the array.

Is this structure possible to implement with MongoDB?

So I am coding an app in Meteor which is a mini social network for my college. The issue I am facing right now is that my data is relational.
This web app allows people to write posts and post images and links. People who follow the user see his posts on his feed. People can share this. So the data is inter connected.
So basically
Users have followers
Followers gets the posts from the people they follow
They can comment and share
The shared post appears on the people who follow the sharer
Every post should be tagged from a predefined of tags
People who follow the tags should get the posts with the tags whether they are not following the person who wrote the post or not
You made the first – and correct – step of defining your use cases before you start to model your data.
However, you have the misconception that interrelated data needs an RDBMS.
There are several ways of model relationships between documents.
Note: The following examples are heavily simplified for brevity and comprehensibility
Embedding
An 1:1 relationship can be modeled simply by embedding:
{
_id: "joe",
name: "Joe Bookreader",
address: {
street: "123 Fake Street",
city: "Faketon",
state: "MA",
zip: "12345"
}
}
A 1:many relationship can be modeled by embedding, too:
{
_id: "joe",
name: "Joe Bookreader",
phone:[
{ type:"mobile", number:"+1 2345 67890"},
{ type:"home", number:"+1 2345 987654"}
]
}
References
The major difference in comparison to RDBMS is that you resolve the references in your application code, as shown below.
Implicit references
Let's say we have publisher and books. A publisher doc may look like this
{
_id: "acme",
name: "Acme Publishing, Inc."
}
and a book doc may look like this
{
_id:"9788636505700",
name: "The Book of Foo",
publisher: "acme"
}
Now here comes the important part: We have basically two use cases we can cover with this data. The first one being
For "The Book of Foo", get the details for the publisher
Easy enough, since we already have "The Book of Foo" and it's values
db.publishers.find({_id:"acme"})
The other use case would be
Which books have been published by Acme Publishing, Inc. ?
Since we have Acme Publishing, Inc's data, again, that is easy enough:
db.books.find({publisher:"acme"})
Explicit References
MongoDB has a notion of references, commonly referred to as DBRefs
However, these references are resolved by the driver and not by the MongoDB server. Personally, I have not ever used or needed it, since implicit references most of the times work perfectly.
Example modeling for "Users make posts"
Let's say we have a user document like
{
_id: "joe",
name: "John Bookreader",
joined: ISODate("2015-05-05T06:31:00Z"),
…
}
When doing it naively, we would simply embed the posts into the user document.
However, that would limit the number of posts one can make, since there is a hardcoded size limit for BSON documents of 16MB.
So, it is pretty obvious that every post should have its own document with a reference to the author:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: "joe"
}
However, this comes with a problem: we want to show the authors name, not his id.
For a single post, we could simply do a lookup in the users collection to get the name. But what when we want to display a list of posts? That would require a lot of queries. So instead, we use redundancy to save those queries and optimize our application:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: { id:"joe", name: "Joe Bookreader"}
}
So for a list of posts, we can display them with the poster's name without additional queries. Only when a user wants to get details about the poster, you would look up the poster by his id. With that, we have saved a lot of queries for a common use case. You may say "Stop! What if a user changes his name?" Well, for starters, it is a relatively rare use case. And even when, it is not much of a problem. First of course, we'd have to update the user document:
db.users.update({"_id":"joe"},{$set:{name:"Joe A. Bookreader"}})
And then, we have to take an additional step:
db.posts.update(
{ "author.id": "joe" },
{ $set:{ "author.name": "Joe A. Bookreader" }},
{ multi: true}
)
Of course, this is kind of costly. But what have we done here? We optimized a common use case at the expense of a rather rare use case. A good bargain in my book.
I hope this simple example helped you to understand better on how you can approach your application's use cases with MongoDB.