Embedding duplicate "friendship" subdocuments to model mutual friendship edges in MongoDB - mongodb

I'm trying to model bidirectional friendships in MongoDB. "Bidirectional" means, like Facebook but unlike Twitter, if you're friends with Sam then Sam must also be friends with you. In MongoDB, the usually-recommended solution (example) seems to be to something like this:
Create a User collection (aka nodes to use the proper graph theory term) and a Friendship collection (aka edges)
Each User document contains an embedded friends array, each element of which contains the ObjectID of each friend. Each element can also cache read-only info about each friend (e.g. name, photo URL) that can avoid cross-document queries for common use-cases like "display my friend list".
Adding friendships involves inserting a new Friendship document and then $push-ing the Friendship's Object ID into a friends array in both users, as well as read-only cached info about friends (e.g. name) to avoid multi-document queries when displaying friend lists.
I'm considering a different design where, instead of a separate Friendship collection, edge data will be stored (duplicated) in both nodes of the bidirectional relationship. Like this:
{
_id: new ObjectID("111111111111111111111111"),
name: "Joe",
pictureUrl: "https://foo.com/joe.jpg",
invites: [
... // similar schema to friends array below
],
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"),
user1: {
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
user2: {
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
}
]
},
{
_id: new ObjectID("222222222222222222222222"),
name: "Bill",
pictureUrl: "https://foo.com/bill.jpg",
invites: [
... // similar schema to friends array below
],
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"), // shared data about the edge
user1: { // data specific to each friend
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
user2: { // data specific to each friend
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
}
]
}
Here's how I'm planning to deal with the following:
Reads - all reads for common operations happen only from individual User documents. High-level info about friends (e.g. name, picture URL) are cached inside the friends array.
Inviting a friend - add a new document to an invites array embedded into both users (not shown above) that's similar in structure and functionality to the friends collection shown above
Invitation accepted - using updateMany, add a new identical embedded document to the friends array of both users, and $pull an element from invites array of both users. Initially I'll use multi-document transactions for these updates, but because adding friendships isn't time-critical, this could be adapted to use eventual consistency.
Friendship revoked - use updateMany with a filter for {'friends.friendshipId': new ObjectID("123456789012345678901234")} to $pull the friendship subdocument from both users' friends arrays. Like above, this could use multi-document transactions initially, and eventual consistency later if needed for scale.
Updates to cached data - if a user changes info cached in friends (e.g. name or picture URL), this is an uncommon operation that can proceed slowly and one-document-at-a-time, so eventual consistency is fine.
I have two basic concerns that I'd like your advice about:
What are the problems and pitfalls with the approach described above? I know about the obvious things: extra storage, slower updates, need to add queuing and retry logic to support eventual consistency, and the risk of edge data getting out-of-sync between its two copies. I think I'm OK with these problems. But are there other, non-obvious problems that I will likely run into?
Instead of having a user1 and user2 field for each node of the edge, would it be better to use a 2-element array instead? Why or why not? Here's an example of what I mean:
friends: [
{
friendshipId: new ObjectID("123456789012345678901234"),
lastMeeting: new Date("2019-02-07T20:35:55.256+00:00"),
users: [
{
userId: new ObjectID("111111111111111111111111"),
name: "Joe", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/joe.jpg",
},
{
userId: new ObjectID("222222222222222222222222"),
name: "Bill", // cached, read-only data to avoid multi-doc reads
pictureUrl: "https://foo.com/bill.jpg",
},
],
}
]
BTW, I know that graph databases and even relational databases are better at modeling relationships compared to MongoDB. But for a variety of reasons I've settled on MongoDB for now, so please limit answers to MongoDB solutions rather than pointing me to using a graph or relational database. Thanks!

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Mongo: Two collections with pagination (in single list in html)

Currently in our system we have two separate collections, of invites, and users. So we can send an invite to someone, and that invite will have some information attached to it and is stored in the invites collection. If the user registers his account information is stored in the users collection.
Not every user has to have an invite, and not every invite has to have a user. We check if a user has an invite (or visa versa) on the email address, which in those case is stored in both collections.
Originally in our dashboard we have had a user overview, in which there is a page where you can see the current users and paginate between them.
Now we want to have one single page (and single table) in which we can view both the invites and the users and paginate through them.
Lets say our data looks like this:
invites: [
{ _id: "5af42e75583c25300caf5e5b", email: "john#doe.com", name: "John" },
{ _id: "53fbd269bde85f02007023a1", email: "jane#doe.com", name: "Jane" },
...
]
users: [
{ _id: "53fe288be081540200733892", email: "john#doe.com", firstName: "John" },
{ _id: "53fd103de08154020073388d", email: "steve#doe.com", firstName: "Steve" },
...
]
Points to note.
Some users can be matched with an invite based on the email (but that is not required)
Some invites never register
The field names are not always exactly the same
Is it possible to make a paginated list of all emails and sort on them? So if there is an email that starts with an a in collection invites, that is picked before the email that starts with a b in collection users etc. And then use offset / limit to paginate through it.
Basically, I want to "merge" the two collections in something that would be akin to a MySQL view and be able to query on that as if the entire thing was one collection.
I'm preferably looking for a solution without changing the data structure in the collection (as a projected view or something, that is fine). But parts of the code already rely on the given structure. Which in light of this new requirement might not be the best approach.

mongodb data model design - embedded document only?

I'm not sure if I have to use only embedded documents for this example:
I have a basic model for an user
Every user has multiple pages
Every page has multiple sessions
Every session has multiple actions
So it might look like this:
user = {
'email': 'test#test.com',
'pages': [
{
'name': 'best page',
'sessions': [
{
session_name: 'abc',
actions: [
{abc: 'def'},
{abc: 'def'}
]
},
]
}, ..
]
};
Basically there are 3 nested arrays. Sessions data will be used just for reading (no update operation). I was thinking about making sessions as another model with reference on page. Is it a good idea?
You are doing absolutely fine with your architecture.This is the case of one to many relationship. You must keep Sessions data in nested form as illustrated in your example.Even if you want to just query sessions data, you might need to issue multiple queries to resolve the references in future. So,for better practice, you should not keep sessions data separately. For more information, see Mongo: Model One-to-Many Relationships

How to model recurency and hasMany relationship in MongoDB?

In my app there are users. Each user may have many friends (other users). If user A has a friend B then user B has a friend A - always. I will have to query collection of users to get all friends of user A for example. And I will have to also use geospacial index for this query to get all friends of user A in a given radius from user A.
I have some problem when trying to "model" this structure in MongoDB.
For now I have this (in Mongoose):
{
created: { type: Date, default: Date.now },
phone_number: { type: String, unique: true },
location: { type: [Number], index: '2dsphere' },
friends: [{ phone_number: String }]
}
So each user contain array of other users phone numbers (phone number identifies each user). But I don't think it's a good idea as one user may have zero or many friends - so friends array will be mutable and may grow significantly.
What will be best option of modeling this structure?
Two approaches:
Join Collection
Similar to the relation approach where there is a collection that has documents representing friendships (essentially two object ids and possible meta-data about the relationship).
Arrays on each user
Create an array and push the object id's of the friends onto the array.
When a friendship is created you would need to modify both friends (push each friend onto the other's friend array). It would be the same for friendship dissolution.
Which one?
The join collection approach is slower as it requires multiple queries to get the friendship data as opposed to having it persisted with the user themselves (taking advantage of data locality). However, if the number of relationships grows in an unbounded fashion, the array approach is not feasible. MongoDB documents have a 16mb limit and there is a practical upper bound of 1000 or so items after which working with arrays becomes slow and unwieldy.

MongoDB Feed Design and Query

I am designing a news feed for a blog site. I am trying to design the feed so blogs with recent activity from your friends keeps those blogs on the top of your feed while feeds you have no participation in fall towards the bottom of the list. Basically, think of your Facebook feed but for Blogs.
Here is the current design I have but I'm open to suggestions to make this easier to select from:
{
_id: 1,
author: {first: "John", last: "Doe", id: 123},
title: "This is a test post.",
body: "This is the body of my post."
date: new Date("Feb 1, 2013"),
edited: new Date("Feb 2, 2013"),
comments: [
{
author: {first: "Jane", last: "Doe", id: 124},
date: new Date("Feb 2, 2013"),
comment: "Awesome post."
},
],
likes: [
{
who: {first: "Black", last: "Smith", id: 125},
when: new Date("Feb 3, 2013")
}
],
tagged: [
{
who: {first: "Black", last: "Smith", id: 126},
when: new Date("Feb 4, 2013")
}
]}
Question 1: Assuming my friends have the ids 124 and 125, how do I select the feed so that the order of this post in the results is by them, not by user 126 that was tagged in the feed later.
Question 2: Is this single collection of blogs a good design or should I normalize actions into a separate collection?
So this document you show represents one blog post and those are the comments, tags, likes, etc? If that's the case this isn't too bad.
1.
db.posts.find({'$or':[{'comments.author.id':{$in:[some list of friends]}}, {'likes.who.id':{$in:[some list of friends]}}, {'tagged.who.id':{$in:[some list of friends]}}]}).sort({date:-1})
This will give you the posts all your friends have activity on sorted by the post's date descending. I don't think mongodb yet supports advanced sorting (like the min/max of the dates in comments, likes or tags) so sorting by either one of comments, likes or tags or sorting on post date is your best bet with this model.
2.
Personally, I would setup a separate collection for dumping a user's feed events into. Then as events happen, just push the event into the array of events in the document.
They will automatically be sorted and you can just slice the array and cap it as needed.
However with documents that grow like that you need to be careful and allocate an initial sizable amount of memory or you will encounter slow document moves on disk.
See the blurb on updates
Edit additional comments:
There are two ways to do it. Either a collection where every document is a feed event or where every document is the user's entire feed. Each has advantages and disadvantages. If you are ok with capping it at say 1000 recent feed events I would use the document to represent an entire feed method.
So I would create a document structure like
{userid:1, feed:[(feed objects)]}
where feed is an array of feed event objects. These should be subdocuments like
{id:(a users id), name:(a users name), type:(an int for like/comment/tag), date:(some iso date), postName:(the name of the post acted on), postId:(the id of the post acted on)}
To update this feed you just need to push a new feed document onto the feed array when the feed event happens. So if user A likes a post, push the feed document onto all of user A's friends feeds.
This works well for small feeds. If you need a very large feed I would recommend using a document per feed entry and sharding off of the recipient user's id and indexing the date field. This is getting closer to how the very very large feeds at twitter/fb work but they use mysql which is arguably better than mongodb for this specific use case.