MongoDB data structure with large number internal documents - mongodb

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?

You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
{
'hashtags': [ '#foo', '#bar' ],
'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
'user_mentions' : [ 'queen_uk' ],
'geodata': { ... },
'userid': 'derickr',
'somegroupid' : 40
}
And then for a user collection, the documents could look like:
{
'userid' : 'derickr',
'realname' : Derick Rethans',
...
}

All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
Chris Winslett # MongoHQ
You will find this video interesting:
http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale
Essentially, in one document, store one days of tweets for one
person. The reasoning:
Querying typically consists of days and users
Therefore, you can have the following index:
{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date
Have fun!
Chris MongoHQ
I think it makes the most sense to implement the following:
user
{ user_id: 123123,
screen_name: 'cledwyn',
misc_bits: {...},
groups: [123123_group_tall_people, 123123_group_techies, ],
groups_in: [123123_group_tall_people]
}
tweet
{ tweet_id: 98798798798987987987987,
user_id: 123123,
tweet_date: 20120220,
text: 'MongoDB is pretty sweet',
misc_bits: {...},
groups_in: [123123_group_tall_people]
}

Related

MongoDB $nin otherDocument.array without downloading otherDocument

I've been fighting with a query for awhile now in which I have to download a document from the database to construct my query and it's honestly slowing things down quite a bit as my service starts to get more and more requests. I was wondering if someone could help me optimize the query, or make it so I don't have to download the initial document.
I'm going to use Tinder as an example here, as it simulates the amount of items that could potentially be in this array, and why having a "swipedBy: []" array on each client to remove the initial download of the query seems like it would be inefficient, as the array could end up being hundreds of thousands of elements long, and will only grow over time.
So let's say that I have a field in my user's documents called "swipes" which is an array of firebase ids (strings) of the user's that they have interacted with, an example of this can be found below:
{
_id: 'firebaseUid,
swipes: [ 'firebaseUid_1', 'firebaseUid_2', 'firebaseUid_3' ]
}
I have a query that is supposed to select a user from the database that is not already in MY swipes array, currently this is how I have it done. (Javascript)
database.collection('users').findOne({ _id: myUserId }).then((document => {
const query = {
...,
$and: [
{ _id: { $ne: myUserId } },
{ _id: { $nin: document.swipes }
]
}
});
This requires me to download the user document from the database, then pass the whole array back in as a query, which seems highly inneficient, when talking about tens, if not hundereds of thousands of array elements.
While the above query DOES work, I feel like there's a way this can be sped up, and my lack of knowledge in MongoDB is really attacking me here. I know for a fact I can do this in MySQL, but I don't know of any good (and affordable) MySQL services like mLab.
I've been fighting with a query for awhile now in which I have to download a document from the database to construct my query and it's honestly slowing things down quite a bit as my service starts to get more and more requests. I was wondering if someone could help me optimize the query, or make it so I don't have to download the initial document.
I'm going to use Tinder as an example here, as it simulates the amount of items that could potentially be in this array, and why having a "swipedBy: []" array on each client to remove the initial download of the query seems like it would be inefficient, as the array could end up being hundreds of thousands of elements long, and will only grow over time.
So let's say that I have a field in my user's documents called "swipes" which is an array of firebase ids (strings) of the user's that they have interacted with, an example of this can be found below:
{
_id: 'firebaseUid,
swipes: [ 'firebaseUid_1', 'firebaseUid_2', 'firebaseUid_3' ]
}
I have a query that is supposed to select a user from the database that is not already in MY swipes array, currently this is how I have it done. (Javascript)
database.collection('users').findOne({ _id: myUserId }).then((document => {
const query = {
...,
$and: [
{ _id: { $ne: myUserId } },
{ _id: { $nin: document.swipes }
]
}
});
This requires me to download the user document from the database, then pass the whole array back in as a query, which seems highly inneficient, when talking about tens, if not hundereds of thousands of array elements.
While the above query DOES work, I feel like there's a way this can be sped up, and my lack of knowledge in MongoDB is really attacking me here. I know for a fact I can do this in MySQL, but I don't know of any good (and affordable) MySQL services like mLab.
I should add: My MongoDB database is remote, so this document and massive array is being downloaded (per-request) to my Google Cloud Functions call, then being sent back to the server over the network. Meaning the data has to be downloaded and then uploaded over the network again, and considering I'm charged by the millisecond, I'd like to minimize that.
You should refactor swipes from user document to separate collection that point to the user who swiped and who was swiped. Also, this would enable to store additional data like was swiped left or right, timestamp and so on.

MongoDb many to many with big relations

I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.

query too large issue with mongodb

let's say we have a collection of users and each user is followed by another user. if I want to find the users that are NOT following me, I need to do something like:
db.users.find({_id: { $nin : followers_ids } } ) ;
if the amount of followers_ids is huge, let's say 100k users, mongodb will start saying the query is too large, plus sending a big amount of data over the network to make the query is not good neither. what are the best practices to accomplish this query without sending all this ids over the network ?.
I recommend that you limit the number of query Results to Reduce Network Demand. According to the Docs,
MongoDB cursors return results in groups of multiple documents. If you know the number of results you want, you can reduce the demand on network resources by issuing the limit() method.
This is typically used in conjunction with sort operations. For
example, if you need only 50 results from your query to the users
collection, you would issue the following command:
db.users.find({$nin : followers_ids}).sort( { timestamp : -1 } ).limit(50)
You can then use the cursor to get retrieve more user documents as needed.
Recommendation to Restructure Followers Schema
I would recommend that you restructure your user documents if the followers will grow to a large amount. Currently user schema may be as such:
{
_id: ObjectId("123"),
username: "jobs",
email: "stevej#apple.com",
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
The good thing about the schema is whenever this user does anything all of the users you need to notify is right here inside of the document. The downside is that if you needed to find everyone a user is following you will have to query the entire users collection. Also your user document will become larger and more volatile as the followers grow.
You may want to further normalize your followers. You can keep a collection that matches followee to followers with documents that look like this:
{
_id: ObjectId("123"),//Followee's "_id"
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
This will keep your user documents slender, but will take an extra query to get the followers. As the "followers" array changes in size, you can enable the userPowerOf2Sizes allocation strategy to reduce fragmentation and moves.

mongodb: Embedded only id or both id and name

I'm new to mongodb, please suggest me how to correct design schema for situation like below:
I have User collection and Product collection. Product contain info like id, title, description, price... User can bookmark or like Product. Currently, in User collection, I'm store 1 array for liked products, and 1 array for bookmarked products. So when I need to view info about 1 user, I have to read out these 2 array, then search in Product collection to get title of liked and bookmarked products.
//User collection
{
_id : 12345,
name: "John",
liked: [123, 456, 789],
bkmark: [123, 125]
}
//Product collection
{
_id : 123,
title: "computer",
desc: "awesome computer",
price: 12
}
Now I think I can speed up this process by embedded both product id and title in User collection, so that I don't have to search in Product collection, just read it out and display. But if I choose this way, whenever Product's title get updated, I have to search and update in User collection too. I can't evaluate update cost in 2nd way, so I don't know which way is correct. Please help me to choose between them.
Thanks & Regards.
You should consider what happens more often: A product gets renamed or the information of a user is requested.
You should also consider what's a bigger problem: Some time lag in which users see an outdated product name (we are talking about seconds, maybe minutes when you have a really large number of users) or always a longer response time when requesting a user profile.
Without knowing your actual usage patterns and requirements, I would guess that it's the latter in both cases, so you should rather optimize for this situation.
In general it is not recommended to normalize a MongoDB as radical as you would normalize a relational database. The reason is that MongoDB can not perform JOINs. So it's usually not such a bad idea to duplicate some relevant information in multiple documents, while accepting a higher cost for updates and a potential risk of inconsistencies.

How get distinct list of values and below that all the documents that are of that distinct value?

Getting my head around MongoDB document design and trying to figure out how to do something or if I'm barking up the wrong tree.
I'm creating a mini CMS. The site will contains either documents or url's that are grouped by a category, i.e. There's a group called 'shop' that has a list of links to items on another site and there's a category called 'art' that has a list of works of art, each of which has a title, summary and images for a slideshow.
So one possible way to do this would be to have a collection that would look something like:
[{category: 'Products',
title: 'Thong',
href: 'http://www.thongs.com'
},{
category: 'Products',
title: 'Incredible Sulk',
href:'http://www.sulk.com'
},{
category: 'Art',
title: 'Cool art',
summary: 'This is a summary to display',
images: [...]
}]
But, and here's the question.... when I'm building the webpage this structure isn't much use to me. the homepage contains lists of 'things' grouped by their category, lists... menus.. stuff like that. To be able to easily do that I need to have something that looks more like:
[
{'Products':[
{title:'thong', href:'http://www.thongs.com'},
{title:'Incredible Sulk'}
]
},
{'Art':[
{title:'Cool art',summary:'This is a summary to display',images:[...]}
]
}
]
So the question is, can I somehow do this transformation in MondoDB? If I can't then is it bad to do this in my app server layer(I'd get a grouped list of unique categories and then loop through them querying Mongo for documents of that category)? I'm guessing app server layer is bad, after all mongodb has it all in memory if I'm lucky. If neither of these are good then am I doing it all wrong and should I actually store the structure like this in the first place?
I need to make it easy for the user to create categories on the fly and consider what happens if they start to add lots of documents and I either need to restrict how many documents I pull back for each category or somehow limit the fields returned so that when I query mongodb it doesn't return back a relatively big chunk of data which is slow and wasteful, but instead returns back the minimum I need to create the desired page.
I figured a group query that will give me almost the structure I want, but good enough to use for templates.
db.things.group({
key:{category:true},
initial:{articles:[]},
reduce: function(doc, aggregator) {
aggregator.articles.push(doc);
}
})