mongodb limit in the embedded document - mongodb

I need to create a message system, where a person can have a conversation with many users.
For example I start to speak with user2, user3 and user4, so anyone of them can see the whole conversation, and if the conversation is not private at any point of time any of participants can add any other person to the conversation.
Here is my idea how to do this.
I am using Mongo and my idea is to use dialog as an instance instead of message.
The schema is listed as follows:
{
_id : ...., // dialog Id
'private' : 0 // is the conversation private
'participants' : [1, 3, 5, 6], //people who are in the conversation
'msgs' :[
{
'mid' : ...// id of a message
'pid': 1, // person who wrote a message
'msg' : 'tafasd' //message
},
....
{
'mid' : ...// id of a message
'pid': 1, // person who wrote a message
'msg' : 'tafasd' //message
}
]
}
I can see some pros for this approach
- in a big database it will be easy to find messages for some particular conversation.
- it will be easy to add people to the conversation.
but here is a problem, for which I can't find a solution:
the conversation is becoming too long (take skype as an example) and they are not showing you all the conversation, they are showing you a part and afterwards they are showing you additional messages.
In other situations skip, limit solves the case, but how can I do this here?
If this is impossible what suggestions do you have?

The MongoDB docs explain how to select a subrange of an array element.
db.dialogs.find({"_id": [dialogId]}, {msgs:{$slice: 5}}) // first 5 comments
db.dialogs.find({"_id": [dialogId]}, {msgs:{$slice: -5}}) // last 5 comments
db.dialogs.find({"_id": [dialogId]}, {msgs:{$slice: [20, 10]}}) // skip 20, limit 10
db.dialogs.find({"_id": [dialogId]}, {msgs:{$slice: [-20, 10]}}) // 20 from end, limit 10
You can use this technique to only select the messages that are relevant to your UI. However, I'm not sure that this is a good schema design. You may want to consider separating out "visible" messages from "archived" messages. It might make the querying a bit easier/faster.

There are caveats if your conversation will have many many messages:
You will notice significant performance reduction on slicing messages arrays as mongodb will do load all of them and will slice the list before return to driver only.
There is document size limit (16MB for now) that could be possibly reached by this approach.
My suggestions is:
Use two collections: one for conversations and the other for messages.
Use dbref in messages to conversation (index this field with the message timestamp to be able to select older ranges on user request).
Additional use separate capped collection for every conversation. It will be easy to find it by name if you build it like "conversation_"
Result:
You will have to write all messages twice. But into separate collections which is normal.
When you want to show your conversation you will need just to select all the data from one collection in natural sort order which is very fast.
Your capped collections will automatically store last messages and delete old.
You may show older messages on the user request by querying main messages collection.

Related

Apply limit to each condition in an $or query?

I have a chatroom application written in Meteor, using MongoDB. Each chatroom contains many messages, and a user can join multiple chatrooms. I'd like to create a query that fetches the 200 most recent messages from each chatroom for all the chatrooms that a given user is in. I'm doing something like this:
// These are the ids of the chatrooms the user is currently in
var conditions = [{chatroomId: 1}, {chatroomId: 2}];
Messages.find({$or: conditions}, {sort: {createdAt: -1}, limit: 200});
However, naturally, this limit applies to the entire query; so the user might end up with 180 messages from one room and 20 from another. Worse, as new messages are added, it's inconsistent which room has old messages culled away.
I could increase the limit as the user joins more chatrooms, but that could lead to them having 5 messages in two chatrooms and 590 in the third.
Is there a way to apply a slice or limit to each condition in the or? I'm publishing this as a Meteor publication, so I need to return a single cursor.
You can't apply a limit to only a portion of the result set, so the only way to accomplish this is with multiple subscriptions. Here's some sample code using template subscriptions. You'll need to modify it for your specific use case.
server
// Returns a limited set of messages for one specific chat room.
Meteor.publish('messages', function(chatroomId, limit) {
check(chatroomId, String);
check(limit, Number);
var options = {sort: {createdAt: -1}, limit: limit};
return Messages.find({chatroomId: chatroomId}, options);
});
client
Template.myTemplate.helpers({
messages: function() {
// Read all of the messages (you may want to restrict these by room),
// sort them, and return the first 100 across all.
return _.chain(Messages.find().fetch())
.sortBy(function(message){return -message.createdAt.getTime();})
.first(100)
.value();
}
});
Template.myTemplate.onCreated(function() {
// Subscribe to multiple chat rooms - here we are assuming the ids are
// stored in the current context - modify this as needed for your use case.
this.subscribe('messages', this.chatroomId1, 100);
this.subscribe('messages', this.chatroomId2, 100);
});
I believe that simply having a dynamic number of .find()s, one for each room, would address the situation.

mongodb: Embedded only id or both id and name

I'm new to mongodb, please suggest me how to correct design schema for situation like below:
I have User collection and Product collection. Product contain info like id, title, description, price... User can bookmark or like Product. Currently, in User collection, I'm store 1 array for liked products, and 1 array for bookmarked products. So when I need to view info about 1 user, I have to read out these 2 array, then search in Product collection to get title of liked and bookmarked products.
//User collection
{
_id : 12345,
name: "John",
liked: [123, 456, 789],
bkmark: [123, 125]
}
//Product collection
{
_id : 123,
title: "computer",
desc: "awesome computer",
price: 12
}
Now I think I can speed up this process by embedded both product id and title in User collection, so that I don't have to search in Product collection, just read it out and display. But if I choose this way, whenever Product's title get updated, I have to search and update in User collection too. I can't evaluate update cost in 2nd way, so I don't know which way is correct. Please help me to choose between them.
Thanks & Regards.
You should consider what happens more often: A product gets renamed or the information of a user is requested.
You should also consider what's a bigger problem: Some time lag in which users see an outdated product name (we are talking about seconds, maybe minutes when you have a really large number of users) or always a longer response time when requesting a user profile.
Without knowing your actual usage patterns and requirements, I would guess that it's the latter in both cases, so you should rather optimize for this situation.
In general it is not recommended to normalize a MongoDB as radical as you would normalize a relational database. The reason is that MongoDB can not perform JOINs. So it's usually not such a bad idea to duplicate some relevant information in multiple documents, while accepting a higher cost for updates and a potential risk of inconsistencies.

mongodb - add column to one collection find based on value in another collection

I have a posts collection which stores posts related info and author information. This is a nested tree.
Then I have a postrating collection which stores which user has rated a particular post up or down.
When a request is made to get a nested tree for a particular post, I also need to return if the current user has voted, and if yes, up or down on each of the post being returned.
In SQL this would be something like "posts.*, postrating.vote from posts join postrating on postID and postrating.memberID=currentUser".
I know MongoDB does not support joins. What are my options with MongoDB?
use map reduce - performance for a simple query?
in the post document store the ratings - BSON size limit?
Get list of all required posts. Get list of all votes by current user. Loop on posts and if user has voted add that to output?
Is there any other way? Can this be done using aggregation?
NOTE: I started on MongoDB last week.
In MongoDB, the simplest way is probably to handle this with application-side logic and not to try this in a single query. There are many ways to structure your data, but here's one possibility:
user_document = {
name : "User1",
postsIhaveLiked : [ "post1", "post2" ... ]
}
post_document = {
postID : "post1",
content : "my awesome blog post"
}
With this structure, you would first query for the user's user_document. Then, for each post returned, you could check if the post's postID is in that user's "postsIhaveLiked" list.
The main idea with this is that you get your data in two steps, not one. This is different from a join, but based on the same underlying idea of using one key (in this case, the postID) to relate two different pieces of data.
In general, try to avoid using map-reduce for performance reasons. And for this simple use case, aggregation is not what you want.

MongoDB data structure with large number internal documents

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?
You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
{
'hashtags': [ '#foo', '#bar' ],
'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
'user_mentions' : [ 'queen_uk' ],
'geodata': { ... },
'userid': 'derickr',
'somegroupid' : 40
}
And then for a user collection, the documents could look like:
{
'userid' : 'derickr',
'realname' : Derick Rethans',
...
}
All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
Chris Winslett # MongoHQ
You will find this video interesting:
http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale
Essentially, in one document, store one days of tweets for one
person. The reasoning:
Querying typically consists of days and users
Therefore, you can have the following index:
{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date
Have fun!
Chris MongoHQ
I think it makes the most sense to implement the following:
user
{ user_id: 123123,
screen_name: 'cledwyn',
misc_bits: {...},
groups: [123123_group_tall_people, 123123_group_techies, ],
groups_in: [123123_group_tall_people]
}
tweet
{ tweet_id: 98798798798987987987987,
user_id: 123123,
tweet_date: 20120220,
text: 'MongoDB is pretty sweet',
misc_bits: {...},
groups_in: [123123_group_tall_people]
}

MongoDB: Calling Count() vs tracking counts in a collection

I am moving our messaging system to MongoDB and am curious what approach to take with respect to various stats, like number of messages per user etc. In MS SQL database I have a table where I have different counts per user and they get updated by trigger on corresponding tables, so I can for example know how many unread messages UserA has without calling an expensive SELECT Count(*) operation.
Is count function in MongoDB also expensive?
I started reading about map/reduce but my site is high load, so statistics has to update in real time, and my understanding is that map/reduce is time consuming operation.
What would be the best (performance-wise) approach on gathering various aggregate counts in MongoDB?
If you've got a lot of data, then I'd stick with the same approach and increment an aggregate counter whenever a new message is added for a user, using a collection something like this:
counts
{
userid: 123,
messages: 10
}
Unfortunately (or fortunately?) there are no triggers in MongoDB, so you'd increment the counter from your application logic:
db.counts.update( { userid: 123 }, { $inc: { messages: 1 } } )
This'll give you the best performance, and you'd probably also put an index on the userid field for fast lookups:
db.counts.ensureIndex( { userid: 1 } )
Mongodb good fit for the data denormaliztion. And if your site is high load then you need to precalculate almost everything, so use $inc for incrementing messages count, no doubt.