MongoDB / NOSQL: Best approach to handling read/unread status on messages - mongodb

Suppose you have a large number of users (M) and a large number of documents (N) and you want each user to be able to mark each document as read or unread (just like any email system). What's the best way to represent this in MongoDB? Or any other document database?
There are several questions on StackOverflow asking this question for relational databases but I didn't see any with recommendations for document databases:
What's the most efficient way to remember read/unread status across multiple items?
Implementing an efficient system of "unread comments" counters
Typically the answers involve a table listing everything a user has read: (i.e. tuples of user id, document id) with some possible optimizations for a cut off date allowing mark-all-as-read to wipe the database and start again knowing that anything prior to that date is 'read'.
So, MongoDB / NOSQL experts, what approaches have you seen in practice to this problem and how did they perform?

{
_id: messagePrefs_uniqueId,
type: 'prefs',
timestamp: unix_timestamp
ownerId: receipientId,
messageId: messageId,
read: true / false,
}
{
_id: message_uniqueId,
timestamp: unix_timestamp
type: 'message',
contents: 'this is the message',
senderId: senderId,
recipients: [receipientId1,receipientId2]
}
Say you have 3 messages you want to retrieve preferences for, you can get them via something like:
db.messages.find({
messageId : { $in : [messageId1,messageId2,messageId3]},
ownerId: receipientId,
type:'prefs'
})
If all you need is read/unread you could use this with MongoDB's upsert capabilities, so you are not creating prefs for each message unless the user actually reads it, then basically you create the prefs object with your own unique id and upsert it into MongoDB. If you want more flexibility(like say tags or folders) you'll probably want to make the pref for each recipient of the message. For example you could add:
tags: ['inbox','tech stuff']
to the prefs object and then to get all the prefs of all the messages tagged with 'tech stuff' you'd go something like:
db.messages.find({type: 'prefs', ownerId: recipientId, tags: 'tech stuff'})
You could then use the messageIds you find within the prefs to query and find all the messages that correspond:
db.messages.find((type:'message', _id: { $in : [array of messageIds from prefs]}})
It might be a little tricky if you want to do something like counting how many messages each 'tag' contains efficiently. If it's only a handful of tags you can just add .count() to the end of your query for each query. If it's hundreds or thousands then you might do better with a map/reduce server side script or maybe an object that keeps track of message counts per tag per user.

If you're only storing a simple boolean value, like read/unread, another method is to embedded an array in each Document that contains a list of the Users who have read it.
{
_id: 'document#42',
...
read_by: ['user#83', 'user#2702']
}
You should then be able to index that field, making for fast queries for Documents-read-by-User and Users-who-read-Document.
db.documents.find({read_by: 'user#83'})
db.documents.find({_id: 'document#42}, {read_by: 1})
However, I find that I'm usually querying for all Documents that have not been read by a particular User, and I can't think of any solution that can make use of the index in this case. I suspect it's not possible to make this fast without having both read_by and unread_by arrays, so that every User is included in every Document (or join table), but that would have a large storage cost.

Related

query too large issue with mongodb

let's say we have a collection of users and each user is followed by another user. if I want to find the users that are NOT following me, I need to do something like:
db.users.find({_id: { $nin : followers_ids } } ) ;
if the amount of followers_ids is huge, let's say 100k users, mongodb will start saying the query is too large, plus sending a big amount of data over the network to make the query is not good neither. what are the best practices to accomplish this query without sending all this ids over the network ?.
I recommend that you limit the number of query Results to Reduce Network Demand. According to the Docs,
MongoDB cursors return results in groups of multiple documents. If you know the number of results you want, you can reduce the demand on network resources by issuing the limit() method.
This is typically used in conjunction with sort operations. For
example, if you need only 50 results from your query to the users
collection, you would issue the following command:
db.users.find({$nin : followers_ids}).sort( { timestamp : -1 } ).limit(50)
You can then use the cursor to get retrieve more user documents as needed.
Recommendation to Restructure Followers Schema
I would recommend that you restructure your user documents if the followers will grow to a large amount. Currently user schema may be as such:
{
_id: ObjectId("123"),
username: "jobs",
email: "stevej#apple.com",
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
The good thing about the schema is whenever this user does anything all of the users you need to notify is right here inside of the document. The downside is that if you needed to find everyone a user is following you will have to query the entire users collection. Also your user document will become larger and more volatile as the followers grow.
You may want to further normalize your followers. You can keep a collection that matches followee to followers with documents that look like this:
{
_id: ObjectId("123"),//Followee's "_id"
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
This will keep your user documents slender, but will take an extra query to get the followers. As the "followers" array changes in size, you can enable the userPowerOf2Sizes allocation strategy to reduce fragmentation and moves.

MongoDB document model size/performance limits? A collection with an object that possibly houses 100k+ names?

I'm trying to build an event website that will host videos and such. I've set up a collection with the event name, event description, and an object with some friendly info of people "attending". If things go well there might be 100-200k people attending, and those people should have access to whoever else is in the event. (clicking on the friendly name will find the user's id and subsequently their full profile) Is that asking too much of mongo? Or is there a better way to go about doing something like that? It seems like that could get rather large rather quick.
{
_id : ...., // event Id,
'name' : // event name
'description' : //event description
'attendees' :{
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url}
}
}
Thanks for the suggestions!
In MongoDB many-to-many modeling (or one-to-many) in general, you should take a different approach depending if the many are few (up to few dozens usually) or "really" many as in your case.
It will be better for you not to use embedding in your case, and instead normalize. If you embed users in your events collection, adding attendees to a certain event will increase the array size. Since documents are updated in-place, if the document can't fit it's disk size, it will have to moved on disk, a very expensive operation which will also cause fragmentation. There are few techniques to deal with moves, but none is ideal.
Having a array of ObjectId as attendees will be better in that documents will grow much less dramatically, but still issue few problems. How will you find all events user has participated in? You can have a multi-key index for attendees, but once a certain document moves, the index will have to be updated per each user entry (the index contains a pointer to the document place on disk). In your case, where you plan to have up to 200K of users it will be very very painful.
Embedding is a very cool feature of MongoDB or any other document oriented database, but it's naive to think it doesn't (sometimes) comes without a price.
I think you should really rethink your schema: having an events collection, a users collection and a user_event collection with a structure similar to this:
{
_id : ObjectId(),
user_id : ObjectId(),
event_id : ObjectId()
}
Normalization is not a dirty word
Perhaps you should consider modeling your data in two collections and your attendees field in an event document would be an array of user ids.
Here's a sample of the schema:
db.events
{
_id : ...., // event Id,
'name' : // event name
'description' : //event description
'attendees' :[ObjectId('userId1'), ObjectId('userId2') ...]
}
db.users
{
_id : ObjectId('userId1'),
username: 'user friendly name',
avatarLink: 'url to avatar'
}
Then you could do 2 separate queries
db.events.find({_id: ObjectId('eventId')});
db.users.find( {_id: {$in: [ObjectId['userId1'), ObjectId('userId2')]}});

best possible schema design for log analysis database in mongodb

i have to store the following data in mongodb uid, gender ,country, city, date_of_visit, url_of_visit
I would like to store uid, gender, country and city in one collection because these information will never change for particular user.
in the other collection i would like to store uid, date_of_visit, url_of_visit
i want to know which is best practice to store uid, date_of_visit and url_of_visit.there are two things in my mind..
(a) { uid: 100, date: xxxxxxxxxxxxxxx, url: abc.php }
{ uid: 100, date: xxxxxx, url: ref.php }
{ uid: 200, date: xxxxxxxxx, url: ref.php }
(b) { uid:100, visit:[{date:xxxxxxx, url:abc.php},
{date:xxxx, url:def.php},
{.........................}]}
i want to have following index date:1, uid:1 ,url:1 ...the problem with approach (a) is with each row inserted in database the database side and index size will grow and there will come a point when index size will not fit into RAM
problem with approach (b) is at some point each document will exceed the 16 MB limit and this approach will fail that time..
please suggest me what should be the best schema design for this scenario. i would also have the query which include uid, gender, country, date_of_visit, url_of_visit
I know this thread is a bit older but I'm wondering if you've decided on a structure and if it works well.
My idea was, instead of risking to create too large documents, to structure it similar to your second approach but include the date in the main collection. This way each document would be the user's activity within one day. It would be indexed by user and date, easy to update and query and keep things organized.
Something like:
{ uid:100, date:xxxxxxx, event:[{time:xxxxxxx, url:abc.php},
{time:xxxx, url:def.php},
{.........................}]}
I think the second approach is better than one because it corresponds to idea of grouping similar data together. About exceeding 16M of document you can reach this limit but he should be a very active user. :)
Also you can pull out some data to another collection and make reference using ObjectId or DBRef.
See more info http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-DBRef
Your second approach will force you to fetch a huge amount of data from the embedded document, which cannot be filtered by Mongo. In other words, if you have a million documents stored inside the "event" field for a particular user, then when you fetch those embedded documents with dot notation, then the entire document including the parent will be returned. There's no way you can filter the results.
I would recommend the first approach which makes the data easier to retrieve and work with.

Many to many in MongoDB

I decided to give MongoDB a try and see how well we get along. I do have some questions though.
Premise
I have users(id, name, address, password, email, etc)
I have stamps(id, type, value, price, etc)
Users browse through a stamp archive and filter it in various ways(pagination, filter by price, type, name, etc), select a stamp then add it to their collection.
Users can add more then one stamp to their collection (1 piece of mint and one used or just 2 pieces of used)
Users can flag some of their stamps for sale or trade and perhapa specify a price.
So far
Here's what I have so far:
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps: [stampid-1, stampid-543,...,stampid-23]
}
Questions
How should I add the state of the owned stamp, the quantity and condition?
what would be some sample queries for the situations described earlier?
As far as I know, ensureindex makes it so you reduce the number of "scanned" entries.
The accepted answer here keeps changing the index. Is that just for the purpose of explaining it or is this the way to do it? I mean it does make sense somehow but I keep thinking of it in sql terms and... it does not make ANY sense...
The only change I would do is how you store the stamps that a user owns. I would store an array of objects representing the stamps and duplicating the values that are the more often accessed.
For example something like that :
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps : [
{
_id: id,
type: 'type',
price: 20,
forSale: true/false,
quantity: 2
},
{
_id: id2,
type: 'type2',
price: 5,
forSale: false,
quantity: 10
}
]
}
You can see that some datas are duplicated between the stamps collection and the stamps array in the user collection. You do that with the properties that you access the more often. Because otherwise you would have to do a findOne for each stamps, and it is better to read directly the data that doing that in MongoDB. And this way you can add others properties such as quantity and forSale here.
The goal of duplication here is to avoid to run a query for each stamp in the array.
There is a link of a video that discusses MongoDB design and also explains what I tried to explain here.
http://lacantine.ubicast.eu/videos/3-mongodb-deployment-strategies/
from a SQL background, struggling with NoSQL also. It seems to me that a lot hinges on how unchanging types of data may or may not be. One thing that puzzles me in RDBMS systems is why it is not possible to say a particular column/field is "immutable". If you know a field is immutable (or nearly) in a NoSQL context it seems me to make it more acceptable to duplicate the info. Is it complete heresy to suggest that in many contexts you might actually want a combination of SQL and NoSQL structures?

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.