best possible schema design for log analysis database in mongodb - mongodb

i have to store the following data in mongodb uid, gender ,country, city, date_of_visit, url_of_visit
I would like to store uid, gender, country and city in one collection because these information will never change for particular user.
in the other collection i would like to store uid, date_of_visit, url_of_visit
i want to know which is best practice to store uid, date_of_visit and url_of_visit.there are two things in my mind..
(a) { uid: 100, date: xxxxxxxxxxxxxxx, url: abc.php }
{ uid: 100, date: xxxxxx, url: ref.php }
{ uid: 200, date: xxxxxxxxx, url: ref.php }
(b) { uid:100, visit:[{date:xxxxxxx, url:abc.php},
{date:xxxx, url:def.php},
{.........................}]}
i want to have following index date:1, uid:1 ,url:1 ...the problem with approach (a) is with each row inserted in database the database side and index size will grow and there will come a point when index size will not fit into RAM
problem with approach (b) is at some point each document will exceed the 16 MB limit and this approach will fail that time..
please suggest me what should be the best schema design for this scenario. i would also have the query which include uid, gender, country, date_of_visit, url_of_visit

I know this thread is a bit older but I'm wondering if you've decided on a structure and if it works well.
My idea was, instead of risking to create too large documents, to structure it similar to your second approach but include the date in the main collection. This way each document would be the user's activity within one day. It would be indexed by user and date, easy to update and query and keep things organized.
Something like:
{ uid:100, date:xxxxxxx, event:[{time:xxxxxxx, url:abc.php},
{time:xxxx, url:def.php},
{.........................}]}

I think the second approach is better than one because it corresponds to idea of grouping similar data together. About exceeding 16M of document you can reach this limit but he should be a very active user. :)
Also you can pull out some data to another collection and make reference using ObjectId or DBRef.
See more info http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-DBRef

Your second approach will force you to fetch a huge amount of data from the embedded document, which cannot be filtered by Mongo. In other words, if you have a million documents stored inside the "event" field for a particular user, then when you fetch those embedded documents with dot notation, then the entire document including the parent will be returned. There's no way you can filter the results.
I would recommend the first approach which makes the data easier to retrieve and work with.

Related

how to join a collection and sort it, while limiting results in MongoDB

lets say I have 2 collections wherein each document may look like this:
Collection 1:
target:
_id,
comments:
[
{ _id,
message,
full_name
},
...
]
Collection 2:
user:
_id,
full_name,
username
I am paging through comments via $slice, let's say I take the first 25 entries.
From these entries I need the according usernames, which I receive from the second collection. What I want is to get the comments sorted by their reference username. The problem is I can't add the username to the comments because they may change often and if so, I would need to update all target documents, where the old username was in.
I can only imagine one way to solve this. Read out the entire full_names and query them in the user collection. The result would be sortable but it is not paged and so it takes a lot of resources to do that with large documents.
Is there anything I am missing with this problem?
Thanks in advance
If comments are an embedded array, you will have to do work on the client side to sort the comments array unless you store it in sorted order. Your application requirements for username force you to either read out all of the usernames of the users who commented to do the sort, or to store the username in the comments and have (much) more difficult and expensive updates.
Sorting and pagination don't work unless you can return the documents in sorted order. You should consider a different schema where comments form a separate collection so that you can return them in sorted order and paginate them. Store the username in each comment to facilitate the sort on the MongoDB side. Depending on your application's usage pattern this might work better for you.
It also seems strange to sort on usernames and expect/allow usernames to change frequently. If you could drop these requirements it'd make your life easier :D

mongodb: Embedded only id or both id and name

I'm new to mongodb, please suggest me how to correct design schema for situation like below:
I have User collection and Product collection. Product contain info like id, title, description, price... User can bookmark or like Product. Currently, in User collection, I'm store 1 array for liked products, and 1 array for bookmarked products. So when I need to view info about 1 user, I have to read out these 2 array, then search in Product collection to get title of liked and bookmarked products.
//User collection
{
_id : 12345,
name: "John",
liked: [123, 456, 789],
bkmark: [123, 125]
}
//Product collection
{
_id : 123,
title: "computer",
desc: "awesome computer",
price: 12
}
Now I think I can speed up this process by embedded both product id and title in User collection, so that I don't have to search in Product collection, just read it out and display. But if I choose this way, whenever Product's title get updated, I have to search and update in User collection too. I can't evaluate update cost in 2nd way, so I don't know which way is correct. Please help me to choose between them.
Thanks & Regards.
You should consider what happens more often: A product gets renamed or the information of a user is requested.
You should also consider what's a bigger problem: Some time lag in which users see an outdated product name (we are talking about seconds, maybe minutes when you have a really large number of users) or always a longer response time when requesting a user profile.
Without knowing your actual usage patterns and requirements, I would guess that it's the latter in both cases, so you should rather optimize for this situation.
In general it is not recommended to normalize a MongoDB as radical as you would normalize a relational database. The reason is that MongoDB can not perform JOINs. So it's usually not such a bad idea to duplicate some relevant information in multiple documents, while accepting a higher cost for updates and a potential risk of inconsistencies.

What is a good way to handle "complex" mongodb documents in meteor

I want to store "carpool_debts" which is basically going to hold the number of days owed to other users. It looks like this:
carpool_debts{
_id,
owner,
owner_id,
creditors:[{name,
id,
amount},
{name,
id,
amount}
]}
Does that data structure look reasonable for what I want to store? Also implementing that data structure seemed cumbersome to maintain. I found it cumbersome mainly because there isn't an upsert type of function available in meteor yet. Instead of creditors being a list of sub documents would I be better off storing the creditors as a delimited string? I would like to know if I am on the right path or if I am missing something? Thanks.
You can structure mongo documents just like you would in a relational database, for example, having separate collections for creditors and owners and using carpool_debts as a link table with the amount attached:
carpool_debts{
_id,
owner_id,
creditor_id,
amount}
creditors{
_id,
name}
owners{
_id,
name}
However, this is not using mongodb to its full potential. Especially if this is a database with masses of data, you may want to optimise it for the most used queries, otherwise it'll be slow. For example, to optimise for looking up an owner's debt, you can add the data needed right there in the owners collection, using sub documents for creditors, and sub documents again for individual debts, similar to what you've already done:
owners{
_id,
name,
creditors: {id,
name,
debts: {
amount,
due_date}
}
}
and similarly, add the debt information on the creditors collection if you often look up the outstanding debt of creditors:
creditors{
_id,
name,
debtors: {
owner_id,
owner_name,
debts: {
aount,
due_date
}
}
}
This way, you only need to look up one record to get all the information you need. Of course, there are catches. First of all, this is not very DRY, but that's intentional. But you have to remember to update the other table(s) when something changes. If you change the name of a creditor for example, you'll need to update every owner document that has debts with this creditor (make sure you index that). This of course makes updates much slower (and the database bigger), but if you don't update very often, and look up much more often, this is not going to be a problem.
Also if for example creditors can have thousands of individual outstanding debts, you may have to separate that into a link table, or rather, link collection, like this, so you don't exceed mongodb's maximum document size:
creditors{
_id,
name,
}
debtors: {
owner_id,
creditor_id,
debts: {
amount,
due_date
}
}
Then you have one document for each creditor-owner connection. This means more documents to look up when looking at a creditor, but still just one for looking up an owner.
This looks fine, but you could also consider separating creditors into its own collection and just storing an array of creditor_id's in the debts collection. That would reduce complexity and make finding and filtering information easier. And it would be more DRY since if there are multiple debts with the same creditor, you only have the creditor stored in a single place.
You could also consider just having each document in the debts collection be a single debt by an owner to a single creditor. Then you'd just have id, owner_id and creditor_id - like a link table in a relational database.

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.

MongoDB / NOSQL: Best approach to handling read/unread status on messages

Suppose you have a large number of users (M) and a large number of documents (N) and you want each user to be able to mark each document as read or unread (just like any email system). What's the best way to represent this in MongoDB? Or any other document database?
There are several questions on StackOverflow asking this question for relational databases but I didn't see any with recommendations for document databases:
What's the most efficient way to remember read/unread status across multiple items?
Implementing an efficient system of "unread comments" counters
Typically the answers involve a table listing everything a user has read: (i.e. tuples of user id, document id) with some possible optimizations for a cut off date allowing mark-all-as-read to wipe the database and start again knowing that anything prior to that date is 'read'.
So, MongoDB / NOSQL experts, what approaches have you seen in practice to this problem and how did they perform?
{
_id: messagePrefs_uniqueId,
type: 'prefs',
timestamp: unix_timestamp
ownerId: receipientId,
messageId: messageId,
read: true / false,
}
{
_id: message_uniqueId,
timestamp: unix_timestamp
type: 'message',
contents: 'this is the message',
senderId: senderId,
recipients: [receipientId1,receipientId2]
}
Say you have 3 messages you want to retrieve preferences for, you can get them via something like:
db.messages.find({
messageId : { $in : [messageId1,messageId2,messageId3]},
ownerId: receipientId,
type:'prefs'
})
If all you need is read/unread you could use this with MongoDB's upsert capabilities, so you are not creating prefs for each message unless the user actually reads it, then basically you create the prefs object with your own unique id and upsert it into MongoDB. If you want more flexibility(like say tags or folders) you'll probably want to make the pref for each recipient of the message. For example you could add:
tags: ['inbox','tech stuff']
to the prefs object and then to get all the prefs of all the messages tagged with 'tech stuff' you'd go something like:
db.messages.find({type: 'prefs', ownerId: recipientId, tags: 'tech stuff'})
You could then use the messageIds you find within the prefs to query and find all the messages that correspond:
db.messages.find((type:'message', _id: { $in : [array of messageIds from prefs]}})
It might be a little tricky if you want to do something like counting how many messages each 'tag' contains efficiently. If it's only a handful of tags you can just add .count() to the end of your query for each query. If it's hundreds or thousands then you might do better with a map/reduce server side script or maybe an object that keeps track of message counts per tag per user.
If you're only storing a simple boolean value, like read/unread, another method is to embedded an array in each Document that contains a list of the Users who have read it.
{
_id: 'document#42',
...
read_by: ['user#83', 'user#2702']
}
You should then be able to index that field, making for fast queries for Documents-read-by-User and Users-who-read-Document.
db.documents.find({read_by: 'user#83'})
db.documents.find({_id: 'document#42}, {read_by: 1})
However, I find that I'm usually querying for all Documents that have not been read by a particular User, and I can't think of any solution that can make use of the index in this case. I suspect it's not possible to make this fast without having both read_by and unread_by arrays, so that every User is included in every Document (or join table), but that would have a large storage cost.