mongodb user logs storage best practice - mongodb

I'm new to mongoDB and trying to figure out what would be the best way to store user logs. I identified two main solutions, but can't figure out which might be the best. If others come to mind, please feel free to share them ;)
1)The first is about storing a log in all the collections that I have. For instance, If I have the 'post', 'friends', 'sports' and 'music' collections, then I could create a log field in each document in each collection with all the logging info that I want to store.
2)The second way is to create an entire 'log' collection, each document having a type ('post', 'friends' ...) to identify the kind of log I'm storing along with the id of the document that is refered to.
What I really need is to be able to store and retrieve data (that is, everything but logs) as fast as possible. (so if I go with (1), I would have to always remove the logs from my selection queries since they would be useless most of the time)
Logs will only be accessed periodicaly (for reporting and stats mostly), yet will require to be mapped to their initial document (in case of (2)).
I will be creating logs for almost all the non log data to store (so storing logs inside each collection might be faster : one insert vs two).
Logging could also be done asynchronously to ease the load on the server.
With all that in mind, I can't really manage to find which is the best for my needs. Would anyone have any idea / comments to share ?
Thanks a lot !

How you want to access your logs will play a big part in your design decision. By what criteria will you access your log documents? Will you have to query by type (e.g. post, friends) AND id (object id of the document)? Is there some other identifying feature? This can be extra overhead as you would have to read your 'type' collection first, get the id you're after, then query your logs collection. This creates a lot more read overhead.
What I would recommend is a separate logs collection as this keeps all related data in the one place. Then, for each log document, have a 1:1 mapping between document ids for your type collections, and your log collection. e.g. If you have a friend document, use the friend document's _id field as the _id field for your document in your logs collection. That way you can directly look up your log document without a second read. If you will have multiple log records for each type document, use an array in the log document and append each log record to it using mongo's $push. This would be a very efficient log architecture in terms of storage, write ($push requires no read - 'set and forget') and lookup time (smart 1:1 mapping - no more than one query needed if you have the _id).

Related

Is a good idea to store chat messages in a mongodb collection?

I'm developing a chat app with node.js, redis, socket.io and mongodb. MongoDB comes the last and for persisting the messages.
My question is what would be the best approach for this last step?
I'm afraid a collection with all the messages like
{
id,
from,
to,
datetime,
message
}
can get too big too soon, and is going to get very slow for reading purposes, what do you think?
Is there a better approach you already worked with?
In MongoDB, you store your data in the format you will want to read them later.
If what you read from the database is a list of messages filtered on the 'to' field and with a dynamic datetime filter, then this schema is the perfect fit.
Don't forget to add an index on the fields you will be querying on, then it will be reasonable fast to query them, even over millions of records.
If you would, for example, always show a full history of a full day, you would store all messages for a single day in one document. If both types of queries occur a lot, you would even store your messages in both formats.
If storage is an issue, you could also use capped collection, which will automatically delete messages of e.g. over 1 year old.
I think the db structure is fine, the way you mentioned in your question.
You may assign some unique id for chat between each pair and keep it in each record of chat. Retrieve based on that when you want to show it.
Say 12 is the unique id for chat between A and B, retrieve should be based on 12 when you want to show chat for A and B.
So your db structure can be like:-
{
id,
from,
to,
datetime,
message,
uid
}
Remember, you can optimize your retrieve, if you will give some limit(say 100 at a time) for retrieve. If user is scrolling beyond 100 retrieve more 100 chats. Which will solve lots of retrieve.
When using limit, retrieve based on date created and use sort with find query as well.
Just a thought here, are the messages plain text or are you allowed to share images and videos as well ?
If it's the latter then storing all the chats for a single day in one collection might not work out.
Actually if you have images and videos shares allowed then you need to take into account the. 16mb document restriction as well.

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.

Moving messaging schema to MongoDB

I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).