how to store User logs on a large server best practice - mongodb

From my experience this is what i come up with.
Im currently saving Users and Statistic classes into the MongoDb and everything works grate.
But how do i save the log each user generate?
Was thinking to use the LogBack SiftingAppender and delegate the log information
to separate MongoDb Collections. Like every MongoDb Collection have the id of the user.
That way i don't have to create advanced mapreduce query's since logs are neatly stacked.
Or use SiftingAppender with a FileAppender so each user have a separate log file.
I there is a problem with this if the MongoDB have one million Log Collections each one named with the User Id. (is it even possible btw)
If everything is stored in the MongoDb the MongoDb master-slave replication makes
it easy if a master node dies.
What about the FileAppender approach. Feels like there will be a hole lot of log files
to administrate. One could maybe save them in folders according to Alphabet. Folder A
for user/id with names/id starting with A.
What are other options to make this work?

On your qn of 1M collections, the default namespace file for a db is 16MB which allows about 24000 namespaces (12000 collections + their _id indexes). More info on this website
And you can set maximum .ns (namespace) file size to 2GB with --nssize option, which will allow probably 3072000 namespaces.

Make use of Embedded Documents and have one Document for each user with an array of embedded documents containing log files. You can also benefit from sharding if collections get large.

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.
Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

mongodb user logs storage best practice

I'm new to mongoDB and trying to figure out what would be the best way to store user logs. I identified two main solutions, but can't figure out which might be the best. If others come to mind, please feel free to share them ;)
1)The first is about storing a log in all the collections that I have. For instance, If I have the 'post', 'friends', 'sports' and 'music' collections, then I could create a log field in each document in each collection with all the logging info that I want to store.
2)The second way is to create an entire 'log' collection, each document having a type ('post', 'friends' ...) to identify the kind of log I'm storing along with the id of the document that is refered to.
What I really need is to be able to store and retrieve data (that is, everything but logs) as fast as possible. (so if I go with (1), I would have to always remove the logs from my selection queries since they would be useless most of the time)
Logs will only be accessed periodicaly (for reporting and stats mostly), yet will require to be mapped to their initial document (in case of (2)).
I will be creating logs for almost all the non log data to store (so storing logs inside each collection might be faster : one insert vs two).
Logging could also be done asynchronously to ease the load on the server.
With all that in mind, I can't really manage to find which is the best for my needs. Would anyone have any idea / comments to share ?
Thanks a lot !
How you want to access your logs will play a big part in your design decision. By what criteria will you access your log documents? Will you have to query by type (e.g. post, friends) AND id (object id of the document)? Is there some other identifying feature? This can be extra overhead as you would have to read your 'type' collection first, get the id you're after, then query your logs collection. This creates a lot more read overhead.
What I would recommend is a separate logs collection as this keeps all related data in the one place. Then, for each log document, have a 1:1 mapping between document ids for your type collections, and your log collection. e.g. If you have a friend document, use the friend document's _id field as the _id field for your document in your logs collection. That way you can directly look up your log document without a second read. If you will have multiple log records for each type document, use an array in the log document and append each log record to it using mongo's $push. This would be a very efficient log architecture in terms of storage, write ($push requires no read - 'set and forget') and lookup time (smart 1:1 mapping - no more than one query needed if you have the _id).

What is the typical usage of ElasticSearch in conjuncion with other storage?

It is not recommended to use ElasticSearch as the only storage from some obvious reasons like security, transactions etc. So how it is usually used together with other database?
Say, I want to store some documents in MongoDB and be able to effectively search by some of their properties. What I'd do would be to store full document in Mongo as usual and then trigger insertion to ElasticSearch but I'd insert only searchable properties plus MongoDB ObjectID there. Then I can search using ElasticSearch and having ObjectID found, go to Mongo and fetch whole documents.
Is this correct usage of ElasticSearch? I don't want to duplicate whole data as I have them already in Mongo.
The best practice is for now to duplicate documents in ES.
The cool thing here is that when you search, you don't have to return to your database to fetch content as ES provide it in only one single call.
You have everything with ES Search Response to display results to your user.
My 2 cents.
You may like to use mongodb river take a look at this post
There are more issue then the size of the data you store or index, you might like to have MongoDB as a backup with "near real time" query for inserted data. and as a queue for the data to indexed (you may like to use mongodb as cluster with the relevant write concern suited for you application

Unknown Database Content

I was given the data files to MongoDB without being told much of what was in the content. Is there a way to probe the contents of the database to identify commonalities within the database.
There are a few different mongo shell helpers that will give you an idea of the general structure of collections (eg. field/data types and their frequency of usage in a collection):
schema.js
variety
To get a better idea of the actual content, you could try one of the Admin UIs.

How to reduce number of documents to be sync from a mongo DB

In my current project, I am using two databases.
A MongoDB instance gathering data from different data providers (abt 15M documents)
Another (relational) database instance holding only the data which is needed for the application, i.e. a subset of the data in the MongoDB instance. (abt 5M rows)
As part of the synchronisation process, I need to regularly check for new entries in the MongoDB depending on data in the relational DB.
Let's say, this is about songs and artists, a document in the MongoDB might look like this:
{_id:1,artists:["Simon","Garfunkel"],"name":"El Condor Pasa"}
Part of the sync process is to import/update all songs from those artists that already exist in the relational DB, which are currently about 1M artists.
So how do I retrieve all songs of 1M named artists from MongoDB for import?
My first thought (and try) was to over all artists and query all songs for each artist (of course, there's an index on the "artists" field). But this takes several minutes for each batch of 1.000 artists, which would make this process a long runner.
My second thought was to write all existing artists to a separate mongoDB collection and have a super query which only retrieves songs of artists that are stored in there. But so far I have not been able retrieve data based on two collections.
Is this a good use case for map/reduce? If yes, can someone pls. give me a hint on how to achieve this? (I am not completely new to NoSQL, but sort of a newbie when it comes to map/reduce.)
Or is this idea just crazy and I have to stick with a process that's running for several days?
Thanks in advance for any hints.
If you regularly need to check for changes, then add a timestamp to your data, and incorporate that timestamp into your query. For example, if you add a "created_ts" attribute, then you can look for records that were created since the last time your batch ran.
Here are a few ideas for making the mongo interaction more efficient:
Reduce network overhead by using an "in" query. Play around with the size of the array of artist IDs in order to determine what works best for your case.
Reduce network overhead by only selecting or reading the attributes that you need.
Make sure that your documents are indexed by artist.
On the Mongo server, make sure that as much of your data fits into memory as possible. Retrieving data from disk is going to be slow no matter what else you do. If it doesn't fit into memory, then you have a few options -- buy more memory; shrink your data set (ex. drop attributes that you don't actually need); shard; etc.