Advantage storing big logs in MongoDB or Hadoop for analytics vs. Zip files on filer? - mongodb

At the moment, we store a huge amount of logs (30G/Day x3 Machines = av. 100G) of a filer. Logs are zipped.
The actual tool to search that logs, is searching the corresponding logs (according to timerange), copying them localy, unzip them, and search the xml for information and display.
We are studying the possibility to make a spunk-like tool to search that logs (it is the output of the message bus : xml-messages sent to other systems).
What are the advantage to rely on a mongo-like db, instead of querying the zipped logfile directly ?
We could also index some data in a db, and let the program search on targeted zip files...
What brings a mongodb... or hadoop more ?

I have worked on MongoDB and currently working on Hadoop so I can list some differences that you might find interesting.
MongoDB will need you to store your files as documents (instead of raw text data). HDFS can store it as files and allow you to use custom MapReduce programs to process them.
MongoDB will require you to choose a good sharding key in order to efficiently distribute the load across the cluster. Since you are storing log files it might be difficult.
If you can store the logs formatted into documents in MongoDB it will allow you query the data with very low latency across huge amounts of logs. My last project had inbuilt logging based on MongoDB and analysis is extremely fast as compared to MapReduce analysis of raw text logs. But the logging has to be done from ground up.
In Hadoop you have technologies like Hive, HBase and Impala which will help you analyze the text format logs, but the latency of MapReduce needs to be kept in mind (there are ways to optimize the latency in though).
To summarize: If you can implement mongoDB based logging in the entire stack go for MongoDB but if you already have text format logs then go for Hadoop. If you can convert your XML data into MongoDB documents in realtime then you can get a very efficient solution.

My knowledge of Hadoop is limited, so I will focus on MongoDB.
You could store each log entry in MongoDB. When you create an index on the time field, you can easily get a specific time range. MongoDB will have support for full text search in version 2.4 which would certainly be an interesting feature for your use-case, but it isn't production-ready yet. Until then, searching for substrings is a very slow operation. So you would have to convert the XML trees which are relevant for your searches to mongodb objects and create indices for the most searched fields.
But you should be aware that storing your logs in MongoDB will mean that you will need a lot more hard drive space. MongoDB does not compress the payload data and also adds some own meta-data overhead, so it will require even more disk space than the unzipped logs. Also, when you use the new text search feature, it will take even more disk space. During a presentation I saw, the text index was two times as large as the data it was indexing. Sure, this feature is still work in progress, but I wouldn't bet on it becomming a lot less in the final version.

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Using MongoDB to store immutable data?

We investigation options to store and read a lot of immutable data (events) and I'd like some feedback on whether MongoDB would be a good fit.
Requirements:
We'll need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb. Would it be fine to store all of these events in the same collection?
A really important requirement is that we need to be able to replay all events in order. I've read here that MongoDB have a limit of 32 Mb when sorting documents using cursors. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary? Are cursors the way to go and would they be able to fullfil this requirement?
If MongoDB would be a good fit for this there some configuration or setting one can tune to increase performance or reliability for immutable data?
This is very similar to storing logs: lots of writes, and the data is read back in order. Luckily the Mongo Site has a recipe for this:
https://docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Regarding immutability of the data, that's not a problem for MongoDB.
Edit 2022-02-19:
Replacement link:
https://web.archive.org/web/20150917095005/docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Snippet of content from page:
This document outlines the basic patterns and principles for using
MongoDB as a persistent storage engine for log data from servers and
other machine data.
Problem Servers generate a large number of events (i.e. logging,) that
contain useful information about their operation including errors,
warnings, and users behavior. By default, most servers, store these
data in plain text log files on their local file systems.
While plain-text logs are accessible and human-readable, they are
difficult to use, reference, and analyze without holistic systems for
aggregating and storing these data.
Solution The solution described below assumes that each server
generates events also consumes event data and that each server can
access the MongoDB instance. Furthermore, this design assumes that the
query rate for this logging data is substantially lower than common
for logging applications with a high-bandwidth event stream.
NOTE
This case assumes that you’re using a standard uncapped collection for
this event data, unless otherwise noted. See the section on capped
collections
Schema Design The schema for storing log data in MongoDB depends on
the format of the event data that you’re storing. For a simple
example, consider standard request logs in the combined format from
the Apache HTTP Server. A line from these logs may resemble the
following:

XML versus MongoDB

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Log viewing utility database choice

I will be implementing log viewing utility soon. But I stuck with DB choice. My requirements are like below:
Store 5 GB data daily
Total size of 5 TB data
Search in this log data in less than 10 sec
I know that PostgreSQL will work if I fragment tables. But will I able to get this performance written above. As I understood NoSQL is better choice for log storing, since logs are not very structured. I saw an example like below and it seems promising using hadoop-hbase-lucene:
http://blog.mgm-tp.com/2010/03/hadoop-log-management-part1/
But before deciding I wanted to ask if anybody did a choice like this before and could give me an idea. Which DBMS will fit this task best?
My logs are very structured :)
I would say you don't need database you need search engine:
Solr based on Lucene and it packages everything what you need together
ElasticSearch another Lucene based search engine
Sphinx nice thing is that you can use multiple sources per search index -- enrich your raw logs with other events
Scribe Facebook way to search and collect logs
Update for #JustBob:
Most of the mentioned solutions can work with flat file w/o affecting performance. All of then need inverted index which is the hardest part to build or maintain. You can update index in batch mode or on-line. Index can be stored in RDBMS, NoSQL, or custom "flat file" storage format (custom - maintained by search engine application)
You can find a lot of information here:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
See which fits your needs.
Anyway for such a task NoSQL is the right choice.
You should also consider the learning curve, MongoDB / CouchDB, even though they don't perform such as Cassandra or Hadoop, they are easier to learn.
MongoDB being used by Craigslist to store old archives: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.