Best data store w/full text search for lots of small documents? (e.g. a Splunk-like system) - mongodb

We are specing out a system that will index and store zillions of Syslog messages. These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
The splunk system has a really great indexing and document compression system.
What to use?
I thought of mongodb, but it seems inappropriate for documents of this small size.
SQL Server is a possibility, but seems perhaps not super efficient for this purpose.
Text files with lucene?
-- The windows file system doesn't always like dirs with zillions of files
Suggestions ?
Thanks!

I thought of mongodb, but it seems inappropriate for documents of this small size
There's a company called Boxed Ice that actually builds a server monitoring system using MongoDB. I would argue that it's definitely appropriate.
These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
From a MongoDB perspective, we would say that you are storing lots of small documents with a few attributes. In a case like this MongoDB has several benefits here.
It can handle changing attributes seamlessly.
It will flexibly handle different types.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
This is well within the type of data range that MongoDB can handle. There are several different methods of handling the 30 day retention periods. These will depend on your reporting needs. I would poke around on the groups for ideas here.
Based on the people I've worked with, this type of insert-heavy logging is one of the places where Mongo tends to be a very good fit.

Graylog2 is an open-source log management tool that is built on top of MongoDB. I believe Loggy, a logging-as-a-service provider, also uses MongoDB as their backend store. So there are quiet a few products using MongoDB for logging.
It should be possible to store the ngrams returned by a Lucene analyzer for better text searching. Not sure about the feasibility though given the large amount of documents. What is primary reporting use case?

It seems that you would want something like mongodb full-text search server, which will enable you to search on different attributes without losing performance. You may try MongoLantern: http://sourceforge.net/projects/mongolantern/. Though it's in alpha stage but gives very best result for me for 5M records.
Let me know whether this serves your purpose.

I would strongly consider using something Lucene or Solr.
Lucene is built specifically for full text search and provides a ton of additional helpful features that you may find useful in your application. As a bonus, Solr is dead simple to setup and configure. (And its super fast for searching)
They do not keep a file per entry, so you shouldnt have to worry much about zillions of files.
None of the free database options specialize in full text search - dont try to force them to do what you want.

I think you should deploy your own (intranet-wide) stack of Grafana, Logstash + ElasticSearch
When setup once you have a flexibel schema, retention and a wonderful UI for your data with Grafana.

Related

ElasticSearch vs MongoDB vs Cassandra for mailer logs

I have a mailer system where in we send 1-2 lakhs mail everyday and then we store all the clicks / opens actions of those mail.
This is currently working fine in MySQL.
But now with increasing traffic, we are facing some performance issue with Mysql.
So we are thinking of shifting to Elastic / Cassandra / Mongo.
My possible queries include
a) Getting user which have opened / clicked a specific mail or not.
b) Calculating open rate / click rate for mail
I think cassandra might not fit here perfectly as it is well suited for applications with high concurrent writes but with less read queries.
Here there can be many types of read queries so it will be difficult to decide on partitioning key / clustering, so too mzny aggregations will be running on cassandra.
What should we use in this case and why?
We are anyhow working on both elastic / mongo to design the data model for both and then run some benchmarks around it.
ELK stack (Elastic Search, LogStash, Kibana) is the best solution for this. As far as I have used ELK stack, it is fast for log processing.
Cassandra is definitely not the right option.
You can use MongoDB since most of the queries are GET queries.
But I have a few points why Elastic search gains power over Mongo for Log Processing.
Full-text search : Elastic Search implements a lot of features, such as customized splitting text into words, customized stemming, facetted search, etc.
Fuzzy Searching : A fuzzy search is good for spelling errors. You can find what you are searching for even though you have a spelling mistake.
Speed : Elastic search is able to execute complex queries extremely fast.
As the name itself suggests Elastic search is made for searching purpose. And Searching in mongo is not as fast as Elastic Search.
But Maintaining Elastic Search also has its own problems.
refer:
https://apiumhub.com/tech-blog-barcelona/elastic-search-advantages-books/
https://interviewbubble.com/elasticsearch-pros-and-cons-advantages-and-disadvantages-of-elasticsearch/
Thanks, I think this will help.
If I try to look at your Data Structure and Data Access pattern, it looks like you'll have a message Id for each message, it's contents, and then along with it, a lot of counters which get updated each time a person opens it, maybe some information like user id/email of people who have opened it.
Since these records are updated on each open of an email, I believe the number of writes are reasonably high. Assuming each mail gets opened on an Average of 10 times/day, it'll have 10-20 Lakh writes per day with 1-2 Lakh emails.
Comparing this with reads, I am not sure of your read pattern, but if it's being used for analytics purpose, or to show in some dashboard it'll be read a few times a day maybe. Basically Reads are significantly low compared to writes.
That being said, if your read query pattern is of the form where you query always with a message id, then Cassandra/Hbase are the best choices that you have.
If that's not the case and you have different kinds of queries, or you want to do a lot of analytics, then I would prefer Mongo DB.
Elastic search is not really a Database, it's more of a query engine. And there are a lot of instances where the data loss happens in ES. If you are planning to keep this as your primary data store then Elastic Search/ELK is not a good choice.
You could look at this video to help come to a conclusion on which DB is best given what scenarios.
Alternatively, a summary is # CodeKarle's website

XML versus MongoDB

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

messaging service: redis or mongodb?

I am working on a messaging system that is a bit more advanced than simply sending receiving messages; it is something that looks like facebook chat/messaging: it has chat aspects but also messaging ones, like group messages, read/unread messages, and other.
On redis, I would simply use lists to store received messages, for example like this:
myID = [ "amy|how are you?", "frank|long time no see!" ]
amyID = [ "john|I'm good! you?" ]
(I have simplified it all a lot for easier reading.
But in this way I would not be able to keep track of single conversations, as they will all be always flushed once the messages are received (so basically no "inbox" feature.
On the other hand, if I use mongodb, I could use something like this: How to keep track of a private messaging system using MongoDB?
I though of the following benefits/disadvantages:
MONGODB
advantages:
can see inbox view
can check read/unread messages on each conversation
disadvantages
not as fast as redis
storage size increases a lot
REDIS
advantages:
easy to pick up new messages
no storage problems (messages are flushed)
disadvantages:
once messages are sent to the client are lost, so no read/unread features and
no inbox
Any ideas?
Thanks in advance.
I cannot answer for Redis because I don't use it and never have so I won't pretend I have.
However, if for some reason, you are not using something like an XMPP client like Facebook does: http://www.ibm.com/developerworks/xml/tutorials/x-realtimeXMPPtut/section3.html (aka Jabber) for chat then I will describe about a pure MongoDB solution in this situation.
MongoDB uses the OS' LRU as a means to cache documents and queries, fair enough it provides no direct query cache however if you are smart you will not need one; instead you just read all your queries directly from RAM. With this in mind MongoDB can be just as fast as Redis, since Redis uses the computers RAM too.
Speed between the two on a optimised query is negligible I would think. The true measure of speed comes from your schema, indexes, cluster setup and the queries you perform.
A note about storage size here, taking your comment into consideration:
the problem with flushing mongodb is bigger than I initially though: apparently when you delete something on mongo you only delete its reference, so if you delete 4mb of documents, it won't free up that much space. the only way to actually free up that memory is to run a dbRepair (or something among this line) that basically blocks the db while running....
You seem to have some misconceptions about exactly how MongoDB works.
This link will be of help to you: http://www.10gen.com/presentations/storage-engine-internals it will describe some of the reasons why excessive disk space is used and will also explain some of the misconceptions you have about how a computer works and how MongoDB frees space and reuses it.
MongoDB does not free space on a record level. Instead it will send that "empty" record (record and document are two different things as the presentation will tell you), shove it into a deleted bucket list and then reuse that space when a new document (or a updated document that has been moved) comes along and fits in that space.
It is true that if you are not careful and understanding on how MongoDB works on this level that you will probably be forced to run repairDB fairly regularly to keep any sort of performance after fragmentation.
As for memory handling. The OS handles this as I said. A good explanation of when the OS will free memory is on Wikipedia: http://en.wikipedia.org/wiki/Paging
Until there is not enough RAM to store all the data needed, the process of obtaining an empty page frame does not involve removing another page from RAM.
So the OS will handle removing pages for you and you shouldn't concern yourself with that part, instead you should be concerned with making your working set fit into RAM.
If you are worried about storing messages and don't really want to, i.e. you want them to be "flushed" you can actually use the TTL feature that comes with the later MongoDB installations: http://docs.mongodb.org/manual/tutorial/expire-data/ which will basically allow you to set a time-out for when a message should be deleted from the collection.
So personally if set-up right MongoDB could do messaging and chat like Facebook do it, of course they use the XMPP protocol and then archive messages into Cassandra for search but you don't have to do it like they do, that is just one way to achieve the same goal.
Hope this makes sense and I haven't gone round in circles, it is a bit of a long answer.
I think the big point here is the storage problems. You would need a lot of machine or a good system of flushing some conversations for you to use MongoDB. Despite wanting a sort of "inbox" system... I think redis would be more conducive to a well-working chat system - you just need to come up with some very creative workaround... or give up that design goal.
We use a mixed design, so we when we need snappy performance as in messages, queues and caches it´s on Redis and when we need to search on secondary indexes or update whole documents, we use MongoDB.
You can also try Riak, which can grow more linearly and smoothly than MongoDB.

Frequent large, multi-record updates in MongoDB, Lucene, etc

I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?
Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.