ElasticSearch vs MongoDB vs Cassandra for mailer logs - mongodb

I have a mailer system where in we send 1-2 lakhs mail everyday and then we store all the clicks / opens actions of those mail.
This is currently working fine in MySQL.
But now with increasing traffic, we are facing some performance issue with Mysql.
So we are thinking of shifting to Elastic / Cassandra / Mongo.
My possible queries include
a) Getting user which have opened / clicked a specific mail or not.
b) Calculating open rate / click rate for mail
I think cassandra might not fit here perfectly as it is well suited for applications with high concurrent writes but with less read queries.
Here there can be many types of read queries so it will be difficult to decide on partitioning key / clustering, so too mzny aggregations will be running on cassandra.
What should we use in this case and why?
We are anyhow working on both elastic / mongo to design the data model for both and then run some benchmarks around it.

ELK stack (Elastic Search, LogStash, Kibana) is the best solution for this. As far as I have used ELK stack, it is fast for log processing.
Cassandra is definitely not the right option.
You can use MongoDB since most of the queries are GET queries.
But I have a few points why Elastic search gains power over Mongo for Log Processing.
Full-text search : Elastic Search implements a lot of features, such as customized splitting text into words, customized stemming, facetted search, etc.
Fuzzy Searching : A fuzzy search is good for spelling errors. You can find what you are searching for even though you have a spelling mistake.
Speed : Elastic search is able to execute complex queries extremely fast.
As the name itself suggests Elastic search is made for searching purpose. And Searching in mongo is not as fast as Elastic Search.
But Maintaining Elastic Search also has its own problems.
refer:
https://apiumhub.com/tech-blog-barcelona/elastic-search-advantages-books/
https://interviewbubble.com/elasticsearch-pros-and-cons-advantages-and-disadvantages-of-elasticsearch/
Thanks, I think this will help.

If I try to look at your Data Structure and Data Access pattern, it looks like you'll have a message Id for each message, it's contents, and then along with it, a lot of counters which get updated each time a person opens it, maybe some information like user id/email of people who have opened it.
Since these records are updated on each open of an email, I believe the number of writes are reasonably high. Assuming each mail gets opened on an Average of 10 times/day, it'll have 10-20 Lakh writes per day with 1-2 Lakh emails.
Comparing this with reads, I am not sure of your read pattern, but if it's being used for analytics purpose, or to show in some dashboard it'll be read a few times a day maybe. Basically Reads are significantly low compared to writes.
That being said, if your read query pattern is of the form where you query always with a message id, then Cassandra/Hbase are the best choices that you have.
If that's not the case and you have different kinds of queries, or you want to do a lot of analytics, then I would prefer Mongo DB.
Elastic search is not really a Database, it's more of a query engine. And there are a lot of instances where the data loss happens in ES. If you are planning to keep this as your primary data store then Elastic Search/ELK is not a good choice.
You could look at this video to help come to a conclusion on which DB is best given what scenarios.
Alternatively, a summary is # CodeKarle's website

Related

How can I handle a very large database and do not miss the performance?

if i want to develop an application, I'm worried about its performance after the number of users and stored data increases.
actually I don't know what is the best way to implement a program that it works with a really large data and do some things like search in it, find and receive user information, search text and so on in real time without any delay !
Let's me explain the problem more
for example i have chosen 'Mongodb' as a database and suppose we have at least five million users and a user want to log in into the system, the user has sent the username and password
The first thing that we should do is to find the user with that username and then check the password, in mongodb we should use something like 'find' method to get the user's information, something like below:
Users.find({ username: entered_username })
then get the user information and we check the password
but the 'find' method should search the username between million users and it's a large number and if any person request for authentication, this method should be run for each of them and it cause a heavy processing on the system
but unfortunately this problem is only for something like finding a user, if we decide to search a text when we have a lot of texts and posts on the database the problem is more bigger
i don't know how big companies like facebook and linkedin search through millions of data in such a short span of time. actually i don't want to create something like facebook or more but i have a large amount of data and i'm looking for a good way to handle it
is there any framework or something else that help me to handle large data on the databases or is there exist a method to implement data on database so that we search and find data fast and quickly? should i use a particular data structure?
i founded an opensource project elasticsearch that it help us to search faster but i don't know if i found something with elastic how can i find it on mongodb too for doing something like updating data and if i use elastic search i should use mongodb too or not!? can i use elastic as a database and as a search engine simultaneous !?
if i use elasticsearch and mongodb together then i should have two copies of my data, one in mongodb and one in elasticsearch!? and this two copies of the data that are separated :( i wish elasticsearch search in the mongodb that does not have to create two copies of the data
thank you if you help me to find out a good way and understand what should i do.
When you talk about performance, it usually boils down to three things:
Your design
Your definition of "quick", and
How much you're willing to pay
Your design
MongoDB is great if you want to iterate on your data model, can scale horizontally, and very quick if used properly. Elasticsearch on the other hand, is not a database. However, it is very quick for searching. A traditional relational database will be useful if you know exactly how your data looks like, and don't expect it to change much, or is relational by nature.
You can, for example, use a relational database for user login, use MongoDB for everything else, and use Elastic for textual, searchable data. There is no rule that tells you to keep everything within a single database.
Make sure you understand indexing, and know how to utilize it to its fullest potential. The fastest hardware will not help you if you don't design your database properly.
Conclusion: use any tool you need, combine if necessary, but understand their strengths and weaknesses.
Your definition of "quick"
How "quick" is quick enough for your application? Is 100ms quick enough? Is 10ms quick enough? Remember that more performance you ask of the machine, more expensive it will be. You can get more performance with a better design, but design can only go so far.
Usually this boils down to what is acceptable for you and your client. Not every application needs a sub-10ms response time. There's plenty of applications that can tolerate queries that return in seconds.
Conclusion: determine what is acceptable, and design accordingly.
How much you're willing to pay
Of course, it all depends on how much you're willing to pay for all the hardware that need to host all that stuff. MongoDB might be open source, but you need some place to host it. Also, you cannot expect magic. You can't throw thousands of queries and updates per second, and expect it to be blazing fast when you only give it 1 GB of RAM.
Conclusion: never under-provision to save money if you want your application to be successful.

ElasticSearch vs. MongoDB for Caching User Data

Up to this point, I have been using MongoDB (Node.js + Mongoose) to save posts which belong to a user, so that I can later retrieve them to display in a stream (just like Facebook, Twitter, etc.)
It recently became necessary to allow the user to deeply search his stream; MongoDB's search was insufficient, so I implemented ElasticSearch on my servers (Amazon EC2 m1.large instances running CentOS, FWIW).
My question: I'm now in a position that I'm duplicating the data between MongoDB (where the user's stream is cached) and ElasticSearch (where it is searched).
Is there any disadvantage to moving my cache ENTIRELY into ElasticSearch, getting rid of the MongoDB all together? It seems a waste to double the storage, and there's no other place that I'm accessing this data (it is only used when presenting/searching the stream of posts).
Specifically, I want to make sure I'm not overlooking anything re: performance. I like the idea of reducing MongoDB as a bottleneck, yet I worry about the memory overhead of ElasticSearch. MongoDB runs on its own server in my cloud setup, whereas ElasticSearch is running on the same instances as node.js. This means I would have MORE ElasticSearch servers (the node.js servers are in an auto-scaling array), but they each are not DEDICATED servers (unlike MongoDB).
The only big obstacle to using ES as a "primary datasource" is that there isn't a good backup mechanism right now. The ES team is working on it and expect it to be out by the end of the year, but in the mean time, you'll have to implement your own backup scripts.
As far as performance, it's really hard to say because almost every situation is unique. ES benefits from memory - so more is always better. In particular, sorts/filters/facets/geo all like to eat memory. If you aren't doing much in the way of faceting, for example, you may be fine with less memory.
ES doesn't need to run on a dedicated node...but it will happily use as many resources as you give it.
Another option is to use just the elastic search indexes. You can choose to not save data in a readable format, so you search in ES and then retrieve documents from MongoDB to your user as needed.
The question bellow comments exactly on that.
Storing only selected fields and not storing _all in pyes/elasticsearch

Log viewing utility database choice

I will be implementing log viewing utility soon. But I stuck with DB choice. My requirements are like below:
Store 5 GB data daily
Total size of 5 TB data
Search in this log data in less than 10 sec
I know that PostgreSQL will work if I fragment tables. But will I able to get this performance written above. As I understood NoSQL is better choice for log storing, since logs are not very structured. I saw an example like below and it seems promising using hadoop-hbase-lucene:
http://blog.mgm-tp.com/2010/03/hadoop-log-management-part1/
But before deciding I wanted to ask if anybody did a choice like this before and could give me an idea. Which DBMS will fit this task best?
My logs are very structured :)
I would say you don't need database you need search engine:
Solr based on Lucene and it packages everything what you need together
ElasticSearch another Lucene based search engine
Sphinx nice thing is that you can use multiple sources per search index -- enrich your raw logs with other events
Scribe Facebook way to search and collect logs
Update for #JustBob:
Most of the mentioned solutions can work with flat file w/o affecting performance. All of then need inverted index which is the hardest part to build or maintain. You can update index in batch mode or on-line. Index can be stored in RDBMS, NoSQL, or custom "flat file" storage format (custom - maintained by search engine application)
You can find a lot of information here:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
See which fits your needs.
Anyway for such a task NoSQL is the right choice.
You should also consider the learning curve, MongoDB / CouchDB, even though they don't perform such as Cassandra or Hadoop, they are easier to learn.
MongoDB being used by Craigslist to store old archives: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

Frequent large, multi-record updates in MongoDB, Lucene, etc

I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?
Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.

Best data store w/full text search for lots of small documents? (e.g. a Splunk-like system)

We are specing out a system that will index and store zillions of Syslog messages. These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
The splunk system has a really great indexing and document compression system.
What to use?
I thought of mongodb, but it seems inappropriate for documents of this small size.
SQL Server is a possibility, but seems perhaps not super efficient for this purpose.
Text files with lucene?
-- The windows file system doesn't always like dirs with zillions of files
Suggestions ?
Thanks!
I thought of mongodb, but it seems inappropriate for documents of this small size
There's a company called Boxed Ice that actually builds a server monitoring system using MongoDB. I would argue that it's definitely appropriate.
These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
From a MongoDB perspective, we would say that you are storing lots of small documents with a few attributes. In a case like this MongoDB has several benefits here.
It can handle changing attributes seamlessly.
It will flexibly handle different types.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
This is well within the type of data range that MongoDB can handle. There are several different methods of handling the 30 day retention periods. These will depend on your reporting needs. I would poke around on the groups for ideas here.
Based on the people I've worked with, this type of insert-heavy logging is one of the places where Mongo tends to be a very good fit.
Graylog2 is an open-source log management tool that is built on top of MongoDB. I believe Loggy, a logging-as-a-service provider, also uses MongoDB as their backend store. So there are quiet a few products using MongoDB for logging.
It should be possible to store the ngrams returned by a Lucene analyzer for better text searching. Not sure about the feasibility though given the large amount of documents. What is primary reporting use case?
It seems that you would want something like mongodb full-text search server, which will enable you to search on different attributes without losing performance. You may try MongoLantern: http://sourceforge.net/projects/mongolantern/. Though it's in alpha stage but gives very best result for me for 5M records.
Let me know whether this serves your purpose.
I would strongly consider using something Lucene or Solr.
Lucene is built specifically for full text search and provides a ton of additional helpful features that you may find useful in your application. As a bonus, Solr is dead simple to setup and configure. (And its super fast for searching)
They do not keep a file per entry, so you shouldnt have to worry much about zillions of files.
None of the free database options specialize in full text search - dont try to force them to do what you want.
I think you should deploy your own (intranet-wide) stack of Grafana, Logstash + ElasticSearch
When setup once you have a flexibel schema, retention and a wonderful UI for your data with Grafana.