I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)
Related
As part of my university curriculum I ended up with a real project which consists in helping a company shifting from their relational data warehouse into a NoSQL data warehouse. The thing is that what they are looking for is better performance in large jobs but so far they have used a single machine and if they indeed migrate to NoSQL they still wish to keep using a single machine for cost reasons.
As far as I know the whole point of NoSQL is to run it in a large distributed system of several machines. So I don't see the point of this migration, specially since I am pretty sure (but not entirely) that if they do install NoSQL, they will probably end having even worst performance.
But still I am not comfortable telling them this since I am still new to this area (less than a month), so I wonder, is there are any situation where using NoSQL in a single machine for a datawarehouse would be justifiable performance wise? Or is it just a plain bad idea?
The answer to your question, like the answer to so many questions, is "it depends."
Ignoring the commentary on the question, I think there may be legitimacy to your client's question. Both relational and non-relational databases ultimately hold data in key-value tuples, with indexes and such to ensure quick and speedy access to the data. The difference is that SQL/relational databases contain an incredible amount of overhead to attempt the optimal way to retrieve results given an unknown set of queries, as well as ensure stable concurrency. This overhead is both computationally expensive and rarely results in the optimal solution. As a result, SQL databases often perform significantly slower for simple repetitive queries.
No-sql databases, on the other hand, are more of a "bare-bones" database, relying on programmers and intelligent design to achieve success. They are optimized to retrieve a value for a given key very quickly, often sub-millisecond. As a result, increased up front investment in the design results in superior and near-optimal performance. It will be necessary to determine the cost-benefit of doing this up-front design, but it is all but guaranteed that the no-sql approach will perform better regardless of the number of machines involved (in fact, SQL databases are very difficult or impossible to cluster together and is one of the main reasons why NoSql was developed).
Eventually we will see relational-like solutions implemented on a no-sql platform. In fact, Mongo, Elasticsearch, and Couchbase (probably others) already have SQL-like query functionality. But right now, you are faced with this dilemma.
For a single machine if the load is write heavy e.g. your logging a lot of events you could do for cassandra. Also a good alternative is hbase but its heavy and not suggested for single node. If they expose api in json you could look into document based dbs such as couchbase, mongo db. If you have an idea about the load then selecting a nosql data store is much easier
If you're in a position where you need to pick one, I think you should look first at MongoDB. If you've never tried it, I really recommend you visit their live demo with tutorial and give it a try. If you like, download and follow the installation guide on their site. It's free, runs well on a single machine, and is incredibly easy to use.
In addition to MongoDB, I've used Oracle, SQL Server, MySQL, SQLite, and HBase. I understand Cassandra should be in the list but I've not tried it. With MongoDB, I was fully deployed and executing reads and writes from an application in like two hours. I attribute most of that to their website's clear and concise instructional content. The biggest learning curve was figuring out how the queries work for things like updating a record or deleting a record without deleting the entire set of similar records.
Regarding NoSQL vs RDBMS, some points to consider:
Adding a new column to RDBMS table can lock the database in or degrade performance in another
MongoDB is schema-less so adding a new field, does not effect old documents and will be instant (think how flexible that really is - throw any dimension of data into this system without maintenance overhead)
You're less likely to require a DBA to solve your schema problems when an application changes
I think problems related to table size are irrelevant, so you won't run into a scaling problem - just a disk space problem on single machine
Suggestions for a NoSQL datastore so that we can push data and generate real time Qlikview reports easily?
Easily means:
1. Qlikview support for reads (mongodb connector available, otherwise maybe can write a JDBC connector, otherwise maybe can write a custom QVX connector to the datastore)
Easily adaptable to changes in schema, or schemaless. We change our schema quite frequently ...
Java support for writes
Super fast reads - real time incremental access, as well as batch access for old data within a time range. I read that Cassandra excels in ranges.
Reasonably fast writes
Reasonably big data storage - 20 million rows stored per day, about 200 bytes each
Would be nice if it can scale for a years worth of data, elasticity not so important.
Easy to use, install, and run. Looking at minimal setup and configuration time.
Matlabe support for adhoc querying
Initially I don't think we need a distributed system however a cluster is a possibility.
I've looked at Mongodb, Cassandra and Hbase. I don't think going over REST is a good idea due to (theoretically) slower performance.
I'm leaning towards MongoDB at the moment due to its ease of use, matlab support, totally schema less, Qlikview support (beta connector is available). However if anyone can suggest something better that would be great!
Depending on the server infrastructure you will use, I guess the best choice is amazon's NoSQL service, avalaible in aws.amazon.com.
The fact is any DB will have a poor performance in cloud infrastructure due to the way it stores data, amazon EC2 with EBS for instance is VERY slow for this task, requiring you to connect up to 20 EBS volumes in raid to acquire a decent speed. They solved this issue creating this NoSQL service, which I never used, but seems nice.
We're designing an OLTP financial system. it should be able to support 10.000 transactions per second and have reporting features.
So we have come to the idea of using:
a NoSQL DB as our main storage
a MySQL DB (Percona server actually) making some ETLs from the NoSQL DB for store reporting data
We're considering MongoDB and Riak for the NoSQL job. we have read that Riak scales more smoothly than MongoDB. And we would like to listen your opinion.
Which NoSQL DB would you use for a
OLTP financial system?
How has been
your experience scaling MongoDB/Riak?
There is no conceivable circumstance where I would use a NOSQl database for anything to do with finance. You simply don't have the data integrity needed or the internal controls. Dow Jones uses SQL Server to do its transactions and if they can properly design a high performance, high transaction Relational datbase so can you. You will have to invest in some people who know what they are doing though.
One has to think about the problem differently. The notion of transaction consistency stems from the UD (update) in CRUD (Create, Read, Update, Delete). noSQL DBs are CRAP (Create, Replicate, Append, Process) oriented, working by accretion of time-stamped data. With the right domain model, there is no reason that auditability and the equivalent of referential integrity can't be achieved.
The global-storage based NoSQL databases - Cache from InterSystems and GT.M from FIS - are used extensively in financial services and have been for many years. Cache in particular is used for both the core database and for OLTP.
I can answer regarding my experience with scaling Riak.
Riak scales smoothly to the extreme. Scaling is as easy as adding nodes to the cluster, which is a very simple operation in itself. You can achieve near linear scalability by simply adding nodes. Our experience with Riak as far as scaling is concerned has been amazing.
The flip side is that it is lacking in many respects. Some examples:
You can't do something like count(*) or list keys on a production cluster. That would require a work around if you want to do ETL from Riak into MySQL - or how would you know what to (E)xtract?
(One possible work around would be to maintain a bucket with a known key sequence that map to values that contain the keys you inserted into your other buckets).
The free version of Riak comes with no management console that lets you know what's going on, and the one that's included in the Enterprise version isn't much of an improvement.
You'll need the Enterprise version of you're looking to replicate your data over WAN (e.g. for DR / high availability). That's alright if you don't mind paying, but keep in mind that Basho pricing is very high.
I work with the Starcounter (so I’m biased), but I think I can safely say that for a system processing financial transactions you have to worry about transaction consistency. Unfortunately, this is what the engines used for Facebook and Twitter had to give up allow their scale-out strategy to offer performance. This is not because engines such as MongoDb or Cassandra are poorly designed; rather, it follows naturally from the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem). Simply put, changes you make in your database will overwrite other changes if they occur close in time. Ok for status updates and new tweets, but disastrous if you deal with money or other quantities. The amounts will simply end up wrong when many reads and writes are being done in parallel. So for the throughput you need, a memory centric NoSQL database with ACID support is probably the way to go.
You can use some NoSQL databases (Cassandra, EventStore) as a storage for financial service if you implement your app using event sourcing and concepts from DDD. I recommend you to read this minibook http://www.oreilly.com/programming/free/reactive-microservices-architecture.html
OLTP can be achieved using NoSQL with a custom implementation,
there are two things,
1. How are you going to achieve ACID properties that an RDBMS gives.
2. Provide a custom blocking or non blocking concurrency and transaction handling mechanism.
To take you closer to solution,
Apache Phoenix,apache trafodion or Splice machine.
Trafodion has full ACID support over HBase, you should take a look.
Cassandra can be used for both OLTP and OLAP. Good replication and eventual data consistency gives you the choice in your hand. Need to design the system properly. And after all it's free of cost but not free of developer, give it a try
I'm looking at NoSQL for extremely high volumes of data. We're storing cached versions of web page text in MySQL at the moment, but it seems like the database will get huge very quickly.
My requirements are:
Durability, must not lose data on flushes/writes
Very fast read, reasonably fast write
Fully consistent replication
Preferably, in-memory plus an eventual disk write
I'm looking at: MongoDB, Redis, Raik, and Cassandra right now.
Which best fits my requirements?
I have experience with Redis and MongoDB, but would not recommend either for your use case. Redis is awesome in every regard, but since it's RAM-only and has no clustering features (yet, they are in development), it doesn't scale very well. MongoDB I wouldn't ever use again for anything that needs anything but a small replica set.
Basically, MongoDB is immature and completely unsuitable for any kind of high volume, high performance requirements. It has a global write lock which is held during disk flushes, which means that performance can vary wildly depending on what you do. In practice it makes updates that grow documents impossible, and you need to be very careful with deletes, too. Speaking of deletes, they fragment the database severely, so if you do a lot of deletes your performance is going to suffer.
Sharding in 1.8.0 through 1.8.1 was a disaster. There were complete show stopper bugs that should never have made it into a stable release. Configuration wasn't flushed properly and it was very easy to get your database into a bad state so that chunks never moved off of the primary shard. 1.8.2 solves most of them and seems more stable, but I don't trust the sharding implementation one bit. Add to this that sharding is hard even when everything works, it's not always easy to select a natural shard key, and if you don't sharding will cause you much grief.
MongoDB is really easy to work with and the feature set is really nice. The documentation, the drivers and the community are all great. MongoDB works super as a replacement for MySQL, but don't use it for anything that needs to scale out.
We're currently looking at moving to Cassandra. I find the dynamo model (e.g. no master nodes; write and read anywhere; simply add nodes to grow the cluster) compelling and the features are more or less right for us. The data model is schema less just like MongoDB, although a little more limited (you can choose between one or two level hashes, basically). I'm sure the community is good once you get into it, but so far I find it hard to find good information on how to solve common problems, and the documentation is lacking. Most of the information you find on blogs is a year old, and a lot of things have happened since then (0.7 and 0.8 seem to be really significant updates both, but most things you find are about 0.6). The drivers are also not very mature or well documented, from what I've seen so far, and everyone seems to be squabbling about whether Thrift, Avro or CQL is what should be used (and that has changed from 0.6 to 0.7 to 0.8).
Riak is interesting, for the same reasons as Cassandra, but for us a pure key-value-store is not enough, we need to be able to update without first doing a read. With Riak this isn't possible since the values are just blobs. This sounds like it wouldn't be an issue for you though.
HBase is another contender. It seems like a pain to set up and run because of the many different pieces, ZooKeeper, HDFS, etc. But the data model is similar to Cassandra (columnar, i.e. one level hashes), which works well for us, but may not be important for you. It seems tried and true, but as with MongoDB you have to watch out for sharding issues, you must put some thought into your keys or you get into trouble.
There is also CouchDB, Project Voldemort and countless other possible choices. I think that if you are serious about "extremely high volumes of data" then it's between Cassandra, Riak and HBase. Strike Riak if pure key-value-storage isn't enough. Depending on what you mean by "fully consistent replication" then Cassandra and Riak are out, because there is a possibility (not necessarily big, and tunable) of reading a stale value.
In the end you obviously have to try it out on your particular use case, so all you really should take home from this answer is: don't bother with MongoDB.
Store the cached versions in MemCache instead of MySQL. It will eliminate most writes. Writing to MySQL is bad, because it kills the query cache. When you cache the pages in MemCache, you will have far less writes to the database, and you'll have less reading pressure too. You can cache the result of complex queries, or cache entire pages as you like.
Maybe it won't be as fast as Cassandra, but it will give you an enormous boost compared to your current situation with only MySQL. And you won't have to rewrite your entire application.
memcachedb - memcached protocol, BDB storage, replication etc
Handlersocket - MySql InnoDB plugin.
Oracle memcached InnoDB plugin
RavenDB can store up to 16TB of data per node, and you can have several nodes per machine acting as one database using its built-in sharding support. Thats as huge as it gets.
Durability, fastness, replication is all there, and running in memory is supported too (but not recommended if you want to scale to 16TB per node).
For extremely high volume data, it's clear that Cassandra and hadoop/hbase are far superior than all others for this task. Cassandra proved itself on large clusters like 400 nodes. rdms dbs cannot scale easily, also mongo has some problems when node counts start to increase http://www.nosqlbenchmarking.com/2011/05/paper-on-elasticity-and-scalability-for-acm-socc-2011/
Serdar
I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico
Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.
CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB
Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.
Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?