Gearman persistent storage solutions - persistence

What is the optimal way to introduce a data persistance system in gearman with optimal performance in mind?
I'm asking because we are thinking of moving away from our queue system in mysql and moving to gearman. It seems rather odd to use a relational database again for persisting data in the queue so we are looking for other possibilities.
I know of libdrizzle, libsqlite, etc ... but i'm thinking more into nosql, what are good, proven and stable solutions?

if you do
$ gearmand -h
it should show you what queue options you have..
I think the only no-sql option available by default is memcached.

Related

Provide sync between PostgreSQL and NoSQL database

I've got a PostgreSQL DB with very normalized data, so a lot of requests spawn a lot of joins and my DB works slow. I want to denormalize data from PostgreSQL and store it in a NoSQL DB for readonly access. For that I must provide sync between PostgreSQL and NoSQL (little latency is allowed). I want to consider different ways so I can choose the most suitable.
I can use events from models when there were changes and put them into a queue. After that a worker can process events and add necessary data to NoSQL, but I've got a lot of legacy code which is bad quality and I don't want to change it a lot. Also, I can denormalize data and put it to PostgreSQL but don't know if this is suitable solution or not.
What solutions exist for such tasks?
I did research on this topic and I've got results.
There are several ways to solve this task. I tell you about 3 general ways.
1) You can use signals(ORM signals for example) in your app to get notifications about changes.
You should put it to queue, RabbitMQ if changes are not a lot and Kafka if there are a lot of changes. It's a simple solution for not complicated apps which were good written.
If you have complex architecture and a lot of legacy then you should choose this approach:
General meaning about this approach is here
2) Use PostgreSQL logical decoding to get events about changes, it's very powerful feature. I found two solution where was used this feature: 1. Use tool bottledwater with Kafka, it works, but not develop any more. 2. Use tool debezium, it works and has active community.
3) Use PostgreSQL logical decoding to get events about changes and write own tool to get events.

no single point of failure with traditional RDBMS

I am working in a trading applications that depends on an Oracle DB.
The DB is crashed two times and the business owner wants some solution in which the application still works even the DB is crashed.
My team leader introduced Cassandra NOSQL as a solution as it has no single point of failure but this option will make us move from the traditional relational model into the NOSQL model which I consider as a drawback.
My question here, Is there a way to avoid a single point of DB failure with traditional relational DBMS like Mysql, postgreSQL,......etc ?
Sounds like you just need a cluster of Oracle database instances, rather than just a single instance, such as Oracle RAC.
If your solution for the Oracle server being offline is to use Cassandra, what happens if the Cassandra cluster goes down? And are you really in the situation where it makes sense to rewrite and re-architect your entire application to use a different type of data store, just to avoid downtime from Oracle? I would suspect this only makes sense for applications with huge usage and load numbers, where any downtime is going to cost serious money (and not just cause embarrassment to the business folks to their bosses).
Is there a way to avoid a single point of DB failure with traditional relational DBMS
No, that's not possible. Simply because when one node dies. It is gone.
Any fault-tolerant system will use several nodes that replicate each other. You can still use traditional RDBMS, but you will need to configure mirroring in order for the system to tolerate a node failure.
NoSQL isn't the only possible solution. You can set up replication with MySQL:
http://dev.mysql.com/doc/refman/5.0/en/replication-solutions.html
and
http://mysql-mmm.org/
and concerming failover discussions:
http://serverfault.com/questions/274094/automated-failover-strategy-for-master-slave-mysql-replication-why-would-this

Which NoSQL database for extremely high volumes of data

I'm looking at NoSQL for extremely high volumes of data. We're storing cached versions of web page text in MySQL at the moment, but it seems like the database will get huge very quickly.
My requirements are:
Durability, must not lose data on flushes/writes
Very fast read, reasonably fast write
Fully consistent replication
Preferably, in-memory plus an eventual disk write
I'm looking at: MongoDB, Redis, Raik, and Cassandra right now.
Which best fits my requirements?
I have experience with Redis and MongoDB, but would not recommend either for your use case. Redis is awesome in every regard, but since it's RAM-only and has no clustering features (yet, they are in development), it doesn't scale very well. MongoDB I wouldn't ever use again for anything that needs anything but a small replica set.
Basically, MongoDB is immature and completely unsuitable for any kind of high volume, high performance requirements. It has a global write lock which is held during disk flushes, which means that performance can vary wildly depending on what you do. In practice it makes updates that grow documents impossible, and you need to be very careful with deletes, too. Speaking of deletes, they fragment the database severely, so if you do a lot of deletes your performance is going to suffer.
Sharding in 1.8.0 through 1.8.1 was a disaster. There were complete show stopper bugs that should never have made it into a stable release. Configuration wasn't flushed properly and it was very easy to get your database into a bad state so that chunks never moved off of the primary shard. 1.8.2 solves most of them and seems more stable, but I don't trust the sharding implementation one bit. Add to this that sharding is hard even when everything works, it's not always easy to select a natural shard key, and if you don't sharding will cause you much grief.
MongoDB is really easy to work with and the feature set is really nice. The documentation, the drivers and the community are all great. MongoDB works super as a replacement for MySQL, but don't use it for anything that needs to scale out.
We're currently looking at moving to Cassandra. I find the dynamo model (e.g. no master nodes; write and read anywhere; simply add nodes to grow the cluster) compelling and the features are more or less right for us. The data model is schema less just like MongoDB, although a little more limited (you can choose between one or two level hashes, basically). I'm sure the community is good once you get into it, but so far I find it hard to find good information on how to solve common problems, and the documentation is lacking. Most of the information you find on blogs is a year old, and a lot of things have happened since then (0.7 and 0.8 seem to be really significant updates both, but most things you find are about 0.6). The drivers are also not very mature or well documented, from what I've seen so far, and everyone seems to be squabbling about whether Thrift, Avro or CQL is what should be used (and that has changed from 0.6 to 0.7 to 0.8).
Riak is interesting, for the same reasons as Cassandra, but for us a pure key-value-store is not enough, we need to be able to update without first doing a read. With Riak this isn't possible since the values are just blobs. This sounds like it wouldn't be an issue for you though.
HBase is another contender. It seems like a pain to set up and run because of the many different pieces, ZooKeeper, HDFS, etc. But the data model is similar to Cassandra (columnar, i.e. one level hashes), which works well for us, but may not be important for you. It seems tried and true, but as with MongoDB you have to watch out for sharding issues, you must put some thought into your keys or you get into trouble.
There is also CouchDB, Project Voldemort and countless other possible choices. I think that if you are serious about "extremely high volumes of data" then it's between Cassandra, Riak and HBase. Strike Riak if pure key-value-storage isn't enough. Depending on what you mean by "fully consistent replication" then Cassandra and Riak are out, because there is a possibility (not necessarily big, and tunable) of reading a stale value.
In the end you obviously have to try it out on your particular use case, so all you really should take home from this answer is: don't bother with MongoDB.
Store the cached versions in MemCache instead of MySQL. It will eliminate most writes. Writing to MySQL is bad, because it kills the query cache. When you cache the pages in MemCache, you will have far less writes to the database, and you'll have less reading pressure too. You can cache the result of complex queries, or cache entire pages as you like.
Maybe it won't be as fast as Cassandra, but it will give you an enormous boost compared to your current situation with only MySQL. And you won't have to rewrite your entire application.
memcachedb - memcached protocol, BDB storage, replication etc
Handlersocket - MySql InnoDB plugin.
Oracle memcached InnoDB plugin
RavenDB can store up to 16TB of data per node, and you can have several nodes per machine acting as one database using its built-in sharding support. Thats as huge as it gets.
Durability, fastness, replication is all there, and running in memory is supported too (but not recommended if you want to scale to 16TB per node).
For extremely high volume data, it's clear that Cassandra and hadoop/hbase are far superior than all others for this task. Cassandra proved itself on large clusters like 400 nodes. rdms dbs cannot scale easily, also mongo has some problems when node counts start to increase http://www.nosqlbenchmarking.com/2011/05/paper-on-elasticity-and-scalability-for-acm-socc-2011/
Serdar

What is the best database/storage to store statistic data?

I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)

Are MongoDB and CouchDB perfect substitutes?

I haven't got my hands dirty yet with neither CouchDB nor MongoDB but I would like to do so soon... I also have read a bit about both systems and it looks to me like they cover the same cases... Or am I missing a key distinguishing feature?
I would like to use a document based storage instead of a traditional RDBMS in my next project. I also need the datastore to
handle large binary objects (images and videos)
automatically replicate itself to physically separate nodes
rendering the need of an additional RDBMS superfluous
Are both equally well suited for these requirements?
Thanks!
I've actually used both pretty extensively, both for very different projects.
I'd say they are equally well suited for the requirements you list, however there are quite a lot of differences between the two. IMO the biggest is their query-ability. CouchDB doesn't have 'queries' in the RDBMS sense (select * from ...) but instead uses 'views' which are more like stored procedures (essentially, static queries defined in the database (1)). MongoDB has much more 'usual' querying.
Essentially it comes down to your application requirements. If you give more information I might be able to shed some more light on what might matter in that situation.
(1): you can have temporarily, non-static queries in CouchDB but they aren't recommended for production use
Mongo uses more "traditional" queries. You turn on indexing on a per-key basis and use a SQLish query syntax.
CouchDB's views can do much deeper indexing and relationships but require you to do a little more work and understand the way the key sorting works for doing the queries.
There is a big difference in the replication systems as well. Mongo's replication looks a lot like most RDBMS solutions with masters and slaves and all that. CouchDB's replication is more peer to peer, no master/slave, every CouchDB is a node.
CouchDB's replication is made for keeping geographically apart sites in sync. It handles network- and other errors gracefully by restarting replication where it left off. Participating nodes can even be put offline deliberately.
Before using MongoDB, I'd recommend that you take a look at the following: http://groups.google.com/group/mongodb-user/browse_thread/thread/460dbd49a5b6b267. MongoDB has a small chance of corrupting data due to its lack of fsync's with each write.
http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
From a developer point of view the biggest difference is the mongo live queries vs couch view (which must be "compiled").
From an operational point of view, couch is working completely on http-rest. If you're able to configure http servers you know how to setup coach. With Mongo instead you have to learn how to set up config servers, replica set and mongos (kind of balancer).