Key/Value distributed database for caching binary data - nosql

I am looking for distributed kv database for caching small binary objects, like images with TTL. Size limitation is not a problem, as I am planning to split each object anyway, to minimize latency. I need C# and Java drivers, and in very near future I will also need C++ driver. The databases like CouchDb and Redis seems to be document based. Mongo supports binary data and well documented, but it is persistent and I am not sure it is scalable in terms of throughput , Cassandra is also persistent and I am not sure about C++/C# drivers quality + need for constantly repair because of deletions.
Aerospike is commercial and also document based. Maybe Riak with memory or leveldb backend (anyone worked with its C++ client?)

Aerospike would be a perfect solution for you because of below reasons:
Serves all your Use cases
Key Value based.
Open sourced from 3.0 version. Earlier upto 2 node Aerospike cluster was open sourced and 3 or more nodes cluster was paid.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Can save binary data.
Reasons for choosing Aerospike
Throughput: Better than Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/benchmark/.
Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.

Couchbase (not CouchDB) is a great option for you. Highly scalable, easy to understand, use and scale. It's a KV document database evolved from memcached that also offers secondary indexes through Map/Reduce and many new things coming soon. You can still use memcached protocol/libraries or speed it up with Couchbase SDK's.

Have you looked at Pivotal GemFire Pivotal GemFire is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Pivotal GemFire also has client drivers in C++, C# and Java

Related

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Apache Ignite with Posgresql

Objective: To scale existing application where PostgreSQL is used as a data store.
How can Apache Ignite help: We have an application which has many modules and all the modules are using some shared tables. So we have only one PostgreSQL master database and It's already on AWS large SSD machines. We already have Redis for caching but as we no limitation of Redis is, It's not easy partial updates and querying on secondary indexes.
Our use case:
We have two big tables, one is member and second is subscription. It's many to many relations where one member is subscribed in multiple groups and we are maintaining subscriptions in subscription table.
Member table size is around 40 million and size of this table is around 40M x 1.5KB + more ~= 60GB
Challenge
A challenge is, we can't archive this data since every member is working and there are frequent updates and read on this table.
My thought:
Apache Ignite can help to provide a caching layer on top of PostgreSQL table, as per I read from the documentation.
Now, I have a couple of questions from an Implementation point of
view.
Will Apache Ignite fits in our use case? If Yes then,
Will apache Ignite keep all data 60GB in RAM? Or we can distribute RAM load on multiple machines?
On updating PostgreSQL database table, we are using python and SQLALchamy (ORM). Will there be a separate call for Apache Ignite to
update the same record in memory OR IS there any way that Apache
Ignite can sync it immediately from Database?
Is there enough support for Python?
Are there REST API support to Interact with Apache Ignite. I can avoid ODBC connection.
How about If this load becomes double in next one year?
A quick answer is much appreciated and Thanks in Advance.
Yes it should fit your case.
Apache Ignite has persistence meaning it can store the data on disk optionally, but if you employ it for caching only it will happily store everything in RAM.
There are two approaches. You can do your updates on Apache Ignite (which will propagate them to PostgreSQL) or you can do your updates to PostgreSQL and have Apache Ignite fetch them on the first use (pull from PostgreSQL). The latter only works for new records as you can imagine. There is no support of propagating data from PostgreSQL to Apache Ignite, I guess you could do something like that by using triggers but it is untested.
There is 3rd party client. I didn't try it. Apache Ignite only has built-in native clients for C++/C#/Java for now, other platforms can only connect through JDBC/ODBC/REST and only use a fraction of functionality.
There is REST API and it have improved recently.
120GB doesn't sound like anything scary as far as Apache Ignite is concerned.
in addition to alamar's answer:
You can store your data in-memory on many machines, as Ignite supports partitioned caches that are divided on parts and are distributed between machines. You can set data-collocations and number of backups.
There is an interesting memory model in Apache Ignite that allows you to persist data on the disk quickly. As Ignite Developers said a database behind the cluster will be slower than Ignite persistence because communication goes through external protocols
In our company we have huge Ignite cluster that keeps in RAM much more data

Propagate change in distributed in-memory cache

I've an application deployed on a cluster of 1000 commodity boxes. While starting, each instance of the application loads a non-trivial amount of data from database and uses this as cache. During a day, around 20%of this cached data needs to be updated.
What are the efficient ways of near simultaneous update of in-memory data of entire cluster? I thought of JMX, Zookeeper, but not sure if that would be really efficient/fast enough.
Well assuming you're using Memcached's consistent hashing, go a step further and have each cache replicate to their closest successor. This can lessen the problem but not entirely alleviate it but it's a simple solution, Gossip + CRDTs are another solution, Dynamo and Riak use a combination of Gossip, Consistent Hashing, and CRDTs.

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.

What is the major difference between Redis and Membase?

What are the major differences between Redis and Membase?
Scalability:
Membase offers a distributed key/value store (just like Memcache), so writes and reads will always be performed in predictably constant time regardless of how large your data set is. Redis on the other hand offers just master-slave replication, which speeds up read but does not speed up writes.
Data Redundancy
It's simple to setup a cluster with a set amount of replicated copy for each key-value pair, allow for servers to failover a inoperative node in a cluster without losing data. Redis' master-slave replication doesn't offer this same type of data redundancy, however.
Data Type:
Redis offers ability to handle lists in an atomic fashion out of the box, but one can implement similar functionality in the application logic layer with Membase.
Adoption:
Currently Redis is more widely adopted and a bit more mature than Membase. Membase does have a few high profile use case, such as Zynga and their slew of social games.
Membase has recently merged with Couchbase and they will have a version of Membase that will offer CouchDB's Map/Reduce and query/index ability in the next major release (scheduled around early 2011).
Membase is a massive key-value store with persistent and replication for failover. The data stored in membase is not subject to "modification" (besides increment). You get or set it.
Redis is more of a key-data store. Redis allows the manipulation of sets, lists, sorted-lists, hashes and some odd other data types. While redis has replication it is more of a master/slave type of replication.
I'm adding some points to Manto's answer:
Redis has built in transaction mechanism, while membase does not. Base on your work it may be critical
Master-master replication have some cons compare to master-slave: loosy consistent (lazy object, async...), more complex compare to master-slave (hence add some latency).
Current version of redis (2.x) does not support clustering. You'll need to shard the database manually (check http://antirez.com/post/redis-presharding.html), while membase support clustering out of the box and have a pretty nice monitoring gui.
(Benchmark may be ** but people just love dirty things) Redis seems to have slight performance edge at heavily concurrent case. (http://coder.cl/2011/06/concurrency-in-redis-and-memcache/)