What are the major differences between Redis and Membase?
Scalability:
Membase offers a distributed key/value store (just like Memcache), so writes and reads will always be performed in predictably constant time regardless of how large your data set is. Redis on the other hand offers just master-slave replication, which speeds up read but does not speed up writes.
Data Redundancy
It's simple to setup a cluster with a set amount of replicated copy for each key-value pair, allow for servers to failover a inoperative node in a cluster without losing data. Redis' master-slave replication doesn't offer this same type of data redundancy, however.
Data Type:
Redis offers ability to handle lists in an atomic fashion out of the box, but one can implement similar functionality in the application logic layer with Membase.
Adoption:
Currently Redis is more widely adopted and a bit more mature than Membase. Membase does have a few high profile use case, such as Zynga and their slew of social games.
Membase has recently merged with Couchbase and they will have a version of Membase that will offer CouchDB's Map/Reduce and query/index ability in the next major release (scheduled around early 2011).
Membase is a massive key-value store with persistent and replication for failover. The data stored in membase is not subject to "modification" (besides increment). You get or set it.
Redis is more of a key-data store. Redis allows the manipulation of sets, lists, sorted-lists, hashes and some odd other data types. While redis has replication it is more of a master/slave type of replication.
I'm adding some points to Manto's answer:
Redis has built in transaction mechanism, while membase does not. Base on your work it may be critical
Master-master replication have some cons compare to master-slave: loosy consistent (lazy object, async...), more complex compare to master-slave (hence add some latency).
Current version of redis (2.x) does not support clustering. You'll need to shard the database manually (check http://antirez.com/post/redis-presharding.html), while membase support clustering out of the box and have a pretty nice monitoring gui.
(Benchmark may be ** but people just love dirty things) Redis seems to have slight performance edge at heavily concurrent case. (http://coder.cl/2011/06/concurrency-in-redis-and-memcache/)
Related
I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.
Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.
I am using CouchDB as my noSQL database for a CRM solution.
CouchDB uses a master-master replication.
Compared to this mongodb uses a master-slave replication.
Being a little newcomer to NoSQL,
I would like to clearly understand what are the benefits of a master-master replication over master-slave replication.
In a master-master architecture, you can distribute the power to the places it's most needed. In a CRM, you may want a single point of authority (the head office), but authoritative content may be created by anyone (sales reps, VP's, tech support agents). Master-master let's you bring the canonical data source as close as possible to every content owner/creator in that scenario.
In master-slave architectures, everyone must be able to reach the canonical authoritative source or risk (at least) having their content overwritten or not be writable at all.
Apache CouchDB is particularly well suited to master-master replication and coupled with PouchDB can provide applications that work offline-first--cloud optional. These apps can then synchronize their changes when an Internet connection is again available.
In master-master replication, client can both read and write on any of the instances. It has its use case where we want to independently write on two databases. Imagine you have one database in Asia and one in North America. clients in asia will write to db in asia and clients in north America will write to database in North America. This will have low latency
If one of the instances goes down client can write to other instances. This provides high availability.
In master-master replication, there is a possibility of write conflict. if you make a change on one db1 and db2 about the same time and if we are modifying the same record, when we reconcile the data between two dbs, that will result in a write conflict. But in master-slave replication, there is only one source of truth because we write ONLY ON MASTER.
I am looking for distributed kv database for caching small binary objects, like images with TTL. Size limitation is not a problem, as I am planning to split each object anyway, to minimize latency. I need C# and Java drivers, and in very near future I will also need C++ driver. The databases like CouchDb and Redis seems to be document based. Mongo supports binary data and well documented, but it is persistent and I am not sure it is scalable in terms of throughput , Cassandra is also persistent and I am not sure about C++/C# drivers quality + need for constantly repair because of deletions.
Aerospike is commercial and also document based. Maybe Riak with memory or leveldb backend (anyone worked with its C++ client?)
Aerospike would be a perfect solution for you because of below reasons:
Serves all your Use cases
Key Value based.
Open sourced from 3.0 version. Earlier upto 2 node Aerospike cluster was open sourced and 3 or more nodes cluster was paid.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Can save binary data.
Reasons for choosing Aerospike
Throughput: Better than Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/benchmark/.
Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.
Couchbase (not CouchDB) is a great option for you. Highly scalable, easy to understand, use and scale. It's a KV document database evolved from memcached that also offers secondary indexes through Map/Reduce and many new things coming soon. You can still use memcached protocol/libraries or speed it up with Couchbase SDK's.
Have you looked at Pivotal GemFire Pivotal GemFire is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Pivotal GemFire also has client drivers in C++, C# and Java
I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.