I am looking for a key-value datastore which has following (preferable) properties - nosql

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.

Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.

Related

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

In database terms, what is the difference between replication and decentralisation?

I am currently researching different databases to use for my next project. I was wanting to use a decentralized database. For example Apache Cassandra claims to be decentralized. MongoDB however says it uses replication. From what I can see, as far as these databases are concerned, replication and decentralization are basically the same thing. Is that correct or is there some difference/feature between decentralization and replication that I'm missing?
Short answer, no, replication and decentralization are two different things. As a simple example, let's say you have three instances (i1, i2 and i3) that replicate the same data. You also have a client that fetches data from only i1. If i1 goes down you will still have the data replicated to i2 and i3 as a backup. But since i1 is down the client has no way of getting the data. This an example of a centralized database with single point of failure.
A centralized database has a centralized location that the majority of requests goes through. It could, as in Mongo DB's case be instances that route queries to instances that can handle the query.
A decentralized database is obviously the opposite. In Cassandra any node in a cluster can handle any request. This node is called the coordinator for the request. The node then reads/writes data from/to the nodes that are responsible for that data before returning a result to the client.
Decentralization means that there should be no single point of failure in your application architecture. These systems will provide deployment scheme, where there's no leader (or master) elected during the service life-cycle. These are often deliver services in a peer-to-peer fashion.
Replication means, that simply your data is copied over to another server instance to ensure redundancy and failure tolerance. Client requests can still be served from copies, but your system should ensure some level of "consistency", when making copies.
Cassandra serves requests in a peer-to-peer fashion. Meaning that clients can initiate requests to any node participating in the cluster. It also provides replication and tunable consistency.
MongoDB offers master/slave deployment, so it's not considered as decentralized. You can deliver a multi-master, to ensure that requests can still be served if master node goes down. It also provides replication out-of-the box.
Links
Cassandra's tunable consistency
MongoDB's master-slave configuration
Introduction to Cassandra's architecture

How good are ZooKeeper and Etcd?

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

What is the major difference between Redis and Membase?

What are the major differences between Redis and Membase?
Scalability:
Membase offers a distributed key/value store (just like Memcache), so writes and reads will always be performed in predictably constant time regardless of how large your data set is. Redis on the other hand offers just master-slave replication, which speeds up read but does not speed up writes.
Data Redundancy
It's simple to setup a cluster with a set amount of replicated copy for each key-value pair, allow for servers to failover a inoperative node in a cluster without losing data. Redis' master-slave replication doesn't offer this same type of data redundancy, however.
Data Type:
Redis offers ability to handle lists in an atomic fashion out of the box, but one can implement similar functionality in the application logic layer with Membase.
Adoption:
Currently Redis is more widely adopted and a bit more mature than Membase. Membase does have a few high profile use case, such as Zynga and their slew of social games.
Membase has recently merged with Couchbase and they will have a version of Membase that will offer CouchDB's Map/Reduce and query/index ability in the next major release (scheduled around early 2011).
Membase is a massive key-value store with persistent and replication for failover. The data stored in membase is not subject to "modification" (besides increment). You get or set it.
Redis is more of a key-data store. Redis allows the manipulation of sets, lists, sorted-lists, hashes and some odd other data types. While redis has replication it is more of a master/slave type of replication.
I'm adding some points to Manto's answer:
Redis has built in transaction mechanism, while membase does not. Base on your work it may be critical
Master-master replication have some cons compare to master-slave: loosy consistent (lazy object, async...), more complex compare to master-slave (hence add some latency).
Current version of redis (2.x) does not support clustering. You'll need to shard the database manually (check http://antirez.com/post/redis-presharding.html), while membase support clustering out of the box and have a pretty nice monitoring gui.
(Benchmark may be ** but people just love dirty things) Redis seems to have slight performance edge at heavily concurrent case. (http://coder.cl/2011/06/concurrency-in-redis-and-memcache/)