What is the consistency of Postgresql HA cluster with Patroni? - postgresql

What is the consistency of Postgresql HA cluster with Patroni?
My understanding is that because the fail-over is using a consensus (etc or zookeeper) the system will stay consistent under network partition.
Does this mean that transaction running under the serializable Isolation Level will also provide linearizability.
If not which consistency will I get Sequential Consistency, Causal Consistency .. ?

You shouldn't mix up consistency between the primary and the replicas and consistency within the database.
A PostgreSQL database running in a Patroni cluster is a normal database with streaming replicas, so it provides the eventual consistency of streaming replication (all replicas will eventually show the same values as the primary).
Serializabiliy guarantees that you can establish an order in the database transactions that ran against the primary such that the outcome of a serialized execution in that order is the same as the workload had in reality.
If I read the definition right, that is just the same as “linearizability”.
Since only one of the nodes in the Patroni cluster can be written to (the primary), this stays true, no matter if the database is in a Patroni cluster or not.

In a distributed context, where we have multiple replicas of an object’s state, A schedule is linearizable if it is as if they were all updated at once at a single point in time.
Once a write completes, all later reads (wall-clock time) from any replica should see the value of that write or the value of a later write.
Since PostgreSQL version 9.6 its possible to have multiple synchronous standy node. This mean if we have 3 server and use num_sync = 2, the primary will always wait for write to be on the 2 standby before doing commit.
This should satisfy the constraint of linearizable schedule even with failover.
Since version 1.2 of Patroni, When synchronous mode is enabled, Patroni will automatically fail over only to a standby that was synchronously replicating at the time of the master failure.
This effectively means that no user visible transaction gets lost in such a case.

Related

Prevent data loss while upgrading Kafka with a single broker

I have a Kafka server which runs on a single node. There is only 1 node because it's a test server. But even for a test server, I need to be sure that no data loss will occur while upgrade is in process.
I upgrade Kafka as:
Stop Kafka, Zookeeper, Kafka Connect and Schema Registry.
Upgrade all the components.
Start upgraded services.
Data loss may occur in the first step, where kafka is not running. I guess you can do a rolling update (?) with multiple brokers to prevent data loss but in my case it is not possible. How can I do something similar with a single broker? Is it possible? If not, what is the best approach for upgrading?
I have to say, obviously, you are always vulnerable to data losses if you are using only one node.
If you can't have more nodes you have the only choice:
Stop producing;
Stop consuming;
Enable parameter controlled.shutdown.enable - this will ensure that your broker saved offset in case of a shutdown.
I guess the first 2 steps are quite tricky.
Unfortunately, there is not much to play with - Kafka was not designed to be fault-tolerant with only one node.
The process of a rolling upgrade is still the same for a single broker.
Existing data during the upgrade shouldn't be lost.
Obviously, if producers are still running, all their requests will be denied while the broker is down, thus why you not only need multiple brokers to prevent data-loss, but a balanced cluster (with unclean leader election disabled) where your restart cycles don't completely take a set of topics offline.

Kafka Connect MongoDB Source Connector failure scenario

I need to use Kafka Connect to monitor changes to a MongoDB cluster with one primary and 2 replicas.
I see there is the official MongoDB connector, and I want to understand what would be the connector's behaviour, in case the primary replica would fail. Will it automatically read from one of the secondary replicas which will become the new primary? I couldn't find information for this in the official docs.
I've seen this post related to the tasks.max configuration, which I thought might be related to this scenario, but the answer implies that it always defaults to 1.
I've also looked at Debezium's implementation of the connector, which seems to support this scenario automatically:
The MongoDB connector is also quite tolerant of changes in membership
and leadership of the replica sets, of additions or removals of shards
within a sharded cluster, and network problems that might cause
communication failures. The connector always uses the replica set’s
primary node to stream changes, so when the replica set undergoes an
election and a different node becomes primary, the connector will
immediately stop streaming changes, connect to the new primary, and
start streaming changes using the new primary node.
Also, Debezium's version of the tasks.max configuration property states that:
The maximum number of tasks that should be created for this connector.
The MongoDB connector will attempt to use a separate task for each
replica set, [...] so that the work for each replica set can be
distributed by Kafka Connect.
The question is - can I get the same default behaviour with the default connector - as advertised for the Debezium one? Because of external reasons, I can't use the Debezium one for now.
In a PSS deployment:
If one node is not available, the other two nodes can elect a primary
If two nodes are not available, there can be no primary
The quote you referenced suggests the connector may be using primary read preference, which means as long as two nodes are up it will be working and if only one node is up it will not retrieve any data.
Therefore, bring down two of the three nodes and observe whether you are able to query.

How to deploy Kafka Stream applications on Kubernetes?

My application has some aggregation/window operation, so it has some state store which stores in the state.dir. AFAIK, it also writes the changelog of state store to the broker,
so is that OK to consider the Kafka Stream application as a stateless POD?
My application has some aggregation/window operation, so it has some state store which stores in the state.dir. AFAIK, it also writes the changelog of state store to the broker, so is that OK to consider the Kafka Stream application as a stateless POD?
Stateless pod and data safety (= no data loss): Yes, you can consider the application as a stateless pod as far as data safety is concerned; i.e. regardless of what happens to the pod Kafka and Kafka Streams guarantee that you will not lose data (and if you have enabled exactly-once processing, they will also guarantee the latter).
That's because, as you already said, state changes in your application are always continuously backed up to Kafka (brokers) via changelogs of the respective state stores -- unless you explicitly disabled this changelog functionality (it is enabled by default).
Note: The above is even true when you are not using Kafka's Streams default storage engine (RocksDB) but the alternative in-memory storage engine. Many people don't realize this because they read "in-memory" and (falsely) conclude "data will be lost when a machine crashes, restarts, etc.".
Stateless pod and application restoration/recovery time: The above being said, you should understand how having vs. not-having local state available after pod restarts will impact restoration/recovery time of your application (or rather: application instance) until it is fully operational again.
Imagine that one instance of your stateful application runs on a machine. It will store its local state under state.dir, and it will also continuously backup any changes to its local state to the remote Kafka cluster (brokers).
If the app instance is being restarted and does not have access to its previous state.dir (probably because it is restarted on a different machine), it will fully reconstruct its state by restoring from the associated changelog(s) in Kafka. Depending on the size of your state this may take milliseconds, seconds, minutes, or more. Only once its state is fully restored it will begin processing new data.
If the app instance is being restarted and does have access to its previous state.dir (probably because it is restarted on the same, original machine), it can recover much more quickly because it can re-use all or most of the existing local state, so only a small delta needs to restored from the associated changelog(s). Only once its state is fully restored it will begin processing new data.
In other words, if your application is able to re-use existing local state then this is good because it will minimize application recovery time.
Standby replicas to the rescue in stateless environments: But even if you are running stateless pods you have options to minimize application recovery times by configuring your application to use standby replicas via the num.standby.replicas setting:
num.standby.replicas
The number of standby replicas. Standby replicas are shadow copies of local state stores. Kafka Streams attempts to create the specified number of replicas and keep them up to date as long as there are enough instances running. Standby replicas are used to minimize the latency of task failover. A task that was previously running on a failed instance is preferred to restart on an instance that has standby replicas so that the local state store restoration process from its changelog can be minimized.
See also the documentation section State restoration during workload rebalance
Update 2018-08-29: Arguably the most convenient option to run Kafka/Kafka Streams/KSQL on Kubernetes is to use Confluent Operator or the Helm Charts provided by Confluent, see https://www.confluent.io/confluent-operator/. (Disclaimer: I work for Confluent.)
Update 2019-01-10: There's also a Youtube video that demoes how to Scale Kafka Streams with Kubernetes.
I think so. The RocksDB is there for saving state in order to be fast when it comes to execute operations which need the state itself. As you already mentioned, the state changes are stored in a Kafka topic as well, so that if the current streams application instance fails, another instance (on another node) can use the topic to re-build the local state and continue to process the stream as the previous one.
KStreams uses the underlying state.dir for local storage. If the pod get's restarted on the same machine, and the volume is mounted, it will pick up from where it was, immediately.
If the pod starts up in another machine where the local state is not available, KStreams will rebuild the state via re-reading the backing Kafka topics
A short video at https://www.youtube.com/watch?v=oikZg7_vy6A shows Lenses - for Apache Kafka - deploying and scaling KStream applications on Kubernetes

Running zookeeper on a cluster of 2 nodes

I am currently working on trying to use zookeeper in a two node cluster. I have my own cluster formation algorithm running on the nodes based on configuration. We only need Zookeeper's distributed DB functionality.
Is it possible to use Zookeeper in a two node cluster ? Do you know of any solutions where this has been done ?
Can we still retain the zookeepers DB functionality without forming a quorum ?
Note: Fault tolerance is not the main concern in this project. If one of the nodes go down we have enough code logic to run without the zookeeper service. We use the zookeeper to share data when both the nodes are alive.
Would greatly appreciate any help.
Zookeeper is a coordination system which is basically used to coordinate among nodes. When writes are occurred to such a distributed system, in ordered to coordinate and agree upon values which are being stored, all the writes are gone through master (aka leader). Reads can occur through any node. Zookeeper requires a master/leader to be elected per a quorum in order to serve write requests consistently. Zookeeper make use of the ZAB protocol as the consensus algorithm.
In order to elect a leader, a quorum should ideally have an odd number of nodes (Otherwise, a node will not be able to win majority and become the leader). In your case, with two nodes, zookeeper will not possibly be able to elect a leader for a long time since both nodes will be candidates and wait for the other node to vote for it. Even though they elect a leader, your ensemble will not work properly in network patitioning situations.
As I said, zookeeper is not a distributed storage. If you need to use it in a distributed manner (more than one node), it need to form a quorum.
As I see, what you need is a distributed database. Not a distributed coordination system.

In master-master/multi-master replication, who is the secondary?

Silly question, when we talk about secondaries in the context of failover behaviour, with regards to master-master/multi-master, is that basically any node other than the one that we are currently reading from or writing to?
In master-master replication both the nodes are primary and secondary. In multi master replication every node is secondary but some or all are primary.
Multi master means there many database servers over which write can perform. In order to sync with other data nodes or database server we have to read all other writes and It behaves as secondary. In master slave replication we have only one master and many slaves. Master ensures that he is only write enabled and no one can writes so no need to read any one. and it behave as primary only.
For example- mysql 5.6 replication has support master-master replication but doesn't support multi master replication. But in mysql 5.7 replication it also support multi master replication. In mongoDB It only support master - slave replication.