Apache ignite cluster replication to another cluster - apache-kafka

I have a use-case where I have to replicate Apache Ignite data(persistent + non-persistent) from one cluster(hosted on AWS) to another(hosted on GCP) and eventually shut down the original cluster.
I have come across GridGain. They have DataCenter replication and Kafka Connect integration methods available. Though this looks promising, it is available in enterprise edition.
I am more inclined towards using open-source. So, Need guidance if there's a good way of doing this replication. Please point me to the resources.
Also, guide on which replication method of Gridgain should be considered.

Related

Implementing multi-datacenter Cassandra with Phantom driver

I'm working with Cassandra 3.x and Phantom driver (scala),
and modifying my Cassandra deployment from a simple, three nodes cluster to a multi datacenter Cassandra deployment that consists of two datacenters:
Transactional - the "main" datacenter, to which all reads/writes occur (except for reads/writes done by some analytics job).
Analytics - a datacenter used for analytics purposes only. The analytics job should operate (i.e. read/write to) on this datacenter.
Both datacenters are configured with the proper snitch and replication factor strategies.
Based on this article ("Workload Separation" section), I'm supposed to be able to read/write from the "Transactional" datacenter, and run analytics jobs on the "Analytics" datacenter however, I'm not sure how to get this to work with the phantom driver.
How can I configure the driver to read/write from the proper datacenter?
Will setting the hosts in ContactPoints class to nodes from the Transactional datacenter only do the trick?
By default, Java driver 3.x uses so-called DCAware load balancing policy combined with TokenAware policy. Data center could be configured explicitly by using withLocalDc function of builder, but it could be omitted and driver will use the datacenter of the first contact point that was reached at initialization. So you can just point Phantom only to servers in the transactional DC, and it will work only with it (until you're using non-local consistency levels, such as QUORUM/SERIAL, EACH_QUORUM, etc.)

Microservices master/slave pattern

There are scenario where you want to run a cluster of microservices in High-Availability but you would like just one of them to execute a specific operation (consuming from a queue, polling a database)
What are the best practices with relation to this use case? Should one use Zookeeper as a registry, or are there other suitable technologies?
There are a couple of technologies for service registration and discovery. Please see if the following articles help:
StackShare's comparison of Consul vs. ZooKeeper vs. Eureka
A nice paper for service-discovery and guide on how to make the choice

Citus: is a 2 node PGSQL cluster doable and if yes how?

I am thinking of using Citus opensource for dualnode cluster - my questions are basically 2:
- if this kind of clustering is available - in the case of a failover is the slave node promoted to master? If yes - how - does it use WAL?
- If such a way of clusterisation is not possible what is an alternative for that except pgpool?
Thank you.
Citus isn't a high-availability solution for single-node PostgreSQL. Citus shards/partitions your data across multiple servers, and can thus use multiple CPU cores in parallel for your queries or transactions.
Citus is suitable for a variety of use-cases, and you can find more information on those here.
For high-availability, Citus can replicate data across multiple nodes, or you can set up streaming replication for each worker node. Citus Cloud uses streaming replication for each node, and you can find more information on how Citus Cloud manages HA on our documentation.

How to use kafka and storm on cloudfoundry?

I want to know if it is possible to run kafka as a cloud-native application, and can I create a kafka cluster as a service on Pivotal Web Services. I don't want only client integration, I want to run the kafka cluster/service itself?
Thanks,
Anil
I can point you at a few starting points, there would be some work involved to go from those starting points to something fully functional.
One option is to deploy the kafka cluster on Cloud Foundry (e.g. Pivotal Web Services) using docker images. Spotify has Dockerized kafka and kafka-proxy (including Zookeeper). One thing to keep in mind is that PWS currently doesn't support apps with persistence (although this work is starting) so if you were to go this route right now, you would lose the data in kafka when the application is rolled. Looking at that Spotify repo, it looks like the docker images are generally run without any mounted volumes, so this persistence-less kafka seems like it may be a valid use case (I don't know enough about kafka to say).
The other option is to deploy kafka directly on some IaaS (e.g. AWS) using BOSH. BOSH can be hard if you're seeing it for the first time, but it is the ideal way to deploy any distributed software that you want running on VMs. You will also be able to have persistent volumes attached to your kafka VMs if necessary. Here is a kafka BOSH release which may work.
Once you have your cluster running, you have two ways to integrate your Cloud Foundry applications with it. The simplest is just to provide it to your applications as a "user-provided service", which lets you flow kafka cluster access info to your apps. The alternative would to put a service broker in front of your cluster, which would be especially useful if you have many different people who will be pushing apps that need to talk to the kafka cluster. Rather than you having to manually tell people the access info each time, they can do something simple like cf bind-service SOME_APP YOUR_KAFKA_SERVICE. Here is a kafka service broker along with more info about service brokers in general.
According to the 12-factor app description (https://12factor.net/processes), Kafka should not run as an application on top of Cloud Foundry:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Kafka is often considered a "distributed commit log" and as such carries a large amount of state. Many companies use it to keep all events flowing through their distributed system of micro services for a long (sometimes unlimited) amount of time.
Therefore I would strongly recommend to go for the second option in the accepted answer: Kafka topics should be bound to your applications in the form of stateful services.

Is my RabbitMQ cluster Active Active or Active Passive?

I have created a cluster consists of three RabbitMQ nodes using join_cluster command.
i.e.
rabbitmqctl –n rabbit2#MYPC1 join_cluster rabbit2#MYPC1
(currently the cluster runs on a single computer)
Questions:
In the documents it says there is one implemetation for active passive and one for active active.
What did I configure?
How do I know?
How can it be changed?
Is there a big performance trade off between Active Active & Active Passive?
What is the best practice to interact with active/active?
i.e. install a load balancer? apache that will round robin
What is the best practice to interact with active/passive?
if I interact with only the active - this is a single point f failure
Thanks.
I have been doing some research into availability options with RabbitMQ and while I am still fairly new, I'll attempt to answer your questions with the knowledge I do have. Please understand that these answers are not intended to be comprehensive.
Before getting to the questions and answers, I think it's worth pointing out that I think using the terms Active/Active and Active/Passive in the context of a cluster running on a single computer does not really apply. Active/Active and Active/Passive are typically terms used to describe highly available clusters where you have a system of more than one logical server (in your case, multiple RabbitMQ clusters), shared/redundant storage, network capabilities, power, etc.
What did I configure?
Without any load balancing for the nodes in your cluster or queue mirroring you have neither, meaning you do not have a highly available cluster.
How do I know?
RabbitMQ does not provide any connection management so traffic with a failed node will not automatically be passed on to a different node, which is required for an active/active cluster. Without queue mirroring you do not have fully redundant nodes in your cluster, which is required for active/passive.
How can it be changed?
Even if you implement load balancing and/or queue mirroring you are missing a number of requirements to offer a highly-available RabbitMQ cluster. Primarily, with a RabbitMQ cluster you only have a single logical broker (at least two are required for an HA cluster).
Is there a big performance trade off between Active Active & Active Passive?
I think you will start seeing performance penalties as you start introducing data replication and/or redundancy, which would affect both Active/Active and Active/Passive. If you are using synchronous data replication then you will see a bigger performance hit than if you replicate data asynchronously. There's a lot more to it, but to me this feels like there may be a bigger performance hit by using Active/Active but this depends heavily on how fast all of the pieces are working together. In Active/Passive where you may be using asynchronous replication across servers your performance may appear better but in a failover situation you would need to wait for that replication to complete before you can switch to your secondary server.
What is the best practice to interact with active/active? i.e. install a load balancer? apache that will round robin
RabbitMQ recommends using a load balancer so that you do not have to leak details about the nodes in your cluster to the clients.
What is the best practice to interact with active/passive? if I interact with only the active - this is a single point of failure
It is a point of failure but with Active/Passive you can implement a failure strategy to retry the next available server or all remaining servers. With these strategies in place you can establish a scenario where the capabilities of your cluster are merely degraded while a failover is happening instead of totally unavailable. Also, you can interact with the passive side but the types of interactions may be very different (i.e. read-only access) since there may be fewer resources available on the passive side and there may be delays in data replication.
Here are some references used to gather this information:
High-Availability Cluster on Wikipedia
Clustering with RabbitMQ
Highly Available Queues in a RabbitMQ Cluster
High Availability in RabbitMQ