Is it possible to use dolphinscheduler without zookeeper? - workflow

Zookeeper plays several roles in the open-source workflow framework dolphinscheduler, such as heartbeat detection among masters and workers, task queue,event listener and distributed lock.
dolphin-sche framework
Is it possible to replace it by using database (mysql)? The main reason is to simplify the project structure .

zookeeper in DS is mainly used as:
Task queue, for master sending tasks to worker
Lock, for the communication between host(masters and workers)
Event watcher. Master listens the event that worker added or removed
it costs to replace zk as mysql.

zk mainly assumes the responsibility of the registry and monitors the application status. zk is very mature in this area and is a recognized solution in the industry. If MySQL wants to do this, the technical implementation cost will be larger, and may not achieve the desired effect.
BTW, their team is currently working on the SPI development for the registry, and in later versions, perhaps you can use other components, such as etcd, to achieve similar functionality.

for now, MasterServer and the WorkerServer nodes in the system all use the Zookeeper for cluster management and fault tolerance. In addition, the system also performs event monitoring and distributed locking based on ZooKeeper. We have also implemented queues based on Redis, but we hope that DolphinScheduler relies on as few components as possible, so we finally removed the Redis implementation.
so now DolphinScheduler can't work fine without Zookeeper, maybe in the future.
DolphinScheduler System Architecture:
For more documents please refer: Official Document.

Related

Uber Cadence production setup

I was looking for a microservice orchestrator and came across Uber Cadence. I have gone through the documentation and also used it in the development setup.
I had a few questions for production scenarios:
Is it recommended to have a dedicated tasklist for the workflow and the different activities used by it? Or, should we use a single tasklist for all? Does this decision impact the scalability or performance?
When we add a new worker machine, is it a common practice to run all the workers for different activities/workflows in the same machine? Example:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker helloWorkflowWorker = factory.newWorker("HelloWorkflowTaskList");
helloWorkflowWorker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
Worker helloActivityWorker = factory.newWorker("HelloActivityTaskList");
helloActivityWorker.registerActivitiesImplementations(new HelloActivityImpl());
Worker upperCaseActivityWorker = factory.newWorker("UpperCaseActivityTaskList");
upperCaseActivityWorker.registerActivitiesImplementations(new UpperCaseActivityImpl());
factory.start();
Or should we run each activity/workflow worker in a dedicated machine?
In a single worker machine, how many workers can we create for a given activity? For example, if we have activity HelloActivityImpl, should we create multiple workers for it in the same worker machine?
I have not found any documentation for production set up. For example, how to install and configure the Cadence Service in production? It will be great if someone can direct me to the right material for this.
In some of the video tutorials, it was mentioned that, for High Availability, we can setup Cadence Service across multiple data centers. How do I configure Cadence service for that?
Unless you need to have separate flow control and rate limiting for a set of activities there is no reason to use more than one task queue per worker process.
As I mentioned in 1 I would rewrite your code as:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker worker = factory.newWorker("HelloWorkflow");
worker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
worker.registerActivitiesImplementations(new HelloActivityImpl(), new UpperCaseActivityImpl());
factory.start();
There is no reason to create more than one worker for the same activity.
Not sure about Cadence. Here is the Temporal documentation that shows how to deploy to Kubernetes.
This documentation is not yet available. We at Temporal are working on it.
You can also use Cadence helmchart https://hub.helm.sh/charts/banzaicloud-stable/cadence
I am actively working with Cadence team to have operation documentation for the community. It will be useful for those don't want to run on K8s, like myself. I will come back later as we make progress.
Current draft version: https://docs.google.com/document/d/1tQyLv2gEMDOjzFibKeuVYAA4fucjUFlxpojkOMAIwnA
will be published to cadence-docs soon.

How to use kafka and storm on cloudfoundry?

I want to know if it is possible to run kafka as a cloud-native application, and can I create a kafka cluster as a service on Pivotal Web Services. I don't want only client integration, I want to run the kafka cluster/service itself?
Thanks,
Anil
I can point you at a few starting points, there would be some work involved to go from those starting points to something fully functional.
One option is to deploy the kafka cluster on Cloud Foundry (e.g. Pivotal Web Services) using docker images. Spotify has Dockerized kafka and kafka-proxy (including Zookeeper). One thing to keep in mind is that PWS currently doesn't support apps with persistence (although this work is starting) so if you were to go this route right now, you would lose the data in kafka when the application is rolled. Looking at that Spotify repo, it looks like the docker images are generally run without any mounted volumes, so this persistence-less kafka seems like it may be a valid use case (I don't know enough about kafka to say).
The other option is to deploy kafka directly on some IaaS (e.g. AWS) using BOSH. BOSH can be hard if you're seeing it for the first time, but it is the ideal way to deploy any distributed software that you want running on VMs. You will also be able to have persistent volumes attached to your kafka VMs if necessary. Here is a kafka BOSH release which may work.
Once you have your cluster running, you have two ways to integrate your Cloud Foundry applications with it. The simplest is just to provide it to your applications as a "user-provided service", which lets you flow kafka cluster access info to your apps. The alternative would to put a service broker in front of your cluster, which would be especially useful if you have many different people who will be pushing apps that need to talk to the kafka cluster. Rather than you having to manually tell people the access info each time, they can do something simple like cf bind-service SOME_APP YOUR_KAFKA_SERVICE. Here is a kafka service broker along with more info about service brokers in general.
According to the 12-factor app description (https://12factor.net/processes), Kafka should not run as an application on top of Cloud Foundry:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Kafka is often considered a "distributed commit log" and as such carries a large amount of state. Many companies use it to keep all events flowing through their distributed system of micro services for a long (sometimes unlimited) amount of time.
Therefore I would strongly recommend to go for the second option in the accepted answer: Kafka topics should be bound to your applications in the form of stateful services.

Persistent storage for Apache Mesos

Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!

Is my RabbitMQ cluster Active Active or Active Passive?

I have created a cluster consists of three RabbitMQ nodes using join_cluster command.
i.e.
rabbitmqctl –n rabbit2#MYPC1 join_cluster rabbit2#MYPC1
(currently the cluster runs on a single computer)
Questions:
In the documents it says there is one implemetation for active passive and one for active active.
What did I configure?
How do I know?
How can it be changed?
Is there a big performance trade off between Active Active & Active Passive?
What is the best practice to interact with active/active?
i.e. install a load balancer? apache that will round robin
What is the best practice to interact with active/passive?
if I interact with only the active - this is a single point f failure
Thanks.
I have been doing some research into availability options with RabbitMQ and while I am still fairly new, I'll attempt to answer your questions with the knowledge I do have. Please understand that these answers are not intended to be comprehensive.
Before getting to the questions and answers, I think it's worth pointing out that I think using the terms Active/Active and Active/Passive in the context of a cluster running on a single computer does not really apply. Active/Active and Active/Passive are typically terms used to describe highly available clusters where you have a system of more than one logical server (in your case, multiple RabbitMQ clusters), shared/redundant storage, network capabilities, power, etc.
What did I configure?
Without any load balancing for the nodes in your cluster or queue mirroring you have neither, meaning you do not have a highly available cluster.
How do I know?
RabbitMQ does not provide any connection management so traffic with a failed node will not automatically be passed on to a different node, which is required for an active/active cluster. Without queue mirroring you do not have fully redundant nodes in your cluster, which is required for active/passive.
How can it be changed?
Even if you implement load balancing and/or queue mirroring you are missing a number of requirements to offer a highly-available RabbitMQ cluster. Primarily, with a RabbitMQ cluster you only have a single logical broker (at least two are required for an HA cluster).
Is there a big performance trade off between Active Active & Active Passive?
I think you will start seeing performance penalties as you start introducing data replication and/or redundancy, which would affect both Active/Active and Active/Passive. If you are using synchronous data replication then you will see a bigger performance hit than if you replicate data asynchronously. There's a lot more to it, but to me this feels like there may be a bigger performance hit by using Active/Active but this depends heavily on how fast all of the pieces are working together. In Active/Passive where you may be using asynchronous replication across servers your performance may appear better but in a failover situation you would need to wait for that replication to complete before you can switch to your secondary server.
What is the best practice to interact with active/active? i.e. install a load balancer? apache that will round robin
RabbitMQ recommends using a load balancer so that you do not have to leak details about the nodes in your cluster to the clients.
What is the best practice to interact with active/passive? if I interact with only the active - this is a single point of failure
It is a point of failure but with Active/Passive you can implement a failure strategy to retry the next available server or all remaining servers. With these strategies in place you can establish a scenario where the capabilities of your cluster are merely degraded while a failover is happening instead of totally unavailable. Also, you can interact with the passive side but the types of interactions may be very different (i.e. read-only access) since there may be fewer resources available on the passive side and there may be delays in data replication.
Here are some references used to gather this information:
High-Availability Cluster on Wikipedia
Clustering with RabbitMQ
Highly Available Queues in a RabbitMQ Cluster
High Availability in RabbitMQ

Distributed Actors in Akka

I'm fairly new to Akka and new to distributed programming in general. Using Akka's Mist component, I've created supervised actors to handle HTTP requests asynchronously. Everything is currently running on one physical machine with local actors. What I don't understand is how to build a truly fault-tolerant system with more than one box. As stated in the Akka docs:
Also, you (usually) need to know if one box is down and/or the service you are talking to on the other box is down. Here actor supervision/linking is a critical tool for not only monitoring the health of remote services, but to actually manage the service, do something about the problem if the actor or node is down. Such as restarting actors on the same node or on another node.
How do I do this? I'm looking for an example or pointers on how to begin making my application distributed. Other services in our group use Apache gateways in front of multiple Tomcat instances, so the event of a Tomcat server going down is transparent to the user. I'm deploying my service to the Akka microkernel and need to achieve a similar level of high availability across more than one physical box.
I'm using Akka 1.1.3.
Remote supervision works only with client-managed remote actors for the Akka 1.x series.
Akka 2.0 that is currently under development will support transparent clustering, cluster-wide supervision and cluster-wide lifecycle monitoring.
You might consider putting an HTTP load balancer in front of Akka Microkernel instances running Mist, this would match what your group does with 'Apache gateways'.
Another approach would be to expose remote actors on a number of instances and then use Akka's LoadBalancer or Actor Pool to send messages around, see here
The second approach is a bit of a pain if you have a dynamic pool of machines, because the pool of devices wants to be specified programatically. Akka 2.0 addresses this with cluster support that is setup in the akka.conf file.
As far as the release date of 2.0, for what its worth 1.2 was just recently released on 2011-Sept-19.