Using celery flower with AWS Ecs Fargate - celery

We have our production running on AWS ECS with fargate, where we are using multiple celery workers.
We have integrated flower to monitor our celery tasks with EFS as persistent db. Everything works fine unless we trigger a new deployment, once we do a new deployment, a new task will be up with new workers and flower consider them as different from the existing workers and existing workers will be considered as offline. Due to this we loose existing data after every deployment.
We tried hard coding worker names, But even after that it works the same way, the only difference is now it is not showing any offline workers after we trigger a deployment.
Please let me know your thoughts on this, is it okay to use flower to monitor celery? or is there any other tools we can use where we will not be facing this kind of issues? if flower is fine please let me know how we can fix this.
Thank you.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Architecture a service on Kubernetes

I have a UI where I can start machine learning jobs. When a job is requested, a message is added to a PubSub (kafka) and pulled by the service that will run the job.
I have a problem with this service design. I was thinking about creating the main service on Kubernetes that will pull messages from PubSub then this main service would create pods (or rather jobs) to run the actual ML work.
However, I don't know how to make the main service monitor the "worker" jobs it creates. Do I have to do it manually by persisting the ID of the job somewhere and monitoring it? Also how to deal with the "main" service potential failure?
I feel like this is a "classic" use case but I can't find much about how to solve this.
Thanks for your help

Uber Cadence production setup

I was looking for a microservice orchestrator and came across Uber Cadence. I have gone through the documentation and also used it in the development setup.
I had a few questions for production scenarios:
Is it recommended to have a dedicated tasklist for the workflow and the different activities used by it? Or, should we use a single tasklist for all? Does this decision impact the scalability or performance?
When we add a new worker machine, is it a common practice to run all the workers for different activities/workflows in the same machine? Example:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker helloWorkflowWorker = factory.newWorker("HelloWorkflowTaskList");
helloWorkflowWorker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
Worker helloActivityWorker = factory.newWorker("HelloActivityTaskList");
helloActivityWorker.registerActivitiesImplementations(new HelloActivityImpl());
Worker upperCaseActivityWorker = factory.newWorker("UpperCaseActivityTaskList");
upperCaseActivityWorker.registerActivitiesImplementations(new UpperCaseActivityImpl());
factory.start();
Or should we run each activity/workflow worker in a dedicated machine?
In a single worker machine, how many workers can we create for a given activity? For example, if we have activity HelloActivityImpl, should we create multiple workers for it in the same worker machine?
I have not found any documentation for production set up. For example, how to install and configure the Cadence Service in production? It will be great if someone can direct me to the right material for this.
In some of the video tutorials, it was mentioned that, for High Availability, we can setup Cadence Service across multiple data centers. How do I configure Cadence service for that?
Unless you need to have separate flow control and rate limiting for a set of activities there is no reason to use more than one task queue per worker process.
As I mentioned in 1 I would rewrite your code as:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker worker = factory.newWorker("HelloWorkflow");
worker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
worker.registerActivitiesImplementations(new HelloActivityImpl(), new UpperCaseActivityImpl());
factory.start();
There is no reason to create more than one worker for the same activity.
Not sure about Cadence. Here is the Temporal documentation that shows how to deploy to Kubernetes.
This documentation is not yet available. We at Temporal are working on it.
You can also use Cadence helmchart https://hub.helm.sh/charts/banzaicloud-stable/cadence
I am actively working with Cadence team to have operation documentation for the community. It will be useful for those don't want to run on K8s, like myself. I will come back later as we make progress.
Current draft version: https://docs.google.com/document/d/1tQyLv2gEMDOjzFibKeuVYAA4fucjUFlxpojkOMAIwnA
will be published to cadence-docs soon.

Is it possible to run a single container Flink cluster in Kubernetes with high-availability, checkpointing, and savepointing?

I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)
One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.
Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.

Airflow: When to use CeleryExecutor and when to use MesosExecutor

I am pretty new to Airflow and trying to understand how should we set it up in our environment(on aws).
I read the Airflow uses Celery with redis broker.
How is it different from Mesos? I have not used Celery before but I tried to set up celery-redis on my dev machine and it worked with ease. But adding new components means, add more monitoring.
Since we already use mesos for our cluster management, I am trying to think what am I missing if I dont chose celery and go with MesosExecutor instead?
Using Celery is the more proven/stable approach at the moment.
For us, managing dependencies using containers is more convenient than managing dependencies on the Mesos instances, which is the case if you choose MesosExecutor. As such we are finding Celery more flexible.
We are currently using Celery + RabbitMQ but we will switch to MesosExecutor in the future though, as our codebase stabilises.
Airflow with the CeleryExecuter doesn't necessarily need to use the Redis Broker. Any broker that celery can use is compatible with airflow, though it is recommended to either use the RabbitMQ broker or the Redis Broker.
Celery is quite different from Mesos. While airflow supports the MesosExecutor too, it is recommended to use the CeleryExecutor if you are planning to distribute the workers. From what I know, Airbnb uses the CeleryExecutor and actively maintains it.
For us, the MesosExecutor cannot be used. We need an abstraction level to handle dependencies for job, we cannot (and shouldn't) rely on any dependencied being installed on the mesos slaves. When Docker container and/or Mesos Container will be supported by MesosExecutor we can turn to it. Also, I like seeing the allocated workers inside Marathon. I am working on how to autoscale workers with Marathon.
The MesosExecutor is still experimental at this stage and does not support running Docker containers, having different resource limits per task and probably many other limitations.
I plan to work on this though, it's a community effort and having spent some effort to deploy a Mesos cluster, I feel that adding Celery and another MQ broker is a waste of resources.