Is there any CPU limit for Service Workers? - progressive-web-apps

What happens if a Service Worker accidentally enters an infinite loop, or what is stopping a Service Worker of a website to be using my computer for crypto-mining for example?

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Airflow tasks failing with SIGTERM when worker pod downscaling

I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.

Stateless Worker service in Service Fabric restarted in the same process

I have a stateless service that pulls messages from an Azure queue and processes them. The service also starts some threads in charge of cleanup operations. We recently ran into an issue where these threads which ideally should have been killed when the service shuts down continue to remain active (definitely a bug in our service shutdown process).
Further looking at logs, it seemed that, the RunAsync methods cancellation token received a cancellation request, and later within the same process a new instance of the stateless service that was registered in ServiceRuntime.RegisterServiceAsync was created.
Is this expected behavior that service fabric can re-use the same process to start a new instance of the stateless service after shutting down the current instance.
The service fabric documentation https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-hosting-model does seem to suggest that different services can be hosted on the same process, but I could not find the above behavior being mentioned there.
In the shared process model, there's one process per ServicePackage instance on every node. Adding or replacing services will reuse the existing process.
This is done to save some (process-level) overhead on the node that runs the service, so hosting is more cost-efficient. Also, it enables port sharing for services.
You can change this (default) behavior, by configuring the 'exclusive process' mode in the application manifest, so every replica/instance will run in a separate process.
<Service Name="VotingWeb" ServicePackageActivationMode="ExclusiveProcess">
As you mentioned, you can monitor the CancellationToken to abort the separate worker threads, so they can stop when the service stops.
More info about the app model here and configuring process activation mode.

Kubernetes Handling a Sudden Request of Processing Power (Such as a Python Script using 5 Processes)

I have a little scenario that I am running in my mind with the following setup:
A Django Web Server running in Kubernetes with the ability to autoscale resources (Google Kubernetes Engine), and I set the resource values to be requesting nodes with 8 Processing Units (8 Cores) and 16 GB Ram.
Because it is a web server, I have my frontend that can call a Python script that executes with 5 Processes, and here's what I am worried about:
I know that If I run this script twice on my webserver (located in the same container as my Django code), I am going to be using (to keep it simple) 10 Processes/CPUs to execute this code.
So what would happen?
Would the first Python script be ran on Pod 1 and the second Python script (since we used 5 out of the 8 processing units) trigger a Pod 2 and another Node, then run on that new replica with full access to 5 new processes?
Or, would the first Python script be ran on Replica 1, and then the second Python script be throttled to 3 processing units because, perhaps, Kubernetes is allocating based on CPU usage in the Replica, not how much processes I called the script with?
If your system has a Django application that launches scripts with subprocess or a similar mechanism, those will always be in the same container as the main server, in the same pod, on the same node. You'll never trigger any of the Kubernetes autoscaling capabilities. If the pod has resource limits set, you could get CPU utilization throttled, and if you exceed the configured memory limit, the pod could get killed off (with Django and all of its subprocesses together).
If you want to take better advantage of Kubernetes scheduling and resource management, you may need to restructure this application. Ideally you could run the Django server and each of the supporting tasks in a separate pod. You would then need a way to trigger the tasks and collect the results.
I'd generally build this by introducing a job queue system such as RabbitMQ or Celery into the mix. The Django application adds items to the queue, but doesn't directly do the work itself. Then you have a worker for each of the processes that reads the queue and does work. This is not directly tied to Kubernetes, and you could run this setup with a RabbitMQ or Redis installation and a local virtual environment.
If you deploy this setup to Kubernetes, then each of the tasks can run in its own deployment, fed by the work queue. Each task could run up to its own configured memory and CPU limits, and they could run on different nodes. With a little extra work you can connect a horizontal pod autoscaler to scale the workers based on the length of the job queue, so if you're running behind processing one of the tasks, the HPA can cause more workers to get launched, which will create more pods, which can potentially allocate more nodes.
The important detail here, though, is that a pod is the key unit of scaling; if all of your work stays within a single pod then you'll never trigger the horizontal pod autoscaler or the cluster autoscaler.

Marathon how to health check a background worker

I got a standard rails app with 2 process types, the web process and the worker process. both running as tasks on marathon.
Is there a way to define health check to the worker process, the process does not listen on any port, whats recommended here?