Cloud Run high latency when invoked from Cloud Tasks - latency

I'm experimenting with Cloud Run(fully managed) and Cloud Tasks and I have been seeing weird results in terms of latency.
I have a queue in Cloud Tasks that invokes an API in Cloud Run. The tasks are created with the following values:
{
'http_request': {
'http_method': 'GET',
'url': 'https://<app>.run.app/<endpoint>'
}
}
I have tried many queue configurations, same results.
Cloud Run is a python server that processes the request(nothing fancy) and returns a response.
The problem is that latency is so high (~15 min) for the requests coming from the queue however if I curl the endpoint curl https://<app>.run.app/<endpoint> it only takes a couple of seconds.
The 15 min. is the value I get for Cloud Run's Request Latency it does not include the delay in the queue.
I also found this in the known issues but it refers to custom domains, which I'm not using, so I'm not sure it is the same problem.
Has anyone faced(and hopefully solved 😊) something similar? What could I be doing wrong?

Related

NestJS schedualers are not working in production

I have a BE service in NestJS that is deployed in Vercel.
I need several schedulers, so I have used #nestjs/schedule lib, which is super easy to use.
Locally, everything works perfectly.
For some reason, the only thing that is not working in my production environment is those schedulers. Everything else is working - endpoints, data base access..
Does anyone has an idea why? is it something with my deployment? maybe Vercel has some issue with that? maybe this schedule library requires something the Vercel doesn't have?
I am clueless..
Cold boot is the process of starting a computer from shutdown or a powerless state and setting it to normal working condition.
Which means that the code you deployed in a serveless manner, will run when the endpoint is called. The platform you are using spins up a virtual machine, to execute your code. And keeps the machine running for a certain period of time, incase you get another API hit, it's cheaper and easier on them to keep the machine running for lets say 5 minutes or 60 seconds, than to redeploy it on every call after shutting the machine when function execution ends.
So in your case, most likely what is happening is that the machine that you are setting the cron on, is killed after a period of time. Crons are system specific tasks which run in the kernel. But if the machine is shutdown, the cron dies with it. The only case where the cron would run, is if the cron was triggered at a point of time, before the machine was shut down.
Certain cloud providers give you the option to keep the machines alive. I remember google cloud used to follow the path of that if a serveless function is called frequently, it shifts from cold boot to hot start, which doesn't kill the machine entirely, and if you have traffic the machines stay alive.
From quick research, vercel isn't the best to handle crons, due to the nature of the infrastructure, and this is what you are looking for. In general, crons aren't for serveless functions. You can deploy the crons using queues for example or another third party service, check out this link by vercel.

GKE and Task Queues

I am working on a cloud service platform that consists of getting tasks from users, executing them, and giving back the results.
TL;DR
Is there a way to have a "task queue", where tasks can be inserted via a REST API, and extracted automatically by the Google Kubernetes Engine cluster by guaranteeing an automatic scaling?
Long description
Users can send tasks in parallel, and each task is time consuming and need to be performed on a GPU. So, setting up an auto-scaling GPU cluster is what I thought of.
More in particular, in my idea, users could send tasks/data through a REST API, the REST API provides in filling a task queue, and the task queue itself will feed tasks to workers on the GPU auto-scaling cluster. Of course, there are other details (authentication, database, storage, etc.) that have to be addressed but are not the point of my question.
For reasons I don't specify here, the project is already started on the Google Cloud Platform, so switching to AWS or other providers is not an option.
For what I understood, things seem a bit different from standard Docker-only clusters in AWS, that is, we have to use the Google Kubernetes Engine (GKE) to setup the auto-scaling cluster, even for "simple" GPU-enabled Docker containers.
By looking at the not-so-exhaustive documentation, I know that queues are used, but what I don't know is whether feeding of tasks to the cluster is automatically handled. Also, the so-called "Task Queue" service has been deprecated.
Thank you!
First I thought Cloud Tasks queues may be the answer to your troubles, but more this post seems to promote Cloud Pub/Sub as a better alternative.
After a quick chat with batch developers, the current solution (before the batch service become public) is to adopt a third-party queue system like Slurm.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.

Issues with postgres_operator in Airflow dag

I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.
We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}