Issues with postgres_operator in Airflow dag - postgresql

I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.

We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}

Related

How to check about Redshift maintenance windows

As usual we set Redshift maintenance windows on Saturday morning, and we got several errors during that maintenance windows time.
* Query Processing Error AM5:07:01
[Amazon](500051) ERROR processing query/statement. Error: Query execution failed
[SQL State=HY000, DB Errorcode=500051]
* Connection Error AM5:07:27.79
[Amazon](500150) Error setting/closing connection: Connection refused: connect.
I guess that's due to Redshift internal maintenance.
May I ask how to check any evidence to prove that on Redshift? I checked the svl_qlog with aborted=1, but couldn't find perfect one.
And is there any way to set maintenance window to skip when the user session is running on?
--
Thanks to useful information from Schepo and Bill, we could prove that connection error was due to reboot on Redshift Maintenance Window.
Also, we checked Redshift Event at Console, exactly what time Redshift reboot started and ended.
Probably the best way to check if the connection errors were due to Redshift maintenance would be to check the Maintenance tab in your cluster configuration. In the example screenshot below, it's some time between 06:30 and 07:00 am every Wednesday.
There's no way to stop it happening while user sessions are connected. Although you do have the option of deferring all maintenance for up to 45 days if you need (follow the Edit button on the same screen).
For evidence to prove, you can check the audit log of past maintenance events by looking in the AWS Config service under the "timeline" of your cluster. Follow the View Config Timeline button to open AWS Config for that cluster. In the below example screenshot you can see the exact time (08:49:20) of one maintenance window in the past.
Another way to document that the maintenance window was used is to check the "healthy" dashboard metric on the console or in CloudWatch. If the cluster went unhealthy then returned to healthy during the maintenance window is very likely that AWS performed an update on the systems.

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Google Cloud SQL: Periodic Read Spikes Associated With Loss of Connectivity

I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.