How can I schedule Postgres queries to run on Amazon RDS? - postgresql

I tried to install pgAgent, but since it is not supported on Amazon I don't know how to schedule postgres jobs without going with Cron jobs and psql directly. Here is what I got on Amazon RDS:
The following command gave the same result:
CREATE EXTENSION pg_cron;

I have total of three options right now on top of my head for this:
1.)AWS Lambda
2.)AWS Glue
3.)Any small EC2 instance (Linux/Windows)
1.)AWS Lambda:
you can use postgres connectvity python module like pg8000 or psycopg2, to connect and create cursor to your target RDS.
and you can pass your sql jobs code /your SQL statements as an input to lambda. If they are very few, you can just code the whole job in your lambda, if not you can pass it to lambda as a input using DynamoDB.
You can have a cron schedule using cloudwatch event, so that it will trigger lambda whenever you need.
Required tools: DynamoDB, AWS Lambda, Python, Postgres python connectivity module.
2.)AWS Glue
AWS Glue also works almost same. You have a option to connect to your RDS DB directly there and you can schedule your jobs there.
3.)Ec2 instance:
Create any small size ec2 instance, either windows or linux and have setup your cron/bat jobs.

On October 10th, 2018, AWS Lambda launched support for long running functions. Customers can now configure their AWS Lambda functions to run up to 15 minutes per execution. Previously, the maximum execution time (timeout) for a Lambda function was 5 minutes. Using longer running functions, a highly requested feature, customers can perform big data analysis, bulk data transformation, batch event processing, and statistical computations more efficiently.

You could use Amazon CloudWatch Events to trigger a Lambda function on a schedule, but it can only run for a maximum of 15 minutes (https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/?nc1=h_ls).
You could also run a t2.nano Amazon EC2 instance (about $50/year On-Demand, or $34/year as a Reserved Instance) to run regular cron jobs.

Related

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Populate RDS on creation

Currently creating an RDS per account for several different AWS accounts. I use Cloudformation scripts for this.
When creating these databases I would like for them to have a similar structure. I created an SQL which I can successfully run manually after the script has run. I would like to however execute this automatically as part of running the script.
My solution so far is to create a EC2 instance with a dependency on the RDS to run once and then manually delete it later but this is not a suitable solution. I couldn't find any other way though?
Is it possible to run a query as part of a cloudformation script?
FYI: I'm creating a 11.5 Postgres instance.
The proper way is to use custom resources.
But this requires some new development. But if you have already EC2 instance that does populate the rds from its UserData you can automate its termination as follows:
Set InstanceInitiatedShutdownBehavior to termiante
At the end of UserData execute shutdown -h now to shutdown the instance.
Since your shutdown behavior is terminate, the instance will be automatically terminated.

a way to script automatically to start and stop the sql database in gcp

i want to run a job in cloud scheduler in gcp to start and stop the sql database in weekdays at working hours.
I have tried by triggering cloud function and using pubsub but i am not getting proper way to do it.
You can use the Cloud SQL Admin API to start or stop and instance. Depending on your language, there are clients available to help you do this. This page contains examples using curl.
Once you've created two Cloud Functions (one to start, and one to stop), you can configure the Cloud Scheduler to send a pub/sub trigger to your function. Check out this tutorial which walks you through the process.
In order to achieve this you can use a Cloud Function to make a call to the Cloud SQL Admin API to start and stop your Cloud SQL instance (you will need 2 Cloud functions). You can see my code on how to use a Cloud Function to start a Cloud SQL instance and stop a Cloud SQL instance
After creating your Cloud Function you can configure the Cloud Scheduler to trigger the http address of each Cloud function

Scheduling a function in google cloud sql - Postgresql DB

I'm trying to schedule a function to periodically run and delete records from my google cloudsql (Postgresql) database. I want this to run a couple of times a day and will run under 10 minutes. What options do I have to schedule this function?
Thanks
Ravi
Your best option will be to use Cloud Scheluder to schedule a job that publishes to a Pub/Sub topic. Then, have a Cloud Function subscribed to this topic so it get's triggered by the message sent.
You can configure this job to run as a Daily routine x times a day.
Try pgAgent
pgAgent is a job scheduling agent for Postgres databases, capable of running multi-step batch or shell scripts and SQL tasks on complex schedules.
pgAgent is distributed independently of pgAdmin. You can download pgAgent from the download area of the pgAdmin website.

Why AWS Lambda is slow when calling EC2 service from it?

I am using trial version of AWS instance and have free EC2 server of 1GB RAM.
I have Mongo DB installed in EC2 and written one simple AWS Lambda function in Java(tried in Node js too). Both are in same region.
When I am trying to save one simple JSON in DB of EC2 byLambda function, its taking too long time around 30 seconds and then getting time out.
I also tried to put a message in Kafka queue which is installed on simillar kind of EC2 server using Lambda function, its taking almost a minute to put that message in Kafka queue.
Why this is sooooooo slow? Am I missing something or shall we blame the 1GB Ubuntu/Linux EC2 server? Or something fishy about AWS Lambda? Tried with both Java and Node JS?
Worked when I create Security group in EC2 console for my DB and provide DB port in port range. Check attached image.