Running automated jobs in PostgreSQL - postgresql

I have setup a PostgreSQL server and am using PgAdmin 4 for managing the databases/clusters. I have a bunch of SQL validation scripts (.sql) which I run on the databases every time some data is added to the database.
My current requirement is to automatically run these .sql scripts and generate some results/statistics every time new data is added to any of the tables in the database.
I have explored the use of pg_cron (https://www.citusdata.com/blog/2016/09/09/pgcron-run-periodic-jobs-in-postgres/) and pgAgent (https://www.pgadmin.org/docs/pgadmin4/dev/pgagent_jobs.html)
Before I proceed to integrate any of these tools into my application, I wanted to know if it is advisable to proceed using these utilities or if I should employ the service of a full-fledged CI framework like Jenkins?

Related

How to change default location for storing Rundeck Projects

I am using Rundeck Docker Container. It was running well for 2 months and suddenly it crashed. I lost all the data wrt a project that I had created using CLI. Is there a way to change the default path to store all project related data including job definitions, resources etc?
Out of the box, Rundeck stores the project/jobs data on their internal H2 database, this database is only for testing purposes and probably will crash with a lot of data (storing projects at filesystem is deprecated right now), the best approach is to use a "real" database like MySQL, PostgreSQL, or Oracle, in that way Rundeck stores all project/jobs data on a robust backend.
Check this MySQL, PostgreSQL and Oracle Docker environment examples.
Of course, having a backup policy for your instance would be ideal to keep safe all your instance data.

Running concurrent/parallel Db Functions/Procedures in Postgres plpgsql?

I am mostly a Java programmer and we can easily run different methods or functions multithreaded or in parallel (simultaneously) by creating new/different Threads.
I recently was writing many Functions and Procedures for my Postgres database and utilizing the Pg_Cron extension, which lets you schedule "Jobs" (basically plpgsql Functions or Procedures you write) to run based on a Cron expression.
With these Jobs, as I understand it, you can have the scripts run essentially in Parallel/Concurrent.
Now, I am curious, without using Pg_cron to run db maintenance tasks, is there anyone at all in Postgres to write "concurrent" logic or scripts that run parallel, without using 3rd party extensions/libraries?
Yes, that is trivial: just open several database connections and run statements in each of them concurrently.

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

for db2 on cloud, are things like runstats and reorgchk/reorg done automatically?

I am seeing some slow performance on a couple of my queries that run against my db2 on cloud instance. When I had a local db2, I would try these tools to see if I could improve performance. Now, with db2 on cloud, I believe I can run them using admin_cmd, however, if they are already being run automatically on my db objects, there is no point, but I am not sure how to tell.
Yes, Db2 on Cloud does auto reorgs and runstats automatic. We do recommend running them manually, if you are running a lot of data loads to better the performance.
As you stated, Db2 on Cloud is a managed (as a Service) database offering. But this is for the general part, not for application-specific stuff. Backup / restore can be done without any application insights, but creating indexes, running runstats or performing reorgs is application-specific.
Runstats can be invoked using admin_cmd. The same is true for running reorg on tables and indexes.

Trigger an airflow DAG asynchronous by a database trigger

I want to consolidate a couple of historically grown scripts (Python, Bash and Powershell) which purpose is to sync data between a lot of different database backends (mostly postgres, but also oracle and sqlserver) and on different sites. There isn't really a master, its more like a loose couple of partner companies working on the same domain specific use cases, everyone with its own data silo and its my job to hold all this together as good as I can.
Currently those scripts I mentioned are cron scheduled and need to run on the origin server where a dataset gets initially written, to sync it to every partner over night.
I am also familiar with and use Apache Airflow in another project. So my idea was to use an workflow management tool like Airflow to streamline the sync process and get it more centralized. But also with Airflow there is only a time interval scheduler available to trigger a DAG.
As most writes come in over postgres databases, I'd like to make use of the NOTIFY/LISTEN feature and already have a python daemon based on this listening to any database change (via triggers) and calling an event handler then.
The last missing piece is how its probably best done to trigger an airflow DAG with this handler and how to keep all this running reliably?
Perhaps there is a better solution?