Is there a way to run the DataStage jobs based on different dates? - datastage

I have a table containing the dates for the ETL jobs to be run.
I do know that using the schedule function in DataStage director able to schedule the jobs run on a specific date or recurring weekly/monthly. However, in my case, the date will change.
For example, Job A need to run every mid of Feb, May, and August.
Is there any way I can achieve this?

One option could be a DataStage sequence that runs regularily (i.e. daily) checking if one of your run dates is reached. This could be checked within the sequence and if the condition is fulfilled run the job.
If you choose to try it you need a job which selects the dates tables - you could compare the date already in the SQL with the current date and then sendthe date to a file or any other flag. Read the file within the sequence and if your check condition is true run whatever job you need to run.

Related

Postgres: Count all INSERT queries executed in the past 1 minute

I can do currently active count of all INSERT queries executed on the PostgreSQL server like this:
SELECT count(*) FROM pg_stat_activity where query like 'INSERT%'
But is there a way to count all INSERT queries executed on the server in a given period of time? E.g. in the past minute?
I have a bunch of tables into which I send a lot of inserts and I would like to somehow aggregate how many rows I am inserting per minute. I could code a solution for this, but it'd be so much easier if this was possible to somehow extract directly from the server.
Any type of stats like this, in a certain period of time, would be very helpful, an average time it takes for the query to process, or knowing the bandwidth that goes through per minute, etc.
Note: I am using PostgreSQL 12
If not already done, install pg_stat_statements extension and take some snapshots of the view pg_stat_statements: the diff will give the number of queries executed between 2 snapshots.
Note: It doesn’t save each individual query, rather it parameterizes them and then saves the aggregated result.
See https://www.citusdata.com/blog/2019/02/08/the-most-useful-postgres-extension-pg-stat-statements/
I believe that you can use the audit trigger.
This audit will create a table that register INSERT, UPDATE and DELETE actions. So you can adapt. So every time that your database runs one of those commands, the audit table register the action, the table and the time of the action. So, it will be easy to do a COUNT() on desired table with a WHERE from a minute ago.
I couldn't come across anything solid, so I have created a table where I log a number of insert transactions using a script that runs as a cron job. It was simple enough to implement and I do not get estimations, but the real values instead. I actually count all new rows inserted to tables in a given interval.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

Running a Postgresql Query at a specific time

Scenario
I have a table which contains tasks that need to be completed by a specific datetime. If the task is not completed by this datetime (+- variable interval) then it will run a script to 'escalate' this task.
This variable interval can be as small as 2 seconds or as large as 2 years
Thoughts so far
Running a cron job every second either via pg_cron or similar will technically allow me to do a check on the database every second, however there is a lot of wasted processing here and a lot of database overhead and i'd rather not do this if possible.
Triggers can be fired on row insert/update/delete. so worst case scenario is we have an external script watching for these triggers being fired.
Question
Is there a way to schedule a query to run at a specific time, ideally within postgresql itself rather than via a bash/cron script. ie:
at 2017-09-30 09:32:00 - select * from table where datetime <= now
Edit
As it came up in the comments PGAgent is a possibility and the scenario for such would be:
The task is created by the user in the application, and the due date is set (eg 2017-09-28 13:00:00) the user has an interval before/after this due date where the task is escalated (eg One Hour Before) so at 12:00:00 on 2017-09-28 i want PGAgent/other option to run my sql script that does the escalation.
The script to escalate is already written and works, the date and time for this PGAgent script to be run is already calculated by another script.

Set up delta load in azure data factory

I have an SQL database on prem that I want to get data from. In the database there is a column called last_update that has information about when a row was last updated. The first time I run my pipeline I want it to copy everything from the database on prem to an azure database. The next time I want copy only the rows that have been updated since the last run. I therefore want to copy everything where last_update is higher than the time of the last run. Is there a way of using information about the time of the last run in a pipeline? Is there any other good way of creating what i want?
I think you can do this by developing custom copy activity. You can add your own transformation/processing logic and use the activity in a pipeline.

DB2: How to timely delete records

I have a table in DB2 database that has a few columns, one of which is L_TIMESTAMP. The need is to delete records where difference between L_TIMESTAMP and CURRENT TIMESTAMP is greater than 5 minutes. This check needs to happen every hour. Please let me know if there is an approach to accomplish this at the DB end rather than scheduling a cron job at the appserver end.
The administrative task scheduler in DB2 would be a good way to accomplish this. You need to wrap the DELETE statement in a stored procedure, then submit it to the scheduler. The syntax for defining the schedule is based on cron but it is all handled inside DB2.
http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.gui.doc%2Fdoc%2Fc0054380.html