Azure Data Factory Stop 2 triggers from executing at same time

Azure Data Factory Stop 2 triggers from executing at same time - triggers

I have two ADFv2 triggers.
One is set to execute every 3 mins and another every 20 mins.
They execute different pipelines but there is an overlap as both touch the same database table which I want to prevent.
Is there a way to set them up so if one is already running and the other is scheduled to start, it is instead queued until the running trigger is finished?

Not natively AFAIK. You can use the pipeline's concurrency property setting to get this behaviour but only for a single pipeline.
Instead you could (we have):
Use Validation activity to block if a sentinel blob exists and have your other pipeline write and delete the blob when it starts/ends.
Likewise have one pipeline set a flag in a control table on the database that you can examine
If you can tolerate changing your frequencies to have a common factor, create a master pipeline that Execute Pipeline's your current two pipelines; make the longer one only called every n-th run using MOD. Then you can use the concurrency setting on the outer pipeline to make sure the next trigger gets queued until the current run ends.
Use REST API https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically#rest-api in one pipeline to check if the other is running

Jason's post gave me an idea for a more simple solution.
I have two triggers. Each executes at different schedules and different pipelines.
On occasion the schedule on these triggers can overlap. In this circumstance the trigger that fires while the other is running should not run. Only one to be running at any one time.
I did this using the following.
Create a control table with a IsJobRunning BIT (flag) column
When a trigger fires, the pipeline associated with it will execute an SP that will check the Control table.
If the value in the IsJobRunning is 0 then UPDATE the IsJobRunning column to 1 and continue executing,
if 1 then RAISEERROR - a dummy error - stop executing.
IF (SELECT J.IsJobRunning FROM '[[Control table ]]' ) = 1
BEGIN
SET #ERRMSG = N'**INFORMATIONAL ONLY** Other ETL trigger job is running - so stop this attempt ' ;
SET #ErrorSeverity = 16 ;
-- Note: this is only a INFORMATIONAL message and not an actual error.
RAISERROR (#ERRMSG,#ErrorSeverity,1 ) WITH NOWAIT;
RETURN 1;
END ;
ELSE
BEGIN
-- set #IsJobRunning to RUNNING
EXEC '[[ UPDATE IsJobRunning on COntrol table]] ' ;
END ;
This looks like this in the pipeline.
This logic is in both Pipelines.

Related

Postgresql sequence: lock strategy to prevent record skipping on a table queue

I have a table that acts like a queue (let's call it queue) and has a sequence from 1..N.
Some triggers inserts on this queue (the triggers are inside transactions).
Then external machines have the sequence number and asks the remote database: give me sequences greater than 10 (for example).
The problem:
In some cases transaction 1 and 2 begins (numbers are examples). But transaction 2 ends before transaction 1. And in between host have asked queue for sequences greater than N and transaction 1 sequences are skipped.
How to prevent this?

I would proceed like this:
add a column state to the table that you change as soon as you process an entry
get the next entry with
SELECT ... FROM queuetab
WHERE state = 'new'
ORDER BY seq
LIMIT 1
FOR UPDATE SKIP LOCKED;
update state in the row you found and process it
As long as you do the last two actions in a single transaction, that will make sure that you are never blocked, get the first available entry and never skip an entry.

How to execute an update after each item writing in Spring batch?

I am doing a database read and database write as spring task. It's running fine. The after job method also is getting executed fine. But my requirement is after each insert of an entry I need to update a flag in the source database. How can we achieve this?

Consider using a CompositeItemWriter - that has 2 delegate writers
Delegate writer 1 - performs the insert into the target database
Delegate writer 2 - update the status in the source database
If you really need to commit after each insert - you will need to set the commit-interval for the step to 1. Do remember that setting the commit interval 1 means very low performance - so unless there is a compelling reason do not set the commit interval to 1

if the inserted data contains some data to identify the insert happened (insert date, status flag, etc.) you could run a simple Taskletstep which executes an update statement like
update ....
set flag = flag.value
where insert.date = ....

How can I ensure that a materialized view is always up to date?

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.
How should I go about doing this?
I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143
Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right?
Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.
How should I go about doing this?
Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW command does block the view in AccessExclusive mode, so while it is working, you can't even do SELECT on the table.
Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY option:
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
This will acquire an ExclusiveLock, and will not block SELECT queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH commands concurrently.
Refresh manually
It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH operation in the end of it.
Scheduling the REFRESH operation
The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:
*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"
And then your materialized view will be refreshed at each 30 minutes.
Considerations
This option is really good, specially with CONCURRENTLY option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY, the REFRESH command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH.
Refreshing with a trigger
Another option is to call the REFRESH MATERIALIZED VIEW in a trigger function, like this:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
RETURN NULL;
END;
$$;
Then, in any table that involves changes on the view, you do:
CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();
Considerations
It has some critical pitfalls for performance and concurrency:
Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
Even with CONCURRENTLY, one REFRESH still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.
The only situation I can think that as a good idea is if the changes are really rare.
Refresh using LISTEN/NOTIFY
The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY operation:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, 'my_mv';
RETURN NULL;
END;
$$;
So then you can build an application that keep connected and uses LISTEN operation to identify the need to call REFRESH. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN, so you can schedule the REFRESH as:
pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"
Or use pglater (also inside pgsidekick) to make sure you don't call REFRESH very often. For example, you can use the following trigger to make it REFRESH, but within 1 minute (60 seconds):
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
RETURN NULL;
END;
$$;
So it will not call REFRESH in less the 60 seconds apart, and also if you NOTIFY many times in less than 60 seconds, the REFRESH will be triggered only once.
Considerations
As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.
OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.

Now there is a PostgreSQL extension to keep materialized views updated: pg_ivm.
It only computes and applies the incremental changes, rather than recomputing the contents fully as REFRESH MATERIALIZED VIEW does. It has 2 approaches, IMMEDIATE and DEFERRED:
For IMMEDIATE, the views are updated in the same transaction that its base table is modified.
For DEFERRED, the views are updated after the transaction is committed.
Version 1.0 has been released on 2022-04-28.

Let me point out three things on the previous answer by MatheusOl - the pglater technology.
As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_long by the phrase "The last element of the array has to be filled with zeros." So, it should read -
static struct option long_options[] = {
//......
{"help", no_argument, NULL, '?'},
{0, 0, 0, 0}
};
On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both char and char**). You many need to force type conversion while you do that(pointer assignment).
char block4[100];
...
password_prompt = block4;
...
char block1[500];
const char **keywords = (const char **)&block1;
...
char block3[300];
char *listen = block3;
sprintf(listen, "listen %s", id);
PQfreemem(id);
res = PQexec(db, listen);
Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.
when md >= callback_delay(cd) ==> timeout: 0
when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]
when md + PING_INTERVAL < cd ==> timeout: PI
To implement this algorithm(3rd point), you should init 'lr' as follows -
res = PQexec(db, command);
latest_refresh = time(0);
if (PQresultStatus(res) == PGRES_COMMAND_OK) {

multiple cron jobs on the same postgres table

I have a cron job that runs every 2 mins it takes a 10 records from a postgres table and working on them then it set a flag when it is finished. i want to make sure if the fist cron runs and takes more than 2 min the other one will run on different data on DBs not on the same data.
is there any why to handle this case?

This can be solved using a Database Transaction.
BEGIN;
SELECT
id,status,server
FROM
table_log
WHERE
(direction = '2' AND status_log = '1')
LIMIT 100
FOR UPDATE SKIP LOCKED;
what are we doing?
We are Selecting all rows available (not locked) from other cron-jobs that might be running. And selecting them for update. So this means all this Query grabs its unlocked and all results will be locked for this cron-job only.
how to update my locked rows?
Simple use a for loop on your processor language (Python, Ruby, PHP) and do a concatenation for each update remember we are building 1 single update.
UPDATE table_log SET status_log = '6' ,server = '1' WHERE id = '1';
Finally we use
COMMIT;
And all rows locked will be updated. This prevents other Queries from touching the same data at the same time. Hope it helps.

Turn your "finished" flag from binary to ternary ("needs work", "in process", "finished"). You also might want to store the pid of the "in process" process, in case it dies and you need to clean it up, and a timestamp for when it started.
Or use a queueing system that someone already wrote and debugged for you.

postgres trigger to check for data

I need to write a trigger that will check a table column to see if data is there or not. The trigger needs to run all the time and log msg every hour.
Basically it will run a select statement if result found then sleep for an hour else log and sleep for an hour

What you want is a scheduled job. pgAgent : http://www.pgadmin.org/docs/1.4/pgagent.html create an hourly job that checks for that line and then logs as required.
Edit to add:
Curious if you've considered writing a SQL script that generates the log on the fly by reading the table instead of a job. If you have a timestamp field, it is quite possible to have a script that returns all hourly periods that don't have a corresponding entry within that time frame (assuming the time stamp isn't updated). Why store a second log when you can generate it directly against the data?

Triggers (in pg and in every dbms i know) can execute before or after events such insert, update or delete. What you probably want here is a script launched via something like cron (if you are using a unix system) every hour, redirecting your output to the log file.
I used something like this many times and it sounded like this (written in python):
#!/usr/bin/python
import psycopg2
try:
conn = psycopg2.connect("dbname='"+dbmane+"' user='"+user+"' host='"+hostname+"' password='"+passwd+"'")
except:
# Get the most recent exception
exceptionType, exceptionValue, exceptionTraceback = sys.exc_info()
# Exit the script and print an error telling what happened.
logging.debug("I am unable to connect to the database!\n ->%s" % (exceptionValue))
exit(2)
cur = conn.cursor()
query = "SELECT whatever FROM wherever WHERE yourconditions"
try:
cur.execute(query)
except ProgrammingError:
print "Programming error, no result produced"
result = cur.fetchone()
if(result == None):
#do whatever you need, if result == None the data is not in your table column
I used to launch my script via cron every 10 minutes, you can easily configure it to launch the script every hour redirecting its output to the log file of your choice.
If your working in a windows environment, than you'll be looking for something like cron.
I don't think that a trigger can help you with this, they fire only after some events (you can use a trigger to check after every insert if the inserted data is the one you want to check every hour, but it's not the same, doing it via script is the best solution in my experience)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Azure Data Factory Stop 2 triggers from executing at same time - triggers

Related

Postgresql sequence: lock strategy to prevent record skipping on a table queue

How to execute an update after each item writing in Spring batch?

How can I ensure that a materialized view is always up to date?

multiple cron jobs on the same postgres table

postgres trigger to check for data

Categories

Resources