multiple cron jobs on the same postgres table - postgresql

I have a cron job that runs every 2 mins it takes a 10 records from a postgres table and working on them then it set a flag when it is finished. i want to make sure if the fist cron runs and takes more than 2 min the other one will run on different data on DBs not on the same data.
is there any why to handle this case?

This can be solved using a Database Transaction.
BEGIN;
SELECT
id,status,server
FROM
table_log
WHERE
(direction = '2' AND status_log = '1')
LIMIT 100
FOR UPDATE SKIP LOCKED;
what are we doing?
We are Selecting all rows available (not locked) from other cron-jobs that might be running. And selecting them for update. So this means all this Query grabs its unlocked and all results will be locked for this cron-job only.
how to update my locked rows?
Simple use a for loop on your processor language (Python, Ruby, PHP) and do a concatenation for each update remember we are building 1 single update.
UPDATE table_log SET status_log = '6' ,server = '1' WHERE id = '1';
Finally we use
COMMIT;
And all rows locked will be updated. This prevents other Queries from touching the same data at the same time. Hope it helps.

Turn your "finished" flag from binary to ternary ("needs work", "in process", "finished"). You also might want to store the pid of the "in process" process, in case it dies and you need to clean it up, and a timestamp for when it started.
Or use a queueing system that someone already wrote and debugged for you.

Related

Azure Data Factory Stop 2 triggers from executing at same time

I have two ADFv2 triggers.
One is set to execute every 3 mins and another every 20 mins.
They execute different pipelines but there is an overlap as both touch the same database table which I want to prevent.
Is there a way to set them up so if one is already running and the other is scheduled to start, it is instead queued until the running trigger is finished?
Not natively AFAIK. You can use the pipeline's concurrency property setting to get this behaviour but only for a single pipeline.
Instead you could (we have):
Use Validation activity to block if a sentinel blob exists and have your other pipeline write and delete the blob when it starts/ends.
Likewise have one pipeline set a flag in a control table on the database that you can examine
If you can tolerate changing your frequencies to have a common factor, create a master pipeline that Execute Pipeline's your current two pipelines; make the longer one only called every n-th run using MOD. Then you can use the concurrency setting on the outer pipeline to make sure the next trigger gets queued until the current run ends.
Use REST API https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically#rest-api in one pipeline to check if the other is running
Jason's post gave me an idea for a more simple solution.
I have two triggers. Each executes at different schedules and different pipelines.
On occasion the schedule on these triggers can overlap. In this circumstance the trigger that fires while the other is running should not run. Only one to be running at any one time.
I did this using the following.
Create a control table with a IsJobRunning BIT (flag) column
When a trigger fires, the pipeline associated with it will execute an SP that will check the Control table.
If the value in the IsJobRunning is 0 then UPDATE the IsJobRunning column to 1 and continue executing,
if 1 then RAISEERROR - a dummy error - stop executing.
IF (SELECT J.IsJobRunning FROM '[[Control table ]]' ) = 1
BEGIN
SET #ERRMSG = N'**INFORMATIONAL ONLY** Other ETL trigger job is running - so stop this attempt ' ;
SET #ErrorSeverity = 16 ;
-- Note: this is only a INFORMATIONAL message and not an actual error.
RAISERROR (#ERRMSG,#ErrorSeverity,1 ) WITH NOWAIT;
RETURN 1;
END ;
ELSE
BEGIN
-- set #IsJobRunning to RUNNING
EXEC '[[ UPDATE IsJobRunning on COntrol table]] ' ;
END ;
This looks like this in the pipeline.
This logic is in both Pipelines.

Postgres 'if not exists' fails because the sequence exists

I have several counters in an application I am building, as am trying to get them to be dynamically created by the application as required.
For a simplistic example, if someone types a word into a script it should return the number of times that word has been entered previously. Here is an example of sql that may be executed if they typed the word example.
CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
SELECT nextval('example')
This would return 1 the first time it ran, 2 the second time, etc.
The problem is when 2 people click the button at the same time.
First, please note that a lot more is happening in my application than just these statements, so the chances of them overlapping is much more significant than it would be if this was all that was happening.
1> BEGIN;
2> BEGIN;
1> CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
2> CREATE SEQUENCE IF NOT EXISTS example START WITH 1; -- is blocked by previous statement
1> SELECT nextval('example') -- returns 1 to user.
1> COMMIT; -- unblocks second connection
2> ERROR: duplicate key value violates unique constraint
"pg_type_typname_nsp_index"
DETAIL: Key (typname, typnamespace)=(example, 109649) already exists.
I was under the impression that by using "IF NOT EXISTS", the statement should just be a no-op if it does exist, but it seems to have this race condition where that is not the case. I say race condition because if these two are not executed at the same time, it works as one would expect.
I have noticed that IF NOT EXISTS is fairly new to postgres, so maybe they haven't worked out all of the kinks yet?
EDIT:
The main reason we were considering doing things this way was to avoid excess locking. The thought being that if two people were to increment at the same time, using a sequence would mean that neither user should have to wait for the other (except, as in this example, for the initial creation of that sequence)
Sequences are part of the database schema. If you find yourself modifying the schema dynamically based on the data stored in the database, you are probably doing something wrong. This is especially true for sequences, which have special properties e.g. regarding their behavior with respect to transactions. Specifically, if you increment a sequence (with the help of nextval) in the middle of a transaction and then you rollback that transaction, the value of the sequence will not be rolled back. So most likely, this kind of behavior is something that you don't want with your data. In your example, imagine that a user tries to add word. This results in the corresponding sequence being incremented. Now imagine that the transaction does not complete for reason (e.g. maybe the computer crashes) and it gets rolled back. You would end up with the word not being added to the database but with the sequence being incremented.
For the particular example that you mentioned, there is an easy solution; create an ordinary table to store all the "sequences". Something like that would do it:
CREATE TABLE word_frequency (
word text NOT NULL UNIQUE,
frequency integer NOT NULL
);
Now I understand that this is just an example, but if this approach doesn't work for your actual use case, let us know and we can adjust it to your needs.
Edit: Here's how you the above solution works. If a new word is added, run the following query ("UPSERT" syntax in postgres 9.5+ only):
INSERT INTO word_frequency(word,frequency)
VALUES ('foo',1)
ON CONFLICT (word)
DO UPDATE
SET frequency = word_frequency.frequency + excluded.frequency
RETURNING frequency;
This query will insert a new word in word_frequency with frequency 1, or if the word exists already it will increment the existing frequency by 1. Now what happens if two transaction try to do that at the same time? Consider the following scenario:
client 1 client 2
-------- --------
BEGIN
BEGIN
UPSERT ('foo',1)
UPSERT ('foo',1) <====
COMMIT
COMMIT
What will happen is that as soon as client 2 tries increment the frequency for foo (marked with the arrow above), that operation will block because the row was modified by a different transaction. When client 1 commits, client 2 will get unblocked and continue without any errors. This is exactly how we wanted it to work. Also note, that postgresql will use row-level locking to implement this behavior, so other insertions will not be blocked.
EDIT: The main reason we were considering doing things this way was to
avoid excess locking. The thought being that if two people were to
increment at the same time, using a sequence would mean that neither
user should have to wait for the other (except, as in this example,
for the initial creation of that sequence)
It sounds like you're optimizing for a problem that likely does not exist. Sure, if you have 100,000 simultaneous users that are only inserting rows (since a sequence will only be used then normally) there is the possibility of some contention with the sequence but realistically there will be other bottle necks long before the sequence gets in the way.
I'd advise you to first prove that the sequence is an issue. With a proper database design (which dynamic DDL is not) the sequence will not be the bottle neck.
As a reference, DDL is not transaction safe in most databases.

How to execute an update after each item writing in Spring batch?

I am doing a database read and database write as spring task. It's running fine. The after job method also is getting executed fine. But my requirement is after each insert of an entry I need to update a flag in the source database. How can we achieve this?
Consider using a CompositeItemWriter - that has 2 delegate writers
Delegate writer 1 - performs the insert into the target database
Delegate writer 2 - update the status in the source database
If you really need to commit after each insert - you will need to set the commit-interval for the step to 1. Do remember that setting the commit interval 1 means very low performance - so unless there is a compelling reason do not set the commit interval to 1
if the inserted data contains some data to identify the insert happened (insert date, status flag, etc.) you could run a simple Taskletstep which executes an update statement like
update ....
set flag = flag.value
where insert.date = ....

How can I ensure that a materialized view is always up to date?

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.
How should I go about doing this?
I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143
Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)
I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right?
Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.
How should I go about doing this?
Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW command does block the view in AccessExclusive mode, so while it is working, you can't even do SELECT on the table.
Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY option:
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
This will acquire an ExclusiveLock, and will not block SELECT queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH commands concurrently.
Refresh manually
It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH operation in the end of it.
Scheduling the REFRESH operation
The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:
*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"
And then your materialized view will be refreshed at each 30 minutes.
Considerations
This option is really good, specially with CONCURRENTLY option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY, the REFRESH command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH.
Refreshing with a trigger
Another option is to call the REFRESH MATERIALIZED VIEW in a trigger function, like this:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
RETURN NULL;
END;
$$;
Then, in any table that involves changes on the view, you do:
CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();
Considerations
It has some critical pitfalls for performance and concurrency:
Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
Even with CONCURRENTLY, one REFRESH still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.
The only situation I can think that as a good idea is if the changes are really rare.
Refresh using LISTEN/NOTIFY
The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY operation:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, 'my_mv';
RETURN NULL;
END;
$$;
So then you can build an application that keep connected and uses LISTEN operation to identify the need to call REFRESH. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN, so you can schedule the REFRESH as:
pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"
Or use pglater (also inside pgsidekick) to make sure you don't call REFRESH very often. For example, you can use the following trigger to make it REFRESH, but within 1 minute (60 seconds):
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
RETURN NULL;
END;
$$;
So it will not call REFRESH in less the 60 seconds apart, and also if you NOTIFY many times in less than 60 seconds, the REFRESH will be triggered only once.
Considerations
As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.
OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.
Now there is a PostgreSQL extension to keep materialized views updated: pg_ivm.
It only computes and applies the incremental changes, rather than recomputing the contents fully as REFRESH MATERIALIZED VIEW does. It has 2 approaches, IMMEDIATE and DEFERRED:
For IMMEDIATE, the views are updated in the same transaction that its base table is modified.
For DEFERRED, the views are updated after the transaction is committed.
Version 1.0 has been released on 2022-04-28.
Let me point out three things on the previous answer by MatheusOl - the pglater technology.
As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_long by the phrase "The last element of the array has to be filled with zeros." So, it should read -
static struct option long_options[] = {
//......
{"help", no_argument, NULL, '?'},
{0, 0, 0, 0}
};
On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both char and char**). You many need to force type conversion while you do that(pointer assignment).
char block4[100];
...
password_prompt = block4;
...
char block1[500];
const char **keywords = (const char **)&block1;
...
char block3[300];
char *listen = block3;
sprintf(listen, "listen %s", id);
PQfreemem(id);
res = PQexec(db, listen);
Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.
when md >= callback_delay(cd) ==> timeout: 0
when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]
when md + PING_INTERVAL < cd ==> timeout: PI
To implement this algorithm(3rd point), you should init 'lr' as follows -
res = PQexec(db, command);
latest_refresh = time(0);
if (PQresultStatus(res) == PGRES_COMMAND_OK) {

postgres trigger to check for data

I need to write a trigger that will check a table column to see if data is there or not. The trigger needs to run all the time and log msg every hour.
Basically it will run a select statement if result found then sleep for an hour else log and sleep for an hour
What you want is a scheduled job. pgAgent : http://www.pgadmin.org/docs/1.4/pgagent.html create an hourly job that checks for that line and then logs as required.
Edit to add:
Curious if you've considered writing a SQL script that generates the log on the fly by reading the table instead of a job. If you have a timestamp field, it is quite possible to have a script that returns all hourly periods that don't have a corresponding entry within that time frame (assuming the time stamp isn't updated). Why store a second log when you can generate it directly against the data?
Triggers (in pg and in every dbms i know) can execute before or after events such insert, update or delete. What you probably want here is a script launched via something like cron (if you are using a unix system) every hour, redirecting your output to the log file.
I used something like this many times and it sounded like this (written in python):
#!/usr/bin/python
import psycopg2
try:
conn = psycopg2.connect("dbname='"+dbmane+"' user='"+user+"' host='"+hostname+"' password='"+passwd+"'")
except:
# Get the most recent exception
exceptionType, exceptionValue, exceptionTraceback = sys.exc_info()
# Exit the script and print an error telling what happened.
logging.debug("I am unable to connect to the database!\n ->%s" % (exceptionValue))
exit(2)
cur = conn.cursor()
query = "SELECT whatever FROM wherever WHERE yourconditions"
try:
cur.execute(query)
except ProgrammingError:
print "Programming error, no result produced"
result = cur.fetchone()
if(result == None):
#do whatever you need, if result == None the data is not in your table column
I used to launch my script via cron every 10 minutes, you can easily configure it to launch the script every hour redirecting its output to the log file of your choice.
If your working in a windows environment, than you'll be looking for something like cron.
I don't think that a trigger can help you with this, they fire only after some events (you can use a trigger to check after every insert if the inserted data is the one you want to check every hour, but it's not the same, doing it via script is the best solution in my experience)