Postgres - limit procedure run per session

Postgres - limit procedure run per session - postgresql

I got a scenario where I have a “pre-cook” procedure by date. If the data for that date is cooked, just return it, otherwise that procedure will cook it.
And the problem is that if the cooking process taking too long, there will be a chance that data will be duplicated.
I would be expect this workflow:
User A opens a session from the web-app and requests data for 2018-June, a procedure called proc_A will check the data for that month and cook it if it does not yet exist.
User B opens another session from the desktop-app and requests the same data for 2018-June, then they should get a message saying that the data is cooking, please wait.
Is it possible to achieve that by only doing changes in the PostgreSQL DB rather than making changes to the web-app and the desktop-app?

I would add a state column to the data table:
ready boolean DEFAULT FALSE
The workflow would be as follows:
INSERT INTO data (month, value, ready)
VALUES (date_trunc('month', current_timestamp)::date, NULL, FALSE)
ON CONFLICT (month) DO NOTHING;
If a row gets inserted, proceed to cook the value, then run
UPDATE data SET
value = 42, ready = TRUE
WHERE month = date_trunc('month', current_date)::date;
If no row gets inserted by the first statement, run
SELECT value, ready
FROM data
WHERE month = date_trunc('month', current_date)::date;
If ready is true, return the data, if not, tell the client to please wait.

Related

how can I stream rows, in a loop, from Postgres? (F#/.NET but any other language is fine)

I have a set of data that looks like this:
(
instrument varchar(20) NOT NULL,
ts timestamp without time zone NOT NULL,
price float8 NOT NULL,
quantity float8 NOT NULL,
direction int NOT NULL
);
and I'd like to keep in my app the last hour of data; so upon startup, query everything where ts >= now - 1h, and then keep a loop where I query from the last row received to 'now'.
We're talking about roughly 1.5M rows to fetch at startup.
The issue is that the timestamp is not unique: you can have multiple rows with the same timestamp.
I am requesting an update every second with a limit of 50k, and it usually produces 200-500 rows; but the startup is providing batches of 50k rows until it catches up with the new data.
Should I:
add an id, find the id of now - 1h, and request records with an id higher than the last one received?
roll back the last timestamp received by 1s and deal with duplicates
something better I didn't think about (I'm not very knowledgeable with SQL DBs)

You have the right idea.
If there is a single writer (which is to say, a single client INSERTing to this table), you can use an autoincrementing ID to avoid having do deal with duplicates. Add an ID column, query for the last hour, keep your largest-seen ID in memory, and then run your loop of id>largest_seen_id every second. This is pretty standard, but does rely on IDs always becoming visible to the querier in increasing order, which isn't something the DB can guarantee in the general case.
If there are multiple writers, it's possible for this approach to skip rows, as a higher ID might be committed while another client still has it's lower-id row in an open transaction, and not visible to your query. Your next iteration would have the higher id as the lower bound, so you never end up seeing the committed lower-ID row. The easiest way of dealing with that is what you're already thinking of with your second option: query for extra rows and ignore the ones you already have. If network transfer is an issue, you can do the ignoring in the WHERE of the loop query you're running, so you're only "wasting" the size of the already-seen row IDs, instead of all the row data:
SELECT *
FROM my_appending_table
WHERE timestamp > ('2021-08-24 00:10:07' - interval '5 seconds')
AND NOT id=ANY({99101,99105,99106})
ORDER BY id asc
LIMIT 50000
Where 2021-08-24 00:10:07 is the max timestamp you currently have (probably fine to use the timestamp of the max-ID row; not guaranteed to be the same but basically always will be), 5 seconds is a fuzz to ensure all other writers have committed, and 99101,99105,99106 are the IDs you've already seen for rows with timestamp 2021-08-24 00:10:02 (:07 - 5) or later.

How to process huge result set without missing any items

I have a script that runs a function on every item in my database to extract academic citations. The database is large, so the script takes about a week to run.
During that time, items are added and removed from the database.
The database is too large to pull completely into memory, so I have to chunk through it to process all the items.
Is there a way to ensure that when the script finishes, all of the items have been processed? Is this a pattern with a simple solution? So far my research hasn't revealed anything useful.
PS: Locking the table for a week isn't an option!

I would add a timestamp column "modified_at" to the table which defaults to null. So any new item can be identified.
Your script can then pick the chunks to work on based on that column.
update items
set modified_at = current_timestamp
from (
select id
from items
where modified_at is null
limit 1000 --<<< this defines the size of each "chunk" that you work on
) t
where t.id = items.id
returning items.*;
This will update 1000 rows that have not been processed as being processed and will return those rows in one single statement. Your job can then work on the returned items.
New rows need to be added with modified_at = null and your script will pick them up based on the where modified_at is null condition the next time you run it.
If you also change items while processing them, you need to update the modified_at accordingly. In your script you will then need to store the last start of your processing somewhere. The next run of your script can then select items to be processed using
where modified_at is null
or modified_at < (last script start time)
If you only process each item only once (and then never again), you don't really need a timestamp, a simple boolean (e.g. is_processed) would do as well.

Deterministic function for getting today's date

I am trying to create an indexed view using the following code (so that I can publish it to replication it as a table):
CREATE VIEW lc.vw_dates
WITH SCHEMABINDING
AS
SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), number) AS SettingDate
FROM lc.numbers
WHERE number<8
GO
CREATE UNIQUE CLUSTERED INDEX
idx_LCDates ON lc.vw_dates(SettingDate)
lc.numbers is simply a table with 1 column (number) which is incremented by row 1-100.
However, I keep getting the error:
Column 'SettingDate' in view 'lc.vw_dates' cannot be used in an index or statistics or as a partition key because it is non-deterministic.
I realize that GETDATE() is non-deterministic. But, is there a way to make this work?
I am using MS SQL 2012.
Edit: The hope was to be able to Convert GetDate() to make it deterministic (it seems like it should be when stripping off the time). If nobody knows of a method to do this, I will close this question and mark the suggestion to create a calendar table as correct.

The definition of a deterministic function (from MSDN) is:
Deterministic functions always return the same result any time they are called with a specific set of input values and given the same state of the database. Nondeterministic functions may return different results each time they are called with a specific set of input values even if the database state that they access remains the same.
Note that this definition does not involve any particular span of time over which the result must remain the same. It must be the same result always, for a given input.
Any function you can imagine that always returns the date at the point the function is called, will by definition, return a different result if you run it one day and then again the next day (regardless of the state of the database).
Therefore, it is impossible for a function that returns the current date to be deterministic.
The only possible interpretation of this question that could enable a deterministic function, is if you were happy to pass as input to the function some information about what day it is.
Something like:
select fn_myDeterministicGetDate('2015-11-25')
But I think that would defeat the point as far as you're concerned.

Get un retrieved rows only in DB2 select

I have an BPM application where I am polling some rows from DB2 database at every 5 mins with a scheduler R1 with below query -
- select * from Table where STATUS = 'New'
based on rows returned I do some processing and then change the status of these rows to 'Read'.
But while this processing is being completed, its takes more than 5 mins and in meanwhile scheduler R1 runs and picks up some of the cases already picked up in last run.
How can I ensure that every scheduler picks up the rows which were not selected in last run. What changes do i need to do in my select statement? Please hep.

How can I ensure that every scheduler picks up the rows which were not selected in last run
You will need to make every scheduler aware of what was selected by other schedulers. You can do this, for example, by locking the selected rows (SELECT ... FOR UPDATE). Of course, you will then need to handle lock timeouts.
Another option, allowing for better concurrency, would be to update the record status before processing the records. You might introduce an intermediary status, something like 'In progress', and include the status in the query condition.

Want to get the MAX(ID) from a db table datewise, but it returns only same ID in sqlserver 2008 r2

I am trying to get the maximum ID from a database table and want to show it on win form load.
I am using the following query to get the maximum ID.
SELECT ISNULL(MAX(ID),0)+1 FROM StockMain WHERE VRDATE = '2013-01-30'
Above should have to return the maximum ID of today., e.g if I this statement is excutes for the first time it will return me the Value '1'. After saving first record on ID = '1' it should give me MAX(ID) = '2'. But it returns me the value 1.
Any suggestion or solutions????

Wild guess... but what data type is VRDATE? Does it include a time component, or is it just a date?
If it includes a time component indicating when you saved the record, it won't pass the VRDATE = '2013-01-30' check, since this defaults to a time of midnight. Since the times aren't identical, they are not equal.
Instead, try:
SELECT ISNULL(MAX(ID),0)+1
FROM StockMain
WHERE VRDATE BETWEEN '2013-01-30' AND '2013-02-01'
Next question... have you considered using an IDENTITY column instead of managing the ID values manually?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgres - limit procedure run per session - postgresql

Related

how can I stream rows, in a loop, from Postgres? (F#/.NET but any other language is fine)

How to process huge result set without missing any items

Deterministic function for getting today's date

Get un retrieved rows only in DB2 select

Want to get the MAX(ID) from a db table datewise, but it returns only same ID in sqlserver 2008 r2

Categories

Resources