How to process huge result set without missing any items - postgresql

I have a script that runs a function on every item in my database to extract academic citations. The database is large, so the script takes about a week to run.
During that time, items are added and removed from the database.
The database is too large to pull completely into memory, so I have to chunk through it to process all the items.
Is there a way to ensure that when the script finishes, all of the items have been processed? Is this a pattern with a simple solution? So far my research hasn't revealed anything useful.
PS: Locking the table for a week isn't an option!

I would add a timestamp column "modified_at" to the table which defaults to null. So any new item can be identified.
Your script can then pick the chunks to work on based on that column.
update items
set modified_at = current_timestamp
from (
select id
from items
where modified_at is null
limit 1000 --<<< this defines the size of each "chunk" that you work on
) t
where t.id = items.id
returning items.*;
This will update 1000 rows that have not been processed as being processed and will return those rows in one single statement. Your job can then work on the returned items.
New rows need to be added with modified_at = null and your script will pick them up based on the where modified_at is null condition the next time you run it.
If you also change items while processing them, you need to update the modified_at accordingly. In your script you will then need to store the last start of your processing somewhere. The next run of your script can then select items to be processed using
where modified_at is null
or modified_at < (last script start time)
If you only process each item only once (and then never again), you don't really need a timestamp, a simple boolean (e.g. is_processed) would do as well.

Related

how can I stream rows, in a loop, from Postgres? (F#/.NET but any other language is fine)

I have a set of data that looks like this:
(
instrument varchar(20) NOT NULL,
ts timestamp without time zone NOT NULL,
price float8 NOT NULL,
quantity float8 NOT NULL,
direction int NOT NULL
);
and I'd like to keep in my app the last hour of data; so upon startup, query everything where ts >= now - 1h, and then keep a loop where I query from the last row received to 'now'.
We're talking about roughly 1.5M rows to fetch at startup.
The issue is that the timestamp is not unique: you can have multiple rows with the same timestamp.
I am requesting an update every second with a limit of 50k, and it usually produces 200-500 rows; but the startup is providing batches of 50k rows until it catches up with the new data.
Should I:
add an id, find the id of now - 1h, and request records with an id higher than the last one received?
roll back the last timestamp received by 1s and deal with duplicates
something better I didn't think about (I'm not very knowledgeable with SQL DBs)
You have the right idea.
If there is a single writer (which is to say, a single client INSERTing to this table), you can use an autoincrementing ID to avoid having do deal with duplicates. Add an ID column, query for the last hour, keep your largest-seen ID in memory, and then run your loop of id>largest_seen_id every second. This is pretty standard, but does rely on IDs always becoming visible to the querier in increasing order, which isn't something the DB can guarantee in the general case.
If there are multiple writers, it's possible for this approach to skip rows, as a higher ID might be committed while another client still has it's lower-id row in an open transaction, and not visible to your query. Your next iteration would have the higher id as the lower bound, so you never end up seeing the committed lower-ID row. The easiest way of dealing with that is what you're already thinking of with your second option: query for extra rows and ignore the ones you already have. If network transfer is an issue, you can do the ignoring in the WHERE of the loop query you're running, so you're only "wasting" the size of the already-seen row IDs, instead of all the row data:
SELECT *
FROM my_appending_table
WHERE timestamp > ('2021-08-24 00:10:07' - interval '5 seconds')
AND NOT id=ANY({99101,99105,99106})
ORDER BY id asc
LIMIT 50000
Where 2021-08-24 00:10:07 is the max timestamp you currently have (probably fine to use the timestamp of the max-ID row; not guaranteed to be the same but basically always will be), 5 seconds is a fuzz to ensure all other writers have committed, and 99101,99105,99106 are the IDs you've already seen for rows with timestamp 2021-08-24 00:10:02 (:07 - 5) or later.

optimal way to stream trades data out of a postgres database

I have a table with a very simple schema:
(
instrument varchar(20) not null,
ts timestamp not null,
price double precision not null,
quantity double precision not null,
direction integer not null,
id serial
constraint trades_pkey
primary key
);
It stores a list of trades done on various instruments.
You can have multiple trades on a single timestamp and also the timestamps are not regular; it's possible to have 10 entries on the same millisecond and then nothing for 2 seconds, etc.
When the client starts, I would like to accomplish two things:
Load the last hour of data.
Stream all the new updates.
The client processes the trades one by one, as if they were coming from a queue. They are sorted by instrument and each instrument has its own queue, expecting each trade to be the one following the previous one.
Solution A:
I did a query to find the id at now - 1hour, and then query all rows with id >= start id, and then loop to get all row with id > last id.
This does not work:
the row id and timestamps do not match, sometimes an older timestamp gets a higher row id, etc. I guess this is due to writes being done on multiple threads, but getting data by id doesn't guarantee I will get the trades in order and while I can sort one batch I receive, I can't be sure that the next batch will not contain an older row.
Solution B:
I can make a query loop that takes the last timestamp received, subtracts 1 second and queries again, etc. I can sort the data in the client and, for each instrument, discard all rows older than the last one processed.
Not very efficient, but that will work.
Solution C:
I can make a query per instrument (there are 22 of them), ordered by timestamp. Can 22 subqueries be grouped into a single one?
Or, is there another solution?
You could try big serial with auto increment to ensure each row is numbered in order as it is inserted.
Since this number is handled by Postgres you should be fine to get a guaranteed ordering on your data.
On the client side you just store (maybe in a separate table of meta-data) the latest serial number you have seen and then just query everything larger than that and keep your meta data table up to date.

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

Get un retrieved rows only in DB2 select

I have an BPM application where I am polling some rows from DB2 database at every 5 mins with a scheduler R1 with below query -
- select * from Table where STATUS = 'New'
based on rows returned I do some processing and then change the status of these rows to 'Read'.
But while this processing is being completed, its takes more than 5 mins and in meanwhile scheduler R1 runs and picks up some of the cases already picked up in last run.
How can I ensure that every scheduler picks up the rows which were not selected in last run. What changes do i need to do in my select statement? Please hep.
How can I ensure that every scheduler picks up the rows which were not selected in last run
You will need to make every scheduler aware of what was selected by other schedulers. You can do this, for example, by locking the selected rows (SELECT ... FOR UPDATE). Of course, you will then need to handle lock timeouts.
Another option, allowing for better concurrency, would be to update the record status before processing the records. You might introduce an intermediary status, something like 'In progress', and include the status in the query condition.