I have an BPM application where I am polling some rows from DB2 database at every 5 mins with a scheduler R1 with below query -
- select * from Table where STATUS = 'New'
based on rows returned I do some processing and then change the status of these rows to 'Read'.
But while this processing is being completed, its takes more than 5 mins and in meanwhile scheduler R1 runs and picks up some of the cases already picked up in last run.
How can I ensure that every scheduler picks up the rows which were not selected in last run. What changes do i need to do in my select statement? Please hep.
How can I ensure that every scheduler picks up the rows which were not selected in last run
You will need to make every scheduler aware of what was selected by other schedulers. You can do this, for example, by locking the selected rows (SELECT ... FOR UPDATE). Of course, you will then need to handle lock timeouts.
Another option, allowing for better concurrency, would be to update the record status before processing the records. You might introduce an intermediary status, something like 'In progress', and include the status in the query condition.
Related
If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.
I have a table with tumbling window, e.g.
CREATE TABLE total_transactions_per_1_days AS
SELECT
sender,
count(*) AS count,
sum(amount) AS total_amount,
histogram(recipient) AS recipients
FROM
completed_transactions
WINDOW TUMBLING (
SIZE 1 DAYS
)
Now I need to only select data from the current window, i.e. windowstart <= current time and windowend <= current time. Is it possible? I could not find any example.
Depends what you mean when you say 'select data' ;)
ksqlDB supports two main query types, (see https://docs.ksqldb.io/en/latest/concepts/queries/).
If what you want is a pull query, i.e. a traditional sql query where you want to pull back the current window as a one time result, then what you want may be possible, though pull queries are a recent feature and not fully featured yet. As of version 0.10 you can only look up a known key. For example, if sender is the key of the table, you could run a query like:
SELECT * FROM total_transactions_per_1_days
WHERE sender = some_value
AND WindowStart <= UNIX_TIMESTAMP()
AND WindowEnd >= UNIX_TIMESTAMP();
This would require the table to have processed data with a timestamp close to the current wall clock time for it to pull back data, i.e. if the system was lagging, or if you were processing historic or delayed data, this would not work.
Note: the above query will work on ksqlDB v0.10. Your success on older versions may vary.
There are plans to extend the functionality of pull queries. So keep an eye for updates to ksqlDB.
I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"
I have a script that runs a function on every item in my database to extract academic citations. The database is large, so the script takes about a week to run.
During that time, items are added and removed from the database.
The database is too large to pull completely into memory, so I have to chunk through it to process all the items.
Is there a way to ensure that when the script finishes, all of the items have been processed? Is this a pattern with a simple solution? So far my research hasn't revealed anything useful.
PS: Locking the table for a week isn't an option!
I would add a timestamp column "modified_at" to the table which defaults to null. So any new item can be identified.
Your script can then pick the chunks to work on based on that column.
update items
set modified_at = current_timestamp
from (
select id
from items
where modified_at is null
limit 1000 --<<< this defines the size of each "chunk" that you work on
) t
where t.id = items.id
returning items.*;
This will update 1000 rows that have not been processed as being processed and will return those rows in one single statement. Your job can then work on the returned items.
New rows need to be added with modified_at = null and your script will pick them up based on the where modified_at is null condition the next time you run it.
If you also change items while processing them, you need to update the modified_at accordingly. In your script you will then need to store the last start of your processing somewhere. The next run of your script can then select items to be processed using
where modified_at is null
or modified_at < (last script start time)
If you only process each item only once (and then never again), you don't really need a timestamp, a simple boolean (e.g. is_processed) would do as well.
When I execute a query and right click in the results area, I get a pop-up menu with the following options:
Save Grid as Report ...
Single Record View ...
Count Rows ...
Find/Highlight ...
Export ...
If I select "Count Rows", is there a way to interrupt the operation if it starts taking too long?
No, you don't seem to be able to.
When you select Count Rows from the context menu, it runs the count on the main UI thread, hanging the whole UI, potentially for minutes or hours.
It's best not to use that feature - just put select count (*) from ( < your query here>) which it runs properly on separate thread which can be cancelled.
You can open an new instance of sql developer and kill the session counting the rows.
I do suggest using the select count(*) query though as it is less painful in the long run.