optimal way to stream trades data out of a postgres database - postgresql

I have a table with a very simple schema:
(
instrument varchar(20) not null,
ts timestamp not null,
price double precision not null,
quantity double precision not null,
direction integer not null,
id serial
constraint trades_pkey
primary key
);
It stores a list of trades done on various instruments.
You can have multiple trades on a single timestamp and also the timestamps are not regular; it's possible to have 10 entries on the same millisecond and then nothing for 2 seconds, etc.
When the client starts, I would like to accomplish two things:
Load the last hour of data.
Stream all the new updates.
The client processes the trades one by one, as if they were coming from a queue. They are sorted by instrument and each instrument has its own queue, expecting each trade to be the one following the previous one.
Solution A:
I did a query to find the id at now - 1hour, and then query all rows with id >= start id, and then loop to get all row with id > last id.
This does not work:
the row id and timestamps do not match, sometimes an older timestamp gets a higher row id, etc. I guess this is due to writes being done on multiple threads, but getting data by id doesn't guarantee I will get the trades in order and while I can sort one batch I receive, I can't be sure that the next batch will not contain an older row.
Solution B:
I can make a query loop that takes the last timestamp received, subtracts 1 second and queries again, etc. I can sort the data in the client and, for each instrument, discard all rows older than the last one processed.
Not very efficient, but that will work.
Solution C:
I can make a query per instrument (there are 22 of them), ordered by timestamp. Can 22 subqueries be grouped into a single one?
Or, is there another solution?

You could try big serial with auto increment to ensure each row is numbered in order as it is inserted.
Since this number is handled by Postgres you should be fine to get a guaranteed ordering on your data.
On the client side you just store (maybe in a separate table of meta-data) the latest serial number you have seen and then just query everything larger than that and keep your meta data table up to date.

Related

Scalable approach to calculating balances over thousands of records in PostgreSQL

I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.
The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.
The different approaches I took:
Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow.
I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.
Main question is - what would be the most efficient way to achieve this while being 'database friendly'?
Additionally:
Whether you think these approaches are sane?
Am I missing another obvious solution?
Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?
I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.
EDIT:
To be a but more specific, I'm dealing with the following data:
CREATE TABLE entries (
profileid bigint NOT NULL,
programid bigint NOT NULL,
ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
startdate timestamptz,
enddate timestamptz,
amount numeric NOT NULL
)
What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).
The desired form is:
RETURNS TABLE (
programid bigint,
ledgerid text,
programid bigint,
currentbalance numeric,
pendingbalance numeric,
expiredbalance numeric,
spentbalance numeric
)
The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...
While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...

Can I build a blob in Postgres?

I have a scenario where a db has a few hundred million rows, keeping a history of a few weeks only.
In this database are different products (tradable instruments) and each has about 72k rows of data per hour and there is roughly 30 products.
The data is always requested by blocks of 1h, aligned on 1h. For example "I want data for X from 2pm to 3pm.
This data is processed by several tools and the requests are very demanding for the database.
Each tool does its own disk caching, building a binary blob for each hour.
But I was wondering if it would be possible to build these directly in Postgres?
Data is indexed by timestamp and is written in a linear fashion as the writes represent live data.
So it would be possible to detect with a trigger that we just crossed an hour.
Would it be possible when we detect this to get all this data, build a binary blob out of it and save it in it own table? the data is simply all the columns one after another in binary forms. They're all numbers, no strings, etc so the alignment / format is very simple and rigid.
In practice the rows are like this:
instrument VARCHAR NOT NULL,
ts TIMESTAMP WITHOUT TIME ZONE NOT NULL,
quantity FLOAT8 NOT NULL,
price FLOAT8 NOT NULL,
direction INTEGER NOT NULL
and I would like, at the end of each hour to build a byte array that's like this:
0 4 12 20 24
|ts|quantity|price|direction|ts|quantity|price|direction...
with every row of the hour. Build one blob per instrument and write it in a table like this:
instrument VARCHAR
ts TIMESTAMP
blob BYTEA
My questions are:
is this possible? or would be it be very inefficient to get 30 (products) * 72k rows each to aggregate and save every hour in an efficient way?
is there anywhere I could find an example leading me toward this?

What is the best way to move millions of data from one postgres database to another?

So we have a task at the moment where I need to move millions of records from one database to another.
To complicate things slightly I need to change an id on each record before inserting the data.
How it works is we have 100 stations in database a.
Each station contains 30+ sensors.
Each sensor contains readings for about the last 10 years.
These readings are anywhere from 15minute interval to daily interval.
So each station can have at least 5m records.
database b has the same structure as database a.
The reading table contains the following fields
id: primary key
sensor_id: int4
value: numeric(12)
time: timestamp
What I have done so far for one station is.
Connect to database a and select all readings for station 1
Find all corresponding sensors in database b
Change the sensor_id from database a to it's new sensor_id from database b
Chunk the updated sensor_id data to groups of about 5000 parameters
Loop over the chunks and do a mass insert
In theory, this should work.
However, I am getting errors saying duplicate key violates unique constraint.
If I query the database on those records that are failing, the data doesn't exist.
The weird thing about this is that if I run the script 4 or 5 times in a row all the data eventually gets in there. So I am at a loss as to why I would be receiving this error because it doesn't seem accurate.
Is there a way I can get around this error from happening?
Is there a more efficient way of doing this?

how can I stream rows, in a loop, from Postgres? (F#/.NET but any other language is fine)

I have a set of data that looks like this:
(
instrument varchar(20) NOT NULL,
ts timestamp without time zone NOT NULL,
price float8 NOT NULL,
quantity float8 NOT NULL,
direction int NOT NULL
);
and I'd like to keep in my app the last hour of data; so upon startup, query everything where ts >= now - 1h, and then keep a loop where I query from the last row received to 'now'.
We're talking about roughly 1.5M rows to fetch at startup.
The issue is that the timestamp is not unique: you can have multiple rows with the same timestamp.
I am requesting an update every second with a limit of 50k, and it usually produces 200-500 rows; but the startup is providing batches of 50k rows until it catches up with the new data.
Should I:
add an id, find the id of now - 1h, and request records with an id higher than the last one received?
roll back the last timestamp received by 1s and deal with duplicates
something better I didn't think about (I'm not very knowledgeable with SQL DBs)
You have the right idea.
If there is a single writer (which is to say, a single client INSERTing to this table), you can use an autoincrementing ID to avoid having do deal with duplicates. Add an ID column, query for the last hour, keep your largest-seen ID in memory, and then run your loop of id>largest_seen_id every second. This is pretty standard, but does rely on IDs always becoming visible to the querier in increasing order, which isn't something the DB can guarantee in the general case.
If there are multiple writers, it's possible for this approach to skip rows, as a higher ID might be committed while another client still has it's lower-id row in an open transaction, and not visible to your query. Your next iteration would have the higher id as the lower bound, so you never end up seeing the committed lower-ID row. The easiest way of dealing with that is what you're already thinking of with your second option: query for extra rows and ignore the ones you already have. If network transfer is an issue, you can do the ignoring in the WHERE of the loop query you're running, so you're only "wasting" the size of the already-seen row IDs, instead of all the row data:
SELECT *
FROM my_appending_table
WHERE timestamp > ('2021-08-24 00:10:07' - interval '5 seconds')
AND NOT id=ANY({99101,99105,99106})
ORDER BY id asc
LIMIT 50000
Where 2021-08-24 00:10:07 is the max timestamp you currently have (probably fine to use the timestamp of the max-ID row; not guaranteed to be the same but basically always will be), 5 seconds is a fuzz to ensure all other writers have committed, and 99101,99105,99106 are the IDs you've already seen for rows with timestamp 2021-08-24 00:10:02 (:07 - 5) or later.

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.