Efficient way to reconstruct base table from changes - postgresql

I have a table consisting of products (with ID's, ~15k records) and another table price_changes (~88m records) recording a change in the price for a given productID at a given changedate.
I'm now interested in the price for each product at given points in time (say every 2 hours for a year, so altogether ~ 4300 points; altogether resulting in ~64m data points of interest). While it's very straight forward to determine the price for a given product at a given time, it seems to be quite time-consuming to determine all 64m data points.
My approach is to pre-populate a new target table fullprices with the data points of interest:
insert into fullprices(obsdate,productID)
select obsdate, productID from targetdates, products
and then update each price observation in this new table like this:
update fullprices f set price = (select price from price_changes where
productID = f.productID and date < f.obsdate
order by date desc
limit 1)
which should give me the most recent price change in each point in time.
Unfortunately, this takes ... well, ages. Is there any better way to do it?
== Edit: My tables are created as follows: ==
CREATE TABLE products
(
productID uuid NOT NULL,
name text NOT NULL,
CONSTRAINT products_pkey PRIMARY KEY (productID )
);
CREATE TABLE price_changes
(
id integer NOT NULL,
productID uuid NOT NULL,
price smallint,
date timestamp NOT NULL
);
CREATE INDEX idx_pc_date
ON price_changes USING btree
(date);
CREATE INDEX idx_pc_productID
ON price_changes USING btree
(productID);
CREATE TABLE targetdates
(
obsdate timestamp
);
CREATE TABLE fullprices
(
obsdate timestamp NOT NULL,
productID uuid NOT NULL,
price smallint
);

Related

PostgreSQL: ON CONFLICT DO UPDATE command cannot affect row a second time

I have two PostreSQL tables:
CREATE TABLE source.staticprompts (
id INT,
projectid BIGINT,
scriptid INT,
promptnum INT,
prompttype VARCHAR(20),
inputs VARCHAR(2000),
attributes VARCHAR(2000),
text VARCHAR(2000),
corpuscode VARCHAR(2000),
comment VARCHAR(2000),
created TIMESTAMP,
modified TIMESTAMP
);
and
CREATE TABLE target.dim_collect_user_inp_configs (
collect_project_id BIGINT NOT NULL,
prompt_type VARCHAR(20),
prompt_input_desc VARCHAR(3000),
prompt_input_name VARCHAR(1000),
no_of_prompt_count BIGINT,
prompt_input_value VARCHAR(100),
prompt_input_value_id BIGSERIAL PRIMARY KEY,
script_id BIGINT,
corpuscode VARCHAR(20),
min_recordings VARCHAR(2000),
max_recordings VARCHAR(2000),
recordings_count VARCHAR(2000),
lease_duration VARCHAR(2000),
date_created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
date_updated TIMESTAMP WITHOUT TIME ZONE,
CONSTRAINT must_be_unique UNIQUE (prompt_input_value, collect_project_id)
);
I need copy data from source to target with this conditions:
Each value need to be stored as one row in the dim_collect_user_inp_configs table. Example, Indoor-Loud as one row and it will have it’s own unique identifier as prompt_input_value_id, Indoor-Normal as one row and it will have it’s own unique identifier as prompt_input_value_id till the Semi-Outdoor-Whisper.
There could be multiple input “name” in one inputs column. Each name and its value need to be stored separately.
prompt_input_value_id - Generate unique sequence number for the combination of each prompt_input_value and collect_project_id
Source table have this data:
20030,input,m66,,null,"[{""desc"": ""Select the setting that you will do the recordings under."", ""name"": ""ambient"", ""type"": ""dropdown"", ""values"": ["""", ""Indoors + High + Loud"", ""Indoors + High + Normal"", ""Indoors + Low + Normal"", ""Indoors + Low + LowVolume"", ""Outdoors + High + Normal"", ""Outdoors + Low + Loud"", ""Outdoors + Low + Normal"", ""Outdoors + Low + LowVolume""]}, {""desc"": ""Select the noise type that you will do the recordings under."", ""name"": ""Noise type"", ""type"": ""dropdown"", ""values"": ["""", ""Human Speech"", ""Ambient Speech"", ""Non-Speech""]}]",,2018-12-13 13:49:24.408933,1,5,5906,2021-08-26 12:43:54.061000
I try to do this task with this query:
INSERT INTO target.dim_collect_user_inp_configs AS t (
collect_project_id,
prompt_type,
prompt_input_desc,
prompt_input_name,
prompt_input_value,
script_id,
corpuscode)
SELECT
s.projectid,
s.prompttype,
el.inputs->>'name' AS name,
el.inputs->>'desc' AS description,
jsonb_array_elements(el.inputs->'values') AS value,
s.scriptid,
s.corpuscode
FROM source.staticprompts AS s,
jsonb_array_elements(s.inputs::jsonb) el(inputs)
ON CONFLICT
(prompt_input_value, collect_project_id)
DO UPDATE SET
(prompt_input_desc, prompt_input_name, date_updated) =
(EXCLUDED.prompt_input_desc,
EXCLUDED.prompt_input_name,
NOW())
WHERE t.prompt_input_desc != EXCLUDED.prompt_input_desc
OR t.prompt_input_name != EXCLUDED.prompt_input_name
RETURNING *;
But I get an error:
ON CONFLICT DO UPDATE command cannot affect row a second time Hint: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
Can you help where is mistake?
Change the SELECT so that all rows with the same prompt_input_value and collect_project_id are grouped together, then each targer row will be updated at most once. Use aggregate functions for all other columns.
Something like
SELECT s.projectid,
max(s.prompttype),
max(el.inputs->>'name') AS name,
max(el.inputs->>'desc') AS description,
v.value,
max(s.scriptid),
max(s.corpuscode)
FROM source.staticprompts AS s
CROSS JOIN LATERAL jsonb_array_elements(s.inputs::jsonb) AS el(inputs)
CROSS JOIN LATERAL jsonb_array_elements(el.inputs->'values') AS v(value)
GROUP BY s.projectid, v.value

postgresql group by datetime in join query

I have 2 tables in my postgresql timescaledb database (version 12.06) that I try to query through inner join.
Tables' structure:
CREATE TABLE currency(
id serial PRIMARY KEY,
symbol TEXT NOT NULL,
name TEXT NOT NULL,
quote_asset TEXT
);
CREATE TABLE currency_price (
currency_id integer NOT NULL,
dt timestamp WITHOUT time ZONE NOT NULL,
open NUMERIC NOT NULL,
high NUMERIC NOT NULL,
low NUMERIC NOT NULL,
close NUMERIC,
volume NUMERIC NOT NULL,
PRIMARY KEY (
currency_id,
dt
),
CONSTRAINT fk_currency FOREIGN KEY (currency_id) REFERENCES currency(id)
);
The query I'm trying to make is:
SELECT currency_id AS id, symbol, MAX(close) AS close, DATE(dt) AS date
FROM currency_price
JOIN currency ON
currency.id = currency_price.currency_id
GROUP BY currency_id, symbol, date
LIMIT 100;
Basically, it returns all the rows that exist in currency_price table. I know that postgres doesn't allow select columns without an aggregate function or including them in "group by" clause. So, if I don't include dt column in my select query, i receive expected results, but if I include it, the output shows rows of every single day of each currency while I only want to have the max value of every currency and filter them out based on various dates afterwards.
I'm very inexperienced with SQL in general.
Any suggestions to solve this would be very appreciated.
There are several ways to do it, easiest one comes to mind is using window functions.
select *
from (
SELECT currency_id,symbol,close,dt
,row_number() over(partition by currency_id,symbol
order by close desc,dt desc) as rr
FROM currency_price
JOIN currency ON currency.id = currency_price.currency_id
where dt::date = '2021-06-07'
)q1
where rr=1
General window functions:
https://www.postgresql.org/docs/9.5/functions-window.html
works also with standard aggregate functions like SUM,AVG,MAX,MIN and others.
Some examples: https://www.postgresqltutorial.com/postgresql-window-function/

Update and insert performance with partial indexes

I have different queries for fetching data from a large table (about 100-200M rows). I've created partial indexes for my table with different predicates to fit the query because I know each query.
For example, the table similar to this:
CREATE TABLE public.contacts (
id int8 NOT NULL DEFAULT ssng_generate_id(8::bigint),
created timestamp NOT NULL DEFAULT timezone('UTC'::text, now()),
contact_pool_id int8 NOT NULL,
project_id int8 NOT NULL,
state_id int4 NOT NULL DEFAULT 10,
order_x int4 NOT NULL,
next_attempt_date timestamp NULL,
CONSTRAINT contacts_pkey PRIMARY KEY (id)
);
And there are two types of query:
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 10 ORDER BY order_x LIMIT 1;
and
SELECT * FROM contacts WHERE contact_pool_id = X AND state_id = 20 AND next_attemp_date <= NOW ORDER BY next_attemp_date LIMIT 1;
For those queries I've created partial indexes:
For state_id = 10 (new contacts)
CREATE INDEX ix_contacts_cpid_orderx_id_for_new ON contacts USING btree (contact_pool_id, order_x, id) WHERE state_id = 10;
For state_id = 20 (available contacts)
CREATE INDEX ix_contacts_cpid_nextattepmdate_id_for_available ON contacts USING btree (contact_pool_id, next_attempt_date, id) WHERE state_id = 20;
For me, those partial indexes are faster than a single index.
And what about an update and insert performance? If I change a row with state_id = 20, will it affect only index 2 (for available contacts) or both of them will be affected?
Partial indexes which are not relevant to the tuple will not get updated.
If PostgreSQL can do a HOT update (if the column being changed is not part of an index, and there is room on the same page for the new tuple), then even the relevant index doesn't need to get updated.
Yes, with a partial index you only pay the overhead of modifying the index for rows that meet the WHERE condition, so you will always only need to modify at most one of the indexes at the same time (unless you change state_id from 10 to 20 or vice versa).

Unique constraint on single field of a custom type in postgres

I have an entity price in my schema it has an attribute amount which is of a custom type money_with_currency.
The money_with_currency is basically type (amount Big Int, currency char(3)).
The price entity belongs to a product. What I want to do is, create a unique constraint on the combination of product_id(foreign key) + currency . How can I do this?
Referencing a single field of a record type is a bit tricky:
CREATE TYPE money_with_currency AS (amount bigint, currency char(3));
CREATE TABLE product_price
(
product_id integer not null references product,
price money_with_currency not null
);
CREATE UNIQUE INDEX ON product_price(product_id, ((price).currency));

Postgres sort expression

I have a table with goods:
CREATE TABLE public.goods (
"id" bigserial NOT NULL,
title varchar(250) NOT NULL,
cost numeric(10,2),
PRIMARY KEY ("id")
);
Now I want to sort this table by title but put all goods with cost 0 at the end of the list. Is this possible?
If I try to use:
ORDER BY
cost DESC,
title ASC
I get incorrect order by title
One way to do this is to use a CASE expression when ordering which places the block of records having a zero cost at the bottom. Then, within each block (either zero cost or non-zero cost), the records can be sorted alphabetically by the title.
SELECT cost, title
FROM public.goods
ORDER BY CASE WHEN cost = 0 THEN 1 ELSE 0 END,
title