PostgreSQL: ON CONFLICT DO UPDATE command cannot affect row a second time - postgresql

I have two PostreSQL tables:
CREATE TABLE source.staticprompts (
id INT,
projectid BIGINT,
scriptid INT,
promptnum INT,
prompttype VARCHAR(20),
inputs VARCHAR(2000),
attributes VARCHAR(2000),
text VARCHAR(2000),
corpuscode VARCHAR(2000),
comment VARCHAR(2000),
created TIMESTAMP,
modified TIMESTAMP
);
and
CREATE TABLE target.dim_collect_user_inp_configs (
collect_project_id BIGINT NOT NULL,
prompt_type VARCHAR(20),
prompt_input_desc VARCHAR(3000),
prompt_input_name VARCHAR(1000),
no_of_prompt_count BIGINT,
prompt_input_value VARCHAR(100),
prompt_input_value_id BIGSERIAL PRIMARY KEY,
script_id BIGINT,
corpuscode VARCHAR(20),
min_recordings VARCHAR(2000),
max_recordings VARCHAR(2000),
recordings_count VARCHAR(2000),
lease_duration VARCHAR(2000),
date_created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
date_updated TIMESTAMP WITHOUT TIME ZONE,
CONSTRAINT must_be_unique UNIQUE (prompt_input_value, collect_project_id)
);
I need copy data from source to target with this conditions:
Each value need to be stored as one row in the dim_collect_user_inp_configs table. Example, Indoor-Loud as one row and it will have it’s own unique identifier as prompt_input_value_id, Indoor-Normal as one row and it will have it’s own unique identifier as prompt_input_value_id till the Semi-Outdoor-Whisper.
There could be multiple input “name” in one inputs column. Each name and its value need to be stored separately.
prompt_input_value_id - Generate unique sequence number for the combination of each prompt_input_value and collect_project_id
Source table have this data:
20030,input,m66,,null,"[{""desc"": ""Select the setting that you will do the recordings under."", ""name"": ""ambient"", ""type"": ""dropdown"", ""values"": ["""", ""Indoors + High + Loud"", ""Indoors + High + Normal"", ""Indoors + Low + Normal"", ""Indoors + Low + LowVolume"", ""Outdoors + High + Normal"", ""Outdoors + Low + Loud"", ""Outdoors + Low + Normal"", ""Outdoors + Low + LowVolume""]}, {""desc"": ""Select the noise type that you will do the recordings under."", ""name"": ""Noise type"", ""type"": ""dropdown"", ""values"": ["""", ""Human Speech"", ""Ambient Speech"", ""Non-Speech""]}]",,2018-12-13 13:49:24.408933,1,5,5906,2021-08-26 12:43:54.061000
I try to do this task with this query:
INSERT INTO target.dim_collect_user_inp_configs AS t (
collect_project_id,
prompt_type,
prompt_input_desc,
prompt_input_name,
prompt_input_value,
script_id,
corpuscode)
SELECT
s.projectid,
s.prompttype,
el.inputs->>'name' AS name,
el.inputs->>'desc' AS description,
jsonb_array_elements(el.inputs->'values') AS value,
s.scriptid,
s.corpuscode
FROM source.staticprompts AS s,
jsonb_array_elements(s.inputs::jsonb) el(inputs)
ON CONFLICT
(prompt_input_value, collect_project_id)
DO UPDATE SET
(prompt_input_desc, prompt_input_name, date_updated) =
(EXCLUDED.prompt_input_desc,
EXCLUDED.prompt_input_name,
NOW())
WHERE t.prompt_input_desc != EXCLUDED.prompt_input_desc
OR t.prompt_input_name != EXCLUDED.prompt_input_name
RETURNING *;
But I get an error:
ON CONFLICT DO UPDATE command cannot affect row a second time Hint: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
Can you help where is mistake?

Change the SELECT so that all rows with the same prompt_input_value and collect_project_id are grouped together, then each targer row will be updated at most once. Use aggregate functions for all other columns.
Something like
SELECT s.projectid,
max(s.prompttype),
max(el.inputs->>'name') AS name,
max(el.inputs->>'desc') AS description,
v.value,
max(s.scriptid),
max(s.corpuscode)
FROM source.staticprompts AS s
CROSS JOIN LATERAL jsonb_array_elements(s.inputs::jsonb) AS el(inputs)
CROSS JOIN LATERAL jsonb_array_elements(el.inputs->'values') AS v(value)
GROUP BY s.projectid, v.value

Related

Is it possible to look at the output of the previous row of a PostgreSQL query?

This is the question: Is it possible to look at the outputs, what has been selected, from the previous row of a running SQL query in Postgres?
I know that lag exists to look at the inputs, the "from" of the query. I also know that a CTE, subquery or lateral join can solve most issues of this kind. But I think the problem I'm facing genuinely requires a peek at the output of the previous row. Why? Because the output of the current row depends on a constant from a lookup table and the value used too look up that constant is an aggregate of all the previous rows. And if that lookup returns the wrong constant all subsequent rows will be increasingly off from the expected value.
The whole rest of this text is a simplified example based on the problem I'm facing. It should be possible to input it to PostgreSQL 12 and above and play around. I'm terribly sorry that it is as complicated as it is, but I think it is the most simple I can make it while still retaining the core issue: lookup in lookup table based on an aggregate from all previous rows as well as the fact that the "inventory" that's being tracked is modeled as a series of transactions of two discrete types.
The database itself exists to keep track of multiple fish farms, or cages full of fish. Fish can be moved/transferred from between these farms and the farms are fed about daily. Why not just carry the aggregate as a field in the table? Because it should be possible to switch out the lookup table after the season is over, to adjust it to better match with reality.
-- A listing of all groups of fish ever grown.
create table farms (
id bigserial primary key,
start timestamp not null,
stop timestamp
);
insert into farms
(id, start)
values (
1, '2021-02-01T13:37'
);
-- A transfer of fish from one odling to another.
-- If the source is null the fish is transferred from another fishery outside our system.
-- If the destination is null the fish is being slaughtered, removed from the system.
create table transfers (
source bigint references farms(id),
destination bigint references farms(id),
timestamp timestamp not null default current_timestamp,
total_weight_g bigint not null constraint positive_nonzero_total_weight_g check (total_weight_g > 0),
average_weight_g bigint not null constraint positive_nonzero_average_weight_g check (average_weight_g > 0),
number_fish bigint generated always as (total_weight_g / average_weight_g) stored
);
insert into transfers
(source, destination, timestamp, total_weight_g, average_weight_g)
values
(null, 1, '2021-02-01T16:38', 5, 5),
(null, 1, '2021-02-15T16:38', 500, 500);
-- Transactions of fish feed into a farm.
create table feedings (
id bigserial primary key,
growth_table bigint not null,
farm bigint not null references farms(id),
amount_g bigint not null constraint positive_nonzero_amunt_g check (amount_g > 0),
timestamp timestamp not null
);
insert into feedings
(farm, growth_table, amount_g, timestamp)
values
(1, 1, 1, '2021-02-02T13:37'),
(1, 1, 1, '2021-02-03T13:37'),
(1, 1, 1, '2021-02-04T13:37'),
(1, 1, 1, '2021-02-05T13:37'),
(1, 1, 1, '2021-02-06T13:37'),
(1, 1, 1, '2021-02-07T13:37');
create view combined_feed_and_transfer_history as
with transfer_history as (
select timestamp, destination as farm, total_weight_g, average_weight_g, number_fish
from transfers as deposits
where deposits.destination = 1 -- TODO: This view only works for one farm, fix that.
union all
select timestamp, source as farm, -total_weight_g, -average_weight_g, -number_fish
from transfers as withdrawals
where withdrawals.source = 1
)
select timestamp, farm, total_weight_g, number_fish, average_weight_g, null as growth_table
from transfer_history
union all
select timestamp, farm, amount_g, 0 as number_fish, 0 as average_weight_g, growth_table
from feedings
order by timestamp;
-- Conversion tables from feed to gained weight.
create table growth_coefficients (
growth_table bigserial not null,
average_weight_g bigint not null constraint positive_nonzero_weight check (average_weight_g > 0),
feed_conversion_rate double precision not null constraint positive_foderkonverteringsfaktor check (feed_conversion_rate >= 0),
primary key(growth_table, average_weight_g)
);
insert into growth_coefficients
(average_weight_g, feed_conversion_rate, growth_table)
values
(5.00,0.10,1),
(10.00,10.00,1),
(20.00,1.30,1),
(50.00,1.31,1),
(100.00,1.32,1),
(300.00,1.36,1),
(600.00,1.42,1),
(1000.00,1.50,1),
(1500.00,1.60,1),
(2000.00,1.70,1),
(2500.00,1.80,1),
(3000.00,1.90,1),
(4000.00,2.10,1),
(5000.00,2.30,1);
-- My current solution is a bad one. It does a CTE that sums over all events but does not account
-- for the feed conversion rate. That means that the average weight used too look up the feed
-- conversion rate will diverge more and more from reality the further into the season time goes.
-- This is why it is important to look at the output, the average weight, of the previous row.
-- We start by summing up all the transfer and feed events to get a rough average_weight_g.
with estimate as (
select
timestamp,
farm,
total_weight_g as transaction_size_g,
growth_table,
sum(total_weight_g) over (order by timestamp) as sum_weight_g,
sum(number_fish) over (order by timestamp) as sum_number_fish,
sum(total_weight_g) over (order by timestamp) / sum(number_fish) over (order by timestamp) as average_weight_g
from
combined_feed_and_transfer_history
)
select
timestamp,
sum_number_fish,
transaction_size_g as trans_g,
sum_weight_g,
closest_lookup_table_weight.average_weight_g as lookup_g,
converted_weight_g as conv_g,
sum(converted_weight_g) over (order by timestamp) as sum_conv_g,
sum(converted_weight_g) over (order by timestamp) / sum_number_fish as sum_average_g
from
estimate
join lateral ( -- We then use this estimated_average_weight to look up the closest constant in the growth coefficient table.
(select gc.average_weight_g - estimate.average_weight_g as diff, gc.average_weight_g from growth_coefficients gc where gc.average_weight_g >= estimate.average_weight_g order by gc.average_weight_g asc limit 1)
union all
(select estimate.average_weight_g - gc.average_weight_g as diff, gc.average_weight_g from growth_coefficients gc where gc.average_weight_g <= estimate.average_weight_g order by gc.average_weight_g desc limit 1)
order by diff
limit 1
) as closest_lookup_table_weight
on true
join lateral ( -- If the historical event is a feeding we need to lookup the feed conversion rate.
select case when growth_table is null then 1
else (select feed_conversion_rate
from growth_coefficients gc
where gc.growth_table = growth_table
and gc.average_weight_g = closest_lookup_table_weight.average_weight_g)
end
) as growth_coefficient
on true
join lateral (
select feed_conversion_rate * transaction_size_g as converted_weight_g
) as converted_weight_g
on true;
At the very bottom is my current "solution". With the above example data the sum_conv_g should end up being 5.6, but due to the aggregate being used as the lookup not accounting for the conversion rate the sum_conv_g ends up 45.2 instead.
One idea I had was if there perhaps something like query-local variables one could use to store the sum_average_g between rows? There's always the escape hatch of just querying out the transactions to my generic programming language Clojure and solving it there, but it would be neat if it could be solved entirely within the database.
You have to formulate a recursive subquery. I posted a simplified version of this question over at the DBA SE and got the answer there. The answer to that question can be found here and can be expanded to this more complicated question, though I would wager that no one will ever have the interest to do that.

postgresql group by datetime in join query

I have 2 tables in my postgresql timescaledb database (version 12.06) that I try to query through inner join.
Tables' structure:
CREATE TABLE currency(
id serial PRIMARY KEY,
symbol TEXT NOT NULL,
name TEXT NOT NULL,
quote_asset TEXT
);
CREATE TABLE currency_price (
currency_id integer NOT NULL,
dt timestamp WITHOUT time ZONE NOT NULL,
open NUMERIC NOT NULL,
high NUMERIC NOT NULL,
low NUMERIC NOT NULL,
close NUMERIC,
volume NUMERIC NOT NULL,
PRIMARY KEY (
currency_id,
dt
),
CONSTRAINT fk_currency FOREIGN KEY (currency_id) REFERENCES currency(id)
);
The query I'm trying to make is:
SELECT currency_id AS id, symbol, MAX(close) AS close, DATE(dt) AS date
FROM currency_price
JOIN currency ON
currency.id = currency_price.currency_id
GROUP BY currency_id, symbol, date
LIMIT 100;
Basically, it returns all the rows that exist in currency_price table. I know that postgres doesn't allow select columns without an aggregate function or including them in "group by" clause. So, if I don't include dt column in my select query, i receive expected results, but if I include it, the output shows rows of every single day of each currency while I only want to have the max value of every currency and filter them out based on various dates afterwards.
I'm very inexperienced with SQL in general.
Any suggestions to solve this would be very appreciated.
There are several ways to do it, easiest one comes to mind is using window functions.
select *
from (
SELECT currency_id,symbol,close,dt
,row_number() over(partition by currency_id,symbol
order by close desc,dt desc) as rr
FROM currency_price
JOIN currency ON currency.id = currency_price.currency_id
where dt::date = '2021-06-07'
)q1
where rr=1
General window functions:
https://www.postgresql.org/docs/9.5/functions-window.html
works also with standard aggregate functions like SUM,AVG,MAX,MIN and others.
Some examples: https://www.postgresqltutorial.com/postgresql-window-function/

Efficient way to reconstruct base table from changes

I have a table consisting of products (with ID's, ~15k records) and another table price_changes (~88m records) recording a change in the price for a given productID at a given changedate.
I'm now interested in the price for each product at given points in time (say every 2 hours for a year, so altogether ~ 4300 points; altogether resulting in ~64m data points of interest). While it's very straight forward to determine the price for a given product at a given time, it seems to be quite time-consuming to determine all 64m data points.
My approach is to pre-populate a new target table fullprices with the data points of interest:
insert into fullprices(obsdate,productID)
select obsdate, productID from targetdates, products
and then update each price observation in this new table like this:
update fullprices f set price = (select price from price_changes where
productID = f.productID and date < f.obsdate
order by date desc
limit 1)
which should give me the most recent price change in each point in time.
Unfortunately, this takes ... well, ages. Is there any better way to do it?
== Edit: My tables are created as follows: ==
CREATE TABLE products
(
productID uuid NOT NULL,
name text NOT NULL,
CONSTRAINT products_pkey PRIMARY KEY (productID )
);
CREATE TABLE price_changes
(
id integer NOT NULL,
productID uuid NOT NULL,
price smallint,
date timestamp NOT NULL
);
CREATE INDEX idx_pc_date
ON price_changes USING btree
(date);
CREATE INDEX idx_pc_productID
ON price_changes USING btree
(productID);
CREATE TABLE targetdates
(
obsdate timestamp
);
CREATE TABLE fullprices
(
obsdate timestamp NOT NULL,
productID uuid NOT NULL,
price smallint
);

CSV file data into a PostgreSQL table

I am trying to create a database for movielens (http://grouplens.org/datasets/movielens/). We've got movies and ratings. Movies have multiple genres. I splitted those out into a separate table since it's a 1:many relationship. There's a many:many relationship as well, users to movies. I need to be able to query this table multiple ways.
So I created:
CREATE TABLE genre (
genre_id serial NOT NULL,
genre_name char(20) DEFAULT NULL,
PRIMARY KEY (genre_id)
)
.
INSERT INTO genre VALUES
(1,'Action'),(2,'Adventure'),(3,'Animation'),(4,'Children\s'),(5,'Comedy'),(6,'Crime'),
(7,'Documentary'),(8,'Drama'),(9,'Fantasy'),(10,'Film-Noir'),(11,'Horror'),(12,'Musical'),
(13,'Mystery'),(14,'Romance'),(15,'Sci-Fi'),(16,'Thriller'),(17,'War'),(18,'Western');
.
CREATE TABLE movie (
movie_id int NOT NULL DEFAULT '0',
movie_name char(75) DEFAULT NULL,
movie_year smallint DEFAULT NULL,
PRIMARY KEY (movie_id)
);
.
CREATE TABLE moviegenre (
movie_id int NOT NULL DEFAULT '0',
genre_id tinyint NOT NULL DEFAULT '0',
PRIMARY KEY (movie_id, genre_id)
);
I dont know how to import my movies.csv with columns movie_id, movie_name and movie_genre For example, the first row is (1;Toy Story (1995);Animation|Children's|Comedy)
If I INSERT manually, it should be look like:
INSERT INTO moviegenre VALUES (1,3),(1,4),(1,5)
Because 3 is Animation, 4 is Children and 5 is Comedy
How can I import all data set this way?
You should first create a table that can ingest the data from the CSV file:
CREATE TABLE movies_csv (
movie_id integer,
movie_name varchar,
movie_genre varchar
);
Note that any single quotes (Children's) should be doubled (Children''s). Once the data is in this staging table you can copy the data over to the movie table, which should have the following structure:
CREATE TABLE movie (
movie_id integer, -- A primary key has implicit NOT NULL and should not have default
movie_name varchar NOT NULL, -- Movie should have a name, varchar more flexible
movie_year integer, -- Regular integer is more efficient
PRIMARY KEY (movie_id)
);
Sanitize your other tables likewise.
Now copy the data over, extracting the unadorned name and the year from the CSV name:
INSERT INTO movie (movie_id, movie_name)
SELECT parts[1], parts[2]::integer
FROM movies_csv, regexp_matches(movie_name, '([[:ascii:]]*)\s\(([\d]*)\)$') p(parts)
Here the regular expression says:
([[:ascii:]]*) - Capture all characters until the matches below
\s - Read past a space
\( - Read past an opening parenthesis
([\d]*) - Capture any digits
\) - Read past a closing parenthesis
$ - Match from the end of the string
So on input "Die Hard 17 (John lives forever) (2074)" it creates a string array with {'Die Hard 17 (John lives forever)', '2074'}. The scanning has to be from the end $, assuming all movie titles end with the year of publication in parentheses, in order to preserve parentheses and numbers in movie titles.
Now you can work on the movie genres. You have to split the string on the bar | using the regex_split_to_table() function and then join to the genre table on the genre name:
INSERT INTO moviegenre
SELECT movie_id, genre_id
FROM movies_csv, regexp_split_to_table(movie_genre, '\|') p(genre) -- escape the |
JOIN genre ON genre.genre_name = p.genre;
After all is done and dusted you can delete the movies_csv table.

an empty row with null-like values in not-null field

I'm using postgresql 9.0 beta 4.
After inserting a lot of data into a partitioned table, i found a weird thing. When I query the table, i can see an empty row with null-like values in 'not-null' fields.
That weird query result is like below.
689th row is empty. The first 3 fields, (stid, d, ticker), are composing primary key. So they should not be null. The query i used is this.
select * from st_daily2 where stid=267408 order by d
I can even do the group by on this data.
select stid, date_trunc('month', d) ym, count(*) from st_daily2
where stid=267408 group by stid, date_trunc('month', d)
The 'group by' results still has the empty row.
The 1st row is empty.
But if i query where 'stid' or 'd' is null, then it returns nothing.
Is this a bug of postgresql 9b4? Or some data corruption?
EDIT :
I added my table definition.
CREATE TABLE st_daily
(
stid integer NOT NULL,
d date NOT NULL,
ticker character varying(15) NOT NULL,
mp integer NOT NULL,
settlep double precision NOT NULL,
prft integer NOT NULL,
atr20 double precision NOT NULL,
upd timestamp with time zone,
ntrds double precision
)
WITH (
OIDS=FALSE
);
CREATE TABLE st_daily2
(
CONSTRAINT st_daily2_pk PRIMARY KEY (stid, d, ticker),
CONSTRAINT st_daily2_strgs_fk FOREIGN KEY (stid)
REFERENCES strgs (stid) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT st_daily2_ck CHECK (stid >= 200000 AND stid < 300000)
)
INHERITS (st_daily)
WITH (
OIDS=FALSE
);
The data in this table is simulation results. Multithreaded multiple simulation engines written in c# insert data into the database using Npgsql.
psql also shows the empty row.
You'd better leave a posting at http://www.postgresql.org/support/submitbug
Some questions:
Could you show use the table
definitions and constraints for the
partions?
How did you load your data?
You get the same result when using
another tool, like psql?
The answer to your problem may very well lie in your first sentence:
I'm using postgresql 9.0 beta 4.
Why would you do that? Upgrade to a stable release. Preferably the latest point-release of the current version.
This is 9.1.4 as of today.
I got to the same point: "what in the heck is that blank value?"
No, it's not a NULL, it's a -infinity.
To filter for such a row use:
WHERE
case when mytestcolumn = '-infinity'::timestamp or
mytestcolumn = 'infinity'::timestamp
then NULL else mytestcolumn end IS NULL
instead of:
WHERE mytestcolumn IS NULL