Aggregate result of sub query to comma separated values - postgresql

How to get get the result of subquery as comma separated values.
I have three tables , location and stock_location_type and location_label.
I am joining location and stock_location_type and based on the result of SLT.inventory_location_cd , I am querying another table location_label.
To do that I am writing following query.
select L.stock_catalogue_id, SLT.inventory_location_cd,
case
when nventory_location_cd = 'base location' then (select related_location_id from location_label where base_location_id = location_id)
when nventory_location_cd != 'base location' then (select base_location_id from location_label where related_location_id = location_id)
end as "Current Location",
* from location L
join stock_location_type SLT on L.stock_location_type_id = SLT.stock_location_type_id;
These subquueries returns multiple rows.
I tried using string_agg and casting related_location_id and base_location_id (as they are UUIDs). But then it complains about group by.
If I use group by then it errors out , 'multiple rows returned by subquery'.
What am I missing?

I recreated your set of tables with
A location_label table
create table location_label(location_id int, location_label varchar);
insert into location_label values(1, 'Home');
insert into location_label values(2, 'Office');
insert into location_label values(3, 'Garage');
insert into location_label values(4, 'Bar');
A stock_location_type table
create table stock_location_type (stock_location_type_id int, inventory_location_cd varchar);
insert into stock_location_type values(1, 'base location');
insert into stock_location_type values(2, 'Not base location');
A location table
create table location (stock_catalogue_id int, base_location_id int, related_location_id int, stock_location_type_id int);
insert into location values(1,1,2,1);
insert into location values(1,2,1,2);
insert into location values(1,3,3,1);
insert into location values(2,4,3,1);
insert into location values(2,3,1,1);
insert into location values(2,2,4,2);
If I understand your statement correctly you are trying to join location and location_label tables based on the inventory_location_cd column using either base_location_id or location_id.
If this is what you're trying to achieve, the following query should do it. By moving the join condition in the proper place
select L.stock_catalogue_id,
SLT.inventory_location_cd,
location_id "Current Location Id",
location_label "Current Location Name"
from location L join stock_location_type SLT
on L.stock_location_type_id = SLT.stock_location_type_id
left outer join location_label
on (
case when
inventory_location_cd = 'base location'
then base_location_id
else related_location_id
end) = location_id
;
result is
stock_catalogue_id | inventory_location_cd | Current Location Id | Current Location Name
--------------------+-----------------------+---------------------+-----------------------
1 | base location | 1 | Home
1 | Not base location | 1 | Home
1 | base location | 3 | Garage
2 | base location | 3 | Garage
2 | base location | 4 | Bar
2 | Not base location | 4 | Bar
(6 rows)
if you need to aggregate it up by stock_catalogue_id and inventory_location_cd, that can be achieved with
select L.stock_catalogue_id,
SLT.inventory_location_cd,
string_agg(location_id::text, ',') "Current Location Id",
string_agg(location_label::text, ',') "Current Location Name"
from location L join stock_location_type SLT
on L.stock_location_type_id = SLT.stock_location_type_id
left outer join location_label
on (case when inventory_location_cd = 'base location' then base_location_id else related_location_id end) = location_id
group by L.stock_catalogue_id,
SLT.inventory_location_cd;
with the result being
stock_catalogue_id | inventory_location_cd | Current Location Id | Current Location Name
--------------------+-----------------------+---------------------+-----------------------
1 | base location | 1,3 | Home,Garage
1 | Not base location | 1 | Home
2 | base location | 3,4 | Garage,Bar
2 | Not base location | 4 | Bar
(4 rows)

You can use the string_agg function to aggregate the values in a comma-separated string. So your sub-queries needs to be rewritten to
select string_agg(related_location_id, ', ') from location_label where base_location_id = location_id
and
select string_agg(base_location_id, ', ') from location_label where related_location_id = location_id

Related

Replacement for materialized view on PostgreSQL

I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:
SELECT number, id, count(*) AS dups_count
FROM my_table
GROUP BY number, id HAVING count(*) > 1;
The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.
At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.
Is there a better solution to it?
A trigger the increments duplicate count might be a solution:
create table duplicates
(
number int,
id int,
dups_count int,
primary key (number, id)
);
The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.
Then create a trigger that updates that table each time a row is inserted into the base table:
create function increment_dupes()
returns trigger
as
$$
begin
insert into duplicates (number, id, dups_count)
values (new.number, new.id, 1)
on conflict (number,id)
do update
set dups_count = duplicates.dups_count + 1;
return new;
end
$$
language plpgsql;
create trigger update_dups_count
after insert on my_table
for each row
execute function increment_dupes();
Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.
A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.
Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':
Tab
Number ID CreationTime
Index on creationtime column.
Key to applying the incremental method is to have a monotonically increasing value.
Here we have 'creationtime' for that.
(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp'
Say:
Tab_duplicate
Number ID Duplicate_count last_compute_timestamp
(b) Create an index on column 'last_compute_timestamp'.
(c) Run the insert to find the duplicate records and insert it into Tab_duplicate along with the last_compute_timestamp.
(d) For repeat Execution:
install extension pg_cron (if its not there) and automate this execution of insert.
https://github.com/citusdata/pg_cron
https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/
or
2. Use a shell script/python script to execute it on the DB through OS crontab.
The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.
DEMONSTRATION:
Step 1: Production table
create table tab
(
id int,
number int,
creationtime timestamp
);
create index tab_id on tab(creationtime);
Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)
create table tab_duplicate
(
id int,
number int,
duplicate_count int,
last_compute_timestamp timestamp);
create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
insert into tab_duplicate values(0,0,0,current_timestamp);
Step 3: Some duplicate entry into the production table
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(1,10,current_timestamp);
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(3,30,current_timestamp);
insert into tab values(3,30,current_timestamp);
select pg_sleep(1);
insert into tab values(4,40,current_timestamp);
Verify records:
postgres=# select * from tab;
id | number | creationtime
----+--------+----------------------------
1 | 10 | 2022-01-23 19:00:37.238865
1 | 10 | 2022-01-23 19:00:38.248574
1 | 10 | 2022-01-23 19:00:38.252622
2 | 20 | 2022-01-23 19:00:39.259584
2 | 20 | 2022-01-23 19:00:40.26655
3 | 30 | 2022-01-23 19:00:41.274673
3 | 30 | 2022-01-23 19:00:41.279298
4 | 40 | 2022-01-23 19:00:52.697257
(8 rows)
Step 4: Duplicates captured and verified.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 4
postgres=#
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
(5 rows)
Step 5: Some more duplicates into the production table
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 2
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
5 | 50 | 3 | 2022-01-23 19:02:37.884417
6 | 60 | 2 | 2022-01-23 19:02:37.884417
(7 rows)
This duplicate capture will be always fast because of two things:
It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).
From execution plan:
-> Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2 (cost=0.15..77.76 rows=1692 width=8)
CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.
Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.
Two approaches come to mind:
Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.

Select row position in filtered and ordered row list PostgreSQL

I got this query,
SELECT s.pos
FROM (SELECT t.guild_id, t.user_id
ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM users t) s
WHERE (s.guild_id, s.user_id) = ($2, $3)
that gets a user's "rank" in a guild, but I want to filter the results by entries that are in an array of t.user_id values (like {'1', '64', '83'}) and have this affect the resulting pos value. I found FILTER and WITHIN GROUP, but I'm not sure how to fit one of those into this query. How would I do that?
Here's the full table if that helps at all:
Table "public.users"
Column | Type | Collation | Nullable | Default
------------+-----------------------+-----------+----------+---------
guild_id | character varying(20) | | not null |
user_id | character varying(20) | | not null |
reputation | real | | not null | 0
Indexes:
"users_pkey" PRIMARY KEY, btree (guild_id, user_id)
Why not select on those first?
WITH UsersWeCareAbout AS (
SELECT * FROM users u WHERE u.user_id = ANY(subgroup_array)
), RepUsers AS (
SELECT t.guild_id, t.user_id, ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM UsersWeCareAbout t
) SELECT s.pos FROM RepUsers s WHERE (s.guild_id, s.user_id) = ($2, $3)
(untested if only because I didn't really have enough context to test with)

Postgres insert trigger fills id

I have a BEFORE trigger which should fill record's root ID which, of course, would point to rootmost entry. I.e:
id | parent_id | root_id
-------------------------
a | null | a
a.1 | a | a
a.1.1 | a.1 | a
b | null | b
If entry's parent_id is null, it would point to record itself.
Question is - inside BEFORE INSERT trigger, if parent_id is null, can I or should I fetch next sequence value, fill id and root_id in order to avoid filling root_id in AFTER trigger?
According to your own definition:
if entry's parent_id is null, it would point to record itself
then you have to do:
if new.parent_id is null then
new.root_id = new.id ;
else
WITH RECURSIVE p (parent_id, level) AS
(
-- Base case
SELECT
parent_id, 0 as level
FROM
t
WHERE
t.id = new.id
UNION ALL
SELECT
t.parent_id, level + 1
FROM
t JOIN p ON t.id = p.parent_id
WHERE
t.parent_id IS NOT NULL
)
SELECT
parent_id
INTO
new.root_id
FROM
p
ORDER BY
level DESC
LIMIT
1 ;
end if ;
RETURN new ;

Microsoft TSQL change text row to column

I want to change the 20 rows with one column to 1 row with 20 columns to insert it later in a second database
Name
----------
- Frank
- Dora
- ...
- Michael
to
Name1 | Name2 | ... | Name20
Frank | Dora | ... | Michael
I tried
SELECT *
FROM (SELECT TOP 20 firstname AS NAME
FROM database) AS d
PIVOT (Min(NAME)
FOR NAME IN (name1,
name2,
name3,
name4,
name5,
name6,
name7,
name8,
name9,
name10,
name11,
name12,
name13,
name14,
name15,
name16,
name18,
name19,
name20) ) AS f
But all names are NULL. DEMO
You were close... But your inner select must carry the new column name. Try it like this:
DECLARE #tbl TABLE(Name VARCHAR(100));
INSERT INTO #tbl VALUES('Frank'),('Dora'),('Michael');
SELECT p.*
FROM
(
SELECT 'Name' + CAST(ROW_NUMBER() OVER(ORDER BY Name) AS VARCHAR(150)) AS ColumnName
,Name
From #tbl
) AS tbl
PIVOT
(
MIN(Name) FOR ColumnName IN(Name1,Name2,Name3)
) AS p

PostgreSQL: duplicate key value violates unique constraint on UPDATE command

When doing an UPDATE query, we got the following error message:
ERROR: duplicate key value violates unique constraint "tableA_pkey"
DETAIL: Key (id)=(47470) already exists.
However, our UPDATE query does not affect the primary key. Here is a simplified version:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE
a.end_at BETWEEN now() AND now() - interval '1 day';
We ensured the primary key sequence was already synced:
\d tableA_id_seq
Which produces:
Column | Type | Value
---------------+---------+--------------------------
sequence_name | name | tableA_id_seq
last_value | bigint | 50364
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | t
Looking for maximum table index:
select max(id) from tableA;
We got a lower value:
max
-------
50363
(1 row)
Have you any idea on why such a behavior? If we exclude the problematic id, it works.
Another strange point is that replacing the previous UPDATE by:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE a.id = 47470;
It works well. Are we missing something?
EDIT: triggers
I have no user-defined triggers on this table:
SELECT t.tgname, c.relname
FROM pg_trigger t
JOIN pg_class c ON t.tgrelid = c.oid
WHERE
c.relname = 'tableA'
AND
t.tgisinternal = false
;
Which returns no row.
Note: I am using psql (PostgreSQL) 9.3.4 version.
Not really sure what was the cause. However, deleting the two (non vital) records corresponding to already existing ids (?) solved the issue.