postgresql : self join with array - postgresql

My question is about forming Postgres SQL query for below use case
Approach#1
I have a table like below where I generate the same uuid across different types(a,b,c,d) like mapping different types.
+----+------+-------------+
| id | type | master_guid |
+----+------+-------------+
| 1 | a | uuid-1 |
| 2 | a | uuid-2 |
| 3 | a | uuid-3 |
| 4 | a | uuid-4 |
| 5 | a | uuid-5 |
| 6 | b | uuid-1 |
| 7 | b | uuid-2 |
| 8 | b | uuid-3 |
| 9 | b | uuid-6 |
| 10 | c | uuid-1 |
| 11 | c | uuid-2 |
| 12 | c | uuid-3 |
| 13 | c | uuid-6 |
| 14 | c | uuid-7 |
| 15 | d | uuid-6 |
| 16 | d | uuid-2 |
+----+------+-------------+
Approach#2
I have a created two tables for id to type and then id to master_guid, like below
table1:
+----+------+
| id | type |
+----+------+
| 1 | a |
| 2 | a |
| 3 | a |
| 4 | a |
| 5 | a |
| 6 | b |
| 7 | b |
| 8 | b |
| 9 | b |
| 10 | c |
| 11 | c |
| 12 | c |
| 13 | c |
| 14 | c |
| 15 | d |
| 16 | d |
+----+------+
table2
+----+-------------+
| id | master_guid |
+----+-------------+
| 1 | uuid-1 |
| 2 | uuid-2 |
| 3 | uuid-3 |
| 4 | uuid-4 |
| 5 | uuid-5 |
| 6 | uuid-1 |
| 7 | uuid-2 |
| 8 | uuid-3 |
| 9 | uuid-6 |
| 10 | uuid-1 |
| 11 | uuid-2 |
| 12 | uuid-3 |
| 13 | uuid-6 |
| 14 | uuid-7 |
| 15 | uuid-6 |
| 16 | uuid-2 |
+----+-------------+
I want to get output like below with both approaches:
+----+------+--------+------------+
| id | type | uuid | mapped_ids |
+----+------+--------+------------+
| 1 | a | uuid-1 | [6,10] |
| 2 | a | uuid-2 | [7,11] |
| 3 | a | uuid-3 | [8,12] |
| 4 | a | uuid-4 | null |
| 5 | a | uuid-5 | null |
+----+------+--------+------------+
I have tried self-joins with array_agg on ids and grouping based on uuid but not able to get the desired output.
Use below query to populate data:
Approach#1
insert into table1 values
(1,'a','uuid-1'),
(2,'a','uuid-2'),
(3,'a','uuid-3'),
(4,'a','uuid-4'),
(5,'a','uuid-5'),
(6,'b','uuid-1'),
(7,'b','uuid-2'),
(8,'b','uuid-3'),
(9,'b','uuid-6'),
(10,'c','uuid-1'),
(11,'c','uuid-2'),
(12,'c','uuid-3'),
(13,'c','uuid-6'),
(14,'c','uuid-7'),
(15,'d','uuid-6'),
(16,'d','uuid-2')
Approach#2
insert into table1 values
(1,'a'),
(2,'a'),
(3,'a'),
(4,'a'),
(5,'a'),
(6,'b'),
(7,'b'),
(8,'b'),
(9,'b'),
(10,'c'),
(11,'c'),
(12,'c'),
(13,'c'),
(14,'c'),
(15,'d'),
(16,'d')
insert into table2 values
(1,'uuid-1'),
(2,'uuid-2'),
(3,'uuid-3'),
(4,'uuid-4'),
(5,'uuid-5'),
(6,'uuid-1'),
(7,'uuid-2'),
(8,'uuid-3'),
(9,'uuid-6'),
(10,'uuid-1'),
(11,'uuid-2'),
(12,'uuid-3'),
(13,'uuid-6'),
(14,'uuid-7'),
(15,'uuid-6'),
(16,'uuid-2')

demo: db<>fiddle
Using window function ARRAY_AGG allows you to aggregate your ids per groups (in your case the groups are your uuids)
SELECT
id, type, master_guid as uuid,
array_agg(id) OVER (PARTITION BY master_guid) as mapped_ids
FROM table1
ORDER BY id
Result:
| id | type | uuid | mapped_ids |
|----|------|--------|------------|
| 1 | a | uuid-1 | 10,6,1 |
| 2 | a | uuid-2 | 16,2,7,11 |
| 3 | a | uuid-3 | 8,3,12 |
| 4 | a | uuid-4 | 4 |
| 5 | a | uuid-5 | 5 |
| 6 | b | uuid-1 | 10,6,1 |
| 7 | b | uuid-2 | 16,2,7,11 |
| 8 | b | uuid-3 | 8,3,12 |
| 9 | b | uuid-6 | 15,13,9 |
| 10 | c | uuid-1 | 10,6,1 |
| 11 | c | uuid-2 | 16,2,7,11 |
| 12 | c | uuid-3 | 8,3,12 |
| 13 | c | uuid-6 | 15,13,9 |
| 14 | c | uuid-7 | 14 |
| 15 | d | uuid-6 | 15,13,9 |
| 16 | d | uuid-2 | 16,2,7,11 |
These arrays currently contain also the id of the current row (mapped_ids of id = 1 contains the 1). This can be corrected by remove this element with array_remove:
SELECT
id, type, master_guid as uuid,
array_remove(array_agg(id) OVER (PARTITION BY master_guid), id) as mapped_ids
FROM table1
ORDER BY id
Result:
| id | type | uuid | mapped_ids |
|----|------|--------|------------|
| 1 | a | uuid-1 | 10,6 |
| 2 | a | uuid-2 | 16,7,11 |
| 3 | a | uuid-3 | 8,12 |
| 4 | a | uuid-4 | |
| 5 | a | uuid-5 | |
| 6 | b | uuid-1 | 10,1 |
| 7 | b | uuid-2 | 16,2,11 |
| 8 | b | uuid-3 | 3,12 |
| 9 | b | uuid-6 | 15,13 |
| 10 | c | uuid-1 | 6,1 |
| 11 | c | uuid-2 | 16,2,7 |
| 12 | c | uuid-3 | 8,3 |
| 13 | c | uuid-6 | 15,9 |
| 14 | c | uuid-7 | |
| 15 | d | uuid-6 | 13,9 |
| 16 | d | uuid-2 | 2,7,11 |
Now for example id=4 contains an empty array instead of a NULL value. This can be achieved by using the NULLIF function. This gives NULL if both parameters are equal, else it gives out the first parameter.
SELECT
id, type, master_guid as uuid,
NULLIF(
array_remove(array_agg(id) OVER (PARTITION BY master_guid), id),
'{}'::int[]
) as mapped_ids
FROM table1
ORDER BY id
Result:
| id | type | uuid | mapped_ids |
|----|------|--------|------------|
| 1 | a | uuid-1 | 10,6 |
| 2 | a | uuid-2 | 16,7,11 |
| 3 | a | uuid-3 | 8,12 |
| 4 | a | uuid-4 | (null) |
| 5 | a | uuid-5 | (null) |
| 6 | b | uuid-1 | 10,1 |
| 7 | b | uuid-2 | 16,2,11 |
| 8 | b | uuid-3 | 3,12 |
| 9 | b | uuid-6 | 15,13 |
| 10 | c | uuid-1 | 6,1 |
| 11 | c | uuid-2 | 16,2,7 |
| 12 | c | uuid-3 | 8,3 |
| 13 | c | uuid-6 | 15,9 |
| 14 | c | uuid-7 | (null) |
| 15 | d | uuid-6 | 13,9 |
| 16 | d | uuid-2 | 2,7,11 |

Try this:
select
t1.id, t1.type, t1.master_guid, array_agg (distinct t2.id)
from
table1 t1
left join table1 t2 on
t1.master_guid = t2.master_guid and
t1.id != t2.id
group by
t1.id, t1.type, t1.master_guid
I don't come up with exactly the same results you listed, but I thought it was close enought that maybe there was a mistaken expectation on your side or only a small error on mine... either way, a potential starting point.
-- EDIT --
For approach #2, I think you just need to add an inner join to Table2 to get the GUID:
select
t1.id, t1.type, t2.master_guid,
array_agg (t2a.id)
from
table1 t1
join table2 t2 on t1.id = t2.id
left join table2 t2a on
t2.master_guid = t2a.master_guid and
t2a.id != t1.id
where
t1.type = 'a'
group by
t1.id, t1.type, t2.master_guid

Related

Serial Number in logical order without gaps

I'm trying to generate a serial number based on a few conditions.
My dataset:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | |
+--------+------------+------------+---------+--------+
What I did to configure my serial number:
RANK() OVER(PARTITION BY Client ORDER BY Client, Start_date ASC)
So now it generates a serial number for my which looks like this:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | 3 |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | 4 |
+--------+------------+------------+---------+--------+
What goes wrong for my analysis is the last line, it generates the serial number. What it has to be is 3.
Can anayone help me to generate it in this order?
Thanks in advance!
Extra
In addition to my question from yesterday, there is something extra that I need to do. Because the Ser_No has to be the same when my Start_Date is the same, but the Ser_No has also be the same when my folowing records is the same product (also when it has a different Start_Date)
So what I I expect and what I get right now:
+--------+------------+------------+---------+--------+------------+
| Client | Start_Date | End_date | Product | Ser_No | Ser_No New |
+--------+------------+------------+---------+--------+------------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 44 | 12-03-2018 | 01-01-2019 | C | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-03-2018 | 31-01-2019 | F | 3 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 11-04-2018 | 31-03-2019 | F | 4 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 20-04-2018 | 31-01-2019 | G | 5 | 4 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | A | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | B | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-05-2018 | 31-01-2019 | B | 7 | 5 |
+--------+------------+------------+---------+--------+------------+
Any idea on how to achieve this, because I won't get it
You need to use DENSE_RANK instead:
This function returns the rank of each row within a result set partition, with no gaps in the ranking values.
DENSE_RANK() OVER(PARTITION BY Client ORDER BY Start_date) AS Ser_no
Additionaly the Client in ORDER BY has no effect because it has the same value per partition.

Summarize the cost by groups in org table

Suppose such a spreadsheet in org table
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| | B | 52.67 | 2 | 105.34 | diagnosis |
| | C | 3.08 | 1 | 3.08 | materials |
| | D | 3.85 | 2 | 7.7 | materials |
| | E | 33.66 | 2 | 67.32 | materials |
| | F | 40 | 1 | 40 | treatments |
| | G | 16.5 | 1 | 16.5 | materials |
| | H | 4 | 3 | 12 | treatments |
| | I | 40 | 1 | 40 | bed |
| x | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: $5=$3*$4
I want to sum up the material fees.
Is it possible to calculate it by grouping like vsum(where Categories == materials)?
One way to do this with an elisp expression will be:
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| | B | 52.67 | 2 | 105.34 | diagnosis |
| | C | 3.08 | 1 | 3.08 | materials |
| | D | 3.85 | 2 | 7.7 | materials |
| | E | 33.66 | 2 | 67.32 | materials |
| | F | 40 | 1 | 40 | treatments |
| | G | 16.5 | 1 | 16.5 | materials |
| | H | 4 | 3 | 12 | treatments |
| | I | 40 | 1 | 40 | bed |
| x | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
| TOTAL: | | | | 97.24 | |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: $5=$3*$4
#+TBLFM: #12$5='(apply #'+ (cl-mapcar (lambda (num category) (if (eq category 'materials) num 0)) '(#II$5..#III$5) '(#II$6..#III$6)));L
cl-mapcar applies + to cell #12$5 by comparing the list which is column 6 to symbol'materials.
This solution and a `calc solution in emacsSE

Replace null by negative id number in not consecutive rows in hive

I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?
Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;

PostgreSQL aggregate function for each row across multiple unknown number of columns

I looked through similar questions like this one, but they seem to have a definite number of columns. I would like to input a table that I do not know the number of columns.
Question:
How to calculate aggregate functions (e.g. avg() or sum() ) for each row across several columns if number of columns is not known in advance?
I have put the input table panel_stats_rnd csv and a DLL to create it here.
I would like to calculate for each row the rnd_avg_parcelcount as average of all columns c_1_avg_parcelcount, c_2_avg_parcelcount, ... where I can have input tables with any number (say 100) columns of _avg_parcelcount. And for columns rnd_sum_parcelcount I would like to calculate sum() of all columns that start with c_ and end with _sum_parcelcount.
The table looks like this:
SELECT * FROM panel_stats_rnd;
gid | d | dist_from | dist_to | distlabel | rnd_avg_parcelcount | rnd_sum_parcelcount | rnd_avg_callcount | rnd_sum_callcount | rnd_avg_perccalled | called_avg_parcelcount | called_sum_parcelcount | called_avg_callcount | called_sum_callcount | called_avg_perccalled | c_1_avg_parcelcount | c_1_sum_parcelcount | c_1_avg_callcount | c_1_sum_callcount | c_1_avg_perccalled | c_2_avg_parcelcount | c_2_sum_parcelcount | c_2_avg_callcount | c_2_sum_callcount | c_2_avg_perccalled
-----+----+-----------+---------+-----------+---------------------+---------------------+-------------------+-------------------+--------------------+------------------------+------------------------+----------------------+----------------------+-----------------------+---------------------+---------------------+-------------------+-------------------+----------------------+---------------------+---------------------+-------------------+-------------------+----------------------
1 | 0 | 0 | 100 | 0-100 | | | | | | 119045 | 119045 | 119045 | 23 | 0.000193204250493511 | 119045 | 119045 | 119045 | 16 | 0.000134402956865051 | 119045 | 119045 | 119045 | 16 | 0.000134402956865051
2 | 1 | 100 | 200 | 100-200 | | | | | | 163140 | 163140 | 163140 | 22 | 0.000134853500061297 | 163140 | 163140 | 163140 | 17 | 0.000104204977320093 | 163140 | 163140 | 163140 | 18 | 0.000110334681868334
3 | 2 | 200 | 300 | 200-300 | | | | | | 135934 | 135934 | 135934 | 10 | 7.3565112481057e-05 | 135934 | 135934 | 135934 | 18 | 0.000132417202465903 | 135934 | 135934 | 135934 | 15 | 0.000110347668721585
4 | 3 | 300 | 400 | 300-400 | | | | | | 116874 | 116874 | 116874 | 13 | 0.000111230898232284 | 116874 | 116874 | 116874 | 11 | 9.41184523503944e-05 | 116874 | 116874 | 116874 | 18 | 0.000154012012937009
5 | 4 | 400 | 500 | 400-500 | | | | | | 93216 | 93216 | 93216 | 12 | 0.000128733264675592 | 93216 | 93216 | 93216 | 10 | 0.000107277720562993 | 93216 | 93216 | 93216 | 12 | 0.000128733264675592
6 | 5 | 500 | 600 | 500-600 | | | | | | 69992 | 69992 | 69992 | 7 | 0.0001000114298777 | 69992 | 69992 | 69992 | 10 | 0.000142873471253858 | 69992 | 69992 | 69992 | 7 | 0.0001000114298777
7 | 6 | 600 | 700 | 600-700 | | | | | | 50816 | 50816 | 50816 | 10 | 0.000196788413098237 | 50816 | 50816 | 50816 | 6 | 0.000118073047858942 | 50816 | 50816 | 50816 | 0 | 0
8 | 7 | 700 | 800 | 700-800 | | | | | | 34814 | 34814 | 34814 | 0 | 0 | 34814 | 34814 | 34814 | 6 | 0.000172344459125639 | 34814 | 34814 | 34814 | 4 | 0.000114896306083759
9 | 8 | 800 | 900 | 800-900 | | | | | | 23023 | 23023 | 23023 | 1 | 4.34348260435217e-05 | 23023 | 23023 | 23023 | 4 | 0.000173739304174087 | 23023 | 23023 | 23023 | 1 | 4.34348260435217e-05
10 | 9 | 900 | 1000 | 900-1000 | | | | | | 14215 | 14215 | 14215 | 1 | 7.03482237073514e-05 | 14215 | 14215 | 14215 | 1 | 7.03482237073514e-05 | 14215 | 14215 | 14215 | 5 | 0.000351741118536757
11 | 10 | 1000 | 5000 | 1000-5000 | | | | | | 23527 | 23527 | 23527 | 0 | 0 | 23527 | 23527 | 23527 | 0 | 0 | 23527 | 23527 | 23527 | 3 | 0.000127513070089684
(11 rows)
I tried the following for 2 columns (works but I'd rather not write it 5 times for 100 columns, besides the number of columns has to be a parameter):
SELECT d,c_1_avg_parcelcount,c_2_avg_parcelcount,
(SELECT avg(c) FROM (VALUES (c_1_avg_parcelcount) , (c_2_avg_parcelcount) ) T (c)) AS Avg_,
(SELECT sum(c) FROM (VALUES (c_1_avg_parcelcount) , (c_2_avg_parcelcount) ) T (c)) AS sum_
FROM panel_stats_rnd;
I also tried the following but doesn't work.
WITH cols AS (
select value(column_name) from information_schema.columns
where table_name = 'panel_stats_rnd'
AND column_name SIMILAR TO 'c_%avg_parcelcount'
AND column_name != 'called_avg_parcelcount'
)
SELECT *, (SELECT avg(Col) FROM cols V(Col) ) AS col_average
FROM panel_stats_rnd;
I am almost there but something is missing...
select
*,
(select avg(v::numeric)
from json_each_text(row_to_json(panel_stats_rnd.*)) as j(k,v)
where k like 'c\_%\_avg\_parcelcount') as rnd_avg_parcelcount,
(select sum(v::numeric)
from json_each_text(row_to_json(panel_stats_rnd.*)) as j(k,v)
where k like 'c\_%\_sum\_parcelcount') as rnd_sum_parcelcount
from
panel_stats_rnd;
Look at the documentation about functions involved.
There are escapes for underlying characters (\_) because for like operator it is meaning any single character, for example select 'a' like '_'; is true.

left join 2 tables not working

I have 2 tables:
Table1: 'op_ats'
| ID1 | numero |id_cofre | id_chave | estadoAT
| 1 | 111 | 1 | 3 | 1
| 2 | 222 | 3 | 3 | 2
| 3 | 333 | 1 | 4 | 2
| 4 | 444 | 1 | 2 | 3
Table_2: 'op_ats_cofres_chaves'
| ID2 | num_chave |
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
I have this SQL:
SELECT chaves.*, ats.numero numAT, ats.estadoAT
FROM op_ats_cofres_chaves chaves
LEFT JOIN op_ats ats ON ats.id_chave_cofre = chaves.id AND ats.id_cofre = 1
With this I get the following result:
| ID2 | num_chave | numAT | estadoAT |
| 1 | A | 444 | 3 |
| 2 | B | NULL | NULL |
| 3 | C | 111 | 1 |
| 4 | D | 333 | 2 |
| 5 | E | NULL | NULL |
Now the problem is that I want to filter the rows that are in Table1 but only that have the column 'estadoAT' with values 1 and 2. I've tried to add the line
WHERE op_ats.estadoAT = 1 OR op_ats.estadoAT = 2
But this makes the following result:
| ID2 | num_chave | numAT | estadoAT |
| 1 | A | 444 | 3 |
| 3 | C | 111 | 1 |
| 4 | D | 333 | 2 |
Resuming...
My intention is to get ALL rows in the Table2 and join the Table1 rows that have the 'id_cofre = 1' and '(estadoAT = 1 OR estadoAT = 2)'.
Any help is appreciated.
You have to move condition to JOIN clause instead of WHERE.
SELECT chaves.*, ats.numero numAT, ats.estadoAT
FROM op_ats_cofres_chaves chaves
LEFT JOIN op_ats ats ON ats.id_chave_cofre = chaves.id AND ats.id_cofre = 1
AND op_ats.estadoAT = 1 OR op_ats.estadoAT = 2;