PostgreSQL: Exclude duplicates as sorted by another key

PostgreSQL: Exclude duplicates as sorted by another key - postgresql

Consider the following table that stores update history of some attributes of certain objects, organized by effective and published dates:
create table update_history(
obj_id integer,
effective date,
published date,
attr1 text,
attr2 integer,
attr3 boolean,
primary key(obj_id, effective, published)
);
insert into update_history values
(1, '2021-01-01', '2021-01-01', 'foo', null, null),
(1, '2021-01-01', '2021-01-02', null, 1, false),
(1, '2021-01-02', '2021-01-01', 'foo', 1, false),
(1, '2021-01-02', '2021-01-02', 'bar', 1, false),
(1, '2021-01-03', '2021-01-01', 'bar', 1, true),
(1, '2021-01-04', '2021-01-01', 'bar', 1, true),
(1, '2021-01-05', '2021-01-01', 'bar', 2, true),
(1, '2021-01-05', '2021-01-02', 'bar', 1, true),
(1, '2021-01-05', '2021-01-03', 'bar', 1, true),
(1, '2021-01-06', '2021-01-04', 'bar', 1, true)
;
I need to write a PostgreSQL query that will simplify the history view for a given obj_id by excluding those update records that did not change any attributes from the immediately preceding update as ordered by effective and published columns. In essence those would be rows ## 6, 9 and 10, marked in italic in the table below:
#
obj_id
effective
published
attr1
attr2
attr3
1
1
2021-01-01
2021-01-01
foo
(null)
(null)
2
1
2021-01-01
2021-01-02
(null)
1
false
3
1
2021-01-02
2021-01-01
foo
1
false
4
1
2021-01-02
2021-01-02
bar
1
false
5
1
2021-01-03
2021-01-01
bar
1
true
6
1
2021-01-04
2021-01-01
bar
1
true
7
1
2021-01-05
2021-01-01
bar
2
true
8
1
2021-01-05
2021-01-02
bar
1
true
9
1
2021-01-05
2021-01-03
bar
1
true
10
1
2021-01-06
2021-01-04
bar
1
true
Keep in mind that in the real life case there are way more attributes to deal with and I don't want the query to get too messy.
The closest I got to the desired result was using the rank window function:
select
obj_id, effective, published,
attr1, attr2, attr3
from (
select *,
rank() over (
partition by attr1, attr2, attr3
order by effective, published
) as rank
from update_history
where obj_id = 1) as d
where rank = 1
order by effective, published;
That results in this:
obj_id
effective
published
attr1
attr2
attr3
1
2021-01-01
2021-01-01
foo
(null)
(null)
1
2021-01-01
2021-01-02
(null)
1
false
1
2021-01-02
2021-01-01
foo
1
false
1
2021-01-02
2021-01-02
bar
1
false
1
2021-01-03
2021-01-01
bar
1
true
1
2021-01-05
2021-01-01
bar
2
true
As you can see, row #8 from the original table is erroneously excluded, although it changed attr2 from the its previous row, #7. Apparently, the problem is that partitioning is applied before sorting in the window definition.
I wonder if there is another way to accomplish this with a single PostgresSQL query.

I would use the lag() for this:
select *
from (
select obj_id, effective, published,
attr1, attr2, attr3,
(attr1, attr2, attr3) is distinct from lag( (attr1,attr2,attr3) ) over (partition by obj_id order by effective, published) as is_different
from update_history
) t
where is_different

Related

DB2 count distinct on multiple columns

I am trying to find count and distinct of multiple values but its not worikng in db2
select count(distinct col1, col2) from table
it throws syntax error that count has multiple columns.
any way to achieve this
column 1 column 2 date
1 a 2022-12-01
1 a 2022-12-01
2 a 2022-11-30
2 b 2022-11-30
1 b 2022-12-01
i want output
column1 column2 date count
1 a 2022-12-01 2
2 a 2022-11-30 1
2 b 2022-11-30 1
1 a 2022-12-01 1

The following query returns exactly what you want.
WITH MYTAB (column1, column2, date) AS
(
VALUES
(1, 'a', '2022-12-01')
, (1, 'a', '2022-12-01')
, (2, 'a', '2022-11-30')
, (2, 'b', '2022-11-30')
, (1, 'b', '2022-12-01')
)
SELECT
column1
, column2
, date
, COUNT (*) AS CNT
FROM MYTAB
GROUP BY
column1
, column2
, date
COLUMN1
COLUMN2
DATE
CNT
1
a
2022-12-01
2
1
b
2022-12-01
1
2
a
2022-11-30
1
2
b
2022-11-30
1
fiddle

Not exactly sure of what you are looking for...
but
select count(distinct col1), count(distinct col2) from table
or
select count(distinct col1 CONCAT col2) from table
Are how I would interpret "distinct count of multiple values" in a table..

Select previous different value PostgreSQL

I have a table:
id
date
value
1
2022-01-01
1
1
2022-01-02
1
1
2022-01-03
2
1
2022-01-04
2
1
2022-01-05
3
1
2022-01-06
3
I want to detect changing of value column by date:
id
date
value
diff
1
2022-01-01
1
null
1
2022-01-02
1
null
1
2022-01-03
2
1
1
2022-01-04
2
1
1
2022-01-05
3
2
1
2022-01-06
3
2
I tried a window function lag(), but all I got:
id
date
value
diff
1
2022-01-01
1
null
1
2022-01-02
1
1
1
2022-01-03
2
1
1
2022-01-04
2
2
1
2022-01-05
3
2
1
2022-01-06
3
3

I am pretty sure you have to do a gaps-and-islands to "group" your changes.
There may be a more concise way to get the result you want, but this is how I would solve this:
with changes as ( -- mark the changes and lag values
select id, date, value,
coalesce((value != lag(value) over w)::int, 1) as changed_flag,
lag(value) over w as last_value
from a_table
window w as (partition by id order by date)
), groupnums as ( -- number the groups, carrying the lag values forward
select id, date, value,
sum(changed_flag) over (partition by id order by date) as group_num,
last_value
from changes
window w as (partition by id order by date)
) -- final query that uses group numbering to return the correct lag value
select id, date, value,
first_value(last_value) over (partition by id, group_num
order by date) as diff
from groupnums;
db<>fiddle here

Improve performance on CTE with sub-queries

I have a table with this structure:
WorkerID Value GroupID Sequence Validity
1 '20%' 1 1 2018-01-01
1 '10%' 1 1 2017-06-01
1 'Yes' 1 2 2017-06-01
1 '2018-01-01' 2 1 2017-06-01
1 '17.2' 2 2 2017-06-01
2 '10%' 1 1 2017-06-01
2 'No' 1 2 2017-06-01
2 '2016-03-01' 2 1 2017-06-01
2 '15.9' 2 2 2017-06-01
This structure was created so that the client can create customized data for a worker. For example Group 1 can be something like "Salary" and Sequence is one value that belongs to that Group like "Overtime Compensation". The column Value is a VARCHAR(150) field and the correct validation and conversation is done in another part of the application.
The Validity column exist mainly for historical reasons.
Now I would like to show, for the different workers, the information in a grid where each row should be one worker (displaying the one with the most recent Validity):
Worker 1_1 1_2 2_1 2_2
1 20% Yes 2018-01-01 17.2
2 10% No 2016-03-01 15.9
To accomplish this I created a CTE that looks like this:
WITH CTE_worker_grid
AS
(
SELECT
worker,
/* 1 */
(
SELECT top 1 w.Value
FROM worker_values AS w
WHERE w.GroupID = 1
AND w.Sequence = 1
ORDER BY w.Validity DESC
) AS 1_1,
(
SELECT top 1 w.Value
FROM worker_values AS w
WHERE w.GroupID = 1
AND w.Sequence = 2
ORDER BY w.Validity DESC
) AS 1_2,
/* 2 */
(
SELECT top 1 w.Value
FROM worker_values AS w
WHERE w.GroupID = 2
AND w.Sequence = 1
ORDER BY w.Validity DESC
) AS 2_1,
(
SELECT top 1 w.Value
FROM worker_values AS w
WHERE w.GroupID = 2
AND w.Sequence = 2
ORDER BY w.Validity DESC
) AS 2_2
)
GO
This produces the correct result but it's very slow as it creates this grid for over 18'000 worker with almost 30 Groups and up to 20 Sequences in each Group.
How could one speed up the process of a CTE of this magnitude? Should CTE even be used? Can the sub-queries be changed or re-factored out to speed up the execution?

Use a PIVOT!
+----------+---------+---------+------------+---------+
| WorkerId | 001_001 | 001_002 | 002_001 | 002_002 |
+----------+---------+---------+------------+---------+
| 1 | 20% | Yes | 2018-01-01 | 17.2 |
| 2 | 10% | No | 2016-03-01 | 15.9 |
+----------+---------+---------+------------+---------+
SQL Fiddle: http://sqlfiddle.com/#!18/6e768/1
CREATE TABLE WorkerAttributes
(
WorkerID INT NOT NULL
, [Value] VARCHAR(50) NOT NULL
, GroupID INT NOT NULL
, [Sequence] INT NOT NULL
, Validity DATE NOT NULL
)
INSERT INTO WorkerAttributes
(WorkerID, Value, GroupID, Sequence, Validity)
VALUES
(1, '20%', 1, 1, '2018-01-01')
, (1, '10%', 1, 1, '2017-06-01')
, (1, 'Yes', 1, 2, '2017-06-01')
, (1, '2018-01-01', 2, 1, '2017-06-01')
, (1, '17.2', 2, 2, '2017-06-01')
, (2, '10%', 1, 1, '2017-06-01')
, (2, 'No', 1, 2, '2017-06-01')
, (2, '2016-03-01', 2, 1, '2017-06-01')
, (2, '15.9', 2, 2, '2017-06-01')
;WITH CTE_WA_RANK
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY WorkerID, GroupID, [Sequence] ORDER BY Validity DESC) AS VersionNumber
, WA.WorkerID
, WA.GroupID
, WA.[Sequence]
, WA.[Value]
FROM
WorkerAttributes AS WA
),
CTE_WA
AS
(
SELECT
WA_RANK.WorkerID
, RIGHT('000' + CAST(WA_RANK.GroupID AS VARCHAR(3)), 3)
+ '_'
+ RIGHT('000' + CAST(WA_RANK.[Sequence] AS VARCHAR(3)), 3) AS SMART_KEY
, WA_RANK.[Value]
FROM
CTE_WA_RANK AS WA_RANK
WHERE
WA_RANK.VersionNumber = 1
)
SELECT
WorkerId
, [001_001] AS [001_001]
, [001_002] AS [001_002]
, [002_001] AS [002_001]
, [002_002] AS [002_002]
FROM
(
SELECT
CTE_WA.WorkerId
, CTE_WA.SMART_KEY
, CTE_WA.[Value]
FROM
CTE_WA
) AS WA
PIVOT
(
MAX([Value])
FOR
SMART_KEY IN
(
[001_001]
, [001_002]
, [002_001]
, [002_002]
)
) AS PVT

Selecting specific row from a sub query depending on lowest priority

I have a table with Clients and their Insurance Providers. There is a column called Priority that ranges from 1-8. I want to be able to select the lowest priority insurance into my 'master table' I have a query that provides Fees, Dates, Doctors etc. and I need a subquery that I can join to the Main query on Client_ID The priority doesn't always start with 1. The Insurance Table is the Many side of the relationship
Row# Client_id Insurance_id Priority active?
1 333 A 1 Y
2 333 B 2 Y
3 333 C 1 N
4 222 D 6 Y
5 222 A 8 Y
6 444 C 4 Y
7 444 A 5 Y
8 444 B 6 Y
Answer should be
Client_id Insurance_id Priority
333 A 1
222 D 6
444 C 4

I was able to achieve the results I think you're asking for pretty easily utilizing SQL's ROW_NUMBER() function:
declare #tbl table
(
Id int identity,
ClientId int,
InsuranceId char(1),
[Priority] int,
Active bit
)
insert into #tbl (ClientId, InsuranceId, [Priority], Active)
values (1, 'A', 1, 1),
(1, 'A', 2, 1),
(1, 'B', 3, 1),
(1, 'B', 4, 1),
(1, 'C', 1, 1),
(1, 'C', 2, 0),
(2, 'C', 1, 1),
(2, 'C', 2, 1)
select Id, ClientId, InsuranceId, [Priority]
from
(
select Id,
ClientId,
InsuranceId,
[Priority],
ROW_NUMBER() OVER (PARTITION BY ClientId, InsuranceId ORDER BY [Priority] desc) as RowNum
from #tbl
where Active = 1
) x
where x.RowNum = 1
Results:
(8 row(s) affected)
Id ClientId InsuranceId Priority
----------- ----------- ----------- -----------
2 1 A 2
4 1 B 4
5 1 C 1
8 2 C 2
(4 row(s) affected)

question with a query

Table1
sub-id ref-id Name
1 1 Project 1
2 1 Project 2
3 2 Project 3
4 2 Project 4
Table2
sub-id ref-id log_stamp Recepient log_type
----------------------------------------------------
1 1 06/06/2011 person A 1
1 1 06/14/2011 person B 2
1 1 06/16/2011 person C 2
1 1 06/17/2011 person D 3
2 1 06/18/2011 person E 2
2 1 06/19/2011 person F 2
3 2 06/20/2011 person G 1
4 2 06/23/2011 person H 3
Result
Name ref-id start_date Recepient latest_comment Recepient completion_date Receipient
Project1 1 06/06/2011 person A 06/19/2011 person F 06/17/2011 person D
Project3 2 06/20/2011 person G NULL NULL 06/23/2011 person H
log_type of 1 stands for start_date
log_type of 2 stands for latest_comment
log_type of 3 stands for completion_date
The Name of the project is just the name of the top-most name in the same group of ref-id
have tried this for now
;with T as (select
Table2.ref-id,
Table2.log_stamp,
Table2 log.log_type
when 1 then '1'
when 2 then '2'
when 3 then '3'
end as title
from
Submission sb inner join submission_log log on Table1.[sub-id] = Table2.[sub-id]
)
select * from T
pivot (
max(log_stamp)
for title IN ([1],[2],[3],[5],[6],[9],[11])

I was unable to do it as a pivot, I dont think it is possible as described
DECLARE #table1 TABLE (sub_id INT, ref_id INT, name VARCHAR(50))
INSERT #table1 VALUES (1, 1, 'Project 1')
INSERT #table1 VALUES (2, 1, 'Project 2')
INSERT #table1 VALUES (3, 2, 'Project 3' )
INSERT #table1 VALUES (4, 2, 'Project 4')
DECLARE #Table2 TABLE (sub_id INT, ref_id INT, log_stamp DATETIME, recepient VARCHAR(10), logtype INT)
INSERT #table2 VALUES(1,1,'06/06/2011','person A',1)
INSERT #table2 VALUES(1,1,'06/14/2011','person B',2)
INSERT #table2 VALUES(1,1,'06/16/2011','person C',2)
INSERT #table2 VALUES(1,1,'06/17/2011','person D',3)
INSERT #table2 VALUES(2,1,'06/18/2011','person E',2)
INSERT #table2 VALUES(2,1,'06/19/2011','person F',2)
INSERT #table2 VALUES(3,2,'06/20/2011','person G',1)
INSERT #table2 VALUES(3,2,'06/23/2011','person H',3)
;WITH a as (
SELECT RN = ROW_NUMBER() OVER (PARTITION BY t1.sub_id, t1.ref_id, t1.name, t2.logtype ORDER BY log_stamp DESC), t1.sub_id, t1.ref_id, t1.name, t2.Recepient , t2.logtype ,log_stamp
FROM #table1 t1 JOIN #table2 t2 ON t1.ref_id = t2.ref_id AND
t1.sub_id = t2.sub_id),
b as (SELECT * FROM a WHERE RN = 1)
SELECT b1.name, b1.ref_id,b1.log_stamp start_date , b1.Recepient, b2.log_stamp latest_comment , b2.Recepient, b3.log_stamp completion_date , b3.Recepient
FROM b b1
LEFT JOIN b b2 ON b1.sub_id=b2.sub_id AND b1.ref_id = b2.ref_id AND b2.logtype = 2
LEFT JOIN b b3 ON b1.sub_id=b3.sub_id AND b1.ref_id = b3.ref_id AND b3.logtype = 3
WHERE b1.logtype = 1
Result:
name ref_id start_date Recepient latest_comment Recepient completion_date Recepient
------------ ----------- ----------------------- ---------- ----------------------- ---------- ----------------------- ----------
Project 1 1 2011-06-06 00:00:00.000 person A 2011-06-16 00:00:00.000 person C 2011-06-17 00:00:00.000 person D
Project 3 2 2011-06-20 00:00:00.000 person G NULL NULL 2011-06-23 00:00:00.000 person H

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL: Exclude duplicates as sorted by another key - postgresql

I would use the lag() for this: select * from ( select obj_id, effective, published, attr1, attr2, attr3, (attr1, attr2, attr3) is distinct from lag( (attr1,attr2,attr3) ) over (partition by obj_id order by effective, published) as is_different from update_history ) t where is_different

Related

DB2 count distinct on multiple columns

Select previous different value PostgreSQL

Improve performance on CTE with sub-queries

Selecting specific row from a sub query depending on lowest priority

question with a query

Categories

Resources