Find unique entities with multiple UUID identifiers in redshift - amazon-redshift

Having an event table with multiple types of UUID's per user, we would like to come up with a way to stitch all those UUIDs together to get the highest possible definition of a single user.
For example:
UUID1 | UUID2
1 a
1 a
2 a
2 b
3 c
4 c
There are 2 users here, the first one with uuid1={1,2} and uuid2={a,b}, the second one with uuid1={3,4} and uuid2={c}. These chains could potentially be very long. There are no intersections (i.e. 1c doesn't exist) and all rows are timestamp ordered.
Is there a way in redshift to generate these unique "guest" identifiers without creating an immense query with many joins?
Thanks in advance!

Create test data table
-- DROP TABLE uuid_test;
CREATE TEMP TABLE uuid_test AS
SELECT 1 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 2 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 3 row_id, 2::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 4 row_id, 2::int uuid1, 'b'::char(1) uuid2
UNION ALL SELECT 5 row_id, 3::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT 6 row_id, 4::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT 7 row_id, 4::int uuid1, 'd'::char(1) uuid2
UNION ALL SELECT 8 row_id, 5::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT 9 row_id, 6::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT 10 row_id, 6::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT 11 row_id, 7::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT 12 row_id, 8::int uuid1, 'g'::char(1) uuid2
UNION ALL SELECT 13 row_id, 8::int uuid1, 'h'::char(1) uuid2
;
The actual problem is solved by using strict ordering to find every place where the unique user changes, capturing that as a lookup table and then applying it to the original data.
-- Create lookup table with a from-to range of IDs for each unique user
WITH unique_user AS (
-- Calculate the end of the id range using LEAD() to look ahead
-- Use an inline MAX() to find the ending ID for the last entry
SELECT row_id AS from_id
, NVL(LEAD(row_id,1) OVER (ORDER BY row_id)-1, (SELECT MAX(row_id) FROM uuid_test) ) AS to_id
, unique_uuid
-- Mark unique user change when there is discontinuity in either UUID
FROM (SELECT row_id
,CASE WHEN NVL(LAG(uuid1,1) OVER (ORDER BY row_id), 0) <> uuid1
AND NVL(LAG(uuid2,1) OVER (ORDER BY row_id), '') <> uuid2
THEN MD5(uuid1||uuid2)
ELSE NULL END unique_uuid
FROM uuid_test) t
WHERE unique_uuid IS NOT NULL
ORDER BY row_id
)
-- Apply the unique user value to each row using a range join to the lookup table
SELECT a.row_id, a.uuid1, a.uuid2, b.unique_uuid
FROM uuid_test AS a
JOIN unique_user AS b
ON a.row_id BETWEEN b.from_id AND b.to_id
ORDER BY a.row_id
;
Here's the output
row_id | uuid1 | uuid2 | unique_uuid
--------+-------+-------+----------------------------------
1 | 1 | a | efaa153b0f682ae5170a3184fa0df28c
2 | 1 | a | efaa153b0f682ae5170a3184fa0df28c
3 | 2 | a | efaa153b0f682ae5170a3184fa0df28c
4 | 2 | b | efaa153b0f682ae5170a3184fa0df28c
5 | 3 | c | 5fcfcb7df376059d0075cb892b2cc37f
6 | 4 | c | 5fcfcb7df376059d0075cb892b2cc37f
7 | 4 | d | 5fcfcb7df376059d0075cb892b2cc37f
8 | 5 | e | 18a368e1052b5aa0388ef020dd9a1e20
9 | 6 | e | 18a368e1052b5aa0388ef020dd9a1e20
10 | 6 | f | 18a368e1052b5aa0388ef020dd9a1e20
11 | 7 | f | 18a368e1052b5aa0388ef020dd9a1e20
12 | 8 | g | 321fcc2447163a81d470b9353e394121
13 | 8 | h | 321fcc2447163a81d470b9353e394121

Related

Get unique values across two columns

I have a table that looks like this:
id | col_1 | col_2
------+------------------+-------------------
1 | 12 | 15
2 | 12 | 16
3 | 12 | 17
4 | 13 | 18
5 | 14 | 18
6 | 14 | 19
7 | 15 | 19
8 | 16 | 20
I know if I do something like this, it will return all unique values from col_1:
select distinct(col_1) from table;
Is there a way I can get the distinct values across two columns? So my output would only be:
12
13
14
15
16
17
18
19
20
That is, it would take the distinct values from col_1 and add them to col_2's distinct values while also removing any values that are in both distinct lists (such as 15 which appears in both col_1 and col_2
You can use a UNION
select col_1
from the_table
union
select col_2
from the_table;
union implies a distinct operation, the above is the same as:
select distinct col
from (
select col_1 as col
from the_table
union all
select col_2 as col
from the_table
) x
You will need to use union
select col1 from table
union
select col2 from table;
You will not need distinct here because a union automatically does that for you (as opposed to a union all).

Compare every row to all other rows

Assume I have a table like this:
CREATE TABLE events (
event_id INTEGER,
begin_date DATE,
end_date DATE,
PRIMARY KEY (event_id));
With data like this:
INSERT INTO events SELECT 1 AS event_id,'2017-01-01'::DATE AS begin_date, '2017-01-07'::DATE AS end_date;
INSERT INTO events SELECT 2 AS event_id,'2017-01-04'::DATE AS begin_date, '2017-01-05'::DATE AS end_date;
INSERT INTO events SELECT 3 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-03'::DATE AS end_date;
INSERT INTO events SELECT 4 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-08'::DATE AS end_date;
INSERT INTO events SELECT 5 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;
INSERT INTO events SELECT 6 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-06'::DATE AS end_date;
INSERT INTO events SELECT 7 AS event_id,'2017-01-08'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;
I would like to be able to do this:
SELECT a.event_id,
COUNT (*) AS COUNT
FROM events AS a
LEFT JOIN events AS b
ON a.begin_date < b.begin_date
AND a.end_date > b.end_date
GROUP BY a.event_id
ORDER BY a.event_id ASC
With results like this:
*----------*--------*
| event_id | count |
*-------------------*
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 3 |
| 6 | 1 |
| 7 | 1 |
*----------*------- *
But with a window function (because it's much faster than the inequality join). Something like this, where I can compare the outer row to the inner rows.
SELECT a.event_id,
COUNT(*) OVER (a.begin_date < b.begin_date AND a.end_date > b.end_date) AS count
FROM events AS a
ORDER BY a.event_id ASC
Ideally this would work on both Postgres and Redshift.
I don't think you will get out of using a JOIN here. Even the window function idea would require a sort of implicit join since the window being compared would have to be a set of all other records.
Instead, consider using Postgres' daterange type. You can cast your begin and end dates into a range and check to see if the exclusive range from your left table contains the inclusive range of your right table:
SELECT t1.event_id, count(*)
FROM events t1
LEFT OUTER JOIN events t2
ON daterange(t1.begin_date, t1.end_date, '()') #> daterange(t2.begin_date, t2.end_date, '[]') AND t1.event_id <> t2.event_id
GROUP BY 1
ORDER BY 1;
event_id | count
----------+-------
1 | 3
2 | 1
3 | 1
4 | 1
5 | 3
6 | 1
7 | 1
The real question (and I don't know the answer) is that if all of this casting and "range includes" #> logic is more efficient then your inequality version. While the Nested Loop Left Join used here has a lower estimated rows, the Total Cost is about 33% higher.
I have a hunch if your data were stored as an inclusive daterange type, then the cost would be less as a cast wouldn't need to happen (although because we are comparing an inclusive range to an exclusive range then it may be a wash).

T-SQL query to remove duplicates from large tables using join

I am new in using T-SQL queries and I was trying different solutions in order to remove duplicate rows from a fairy large table (with over 270,000 rows).
The table looks something like:
TableA
-----------
RowID int not null identity(1,1) primary key,
Col1 varchar(50) not null,
Col2 int not null,
Col3 varchar(50) not null
The rows for this table are not perfect duplicates because of the existence of the RowID identity field.
The second table that I need to join with:
TableB
-----------
RowID int not null identity(1,1) primary key,
Col1 int not null,
Col2 varchar(50) not null
In TableA I have something like:
1 | gray | 4 | Angela
2 | red | 6 | Diana
3 | black| 6 | Alina
4 | black| 11 | Dana
5 | gray | 4 | Angela
6 | red | 12 | Dana
7 | red | 6 | Diana
8 | black| 11 | Dana
And in TableB:
1 | 6 | klm
2 | 11 | lmi
Second column from TableB (Col1) is foreign key inside TableA (Col2).
I need to remove ONLY the duplicates from TableA that has Col2 = 6 ignoring the other duplicates.
1 | gray | 4 | Angela
2 | red | 6 | Diana
4 | black| 6 | Alina
5 | black| 11 | Dana
6 | gray | 4 | Angela
7 | red | 12 | Dana
8 | black| 11 | Dana
I tried using
DELETE FROM TableA a inner join TableB b on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId) FROM TableA GROUP BY RowId, Col1, Col2, Col3) and b.Col2="klm"
but I still get some of the duplicates that I need to remove.
What is the best way to remove not perfect duplicate rows using join?
well min would only be one and group by PK will give you everything
and the RowID are wrong in the example
DELETE FROM TableA a
inner join TableB b
on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId)
FROM TableA GROUP BY RowId, Col1, Col2, Col3)
and b.Col2="klm"
this would be rows to delete
select *
from
( select *
, row_number over (partition by Col1, Col3 order by RowID) as rn
from TableA a
where del.Col2 = 6
) tt
where tt.rn > 1
another solution is:
WITH CTE AS(
SELECT t.[col1], t.[col2], t.[col3], t.[col4],
RN = ROW_NUMBER() OVER (PARTITION BY t.[col1], t.[col2], t.[col3], t.[col4] ORDER BY t.[col1])
FROM [TableA] t
)
delete from CTE WHERE RN > 1
regards.

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type and time_stamp.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then
the first event is not a duplicate
all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
and so on
For this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10 result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
This feels more like a recursive problem than windowing function. The following query obtained the desired results:
WITH RECURSIVE base(type, time_stamp) AS (
-- 3. base of recursive query
SELECT x.type, x.time_stamp, y.next_time_stamp
FROM
-- 1. start with the initial records of each type
( SELECT type, min(time_stamp) AS time_stamp
FROM event
GROUP BY type
) x
LEFT JOIN LATERAL
-- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = x.type
AND time_stamp > (x.time_stamp + 10)
) y ON true
UNION ALL
-- 4. recursive join, same logic as base
SELECT e.type, e.time_stamp, z.next_time_stamp
FROM event e
JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
LEFT JOIN LATERAL
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = e.type
AND time_stamp > (e.time_stamp + 10)
) z ON true
)
-- The actual query:
-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
FROM base
UNION
-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
FROM event
WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base)
ORDER BY type, time_stamp
There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.
This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.

pl sql query recuresive looping

i have only one table "tbl_test"
Which have table filed given below
tbl_test table
trx_id | proj_num | parent_num|
1 | 14 | 0 |
2 | 14 | 1 |
3 | 14 | 2 |
4 | 14 | 0 |
5 | 14 | 3 |
6 | 15 | 0 |
Result i want is : when trx_id value 5 is fetched
it's a parent child relationship. so,
trx_id -> parent_num
5 -> 3
3 -> 2
2 -> 1
That means output value:
3
2
1
Getting all parent chain
Query i used :
SELECT * FROM (
WITH RECURSIVE tree_data(project_num, task_num, parent_task_num) AS(
SELECT project_num, task_num, parent_task_num
FROM tb_task
WHERE project_num = 14 and task_num = 5
UNION ALL
SELECT child.project_num, child.task_num, child.parent_task_num
FROM tree_data parent Join tb_task child
ON parent.task_num = child.task_num AND parent.task_num = child.parent_task_num
)
SELECT project_num, task_num, parent_task_num
FROM tree_data
) AS tree_list ;
Can anybody help me ?
There's no need to do this with pl/pgsql. You can do it straight in SQL. Consider:
WITH RECURSIVE my_tree AS (
SELECT trx_id as id, parent_id as parent, trx_id::text as path, 1 as level
FROM tbl_test
WHERE trx_id = 5 -- start value
UNION ALL
SELECT t.trx_id, t.parent_id, p.path || ',' || t.trx_id::text, p.level + 1
FROM my_tree p
JOIN tbl_text t ON t.trx_id = p.parent
)
select * from my_tree;
If you are using PostgresSQL, try using a WITH clause:
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT region
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
product,
SUM(quantity) AS product_units,
SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;