Check in PGSQL if two ids match in two ways - postgresql

I have two Users ids : ID_User1 and ID_User2. I would like to check in my TABLE if there is duplicate entry for ID_User1 match with ID_User2 and ID_User2 match with ID_User 1 like this :
1 match with 2
1 match with 2
but also
2 match with 1
I actually do this :
SELECT * FROM (SELECT *, COUNT(*) OVER (PARTITION BY "IDWU_User1", "IDWU_User2") AS COUNT FROM "WU_MatchingUsers") tableWithCount WHERE tableWithCount.count > 1;
and here is the result :
id | IDWU_User1 | IDWU_User2 | MatchingScore | count
----+------------+------------+---------------+-------
1 | 1 | 2 | 39 | 2
46 | 1 | 2 | 35 | 2
2 | 1 | 3 | 41 | 2
47 | 1 | 3 | 35 | 2
But I would like this results :
id | IDWU_User1 | IDWU_User2 | MatchingScore | count
----+------------+------------+---------------+-------
1 | 1 | 2 | 39 | 3
46 | 1 | 2 | 35 | 3
48 | 2 | 1 | 35 | 3
2 | 1 | 3 | 41 | 2
47 | 1 | 3 | 35 | 2
I want to also have the row in middle. So also check if there is a duplicate row / a matching in the way B > A and not only A > B.
Kind regards !

Make it a case expression wether id1 greater id2.
I used primes to combine the ids uniquely, string-concatenation with delimiter would do, too:
SELECT * FROM (SELECT *, COUNT(*) OVER (PARTITION BY CASE WHEN IDWU_User1 > IDWU_User2 THEN IDWU_User1*2 + IDWU_User2*3 ELSE IDWU_User1*3 + IDWU_User2*2 END )
AS COUNT FROM WU_MatchingUsers) tableWithCount WHERE tableWithCount.count > 1
RESULT
db-fiddle

Related

How can you filter for only the max value from from a queried table in Postgresql?

I'm fairly new to Postgresql and my problem can be simplified to the following:
Suppose that I have 2 tables:
Table A:
id | join_value | filter_data1 | filter_data2
---------------------------------------------
1 | 1 | "Yes" | 1
2 | 1 | "Yes" | 3
3 | 2 | "No" | 0
Table B:
id | join_value | filter_data1 | filter_data2 | date
---------------------------------------------------------
1 | 3 | "Yes" | 0 | 1/3/2021
2 | 1 | "Yes" | 17 | 1/3/2021
3 | 1 | "No" | -1 | 1/2/2021
4 | 1 | "Yes" | 32 | 1/2/2021
5 | 1 | "Yes" | 40 | 1/3/2021
I would like to filter these tables on the filter data and then join them on the join value. The catch is that I would then like to only grab the values that have a date == MAX(date). Here is an example of a query that I have attempted.
SELECT * FROM
(SELECT * FROM A
WHERE filter_data1 = "Yes"
AND filter_data2 > 2)
AS a_tab
JOIN
(SELECT * FROM B
WHERE filter_data1 = "Yes"
AND filter_data2 > 16)
AS b_tab
ON a_tab.join_value = b_tab.join_value;
This would give me the following table:
id | join_value | filter_data1 | filter_data2 | id | filter_data1 | filter_data2 | date
------------------------------------------------------------------------------------------
2 | 1 | "Yes" | 3 | 2 | "Yes" | 17 | 1/3/2021
2 | 1 | "Yes" | 3 | 4 | "Yes" | 32 | 1/2/2021
2 | 1 | "Yes" | 3 | 5 | "Yes" | 40 | 1/3/2021
But the problem is, I would like to also do a 'WHERE date = MAX(date)'
The resulting table would be this:
id | join_value | filter_data1 | filter_data2 | id | filter_data1 | filter_data2 | date
------------------------------------------------------------------------------------------
2 | 1 | "Yes" | 3 | 2 | "Yes" | 17 | 1/3/2021
2 | 1 | "Yes" | 3 | 5 | "Yes" | 40 | 1/3/2021
Does anyone have any ideas how to accomplish this?
At first, let me give you a hint, how you can write your existing select query in a way that it is better readable:
SELECT
a.*, b.*
FROM a
INNER JOIN b ON b.join_value = a.join_value
WHERE a.filter_data1 = 'YES' AND a.filter_data2 > 2
AND b.filter_data1 = 'YES' AND b.filter_data2 > 16
Now I am going to add another column to this query, that holds the maximum value of the date column of the output. Therefore, we can use a WINDOW FUNCTION:
SELECT
a.*, b.*, MAX(b.date) OVER ()
FROM a
INNER JOIN b ON b.join_value = a.join_value
WHERE a.filter_data1 = 'YES' AND a.filter_data2 > 2
AND b.filter_data1 = 'YES' AND b.filter_data2 > 16
As the WINDOW FUNCTION is the part of the query, that is computed in the last step, we cannot add a condition here. So we use this query as a subquery and add the condition to the top-level-query:
SELECT
*
FROM (
SELECT
a.*, b.*, MAX(b.date) OVER () AS max_date
FROM a
INNER JOIN b ON b.join_value = a.join_value
WHERE a.filter_data1 = 'YES' AND a.filter_data2 > 2
AND b.filter_data1 = 'YES' AND b.filter_data2 > 16
) t
WHERE t.date = t.max_date
This should give you the required results.

Postgres - Find pairs of neighbours polygons

I have a list of IDs (polygons) and in a table (i.e. table zones) I've all the possible permutations of these IDs. In another table (i.e. zonesid) I've their corresponding geometries [geometry(MultiPolygon,4326)].
table zones:
index | zone1 | zone2
-------+--------+--------
0 | 100 | 100
1 | 100 | 101
2 | 100 | 102
3 | 101 | 100
4 | 101 | 101
5 | 101 | 102
6 | 102 | 100
7 | 102 | 101
8 | 102 | 102
table zonesid:
index | zone_id | geom
-------+--------+--------
0 | 100 | geom100
1 | 101 | geom101
2 | 102 | geom102
Now I'd need to find which areas are adjacent and write a 1 next to the pair.
I've read the question Finding neighbouring polygons - postgis query and I think I need something similar, even if in this case I need to make it indicating the exact pair.
In the above example let's say that just 100 and 102 are adjacent. It should be:
table zones:
index | zone1 | zone2 | adiacent
-------+--------+---------+--------
0 | 100 | 100 | 0
1 | 100 | 101 | 0
2 | 100 | 102 | 1
3 | 101 | 100 | 0
4 | 101 | 101 | 0
5 | 101 | 102 | 0
6 | 102 | 100 | 1
7 | 102 | 101 | 0
8 | 102 | 102 | 0
I've started with:
ALTER TABLE zones
ADD COLUMN adjacent bigint;
UPDATE zones set adjacent=1, time=2
FROM (
SELECT (*)
FROM zonesid as a,
zonesid as b,
zones as c,
zones as d
WHERE ST_Touches(a.geom, b.geom)
AND c.zone1 != d.zone2
) as subquery
WHERE c.zone1 = subquery.zoneid
But... I'm struggling in getting how to refer correctly to the table zonesid to compare the pairs and then get which they are.
A colleague of mine helped me (Thanks again!). I post the answer that works for me, in case it could be useful to someone else:
with adjacent_pairs as (
select
a.zone_id zone_id_1,
q.zone_id zone_id_2
from zonesid a
cross join lateral (
select zone_id
from zonesid b
where
st_dwithin(a.geom, b.geom, 0)
and a.zone_id != b.zone_id
) q
)
update zones a
set adjacent = 1
from adjacent_pairs b
where
a.zone_a = b.zone_id_1
and a.zone_b = b.zone_id_2;
update zones
set adjacent = 0
where adjacent is null;

Filter a sum of values until a certain threshold is reached

DbFiddle
Stuck. Need SO :)
Consider the following distribution of values.
ID CNT SEC SHOW(Bool)
1 10 1
2 1 1
3 25 1
4 1 1
5 2 1
6 10 1
7 50 2
8 90 2
My goal is to filter by sec and then
sort by cnt ascending,
sort by id ascending
and then flag/filter all rows as show - false where cnt is < 5 and until the sum of cnt of all hidden rows (show=false) is >= 5.
So the sum of all "hidden" rows may never be < 5.
Expected outcome for sec=1:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 2 | 1 | 1 | false |
| 4 | 1 | 2 | false |
| 5 | 2 | 4 | false |
| 1 | 10 | 14 | false | -- The sum of all hidden rows before this point is 4
| 6 | 10 | 24 | true | -- The total of all hidden rows is now >= 5.
| 3 | 25 | 49 | true |
Expected outcome for sec=2:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 7 | 50 | 50 | true |
| 8 | 90 | 140 | true |
I can already sort the values and create the sums etc. I have not figured out, how to determine how to set the cutoff point, when "hidding" is not necessary.
I am already doing this in "client code" and I want to migrate it to sql.
Here LAG() will help to achieve what you want. You can write your query like below:
with cte as (
SELECT
id, cnt, sec,
sum(cnt) over (partition by sec order by cnt,id) sum_
FROM
tbl )
select
id, cnt, sum_,
case
when sum_<5 or lag(sum_) over (partition by sec order by cnt,id) <5 then 'false'
else
'true'
end as "show"
from cte
DEMO

Generate a histogram of values grouped by a column

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
However, the result looks like this:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.
The second query can be rewritten to use ranges to make editing and writing the query a bit easier:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 #> score) as bucket_1,
count(*) filter (where b2 #> score) as bucket_2,
count(*) filter (where b3 #> score) as bucket_3,
count(*) filter (where b4 #> score) as bucket_4,
count(*) filter (where b5 #> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
A range constructed with int4range(0,20) includes the lower end and excludes the upper end.
The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.
I found this post useful
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Use the bucket numbers(100 in above script) in your favour.
Invoke like this
SELECT * FROM histogram($table_name_or_subquery, $column_name);
Example:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');

Postgresql: Select sum with different conditions

I have two table table:
I. Table 1 like this:
------------------------------------------
codeid | pos | neg | category
-----------------------------------------
1 | 10 | 3 | begin2016
1 | 3 | 5 | justhere
3 | 7 | 7 | justthere
4 | 1 | 1 | else
4 | 12 | 0 | begin2015
4 | 5 | 12 | begin2013
1 | 2 | 50 | now
2 | 5 | 33 | now
5 | 33 | 0 | Begin2011
5 | 11 | 7 | begin2000
II. Table 2 like this:
------------------------------------------
codeid | codedesc | codegroupid
-----------------------------------------
1 | road runner | 1
2 | bike warrior | 2
3 | lazy driver | 4
4 | clever runner | 1
5 | worker | 3
6 | smarty | 1
7 | sweety | 3
8 | sweeper | 1
I want to have one result like this having two (or more) conditions:
sum pos and neg where codegroupid IN('1', '2', '3')
BUt do not sum pos and neg if category like 'begin%'
So the result will like this:
------------------------------------------
codeid | codedesc | sumpos | sumneg
-----------------------------------------
1 | roadrunner | 5 | 55 => (sumpos = 3+2, because 10 have category like 'begin%' so doesn't sum)
2 | bike warrior | 5 | 33
4 | clever runner | 1 | 1
5 | worker | 0 | 0 => (sumpos=sumneg=0) becase codeid 5 category ilike 'begin%'
Group by codeid, codedesc;
Sumpos is sum(pos) where category NOT ILIKE 'begin%', BUT IF category ILKIE 'begin%' make all pos values become zero (0);
Sumpos is sum(neg) where category NOT ILIKE 'begin%', BUT IF category ILKIE 'begin%' make all neg values become zero;
Any ideas how to do it?
Try:
SELECT
b.codeid,
b.codedesc,
sum(CASE WHEN category LIKE 'begin%' THEN 0 ELSE a.pos END) AS sumpos,
sum(CASE WHEN category LIKE 'begin%' THEN 0 ELSE a.neg END) AS sumneg
FROM
table1 AS a
JOIN
table2 AS b ON a.codeid = b.codeid
WHERE b.codegroupid IN (1, 2, 3)
GROUP BY
b.codeid,
b.codedesc;