Postgres - Find pairs of neighbours polygons - postgresql

I have a list of IDs (polygons) and in a table (i.e. table zones) I've all the possible permutations of these IDs. In another table (i.e. zonesid) I've their corresponding geometries [geometry(MultiPolygon,4326)].
table zones:
index | zone1 | zone2
-------+--------+--------
0 | 100 | 100
1 | 100 | 101
2 | 100 | 102
3 | 101 | 100
4 | 101 | 101
5 | 101 | 102
6 | 102 | 100
7 | 102 | 101
8 | 102 | 102
table zonesid:
index | zone_id | geom
-------+--------+--------
0 | 100 | geom100
1 | 101 | geom101
2 | 102 | geom102
Now I'd need to find which areas are adjacent and write a 1 next to the pair.
I've read the question Finding neighbouring polygons - postgis query and I think I need something similar, even if in this case I need to make it indicating the exact pair.
In the above example let's say that just 100 and 102 are adjacent. It should be:
table zones:
index | zone1 | zone2 | adiacent
-------+--------+---------+--------
0 | 100 | 100 | 0
1 | 100 | 101 | 0
2 | 100 | 102 | 1
3 | 101 | 100 | 0
4 | 101 | 101 | 0
5 | 101 | 102 | 0
6 | 102 | 100 | 1
7 | 102 | 101 | 0
8 | 102 | 102 | 0
I've started with:
ALTER TABLE zones
ADD COLUMN adjacent bigint;
UPDATE zones set adjacent=1, time=2
FROM (
SELECT (*)
FROM zonesid as a,
zonesid as b,
zones as c,
zones as d
WHERE ST_Touches(a.geom, b.geom)
AND c.zone1 != d.zone2
) as subquery
WHERE c.zone1 = subquery.zoneid
But... I'm struggling in getting how to refer correctly to the table zonesid to compare the pairs and then get which they are.

A colleague of mine helped me (Thanks again!). I post the answer that works for me, in case it could be useful to someone else:
with adjacent_pairs as (
select
a.zone_id zone_id_1,
q.zone_id zone_id_2
from zonesid a
cross join lateral (
select zone_id
from zonesid b
where
st_dwithin(a.geom, b.geom, 0)
and a.zone_id != b.zone_id
) q
)
update zones a
set adjacent = 1
from adjacent_pairs b
where
a.zone_a = b.zone_id_1
and a.zone_b = b.zone_id_2;
update zones
set adjacent = 0
where adjacent is null;

Related

Check in PGSQL if two ids match in two ways

I have two Users ids : ID_User1 and ID_User2. I would like to check in my TABLE if there is duplicate entry for ID_User1 match with ID_User2 and ID_User2 match with ID_User 1 like this :
1 match with 2
1 match with 2
but also
2 match with 1
I actually do this :
SELECT * FROM (SELECT *, COUNT(*) OVER (PARTITION BY "IDWU_User1", "IDWU_User2") AS COUNT FROM "WU_MatchingUsers") tableWithCount WHERE tableWithCount.count > 1;
and here is the result :
id | IDWU_User1 | IDWU_User2 | MatchingScore | count
----+------------+------------+---------------+-------
1 | 1 | 2 | 39 | 2
46 | 1 | 2 | 35 | 2
2 | 1 | 3 | 41 | 2
47 | 1 | 3 | 35 | 2
But I would like this results :
id | IDWU_User1 | IDWU_User2 | MatchingScore | count
----+------------+------------+---------------+-------
1 | 1 | 2 | 39 | 3
46 | 1 | 2 | 35 | 3
48 | 2 | 1 | 35 | 3
2 | 1 | 3 | 41 | 2
47 | 1 | 3 | 35 | 2
I want to also have the row in middle. So also check if there is a duplicate row / a matching in the way B > A and not only A > B.
Kind regards !
Make it a case expression wether id1 greater id2.
I used primes to combine the ids uniquely, string-concatenation with delimiter would do, too:
SELECT * FROM (SELECT *, COUNT(*) OVER (PARTITION BY CASE WHEN IDWU_User1 > IDWU_User2 THEN IDWU_User1*2 + IDWU_User2*3 ELSE IDWU_User1*3 + IDWU_User2*2 END )
AS COUNT FROM WU_MatchingUsers) tableWithCount WHERE tableWithCount.count > 1
RESULT
db-fiddle

Aggregate at either of two levels

In Tableau, I am joining two tables where a header can have multiple details
Work Order Header
Work Order Details
The joined data looks like this:
Header.ID | Header.ManualTotal | Details.ID | Details.LineTotal
A | 1000 | 1 | 550
A | 1000 | 2 | 35
A | 1000 | 3 | 100
B | 335 | 1 | 250
B | 335 | 2 | 300
C | null | 1 | 50
C | null | 2 | 25
C | null | 3 | 5
C | null | 4 | 5
Where there is a manual total, use that, if there is no manual total, use the sum of the line totals
ID | Total
A | 1000
B | 335
C | 85
I tried something like this:
ifnull( sum({fixed [Header ID] : [Manual Total] }), sum([Line Total]) )
basically I need to use the ifnull, then use the manual total if it exists, or sum line totals if it doesn't
Please advise on how to use LODs or some other solution to get the correct answer
Here is a solution that does not require a level-of-detail calculation.
Just try this:
use an inner join on id of the two tables
create this calculation: ifnull(median([Manual Total]),sum([Line Total]))
insert agg(your_calculation) into your sheet

How to Calculate Median Price Per Unit Using PERCENTILE_CONT and GROUP BY id

I'm using postgres 9.5 and trying to calculate median and average price per unit with a GROUP BY id. Here is the query in DBFIDDLE
Here is the data
id | price | units
-----+-------+--------
1 | 100 | 15
1 | 90 | 10
1 | 50 | 8
1 | 40 | 8
1 | 30 | 7
2 | 110 | 22
2 | 60 | 8
2 | 50 | 11
Using percentile_cont this is my query:
SELECT id,
ceil(avg(price)) as avg_price,
percentile_cont(0.5) within group (order by price) as median_price,
ceil( sum (price) / sum (units) ) AS avg_pp_unit,
ceil( percentile_cont(0.5) within group (order by price) /
percentile_cont(0.5) within group (order by units) ) as median_pp_unit
FROM t
GROUP by id
This query returns:
id| avg_price | median_price | avg_pp_unit | median_pp_unit
--+-----------+--------------+--------------+---------------
1 | 62 | 50 | 6 | 7
2 | 74 | 60 | 5 | 5
I'm pretty sure average calculation is correct. Is this the correct way to calculate median price per unit?
This post suggests this is correct (although performance is poor) but I'm curious if the division in the median calculation could skew the result.
Calculating median with PERCENTILE_CONT and grouping
The median is the value separating the higher half from the lower half of a data sample (a population or a probability distribution). For a data set, it may be thought of as the "middle" value.
https://en.wikipedia.org/wiki/Median
So your median price is 55, and the median units is 9
Sort by price Sort by units
id | price | units | | id | price | units
-------|-----------|--------| |-------|---------|----------
1 | 30 | 7 | | 1 | 30 | 7
1 | 40 | 8 | | 1 | 40 | 8
1 | 50 | 8 | | 1 | 50 | 8
>>> 2 | 50 | 11 | | 2 | 60 | 8 <<<<
>>> 2 | 60 | 8 | | 1 | 90 | 10 <<<<
1 | 90 | 10 | | 2 | 50 | 11
1 | 100 | 15 | | 1 | 100 | 15
2 | 110 | 22 | | 2 | 110 | 22
| | | | | |
(50+60)/2 (8+10)/2
55 9
I'm unsure what you intend for "median price per unit":
CREATE TABLE t(
id INTEGER NOT NULL
,price INTEGER NOT NULL
,units INTEGER NOT NULL
);
INSERT INTO t(id,price,units) VALUES (1,30,7);
INSERT INTO t(id,price,units) VALUES (1,40,8);
INSERT INTO t(id,price,units) VALUES (1,50,8);
INSERT INTO t(id,price,units) VALUES (2,50,11);
INSERT INTO t(id,price,units) VALUES (2,60,8);
INSERT INTO t(id,price,units) VALUES (1,90,10);
INSERT INTO t(id,price,units) VALUES (1,100,15);
INSERT INTO t(id,price,units) VALUES (2,110,22);
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY price) med_price
, percentile_cont(0.5) WITHIN GROUP (ORDER BY units) med_units
FROM
t;
| med_price | med_units
----|-----------|-----------
1 | 55 | 9
If column "price" represents a "unit price" then you don't need to divide 55 by 9, but if "price" is an "order total" then you would need to divide by units: 55/9 = 6.11

Groupwise select nth row Postgres

I have an problem that falls into the "greatest-n-per-group" category, but with a slight twist. I have a table along the lines of the following:
| t_id | t_amount | b_id | b_amount |
|------|----------|------|----------|
| 1 | 50 | 7 | 50 |
| 1 | 50 | 15 | 50 |
| 1 | 50 | 80 | 50 |
| 3 | 50 | 7 | 50 |
| 3 | 50 | 15 | 50 |
| 3 | 50 | 80 | 50 |
| 17 | 50 | 7 | 50 |
| 17 | 50 | 15 | 50 |
| 17 | 50 | 80 | 50 |
What I'd like to do is essentially partition this table by t_id and then select the first row of the first partition, the second row of the second partition, and the third row of the third partition, with the results looking like this:
| t_id | t_amount | b_id | b_amount |
|------|----------|------|----------|
| 1 | 50 | 7 | 50 |
| 3 | 50 | 15 | 50 |
| 17 | 50 | 80 | 50 |
It seems like a window function or something with distinct on might do the trick, but I haven't yet put it together.
I'm using Postgres 10 on a *nix system.
Using window functions dense_rank and row_number would do it
https://www.postgresql.org/docs/10/static/functions-window.html
Solution: db<>fiddle
SELECT
t_id,
t_amount,
b_id,
b_amount
FROM
(
SELECT
*,
dense_rank() over (ORDER BY t_id) as group_number, -- A
row_number() over (PARTITION BY t_id ORDER BY t_id, b_id)
as row_number_in_group -- B
FROM
test_data) s
WHERE
group_number = row_number_in_group
A dense_rank increases a number per given group (a partition over t_id). So every t_id is getting its own value.
B row_number counts the rows within a given partition.
I illustrate the result of the subquery here:
t_id t_amount b_id b_amount dense_rank row_number
---- -------- ---- -------- ---------- ----------
1 50 7 50 1 1
1 50 15 50 1 2
1 50 80 50 1 3
3 50 7 50 2 1
3 50 15 50 2 2
3 50 80 50 2 3
17 50 7 50 3 1
17 50 15 50 3 2
17 50 80 50 3 3
Now you have to filter where group number equals row number within the group and you get your expected result.

Generate a histogram of values grouped by a column

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
However, the result looks like this:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.
The second query can be rewritten to use ranges to make editing and writing the query a bit easier:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 #> score) as bucket_1,
count(*) filter (where b2 #> score) as bucket_2,
count(*) filter (where b3 #> score) as bucket_3,
count(*) filter (where b4 #> score) as bucket_4,
count(*) filter (where b5 #> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
A range constructed with int4range(0,20) includes the lower end and excludes the upper end.
The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.
I found this post useful
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Use the bucket numbers(100 in above script) in your favour.
Invoke like this
SELECT * FROM histogram($table_name_or_subquery, $column_name);
Example:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');