PostGIS minimum distance of two sets including other variables from both tables - postgresql

I have two tables (table1 and table2) with three columns: id, value and geometry. The geometries are point features.
I want to do a join between both tables where the resulting table contains for each point of table1, the minimum distance to a point of table2, the value of table1 and the value of the corresponding point of table2.
I tried the following code, but logically, this gives for each poin of table1 the distance to each point of table2. However, I cannot leave v2 from the group by clause. How can I get the table I want?
SELECT t1.value AS v1,
t2.value AS v2,
MIN(st_distance(t1.eometry, t2.geometry)) AS dis
FROM table1 t1, table2 t2
GROUP BY v1, v2

For some simplicity I simply took integer values and their differences instead of the distance between points (but it should be exactly the same: just change the subtraction against the st_distance function):
demo:db<>fiddle
SELECT DISTINCT ON (v1.point)
v1.point,
v2.point,
abs(v1.point - v2.point)
FROM
table1 v1
CROSS JOIN table2 v2
ORDER BY v1.point, abs(v1.point - v2.point)
My tables:
table1.point: 1, 2, 4, 8, 16
table2.point: 2, 3, 5, 7, 11, 13
The result:
| point | point | abs |
|-------|-------|-----|
| 1 | 2 | 1 |
| 2 | 2 | 0 |
| 4 | 3 | 1 |
| 8 | 7 | 1 |
| 16 | 13 | 3 |
Explanation:
You have to calculate all differences to know which one is the smallest. That's the reason for the CROSS JOIN. Now you can ORDER BY the points of table1 and the differences (or distances). Notice the abs() function: This makes all negative values positive. Otherwise difference -42 would be taken instead of +1.
DISTINCT ON (v1.point) takes the first ordered row for each v1.point.
Notice:
Because of the CROSS JOIN and the heavy mathematics in st_distance it could be really slow for huge data sets!

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24
step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.
Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!
The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

How to calculate the nearest neighbor distance for 10000 points in a table

I am using PostgreSQL and I am using PostGIS extension.
I am able to compare one point with this query:
SELECT st_distance(geom, 'SRID=4326;POINT(12.601828337172 50.5173393068512)'::geometry) as d
FROM pointst1
ORDER BY d
but I want to compare not to one fixed point but to a column of points. And I want to do this with some sort of indexing so that it is computationally cheap and not 10000x10000 like a cross join within that table.
Create table:
create table pointst1
(
id integer not null
constraint pointst1_id_pk
primary key,
geom geometry(Point, 4325)
);
create unique index pointst1_id_uindex
on pointst1 (id);
create index geomidx
on pointst1 (geom);
Edit:
Refined query (comparing 10000 points with their nearest neighbor but getting the result of the point itself which is 0 and not the next nearest point:
select points.*,
p1.id as p1_id,
ST_Distance(geography(p1.geom), geography(points.geom)) as distance
from
(select distinct on(p2.geom)*
from pointst1 p2
where p2.id is not null) as points
cross join lateral
(select id, geom
from pointst1
order by points.geom <-> geom
limit 1) as p1;
Your query is already calculating the distance from the given geometry to all records in the table pointst1.
Considering these values ..
INSERT INTO pointst1 VALUES (1,'SRID=4326;POINT(16.19 48.21)'),
(2,'SRID=4326;POINT(18.96 47.50)'),
(3,'SRID=4326;POINT(13.47 52.52)'),
(4,'SRID=4326;POINT(-3.70 40.39)');
... if you run your query, it will already calculate the distance from all points in the table:
SELECT ST_Distance(geom, 'SRID=4326;POINT(12.6018 50.5173)'::geometry) as d
FROM pointst1
ORDER BY d
d
------------------
2.1827914536208
4.26600662563949
7.03781262396208
19.1914274750473
(4 Zeilen)
Change your index to GIST, which is the most suitable for geometry data:
create index geomidx on pointst1 using GIST (geom);
Just note that an index won't speed up this query of yours, since you're doing a full scan. But as soon as you start playing more in the where clause, you might see some improvement.
EDIT:
WITH j AS (SELECT id AS id2, geom AS geom2 FROM pointst1)
SELECT id,j.id2,ST_Distance(geom, j.geom2) AS d
FROM pointst1,j
WHERE id <> j.id2
ORDER BY id,id2
id | id2 | d
----+-----+------------------
1 | 2 | 2.85954541841881
1 | 3 | 5.0965184194703
1 | 4 | 21.3720495039666
2 | 1 | 2.85954541841881
2 | 3 | 7.43911957156222
2 | 4 | 23.7492673571207
3 | 1 | 5.0965184194703
3 | 2 | 7.43911957156222
3 | 4 | 21.0225069865609
4 | 1 | 21.3720495039666
4 | 2 | 23.7492673571207
4 | 3 | 21.0225069865609
(12 rows)
Removing duplicate distances:
SELECT DISTINCT ON(d) * FROM (
WITH j AS (SELECT id AS id2, geom AS geom2 FROM pointst1)
SELECT id,j.id2,ST_Distance(geom, j.geom2) AS d
FROM pointst1,j
WHERE id <> j.id2
ORDER BY id,id2) AS j
id | id2 | d
----+-----+------------------
1 | 2 | 2.85954541841881
3 | 1 | 5.0965184194703
3 | 2 | 7.43911957156222
4 | 3 | 21.0225069865609
4 | 1 | 21.3720495039666
2 | 4 | 23.7492673571207
(6 rows)

Efficiently selecting from a large table using floor() in Postgres

I have two tables: One with squares with columns x and y over the natural numbers, and another with points on this grid created by the first table. Example schema:
Grid Table
id | x | y
------------
123 | 1 | 1
234 | 1 | 2
345 | 2 | 1
456 | 2 | 2
Then, the points table:
id | x | y
----------------
12 | 1.23 | 1.23
23 | 2.89 | 1.55
Currently, using this query:
SELECT g.* FROM grid as g, points as p
WHERE p.id=23 AND floor(p.x)=g.x AND floor(p.y)=g.y;
I get the expected result, which is the grid square in which the point with id 23 resides (grid with id 345); However, when the table grid has 10,000,000 rows (the current situation I'm in), this query is incredibly slow, i.e. on the order of a few seconds.
I've found a workaround for this, but it's ugly:
SELECT g.* FROM grid as g, points as p
WHERE p.id=23 AND (p.x-.5)::integer=g.x AND (p.y-.5)::integer=g.y;
I get the expected result again, and in 11ms, but this feels hacky. Are there cleaner ways to do this? Any help is appreciated!
You can use a CTE, as it is evaluated once only.
WITH p2 AS (select floor(p.x) x,
floor(p.y) y
from points p
where p.id=23)
SELECT g.*
FROM grid g
INNER JOIN p2
ON p2.x=g.x and p2.y=g.y

PostgreSQL - How to get the previous(lag) calculated value

I would like to get the previous(lag) calculated value?
id | value
-------|-------
1 | 1
2 | 3
3 | 5
4 | 7
5 | 9
What I am trying to achieve is this:
id | value | new value
-------|-------|-----------
1 | 1 | 10 <-- 1 * lag(new_value)
2 | 3 | 30 <-- 3 * lag(new_value)
3 | 5 | 150 <-- 5 * lag(new_value)
4 | 7 | 1050 <-- 7 * lag(new_value)
5 | 9 | 9450 <-- 9 * lag(new_value)
What I have tried:
SELECT value,
COALESCE(lag(new_value) OVER () * value, 10) AS new_value
FROM table
Error:
ERROR: column "new_value" does not exist
Similar to Juan's answer but I thought I'd post it anyway. It at least avoids the need for the ID column and doesn't have the empty row at the end:
with recursive all_data as (
select value, value * 10 as new_value
from data
where value = 1
union all
select c.value,
c.value * p.new_value
from data c
join all_data p on p.value < c.value
where c.value = (select min(d.value)
from data d
where d.value > p.value)
)
select *
from all_data
order by value;
The idea is to join exactly one row in the recursive part to exactly one "parent" row. While the "exactly one parent" can be done with a derived table and a lateral join (which surprisingly does allow the limit). The "exactly one row" from the "child" in the recursive part can unfortunately only be done using the sub-select with a min().
The where c.value= (...) wouldn't be necessary if it was possible to use an order by and limit in the recursive part as well, but unfortunately that is not supported in the current Postgres version.
Online example: http://rextester.com/WFBVM53545
My bad, this isnt that easy as I thought. Got a very close result but still need some tunning.
DEMO
WITH RECURSIVE t(n, v) AS (
SELECT MIN(value), 10
FROM Table1
UNION ALL
SELECT (SELECT min(value) from Table1 WHERE value > n),
(SELECT min(value) from Table1 WHERE value > n) * v
FROM t
JOIN Table1 on t.n = Table1.value
)
SELECT n, v
FROM t;

Calculate length of a series of line segments

I have a table like the following:
X | Y | Z | node
----------------
1 | 2 | 3 | 100
2 | 2 | 3 |
2 | 2 | 4 |
2 | 2 | 5 | 200
3 | 2 | 5 |
4 | 2 | 5 |
5 | 2 | 5 | 300
X, Y, Z are 3D space coordinates of some points, a curve passes through all the corresponding points from the first row to the last row. I need to calculate the curve length between two adjacent points whose "node" column aren't null.
If would be great if I can directly insert the result into another table that has three columns: "first_node", "second_node", "curve_length".
I don't need to interpolate extra points into the curve, just need to accumulate lengths all the straight lines, for example, in order to calculate the curve length between node 100 and 200, I need to sum the lengths of 3 straight lines: (1,2,3)<->(2,2,3), (2,2,3)<->(2,2,4), (2,2,4)<->(2,2,5)
EDIT
The table has an ID column, which is in increasing order from the first row to the last row.
To get a previous value in SQL, use the lag window function, e.g.
SELECT
x,
lag(x) OVER (ORDER BY id) as prev_x, ...
FROM ...
ORDER BY id;
That lets you get the previous and next points in 3-D space for a given segment. From there you can trivially calculate the line segment length using regular geometric maths.
You'll now have the lengths of each segment (sqlfiddle query). You can use this as input into other queries, using SELECT ... FROM (SELECT ...) subqueries or a CTE (WITH ....) term.
It turns out to be pretty awkward to go from the node segment lengths to node-to-node lengths. You need to create a table that spans the null entries, using a recursive CTE or with a window function.
I landed up with this monstrosity:
SELECT
array_agg(from_id) AS seg_ids,
-- 'max' is used here like 'coalese' for an aggregate,
-- since non-null is greater than null
max(from_node) AS from_node,
max(to_node) AS to_node,
sum(seg_length) AS seg_length
FROM (
-- lengths of all sub-segments with the null last segment
-- removed and a partition counter added
SELECT
*,
-- A running counter that increments when the
-- node ID changes. Allows us to group by series
-- of nodes in the outer query.
sum(CASE WHEN from_node IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY from_id) AS partition_id
FROM
(
-- lengths of all sub-segments
SELECT
id AS from_id,
lead(id, 1) OVER (ORDER BY id) AS to_id,
-- length of sub-segment
sqrt(
(x - lead(x, 1) OVER (ORDER BY id)) ^ 2 +
(y - lead(y, 1) OVER (ORDER BY id)) ^ 2 +
(z - lead(z, 1) OVER (ORDER BY id)) ^ 2
) AS seg_length,
node AS from_node,
lead(node, 1) OVER (ORDER BY id) AS to_node
FROM
Table1
) sub
-- filter out the last row
WHERE to_id IS NOT NULL
) seglengths
-- Group into series of sub-segments between two nodes
GROUP BY partition_id;
Credit to How do I efficiently select the previous non-null value? for the partition trick.
Result:
seg_ids | to_node | from_node | seg_length
---------+---------+---------+------------
{1,2,3} | 100 | 200 | 3
{4,5,6} | 200 | 300 | 3
(2 rows)
To insert directly into another table, use INSERT INTO ... SELECT ....