ST_ClusterDBSCAN function and minpoints parameter definition - postgresql

I've spent the last 2 days trying to figure out what's wrong with my clustering query. It did seem to work correctly but after some more detailed testing, I was confused seeing that some clusters haven't got created even if they evidently should be.
So initially I was assuming that:
ST_ClusterDBSCAN(geom, eps := 1, minpoints := 4) OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type))
would result in clustering points that have the same "PARTITION BY" attributes and the minimum group would need to have 4 points (including core point) in eps distance. According to the docs:
https://postgis.net/docs/ST_ClusterDBSCAN.html
An input geometry will be added to a cluster if it is either:
A "core" geometry, that is within eps distance of at least minpoints input geometries (including itself) or
A "border" geometry, that is within eps distance of a core geometry. its surrounding area with radius eps.
But it seems that it's not exactly true. It seems that in order to cluster 4 points (like minpoints parameter is set), the grouping query:
OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type))
needs to result in at least five objects to let clst_id to be created for other four.. Here is an example:
CREATE TABLE IF NOT EXISTS public.point_table
(
point_sys_id serial primary key,--System generated Primary Key - Asset Cache
point_id bigint,
geom geometry(Point,4326),--Geometry Field
country_code varchar(4),--Country Code
elevation_ft numeric(7,2),--Elevation in Feet
height_ft numeric(7,2),--Height in Feet
obstacle_type varchar(50)--Obstacle Type
)
INSERT INTO point_table(point_id, geom, country_code, elevation_ft, height_ft, obstacle_type)
VALUES
(1, '0101000020E6100000E4141DC9E5934B40D235936FB6193940', 'ARE', 100, 50, 'BUILDING'),
(2, '0101000020E6100000C746205ED7934B40191C25AFCE193940', 'ARE', 100, 50, 'BUILDING'),
(3, '0101000020E6100000C780ECF5EE934B40B6BE4868CB193940', 'ARE', 100, 50, 'BUILDING'),
(4, '0101000020E6100000A97A358FA5AF4B4074A0C65B724C3940', 'ARE', 100, 50, 'BUILDING'), -- this point is outside of the cluster distance (eps)
(5, '0101000020E6100000ABB2EF8AE0934B404451A04FE4193940', 'ARE', 100, 50, 'BUILDING')
select ST_ClusterDBSCAN(geom, eps := 0.000906495804256269, minpoints := 4) OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type)) as clst_id,
point_id, geom, country_code, elevation_ft, height_ft, obstacle_type
from point_table
--where point_id != 4
Running clustering query agains all five points works fine. But once you exclude seemingly irrelevant point_id = 4 (which is outside of eps distance anyway) the clustering stops working (clst_id becomes null), even if still theoretically 4 needed points (according to the docs) are in place.
Once I change the minpoints parameter to 3, clustering works fine for those 4 neighbor points.
Can someone confirm my conclusion that ST_ClusterDBSCAN is not correct or give some good explanation for this behavior?
EDIT:
I've submitted a ticket to PostGIS directly: https://trac.osgeo.org/postgis/ticket/4853
and it seems that this has been fixed from 3.1 version :)

Related

Calculating a score for each road in Openstreetmap produces unexpected result. What am I missing?

I have a Postgres database with a postgis extention installed and filles with open street map data.
With the following SQL statement :
SELECT
l.osm_id,
sum(
st_area(st_intersection(ST_Buffer(l.way, 30), p.way))
/
st_area(ST_Buffer(l.way, 30))
) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p ON ST_Intersects(l.way, ST_Buffer(p.way,30))
WHERE p.natural in ('water') or p.landuse in ('forest') GROUP BY l.osm_id;
I calculate a "green" score.
My goal is to create a "green" score for each osm_id.
Which means; how much of a road is near a water, forrest or something similar.
For example a road that is enclosed by a park would have a score of 1.
A road that only runs by a river for a short period of time would have a score of for example 0.4
OR so is my expectation.
But by inspection the result of this calculation I get sometimes Values of
212.11701212511463 for a road with the OSM ID -647522
and 82 for a road with osm ID -6497265
I do get values between 0 and 1 too but I don't understand why I do also get such huge values.
What am I missing ?
I was expecting values between 1 and 0.
Using a custom unique ID that you must populate, the query can also union eventually overlapping polygons:
SELECT
l.uid,
st_area(
ST_UNION(
st_intersection(ST_Buffer(l.way, 30), p.way))
) / st_area(ST_Buffer(l.way, 30)) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p
ON st_dwithin(l.way, p.way,30)
WHERE p.natural in ('water') or p.landuse in ('forest')
GROUP BY l.uid;

Why does ST_Transform fail when transforming to 4326?

I have a postgis table with some data which looks like this:
distance
id
nr
X
Y
description
strecke
point
gsal_strecke
rikz
gsal_km_anf_p
gsal_km_end_p
gsal_geo_compound
rk
kilometer
basepoint
2.088918198132633
9105
1
7.59833269418573
50.3171094720011
valide with result
3507
POINT (3400245.168425543 5576618.697108934)
3507
2.0
123905.310
123945.537
LINESTRING (3400253.52199817 5576605.02429315, 3400251.48999817 5576609.59229315, 3400247.37399817 5576618.87129315, 3400243.28599817 5576628.16129316, 3400239.24399817 5576637.47229316, 3400237.27599817 5576642.06929316)
1
123.92110139140328
POINT (3400247.0804134225 5576619.538465925)
4.601389947759623
9106
611171
8.83478109
54.7923646
crash
1201
POINT (3489442.4653895213 6073687.653162317)
1201
0.0
162291.691
162329.922
LINESTRING (3489446.77287361 6073662.83069441, 3489447.226 6073701.05844041)
1
162.31646103819043
POINT (3489447.066456252 6073687.598624319)
This table holds the end result of some calculations. In short what happens is, that a collection of points is inserted into the database and if they are within a certain perimter of a given rail, a basepoint to that rail is calculated. This calculation takes place in SRID 5683
The data is fetched by a Service, which returns them as geojson. According to the specification, geoJSON works best when used with WGS84 coordinates.
So when fetching the data, I have to transform the coordinates.
The query I use looks like this:
select *, ST_X(ST_Transform(point, 4326)) as x, ST_Y(ST_Transform(point, 4326)) as y from bp40.bp40_punktlage;
The first of these two example rows yields the following result:
distance
id
nr
X
Y
Objektbezeichnung
strecke
punkt
gsal_strecke
rikz
gsal_km_anf_p
gsal_km_end_p
gsal_geo_compound
rk
kilometer
fusspunkt
x
y
2.088918198132633
9105
1
7.59833269418573
50.3171094720011
valide mit ergebnis
3507
POINT (3400245.168425543 5576618.697108934)
3507
2.0
123905.310
123945.537
LINESTRING (3400253.52199817 5576605.02429315, 3400251.48999817 5576609.59229315, 3400247.37399817 5576618.87129315, 3400243.28599817 5576628.16129316, 3400239.24399817 5576637.47229316, 3400237.27599817 5576642.06929316)
1
123.92110139140328
POINT (3400247.0804134225 5576619.538465925)
7.598332691520598
50.317109473199004
Now for some reason I cannot explain, the second row just crashes yielding the follow error message:
SQL Error [XX000]: ERROR: transform: Invalid argument (22)
According to the postgres documentation
This is a internal error, which does not really help me understand what is wrong here.
I have checked the geometry for validity (st_isvalid) and both rows contain valid geometry.
Also the initial X,Y coordinates are valid and pinpoint the location I want them to be in.
EDIT 1
Out of curiosity i tried the following queries:
select st_transform(point,5682) from bp40_punktlage
--works just fine with both rows
select st_transform(st_transform(punkt, 5682),4326) from bp40_punktlage
-- crashes with the same error
PostgreSQL 13.4, compiled by Visual C++ build 1914, 64-bit
Postgis Version : 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
Edit 2
As requestet here is the CREATE TABLE statement:
CREATE TABLE bp40.bp40_punktlage (
distance float8 NULL,
id int4 NULL,
nr varchar NULL,
"X" float8 NULL,
"Y" float8 NULL,
"description" varchar NULL,
strecke varchar NULL,
point geometry NULL,
gsal_strecke varchar(4) NULL,
rikz float8 NULL,
gsal_km_anf_p numeric(19, 3) NULL,
gsal_km_end_p numeric(19, 3) NULL,
gsal_geo_compound geometry NULL,
rk int8 NULL,
kilometer float8 NULL,
basepoint geometry NULL
);
Inserts are a bit more complicated, since the data goes through several tables and processing steps until it arrives at the table shown in this post. I will try to add them over the course of the day.
Edit 3
The geometries are valid.
st_srid() return 5863 for both points, as expected. I tried transforming them into 5862 just for the sake of trying. But it fails when transforming to 4326, no matter from which SRID i start

Is it possible to look at the output of the previous row of a PostgreSQL query?

This is the question: Is it possible to look at the outputs, what has been selected, from the previous row of a running SQL query in Postgres?
I know that lag exists to look at the inputs, the "from" of the query. I also know that a CTE, subquery or lateral join can solve most issues of this kind. But I think the problem I'm facing genuinely requires a peek at the output of the previous row. Why? Because the output of the current row depends on a constant from a lookup table and the value used too look up that constant is an aggregate of all the previous rows. And if that lookup returns the wrong constant all subsequent rows will be increasingly off from the expected value.
The whole rest of this text is a simplified example based on the problem I'm facing. It should be possible to input it to PostgreSQL 12 and above and play around. I'm terribly sorry that it is as complicated as it is, but I think it is the most simple I can make it while still retaining the core issue: lookup in lookup table based on an aggregate from all previous rows as well as the fact that the "inventory" that's being tracked is modeled as a series of transactions of two discrete types.
The database itself exists to keep track of multiple fish farms, or cages full of fish. Fish can be moved/transferred from between these farms and the farms are fed about daily. Why not just carry the aggregate as a field in the table? Because it should be possible to switch out the lookup table after the season is over, to adjust it to better match with reality.
-- A listing of all groups of fish ever grown.
create table farms (
id bigserial primary key,
start timestamp not null,
stop timestamp
);
insert into farms
(id, start)
values (
1, '2021-02-01T13:37'
);
-- A transfer of fish from one odling to another.
-- If the source is null the fish is transferred from another fishery outside our system.
-- If the destination is null the fish is being slaughtered, removed from the system.
create table transfers (
source bigint references farms(id),
destination bigint references farms(id),
timestamp timestamp not null default current_timestamp,
total_weight_g bigint not null constraint positive_nonzero_total_weight_g check (total_weight_g > 0),
average_weight_g bigint not null constraint positive_nonzero_average_weight_g check (average_weight_g > 0),
number_fish bigint generated always as (total_weight_g / average_weight_g) stored
);
insert into transfers
(source, destination, timestamp, total_weight_g, average_weight_g)
values
(null, 1, '2021-02-01T16:38', 5, 5),
(null, 1, '2021-02-15T16:38', 500, 500);
-- Transactions of fish feed into a farm.
create table feedings (
id bigserial primary key,
growth_table bigint not null,
farm bigint not null references farms(id),
amount_g bigint not null constraint positive_nonzero_amunt_g check (amount_g > 0),
timestamp timestamp not null
);
insert into feedings
(farm, growth_table, amount_g, timestamp)
values
(1, 1, 1, '2021-02-02T13:37'),
(1, 1, 1, '2021-02-03T13:37'),
(1, 1, 1, '2021-02-04T13:37'),
(1, 1, 1, '2021-02-05T13:37'),
(1, 1, 1, '2021-02-06T13:37'),
(1, 1, 1, '2021-02-07T13:37');
create view combined_feed_and_transfer_history as
with transfer_history as (
select timestamp, destination as farm, total_weight_g, average_weight_g, number_fish
from transfers as deposits
where deposits.destination = 1 -- TODO: This view only works for one farm, fix that.
union all
select timestamp, source as farm, -total_weight_g, -average_weight_g, -number_fish
from transfers as withdrawals
where withdrawals.source = 1
)
select timestamp, farm, total_weight_g, number_fish, average_weight_g, null as growth_table
from transfer_history
union all
select timestamp, farm, amount_g, 0 as number_fish, 0 as average_weight_g, growth_table
from feedings
order by timestamp;
-- Conversion tables from feed to gained weight.
create table growth_coefficients (
growth_table bigserial not null,
average_weight_g bigint not null constraint positive_nonzero_weight check (average_weight_g > 0),
feed_conversion_rate double precision not null constraint positive_foderkonverteringsfaktor check (feed_conversion_rate >= 0),
primary key(growth_table, average_weight_g)
);
insert into growth_coefficients
(average_weight_g, feed_conversion_rate, growth_table)
values
(5.00,0.10,1),
(10.00,10.00,1),
(20.00,1.30,1),
(50.00,1.31,1),
(100.00,1.32,1),
(300.00,1.36,1),
(600.00,1.42,1),
(1000.00,1.50,1),
(1500.00,1.60,1),
(2000.00,1.70,1),
(2500.00,1.80,1),
(3000.00,1.90,1),
(4000.00,2.10,1),
(5000.00,2.30,1);
-- My current solution is a bad one. It does a CTE that sums over all events but does not account
-- for the feed conversion rate. That means that the average weight used too look up the feed
-- conversion rate will diverge more and more from reality the further into the season time goes.
-- This is why it is important to look at the output, the average weight, of the previous row.
-- We start by summing up all the transfer and feed events to get a rough average_weight_g.
with estimate as (
select
timestamp,
farm,
total_weight_g as transaction_size_g,
growth_table,
sum(total_weight_g) over (order by timestamp) as sum_weight_g,
sum(number_fish) over (order by timestamp) as sum_number_fish,
sum(total_weight_g) over (order by timestamp) / sum(number_fish) over (order by timestamp) as average_weight_g
from
combined_feed_and_transfer_history
)
select
timestamp,
sum_number_fish,
transaction_size_g as trans_g,
sum_weight_g,
closest_lookup_table_weight.average_weight_g as lookup_g,
converted_weight_g as conv_g,
sum(converted_weight_g) over (order by timestamp) as sum_conv_g,
sum(converted_weight_g) over (order by timestamp) / sum_number_fish as sum_average_g
from
estimate
join lateral ( -- We then use this estimated_average_weight to look up the closest constant in the growth coefficient table.
(select gc.average_weight_g - estimate.average_weight_g as diff, gc.average_weight_g from growth_coefficients gc where gc.average_weight_g >= estimate.average_weight_g order by gc.average_weight_g asc limit 1)
union all
(select estimate.average_weight_g - gc.average_weight_g as diff, gc.average_weight_g from growth_coefficients gc where gc.average_weight_g <= estimate.average_weight_g order by gc.average_weight_g desc limit 1)
order by diff
limit 1
) as closest_lookup_table_weight
on true
join lateral ( -- If the historical event is a feeding we need to lookup the feed conversion rate.
select case when growth_table is null then 1
else (select feed_conversion_rate
from growth_coefficients gc
where gc.growth_table = growth_table
and gc.average_weight_g = closest_lookup_table_weight.average_weight_g)
end
) as growth_coefficient
on true
join lateral (
select feed_conversion_rate * transaction_size_g as converted_weight_g
) as converted_weight_g
on true;
At the very bottom is my current "solution". With the above example data the sum_conv_g should end up being 5.6, but due to the aggregate being used as the lookup not accounting for the conversion rate the sum_conv_g ends up 45.2 instead.
One idea I had was if there perhaps something like query-local variables one could use to store the sum_average_g between rows? There's always the escape hatch of just querying out the transactions to my generic programming language Clojure and solving it there, but it would be neat if it could be solved entirely within the database.
You have to formulate a recursive subquery. I posted a simplified version of this question over at the DBA SE and got the answer there. The answer to that question can be found here and can be expanded to this more complicated question, though I would wager that no one will ever have the interest to do that.

Repeat same data certain number of times

I've ran into an issue that I cant seem to solve without a lot of changes deep in the code, and I think there must be a simpler solution that I'm simply not aware of.
I have a table of product names, product locations and various statuses (from 1 to 10). I have data for all products and locations but only some of the statuses (for example product X in city XX has data for categories 1 and 3, and product Y for city YY has data for categories 1 to 6).
I'd like to always display 10 repetitions of each product/location, with corresponding data (if there is any) or nulls. This makes a report I'm planning on creating much easier to read and understand.
I'm using SSMS2017, on SQL Server 2016.
SELECT
[Product],
[Location],
[Category],
[Week1],
[Week2],
[Week3]
FROM MyView
Naturally it will only return data that I have, but I'd like to always return all 10 rows for each product/location combination (with nulls in Week columns if I have no data there).
Your question ist not very clear, but I think, that my magic crystall ball gave me a good guess:
I think, that you are looking for LEFT JOIN and CROSS JOIN:
--Next time please create a stand-alone sample like this yourself
--I create 3 dummy tables with sample data
DECLARE #tblStatus TABLE(ID INT IDENTITY,StatusName VARCHAR(100));
INSERT INTO #tblStatus VALUES('Status 1')
,('Status 2')
,('Status 3')
,('Status 4')
,('Status 5');
DECLARE #tblGroup TABLE(ID INT IDENTITY,GroupName VARCHAR(100));
INSERT INTO #tblGroup VALUES ('Group 1')
,('Group 2')
,('Group 3')
,('Group 4')
,('Group 5');
DECLARE #tblProduct TABLE(ID INT IDENTITY,ProductName VARCHAR(100),StatusID INT, GroupID INT);
INSERT INTO #tblProduct VALUES ('Product 1, Status 1, Group 2',1,2)
,('Product 2, Status 1, Group 3',1,3)
,('Product 3, Status 3, Group 4',3,4)
,('Product 4, Status 3, Group 3',3,3)
,('Product 5, Status 1, Group 5',1,5);
--This will return each status (independent of product values), together with the products (if there is a corresponding line)
SELECT s.StatusName
,p.*
FROM #tblStatus s
LEFT JOIN #tblProduct p ON s.ID=p.StatusID
--This will first use CROSS JOIN to create an each-with-each cartesian product.
--The LEFT JOIN works as above
SELECT s.StatusName
,g.GroupName
,p.*
FROM #tblStatus s
CROSS JOIN #tblGroup g
LEFT JOIN #tblProduct p ON s.ID=p.StatusID AND g.ID=p.GroupID;
If this is not what you need, please try to set up an example like mine and provide the expected output.

Aggregating connected sets of nodes / edges

I have a connected set of edges with unique nodes. They are connected using a parent node. Consider the following example code and illustration:
CREATE TABLE network (
node integer PRIMARY KEY,
parent integer REFERENCES network(node),
length numeric NOT NULL
);
CREATE INDEX ON network (parent);
INSERT INTO network (node, parent, length) VALUES
(1, NULL, 1.3),
(2, 1, 1.2),
(3, 2, 0.9),
(4, 3, 1.4),
(5, 4, 1.6),
(6, 2, 1.5),
(7, NULL, 1.0);
Visually, two groups of edges can be identified. How can the two groups be identified using PostgreSQL 9.1, and length summed? The expected result is shown:
edges_in_group | total_edges | total_length
----------------+-------------+--------------
{1,2,3,4,5,6} | 6 | 7.9
{7} | 1 | 1.0
(2 rows)
I don't even know where to begin. Do I need a custom aggregate or window function? Could I use WITH RECURSIVE to iteratively collect edges that connect? My real world case is a stream network of 245,000 edges. I expect the maximum number of edges_in_group to be less than 200, and a couple hundred aggregated groups (rows).
A recursive query is the way to go:
with recursive tree as (
select node, parent, length, node as root_id
from network
where parent is null
union all
select c.node, c.parent, c.length, p.root_id
from network c
join tree p on p.node = c.parent
)
select root_id, array_agg(node) as edges_in_group, sum(length) as total_length
from tree
group by root_id;
The important thing is to keep the id of the root node in each recursion, so that you can group by that id in the final result.