Aggregating connected sets of nodes / edges - postgresql

I have a connected set of edges with unique nodes. They are connected using a parent node. Consider the following example code and illustration:
CREATE TABLE network (
node integer PRIMARY KEY,
parent integer REFERENCES network(node),
length numeric NOT NULL
);
CREATE INDEX ON network (parent);
INSERT INTO network (node, parent, length) VALUES
(1, NULL, 1.3),
(2, 1, 1.2),
(3, 2, 0.9),
(4, 3, 1.4),
(5, 4, 1.6),
(6, 2, 1.5),
(7, NULL, 1.0);
Visually, two groups of edges can be identified. How can the two groups be identified using PostgreSQL 9.1, and length summed? The expected result is shown:
edges_in_group | total_edges | total_length
----------------+-------------+--------------
{1,2,3,4,5,6} | 6 | 7.9
{7} | 1 | 1.0
(2 rows)
I don't even know where to begin. Do I need a custom aggregate or window function? Could I use WITH RECURSIVE to iteratively collect edges that connect? My real world case is a stream network of 245,000 edges. I expect the maximum number of edges_in_group to be less than 200, and a couple hundred aggregated groups (rows).

A recursive query is the way to go:
with recursive tree as (
select node, parent, length, node as root_id
from network
where parent is null
union all
select c.node, c.parent, c.length, p.root_id
from network c
join tree p on p.node = c.parent
)
select root_id, array_agg(node) as edges_in_group, sum(length) as total_length
from tree
group by root_id;
The important thing is to keep the id of the root node in each recursion, so that you can group by that id in the final result.

Related

Create `has_($value)`column for table values

I would like to perform an operation similar to a dynamic pivot table. I utilize Postgresql as database framework. The table t has a column with values 10, 20, and 30. I wish to create n columns, in this case equal to 3, to allow the boolean assignment has_($value) equal to 1 if existent in respective group, or 0 if not. I tried to understand tablefunc and crosstab without success.
CREATE TABLE IF NOT EXISTS t (
id INTEGER NOT NULL,
value INT NOT NULL
)
INSERT INTO t (id, value) VALUES
(1, 10),
(1, 20),
(2, 10),
(3, 30),
(3, 20)

ST_ClusterDBSCAN function and minpoints parameter definition

I've spent the last 2 days trying to figure out what's wrong with my clustering query. It did seem to work correctly but after some more detailed testing, I was confused seeing that some clusters haven't got created even if they evidently should be.
So initially I was assuming that:
ST_ClusterDBSCAN(geom, eps := 1, minpoints := 4) OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type))
would result in clustering points that have the same "PARTITION BY" attributes and the minimum group would need to have 4 points (including core point) in eps distance. According to the docs:
https://postgis.net/docs/ST_ClusterDBSCAN.html
An input geometry will be added to a cluster if it is either:
A "core" geometry, that is within eps distance of at least minpoints input geometries (including itself) or
A "border" geometry, that is within eps distance of a core geometry. its surrounding area with radius eps.
But it seems that it's not exactly true. It seems that in order to cluster 4 points (like minpoints parameter is set), the grouping query:
OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type))
needs to result in at least five objects to let clst_id to be created for other four.. Here is an example:
CREATE TABLE IF NOT EXISTS public.point_table
(
point_sys_id serial primary key,--System generated Primary Key - Asset Cache
point_id bigint,
geom geometry(Point,4326),--Geometry Field
country_code varchar(4),--Country Code
elevation_ft numeric(7,2),--Elevation in Feet
height_ft numeric(7,2),--Height in Feet
obstacle_type varchar(50)--Obstacle Type
)
INSERT INTO point_table(point_id, geom, country_code, elevation_ft, height_ft, obstacle_type)
VALUES
(1, '0101000020E6100000E4141DC9E5934B40D235936FB6193940', 'ARE', 100, 50, 'BUILDING'),
(2, '0101000020E6100000C746205ED7934B40191C25AFCE193940', 'ARE', 100, 50, 'BUILDING'),
(3, '0101000020E6100000C780ECF5EE934B40B6BE4868CB193940', 'ARE', 100, 50, 'BUILDING'),
(4, '0101000020E6100000A97A358FA5AF4B4074A0C65B724C3940', 'ARE', 100, 50, 'BUILDING'), -- this point is outside of the cluster distance (eps)
(5, '0101000020E6100000ABB2EF8AE0934B404451A04FE4193940', 'ARE', 100, 50, 'BUILDING')
select ST_ClusterDBSCAN(geom, eps := 0.000906495804256269, minpoints := 4) OVER(PARTITION BY CONCAT(country_code, elevation_ft, height_ft, obstacle_type)) as clst_id,
point_id, geom, country_code, elevation_ft, height_ft, obstacle_type
from point_table
--where point_id != 4
Running clustering query agains all five points works fine. But once you exclude seemingly irrelevant point_id = 4 (which is outside of eps distance anyway) the clustering stops working (clst_id becomes null), even if still theoretically 4 needed points (according to the docs) are in place.
Once I change the minpoints parameter to 3, clustering works fine for those 4 neighbor points.
Can someone confirm my conclusion that ST_ClusterDBSCAN is not correct or give some good explanation for this behavior?
EDIT:
I've submitted a ticket to PostGIS directly: https://trac.osgeo.org/postgis/ticket/4853
and it seems that this has been fixed from 3.1 version :)

PostgreSQL multicolumn index not fully used

I have a large (~110 million rows) table on PostgreSQL 12.3 whose relevant fields can be described by the following DDL:
CREATE TABLE tbl
(
item1_id integer,
item2_id integer,
item3_id integer,
item4_id integer,
type_id integer
)
One of the queries we execute often is:
SELECT type_id, item1_id, item2_id, item3_id, item4_id
FROM tbl
WHERE
type_id IS NOT NULL
AND item1_id IN (1, 2, 3)
AND (
item2_id IN (4, 5, 6)
OR item2_id IS NULL
)
AND (
item3_id IN (7, 8, 9)
OR item3_id IS NULL
)
AND (
item4_id IN (10, 11, 12)
OR item4_id IS NULL
)
Although we have indexes for each of the individual columns, the query is still relatively slow (a couple of seconds). Hoping to optimize this, I created the following index:
CREATE INDEX tbl_item_ids
ON public.tbl USING btree
(item1_id ASC, item2_id ASC, item3_id ASC, item4_id ASC)
WHERE type_id IS NOT NULL;
Unfortunately the query performance barely improved - EXPLAIN tells me this is because although an index scan is done with this newly created index, only item1_id is used as an Index Cond, whereas all the other filters are applied at table level (i.e. plain Filter).
I'm not sure why the index is not used in its entirety (or at least for more than the item1_id column). Is there an obvious reason for this? Is there a way I can restructure the index or the query itself to help with performance?
A multi-column index can only be used for more than the first column if the condition on the first column uses an equality comparison (=). IN or = ANY does not qualify.
So you will be better off with individual indexes for each column, which can be combined with a bitmap or.
You should try to avoid OR in the WHERE condition, perhaps with
WHERE coalesce(item2_id, -1) IN (-1, 4, 5, 6)
where -1 is a value that doesn't occur. Then you could use an index on the coalesce expression.

Multiply rows by difference of numbers in columns, with sequence list

I need to create a table using postgres that multiplies a row by the difference of the numbers in 2 columns, and provides the corresponding sequence. It's hard to explain, I'll leave a picture to save us a thousand words:
I have found a partial answer to this question in SQL, but it only multiplies by one column, and I'm having trouble with using it in Posgresql:
How to multiply a single row with a number from column in sql.
You can use the generate_series function: https://www.postgresql.org/docs/current/static/functions-srf.html
create table table_a(
a integer primary key,
start_a integer,
end_a integer
);
insert into table_a values
(1, 1, 3),
(2, 2, 5);
create table table_b as
select a, start_a, end_a, g as start_b, g+1 as end_b
from table_a, lateral generate_series(start_a, end_a-1) g;
select * from table_b;
You can try it here: http://rextester.com/RTZWK4070

Constraint on sum from rows

I've got a table in PostgreSQL 9.4:
user_votes (
user_id int,
portfolio_id int,
car_id int
vote int
)
Is it possible to put a constraint on the table so a user max can have 99 point to vote with in each portfolio?
This means that a user can have multiple rows consisting of the same user_id and portfolio_id, but different car_id and vote. The sum on votes should never exceed 99, but it can be placed among different cars.
So doing:
INSERT INTO user_vores (user_id, portfolio_id, car_id, vote) VALUES
(1, 1, 1, 20),
(1, 1, 7, 40),
(1, 1, 9, 25)
would all be allowed, but when trying to add something that exceeds 99 votes should fail, like another row:
INSERT INTO user_vores (user_id, portfolio_id, car_id, vote) VALUES
(1, 1, 21, 40)
Unfortunately no, if you tried to create such a constraint you will see this error message:
ERROR: aggregate functions are not allowed in check constraints
But the wonderfull thing about postgresql is that there is always more than one way to skin a cat. You can use a BEFORE trigger to check that the data you are trying to insert fullfills our requirements.
Row-level triggers fired BEFORE can return null to signal the trigger
manager to skip the rest of the operation for this row (i.e.,
subsequent triggers are not fired, and the INSERT/UPDATE/DELETE does
not occur for this row). If a nonnull value is returned then the
operation proceeds with that row value.
Inside your trigger you would count the number of votes
SELECT COUNT(*) into vote_count FROM user_votes WHERE user_id = NEW.user_id
Now if vote_count is 99 you return NULL and the data will not be inserted.