How to check if multilinestring really is multilinestring? - postgresql

I have a huge database with a road network, and the geometry type is MULTILINESTRING. I would like to filter out the MULTILINESTRINGS with topological errors. Both the lines on the left side and on the right side are one-one record, made of two lines. Now on the right side they connect, so it doesn't really bother me, I can merge them later without a topological error. However on the left side they don't connect, but they still are one record.
What I've tried so far:
SELECT gid
FROM myschema.roads
WHERE (
NOT ST_Equals(ST_Endpoint(ST_GeometryN(the_geom,1 )),ST_Startpoint(ST_GeometryN(the_geom,2 )))
AND NOT ST_Equals(ST_Endpoint(ST_GeometryN(the_geom,2 )),ST_Startpoint(ST_GeometryN(the_geom,1 )))
)
If I could say that the MULTILINESTRINGS are made up of maximum two lines, it would work I assume. Unfortunatelly some of them are made up of 10-20 lines, and I cannot be sure that the line parts are folowwing each other in an ascending or descending order. So extending my SQL script is not an option in my opinion.
(I'm using QGIS with a PostGIS database, but I also posess ArcMap.)

If you're simply looking for a way to identify which MultiLineStrings contain more than one line you can simply use ST_LineMerge, then ST_Dump and count the returning LineStrings. In case a geometry contains non continuous lines the query will return a count bigger than 1, e.g.
WITH j (geom) AS (
VALUES ('MULTILINESTRING((10 10, 20 20, 10 40),(40 40, 30 30, 40 20, 30 10))'),
('MULTILINESTRING((10 10, 20 20, 10 40),(10 40, 30 30, 40 20, 30 10))'))
SELECT geom,(SELECT count(*) FROM ST_Dump(ST_LineMerge(geom)))
FROM j;
geom | count
---------------------------------------------------------------------+-------
MULTILINESTRING((10 10, 20 20, 10 40),(40 40, 30 30, 40 20, 30 10)) | 2
MULTILINESTRING((10 10, 20 20, 10 40),(10 40, 30 30, 40 20, 30 10)) | 1
(2 Zeilen)
Another alternative is to use ST_NumGeometries after applying ST_LineMerge, e.g.
WITH j (geom) AS (
VALUES ('MULTILINESTRING((10 10, 20 20, 10 40),(40 40, 30 30, 40 20, 30 10))'),
('MULTILINESTRING((10 10, 20 20, 10 40),(10 40, 30 30, 40 20, 30 10))'))
SELECT geom,ST_NumGeometries(ST_LineMerge(geom)) AS count
FROM j;
geom | count
---------------------------------------------------------------------+-------
MULTILINESTRING((10 10, 20 20, 10 40),(40 40, 30 30, 40 20, 30 10)) | 2
MULTILINESTRING((10 10, 20 20, 10 40),(10 40, 30 30, 40 20, 30 10)) | 1
(2 Zeilen)

You could use this function to check if the multilinestring is connected:
CREATE OR REPLACE FUNCTION is_connected(g geometry(MultiLineString)) RETURNS boolean
LANGUAGE plpgsql AS
$$DECLARE
i integer;
point geometry := NULL;
part geometry;
BEGIN
FOR i IN 1..ST_NumGeometries(g) LOOP
part := ST_GeometryN(g, i);
IF NOT ST_Equals(point, ST_Startpoint(part)) THEN
RETURN FALSE;
END IF;
point := ST_Endpoint(part);
END LOOP;
RETURN TRUE;
END;$$;

Related

Is there a PostGIS function to conditionally merge linestring geometries to the neighboring ones?

I had a lines (multilinestring) table in my PostGIS database (Postgres 11), which I have converted to linestrings and also checked the validity (ST_IsValid()) of new linestring geometries.
create table my_line_tbl as
select
gid gid_multi,
adm_code, t_count,
st_length((st_dump(st_linemerge(geom))).geom)::int len,
(st_dump(st_linemerge(geom))).geom geom
from
my_multiline_tbl
order by gid;
alter table my_line_tbl add column id serial primary key not null;
The first 10 rows look like this:
id, gid_multi, adm_code, t_count, len, geom
1, 1, 30, 5242, 407, LINESTRING(...)
2, 1, 30, 3421, 561, LINESTRING(...)
3, 2, 50, 5248, 3, LINESTRING(...)
4, 2, 50, 1458, 3, LINESTRING(...)
5, 2, 60, 2541, 28, LINESTRING(...)
6, 2, 30, 3325, 4, LINESTRING(...)
7, 2, 20, 1142, 5, LINESTRING(...)
8, 2, 30, 1425, 7, LINESTRING(...)
9, 3, 30, 2254, 4, LINESTRING(...)
10, 3, 50, 2254, 50, LINESTRING(...)
I am trying to develop the logic.
Find all <= 10 m segments and merge those to neighboring geometries
(previous or next) > 10 m
If there are many <= 10 m segments next to
each other merge them to make > 10 m segments (min length: > 10 m)
In case of intersections, merge any <= 10 m segments to the longest neighboring geometry
I thought of using SQL window functions to check the length (st_length()) of succeeding geometries (lead(id) over()), and then merging them, but the problem with this approach is, the successive IDs are not next to each other (do not intersects, st_intersects()).
My code attempt (dynamic SQL) is here, where I try to separate <= 10 and > 10 meter geometries.
with lt10mseg as (
select
id, gid_multi,
len, geom lt10m_geom
from
my_line_tbl
where len <= 10
order by id
), gt10mseg as (
select
id, gid_multi,
len, geom gt10m_geom
from
my_line_tbl
where len > 10
order by id
)
select
st_intersects(lt10m_geom,gt10m_geom)
from
lt10mseg, gt10mseg
order by id
Any help/suggestions (dynamic SQL/PLPGSQL) to continue develop the above logic? The ultimate goal is to get rid of <= 10 m segments by merging them to the neighbors.

Multiple Cases For Same Result Column

Have a table where one of the columns has all of the info I need for a report.
I want to substring certain portions of this column as a column in this report, but the problem is that this column has results from 3 varying character lengths.
Example:
Row1: 20180101_ABC_12
Row2: 20180102_DEFG_23
Row3: 20180103_HIJKL_45
In this particular example I want the middle portion (eg. ABC) to be a column called 'Initials', problem is I am using CASE logic for each LEN. Not sure else how to achieve this.
My sample query below. It pulls all of the possible options, but as separate columns. What would I need to do to have these 3 options pull into one column, let's call it 'Initials'?
Thanks
SELECT
FileName
, CASE WHEN LEN(FileName) = 10 THEN SUBSTRING(FileName, 10, 3) ELSE NULL END
, CASE WHEN LEN(FileName) = 11 THEN SUBSTRING(FileName, 10, 4) ELSE NULL END
, CASE WHEN LEN(FileName) = 12 THEN SUBSTRING(FileName, 10, 5) ELSE NULL END
FROM File
In Tableau, you would accomplish this using a calculated field.
Initials:
CASE LEN(FileName)
WHEN 10 THEN SUBSTRING(FileName, 10, 3)
WHEN 11 THEN SUBSTRING(FileName, 10, 4)
WHEN 12 THEN SUBSTRING(FileName, 10, 5)
END
Or maybe
SUBSTRING(FileName
,10
,CASE LEN(FileName)
WHEN 10 THEN 3
WHEN 11 THEN 4
WHEN 12 THEN 5
END
)
But barring the more technical aspect, this can be solved with math (assuming your data is either limited to the 10, 11, and 12, or that the pattern holds):
SUBSTRING(FileName
,10
,LEN(FileName)-7
)
You need 1 CASE statement covering every possible case and not 3 separate ones, because each one creates a new column:
SELECT
FileName
, CASE LEN(FileName)
WHEN 10 THEN SUBSTRING(FileName, 10, 3)
WHEN 11 THEN SUBSTRING(FileName, 10, 4)
WHEN 12 THEN SUBSTRING(FileName, 10, 5)
ELSE NULL
END AS Initials
FROM File
Another way to get everything between the 2 _:
SELECT
FileName
, substring(
left(FileName, len(FileName) - charindex('_', reverse(FileName) + '_')),
charindex('_', FileName) + 1,
len(FileName)
) AS Initials
FROM File
but from your logic I assume that the values in column FileName have the same patern:
<9 digits>_<Initials>_<2 digits>
If this is the case then you can get what you want like this:
SELECT
FileName
, substring(FileName, 10, len(FileName) - 12) AS Initials
FROM File

Creating a Bin column in Postgres to check an integer and return a string

I have a large data set in a Postgres db and need to generate a field that groups rows into a respective bin for "0-100", "101-200", "201-300", etc. all the way up to nearly 5000. I am aware that I could manually update each row and produce a line of code for each bin like this:
update test
set testgroup = '0-100' where testint >= 1 and distance < 100;
I really would like to figure out a more efficient way to do this, open to anything and everything! The main goal is to look at the integer in this 'testint' column and then if it is in between 1-100 return in the testgroup column "0-100".
Use the width_bucket function. See the the docs, but here is a short version of the syntax:
width_bucket(a, LBound, UBound, num_bins)
To get it to work properly for your bins, I have to add 1 to UBound. Some examples:
select width_bucket( 1, 0, 5001, 50) gives 1
select width_bucket(100, 0, 5001, 0) gives 1
select width_bucket(101, 0, 5001, 50) gives 2
select width_bucket(4900, 0, 5001, 50) gives 49
select width_bucket(4901, 0, 5001, 50) gives 50
So that works as expected. Next we need to generate the proper string. Pseudo format is
(width_bucket - 1)*100 || '-' || (width_bucket)*100
Where || is the SQL concatenation operator. Using the first example from before:
select (width_bucket(1, 0, 5001, 50)-1)*100 || ' - ' || width_bucket(1, 0, 5001, 50)*100
gives '0 - 100'
Sweet. Now putting it all together. First make a sandbox table you can use for testing. This will be a copy or partial copy of your data:
CREATE TABLE test
AS
SELECT *
FROM original_table
Then add the new column to the table:
ALTER TABLE test
ADD COLUMN testgroup text
Now the UPDATE statement:
UPDATE test
SET testgroup = width_bucket(testint, 0, 5001, 50)-1)*100 || ' - ' ||
width_bucket(testint, 0, 5001, 50)*100
You can make use of generate_series to generate numbers from 0 to 50, and then to select the data between the generated values * 100 and the next generated value * 100. The same principle is used to build the bin name.
UPDATE test
SET testgroup = (x*100)+1 || '-' || (x+1)*100
FROM generate_series(0,50) f(x)
WHERE testint > (x*100)
AND testint <= ((x+1)*100);
http://rextester.com/FXIS37706

Aggregation on fixed size JSONB array in PostgreSQL

I'm struggling doing aggregations on a JSONB field in a PostgreSQL database. This is probably easier explained with an example so if create and populate a table called analysis with 2 columns (id and analysis) as follows: -
create table analysis (
id serial primary key,
analysis jsonb
);
insert into analysis
(id, analysis) values
(1, '{"category" : "news", "results" : [1, 2, 3, 4, 5 , 6, 7, 8, 9, 10, 11, 12, 13, 14, null, null]}'),
(2, '{"category" : "news", "results" : [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, null, 26]}'),
(3, '{"category" : "news", "results" : [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]}'),
(4, '{"category" : "sport", "results" : [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66]}'),
(5, '{"category" : "sport", "results" : [71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]}'),
(6, '{"category" : "weather", "results" : [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106]}');
As you can see the analysis JSONB field always contains 2 attributes category and results. The results attribute will always contain an fixed length array of size 16. I've used various functions such as jsonb_array_elements but what I'm trying to do is the following: -
Group by analysis->'category'
Average of each array element
When I want is a statement to return 3 rows grouped by category (i.e. news, sport and weather) and a 16 fixed length array containing averages. To further complicate things, if there are nulls in the array then we should ignore them (i.e. we are not simply summing and averaging by the number of rows). The result should look something like the following: -
category | analysis_average
-----------+--------------------------------------------------------------------------------------------------------------
"news" | [14.33, 15.33, 16.33, 17.33, 18.33, 19.33, 20.33, 21.33, 22.33, 23.33, 24.33, 25.33, 26.33, 27.33, 45, 36]
"sport" | [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76]
"weather" | [91, 92, 93, 94, 95, 96, 97, 98, 99, 00, 101, 102, 103, 104, 105, 106]
NOTE: Notice the 45 and 36 in the last 2 array itmes on the 1st row which illustrates ignoring the nullss.
I had considered creating a view which exploded the array into 16 columns i.e.
create view analysis_view as
select a.*,
(a.analysis->'results'->>0)::int as result0,
(a.analysis->'results'->>1)::int as result1
/* ... etc for all 16 array entries .. */
from analysis a;
This seems extremely inelegant to me and removes the advantages of using an array in the first place but could probably hack something together using that approach.
Any pointers or tips will be most appreciated!
Also performance is really important here so the higher the performance the better!
This will work for any array length
select category, array_agg(average order by subscript) as average
from (
select
a.analysis->>'category' category,
subscript,
avg(v)::numeric(5,2) as average
from
analysis a,
lateral unnest(
array(select jsonb_array_elements_text(analysis->'results')::int)
) with ordinality s(v,subscript)
group by 1, 2
) s
group by category
;
category | average
----------+----------------------------------------------------------------------------------------------------------
news | {14.33,15.33,16.33,17.33,18.33,19.33,20.33,21.33,22.33,23.33,24.33,25.33,26.33,27.33,45.00,36.00}
sport | {61.00,62.00,63.00,64.00,65.00,66.00,67.00,68.00,69.00,70.00,71.00,72.00,73.00,74.00,75.00,76.00}
weather | {91.00,92.00,93.00,94.00,95.00,96.00,97.00,98.00,99.00,100.00,101.00,102.00,103.00,104.00,105.00,106.00}
table functions - with ordinality
lateral
Because the array is always of the same length, you can use generate_series instead of typing the index of every array element yourself. You CROSS JOIN with that generated series so the index is applied to every category and you can get every element at position s from the array. Then it is just aggregating the data using GROUP BY.
The query then becomes:
SELECT category, array_agg(val ORDER BY s) analysis_average
FROM (
SELECT analysis->'category' category, s, AVG((analysis->'results'->>s)::numeric) val
FROM analysis
CROSS JOIN generate_series(0, 15) s
GROUP BY category,s
) q
GROUP BY category
15 is in this case the last index of the array (16-1).
It can be done in more traditional way like
select
(t.analysis->'category')::varchar,
array_math_avg(array(select jsonb_array_elements_text(t.analysis->'results')::int))::numeric(9,2)[]
from
analysis t
group by 1 order by 1;
but we need to do some preparation:
create type t_array_math_agg as(
c int[],
a numeric[]
);
create or replace function array_math_sum_f(in t_array_math_agg, in numeric[]) returns t_array_math_agg as $$
declare
r t_array_math_agg;
i int;
begin
if $2 is null then
return $1;
end if;
r := $1;
for i in array_lower($2,1)..array_upper($2,1) loop
if coalesce(r.a[i],$2[i]) is null then
r.a[i] := null;
else
r.a[i] := coalesce(r.a[i],0) + coalesce($2[i],0);
r.c[i] := coalesce(r.c[i],0) + 1;
end if;
end loop;
return r;
end; $$ immutable language plpgsql;
create or replace function array_math_avg_final(in t_array_math_agg) returns numeric[] as $$
declare
r numeric[];
i int;
begin
if array_lower($1.a, 1) is null then
return null;
end if;
for i in array_lower($1.a,1)..array_upper($1.a,1) loop
r[i] := $1.a[i] / $1.c[i];
end loop;
return r;
end; $$ immutable language plpgsql;
create aggregate array_math_avg(numeric[]) (
sfunc=array_math_sum_f,
finalfunc=array_math_avg_final,
stype=t_array_math_agg,
initcond='({},{})'
);

pg upgrade saving database definition taking time

I am in process of a pg_upgrade from 8.4 to 9.3.
I am using this technique:
http://momjian.us/main/writings/pgsql/pg_upgrade.pdf
The upgrade has been running since 250 hours, and it has been on the step saving database definition since 160 hours. This is the current output of strace last few lines:
poll([{fd=5, events=POLLIN|.
POLLERR}], 1, -1) = 1 ([{fd=5,
revents=POLLIN}])
recvfrom(5, "T\0\0\0F
\0\2reltoastrelid\0\0\0\4\353\0\n
\0\0\0\32\0"..., 16384, 0, NULL,
NULL) = 122
(5, "Q\0\0\0\221SELECT
attname, attacl FROM"..., 146,
MSG_NOSIGNAL, NULL, 0) = 146
poll([{fd=5, events=POLLIN|
POLLERR}], 1, -1
Is there a way to estimate how much time it will take?
There are ~3,300,000 objects in pg_class and the database has around 765,000 tables. There are around 3-5 columns in most tables, and around 2,000,000 records in total.