Postgis DB Structure: Identical tables for multiple years? - postgresql

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim

Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Related

Optimise a simple update query for large dataset

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:
------- -----------
| event | | parameter |
------- -----------
| id | | id |
| order | | eventId |
------- | order |
-----------
Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:
UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id
The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.
There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:
Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
-> Hash Join (cost=706691.80..1706622.39 rows=11217313 width=155)
Hash Cond: (parameter."eventId" = e.id)
-> Seq Scan on parameter (cost=0.00..435684.13 rows=11217313 width=145)
-> Hash (cost=557324.91..557324.91 rows=7724791 width=26)
-> Seq Scan on event e (cost=0.00..557324.91 rows=7724791 width=26)
Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".
Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:
I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)
The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?
Edit for Jim Jones' comment:
I executed the following query:
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
I got 9 identical rows like the following:
pid | locktype | mode | query | query-start | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active
query column:
--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.
If some rows already have the target value, you can skip empty updates (at full cost). Like:
UPDATE parameter p
SET "order" = e."order"
FROM event e
WHERE p."eventId" = e.id
AND p."order" IS DISTINCT FROM e."order"; -- this
If both "order" columns are defined NOT NULL, simplify to:
...
AND p."order" <> e."order";
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.
Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.
The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.
I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?
I suggest creating a new table, then dropping the old one and renaming the new table.
CREATE TABLE parameter_new AS
SELECT
parameter.id,
parameter."eventId",
e."order"
FROM
parameter
JOIN event AS "e" ON
"e".id = parameter."eventId"
Later, once you verify things:
ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;
Later, once you're completely certain:
DROP TABLE parameter_old;

Gist index on Postgres/PostGIS still slow

I am not an expert at Postgres/GIS subjects and I have an issue with a large database (over 20 million records) of geometries. First of all my set up looks like this:
mmt=# select version();
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------
version | PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
mmt=# select PostGIS_Version();
-[ RECORD 1 ]---+--------------------------------------
postgis_version | 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
The table that I am querying contains the following columns:
mmt=# \d titles
Table "public.titles"
Column | Type | Collation | Nullable | Default
----------------------+--------------------------+-----------+----------+-----------------------------------------
ogc_fid | integer | | not null | nextval('titles_ogc_fid_seq'::regclass)
wkb_geometry | bytea | | |
timestamp | timestamp with time zone | | |
end | timestamp with time zone | | |
gml_id | character varying | | |
validfrom | character varying | | |
beginlifespanversion | character varying | | |
geom_bounding_box | geometry(Geometry,4326) | | |
Indexes:
"titles_pkey" PRIMARY KEY, btree (ogc_fid)
"geom_idx" gist (geom_bounding_box)
The geom_bounding_box column holds the bounding box of the wkb_geometry. I have created that bounding box column because the wkb geometries exceed the default size limits for items in a GIST index. Some of them are quite complex geometries with several dozens of points making up a polygon. Using a bounding box instead meant I was able to put an index on that column as a way of speeding up the search.. at least that's the theory.
My search aims to find geometries which fall within 100 metres of a given point as follows, however this takes well over two minutes to return. I want to get that under one second!:
select ogc_fid, web_geometry from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
Below is a basic explain output. What can I do to speed this thing up?
Thank you!
mmt=# explain select ogc_fid from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Gather (cost=1000.00..243806855.33 rows=2307 width=4)
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Workers Planned: 2
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Parallel Seq Scan on titles (cost=0.00..243805624.63 rows=961 width=4)
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Filter: st_dwithin((geom_bounding_box)::geography, '0101000020E61000006878B306EFABC2BF6308008E3DC14940'::geography, '100'::double precision, true)
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | JIT:
-[ RECORD 6 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Functions: 4
-[ RECORD 7 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Options: Inlining true, Optimization true, Expressions true, Deforming true
The problem is that you are mixing geometry and geography, and PostgreSQL casts geom_bounding_box to geography so that they match.
Now you have indexed geom_bounding_box, but not geom_bounding_box::geography, which is something different.
Either use 'SRID=4326;POINT(-0.145872 51.509691)'::geometry as second operand or create the GiST index on ((geom_bounding_box::geography)) (note the double parentheses).
EDIT:
As pointed out by mlinth, my answer below is not really valid. It raises a danger though: beware of the arguments given to the ST_DWithin function, because the unit of distance argument is inferred differently depending if you give geographies (meters) or geometries (srid unit).
According to the ST_DWithin doc, the distance is specified in SRID unit. In your case, the spatial reference system is a geographic one, so your 100 value means 100 degree radius, not 100 meters. That means approximately the entire world. In such case, efficiently using the index is impossible.
If you want to find geometries in a 100 meter radius, you must convert a 100 meter in degree unit, but that depends on latitude (if you want to be accurate).
To start, I'd recommend you to use a (very) approximate shortcut: 100 meters at the equator is (very) approximately equal to 0.001 degrees. So replace your distance value with it, and if it speed up things (and I'm pretty convinced it will), then you will be able to refine your query to be more accurate.
I did resolve this and it was a combination of all of the above things, although not any one of them alone. As a quick summary:
Laurenz Albe was right in spotting the mix of geography and geometry types, which was easy to fix by removing the cast.
Ian Turton was also right in spotting that dozens of points shouldn't be an issue for a gist index, so I abandoned the bounding box approximation approach and went back to exploring the index issues. What I found was that the geometry column was defined with a data type of 'byte array' (bytea), which prevents creation of an spgist index due to 'no default operator class for access method "spgist"' This was resolved by changing the column type as follows:
mmt=# ALTER TABLE titles
ALTER COLUMN wkb_geometry
TYPE geometry
USING wkb_geometry::geometry;
The index then creates successfully (either gist or spgist) and I have been able to benchmark the two side by side, finding gist to be slightly more efficient in my use-case.
Amanin was also right to point out the differences between meters and radial degrees according to the spatial reference system. In some of my tests I was erroneously using the latter, but on very large radii. Since I'm indexing and searching with geometry types, that radius value needs to be very small in radial degrees in order to cover quite large areas. Fixed!
All put together, and searches across 26 million records consistently complete in 200ms to 500ms, with occasional spikes up to 1.1s. This is pretty good.
Thanks all who contributed input, ideas and discussion.

How to combine PostgreSQL text search with fuzzystrmatch

I'd like to be able to query words from column of type ts_vector but everything which has a levenshtein distance below X should be considered a match.
Something like this where my_table is:
id | my_ts_vector_colum | sentence_as_text
------------------------------------------------------
1 | 'bananna':3 'tasty':2 'very':1 | Very tasty bananna
2 | 'banaana':2 'yellow':1 | Yellow banaana
3 | 'banana':2 'usual':1 | Usual banana
4 | 'baaaanaaaanaaa':2 'black':1 | Black baaaanaaaanaaa
I want to query something like "Give me id's of all rows, which contain the word banana or words similar to banana where similar means that its Levenshtein distance is less than 4". So the result should be 1, 2 and 3.
I know i can do something like select id from my_table where my_ts_vector_column ## to_tsquery('banana');, but this would only get me exact matches.
I also know i could do something like select id from my_table where levenshtein(sentence_as_text, 'banana') < 4;, but this would work only on a text column and would work only if the match would contain only the word banana.
But I don't know if or how I could combine the two.
P.S. Table where I want to execute this on contains about 2 million records and the query should be blazing fast (less than 100ms for sure).
P.P.S - I have full control on the table's schema, so changing datatypes, creating new columns, etc would be totally feasible.
2 million short sentences presumably contains far fewer distinct words than that. But if all your sentences have "creative" spellings, maybe not.
So you can perhaps create a table of distinct words to search relatively quickly with the unindexed distance function:
create materialized view words as
select distinct unnest(string_to_array(lower(sentence_as_text),' ')) word from my_table;
And create an exact index into the larger table:
create index on my_table using gin (string_to_array(lower(sentence_as_text),' '));
And then join the together:
select * from my_table join words
ON (ARRAY[word] <# string_to_array(lower(sentence_as_text),' '))
WHERE levenshtein(word,'banana')<4;

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Does PostgreSQL have a way of creating metadata about the data in a particular table?

I'm dealing with a lot of unique data that has the same type of columns, but each group of rows have different attributes about them and I'm trying to see if PostgreSQL has a way of storing metadata about groups of rows in a database or if I would be better off adding custom columns to my current list of columns to track these different attributes. Microsoft Excel for instance has a way you can merge multiple columns into a super-column to group multiple columns into one, but I don't know how this would translate over to a PostgreSQL database. Thoughts anyone?
Right, can't upload files. Hope this turns out well.
Section 1 | Section 2 | Section 3
=================================
Num1|Num2 | Num1|Num2 | Num1|Num2
=================================
132 | 163 | 334 | 1345| 343 | 433
......
......
......
have a "super group" of columns (In SQL in general, not just postgreSQL), the easiest approach is to use multiple tables.
Example:
Person table can have columns of
person_ID, first_name, last_name
employee table can have columns of
person_id, department, manager_person_id, salary
customer table can have columns of
person_id, addr, city, state, zip
That way, you can join them together to do whatever you like..
Example:
select *
from person p
left outer join student s on s.person_id=p.person_id
left outer join employee e on e.person_id=p.person_id
Or any variation, while separating the data into different types and PERHAPS save a little disk space in the process (example if most "people" are "customers", they don't need a bunch of employee data floating around or have nullable columns)
That's how I normally handle this type of situation, but without a practical example, it's hard to say what's best in your scenario.