How to optimze the query (which tooks to long) - postgresql

I have a query which I want to know relatively how many locations are up to 100 meters away (relate to all distances):
select person_tbl.tdm, sum((st_distance (person_tbl.geo, location_tbl.geo) < 100)::INT)::FLOAT / count(*)
from persons as person_tbl, locations as location_tbl
where person_tbl.geo is not null
group by person_tbl.tdm
The 2 tables contains geometry indexs:
create index idx on persons using gist(geo)
create index idx on locations using gist(geo)
The first table (persons) the values of geo field is POLYGON
The second table (locations) the values of geo field are POINT Z or POLYGON Z or MULTIPOLYGON Z
The first table persons contains ~2M rows and the second table locations contains ~500 rows
The query took too long (~2 hours).
The values of max_parallel_processes and max_parallel_workers is 8
Is there something I can do to optimize the query calculation time (2 hours seems too long) ?
Is there a better way to write the query ? or do I need to define the indexes in other way ?

Related

What is wrong with my ST_Within query - query result contains point twice although it exists only once

I have two point tables, tab_1 and tab_2. I want to query all points from the first table that are probably the same points from the table 2. So i give the points from table 2 a buffer. Then I want to get the points from table 1 and query from table 2 within a 30 m buffer. My problem is, I get the points from table 1 and table 2 twice. But point 1 from table 1 exists only once and point 1 from table 2 also only once.
My query is:
with
"points1" as
(
select id, geom from tab_1
)
,
"points2" as
(
select id, geom from tab_2
)
select "points1".*, "points2".* from "points1", "points2"
where
st_within(st_transform("points1".geom, 31468), st_buffer(st_transform("points2".geom, 31468), 30)) = true;
id_tab1
geom
id_tab2
geom
st_distance
767074270
POINT (11.6968379 48.132722)
16455
POINT (11.69707 48.13265)
19.041083533921977
767074270
POINT (11.6968379 48.132722)
16455
POINT (11.69707 48.13265)
19.041083533921977
The query should be give only one result:
id_tab1
geom
id_tab2
geom
st_distance
767074270
POINT (11.6968379 48.132722)
16455
POINT (11.69707 48.13265)
19.041083533921977
Is my query wrong?
STEP 1. Query
SELECT *
FROM tab_1
JOIN tab_2
ON ST_DWithin
( ST_Transform(tab_1.geom, 31468)
, ST_Transform(tab_2.geom, 31468)
, 30
)
STEP 2. Spatial index
Most likely, the query cannot use the spatial index (even if it exists) and the function ST_DWithin() properly (ST_Transform() does not allow using an existing spatial index).
Solution - create new spatial indexes for EPSG:31468
CREATE
INDEX tab_1_geom_31468_idx
ON tab_1
USING GIST (ST_Transform(geom, 31468))
;
CREATE
INDEX tab_2_geom_31468_idx
ON tab_2
USING GIST (ST_Transform(geom, 31468))
;

PostGIS: Query z and m dimensions (linestringzm)

Question
I have a system with multiple linestringzm, where the data is structured the following way: [x, y, speed:int, time:int]. The data is structured this way to be able to use ST_SimplifyVW on the x, y and z dimensions, but I still want to be able to query the linestring based on the m dimension e.g. get all linestrings between a time interval.
Is this possible with PostGIS or am I structuring the data incorrectly for my use case?
Example
z = speed e.g. km/h
m = Unix epoch time
CREATE TABLE t (id int NOT NULL, geom geometry(LineStringZM,4326), CONSTRAINT t_pkey PRIMARY KEY (id));
INSERT INTO t VALUES (1, 'SRID=4326;LINESTRING ZM(30 10 5 1620980688, 30 15 10 1618388688, 30 20 15 1615710288, 30 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (2, 'SRID=4326;LINESTRING ZM(50 10 5 1620980688, 50 15 10 1618388688, 50 20 15 1615710288, 50 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (3, 'SRID=4326;LINESTRING ZM(20 10 5 1620980688, 20 15 10 1618388688, 20 20 15 1615710288, 20 25 20 1620980688)'::geometry);
Use case A: Simplify the geometry based on x, y and z
This can be accomplished by e.g. ST_SimplifyVW which keep the m dimension after simplification.
Use case B: Query geometry based on the m dimension
I have a set of linestringzm which I want to query based on my time dimension (m). The result is either the full geometry if all m is between e.g.1618388000 and 1618388700 or the part of the geometry which satisfies the predicate. What is the most efficient way to query the data?
If you want to check every single point of your LineString you could ST_DumpPoints them and get the M dimension with ST_M. After that extract the subset as a LineString containing the overlapping M values and apply ST_MakeLine with a GROUP BY:
WITH j AS (
SELECT id,geom,(ST_DumpPoints(geom)).geom AS p
FROM t
)
SELECT id,ST_AsText(ST_MakeLine(p))
FROM j
WHERE ST_M(p) BETWEEN 1618388000 AND 1618388700
GROUP BY id;
Demo: db<>fiddle
Note: Depending on your table and LineString sizes this query may become pretty slow, as values are being parsed in query time and therefore aren't indexed. Imho a more elegant alternative would be ..
.. 1) to create a tstzrange column
ALTER TABLE t ADD COLUMN line_interval tstzrange;
.. 2) to properly index it
CREATE INDEX idx_t_line_interval ON t USING gist (line_interval);
.. and 3) to populate it with the time of geom's first and last points:
UPDATE t SET line_interval =
tstzrange(
to_timestamp(ST_M(ST_PointN(geom,1))),
to_timestamp(ST_M(ST_PointN(geom,ST_NPoints(geom)))));
After that you can speed things up by checking wether the indexed column overlaps with a given interval. This will significantly improve query time:
SELECT * FROM t
WHERE line_interval && tstzrange(
to_timestamp(1618138148),
to_timestamp(1618388700));
Demo: db<>fiddle
Further reading:
ST_M
ST_PointN
ST_NPoints
PostgreSQL Built-in Range Types

Indexing issue in postgres

It is impossible to speed up the database due to indexing.
I create a table:
CREATE TABLE IF NOT EXISTS coordinate( Id serial primary key,
Lat DECIMAL(9,6),
Lon DECIMAL(9,6));
After that I add indexing:
CREATE INDEX indeLat ON coordinate(Lat);
CREATE INDEX indeLon ON coordinate(Lon);
Then the table is filled in:
INSERT INTO coordinate (Lat, Lon) VALUES(48.685444, 44.474254);
Fill in 100k random coordinates.
Now I need to return all coordinates that are included in a radius of N km from a given coordinate.
SELECT id, Lat, Lon
FROM coordinate
WHERE acos(sin(radians(48.704578))*sin(radians(Lat)) + cos(radians(48.704578))*cos(radians(Lat))*cos(radians(Lon)-radians(44.507112))) * 6371 < 50;
The test execution time is approximately 0.2 seconds, and if you do not do CREATE INDEX, the time does not change. I suspect that there is an error in the request, maybe you need to rebuild it somehow?
I'm sorry for my english
An index can only be used if the indexed expression is exactly what you have on the non-constant side of the operator. That is obviously not the case here.
For operations like this, you need to use the PostGIS extension. Then you can define a table like:
CREATE TABLE coordinate (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
p geography NOT NULL
);
and query like this:
SELECT id, p
FROM coordinate
WHERE ST_DWithin(p, 'POINT(48.704578 44.507112)'::geography, 50);
This index would speed up the query:
CREATE INDEX ON coordinate USING gist (p);

PostgreSQL hierarchical nested set huge database

I have a database that must store thousands of scenarios (each scenario with a single unix_timestamp value). Each scenario has 1,800,000 registers organized in a Nested Set structure.
The general table structure is given by:
table_skeleton:
- unix_timestamp integer
- lft integer
- rgt integer
- value
Usually, my SELECTs are will perform taking all nested values within an specific scenario, it means for example:
SELECT * FROM table_skeleton WHERE unix_timestamp = 123 AND lft >= 10 AND rgt <= 53
So I hierarchically divided my table into master / children within groups of dates, for example:
table_skeleton_201303 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
and
table_skeleton_201304 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
And also created index for each children according to the usual search I am expecting, it is for example:
Create Index idx_201303
on table_skeleton_201303
using btree(unix_timestamp, lft, rgt)
It improved the retrieval, but it still takes about 1 minute for each select.
I imagined that this was because the index was too big to be loaded into memory always so I tried to create partial index for each timestamp, for example:
Create Index idx_201303_1362981600
on table_skeleton_201303
using btree(lft, rgt)
WHERE unix_timestamp = 1362981600
And in fact the second type of index created is much, much, much smaller than the general one. However, when I run an EXPLAIN ANALYZE for the SELECT I've previously shown here, the query solver ignores my new partial index and keeps using the giant old one.
Is there a reason for that?
Is there any new approach to optimize such type of huge nested set hierarchical database?
When you filter on a table by field_a > x and field_b > y, then an index for field_a, field_b will (actually just may, depending on the distribution and the percentage of rows with field_a > x, as per the statistics collected) only be used for "field_a > x", and field_b > y will be a sequential search.
In the case above, having two indexes, one for each field, could be used and each of the results joined, the internal equivalent of:
SELECT *
FROM table t
JOIN (
SELECT id table field_a > x) ta ON (ta.id = t.id)
JOIN (
SELECT id table field_b > y) tb ON (tb.id = t.id);
There is a change you could benefit from a GIST index, and treating your lft and rgt fields as points:
CREATE INDEX ON table USING GIST (unix_timestamp, point(lft, rgt));
SELECT * table
WHERE unix_timestamp = 123 AND
point(lft,rgt) <# box(point(10,'-inf'), point('inf',53));

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)