ST_contains taking too much time - postgresql

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.

As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);

Related

What is the best approach?

At work we have a SQL Server 2019 instance. There are two big tables in the same database that have to be joined to obtain specific data: one contains GPS data taken at 4 minutes interval, but there could be in between records as well. The important thing here is that there is a non-key attribute called file_id, a timestamp (DATE_TIME column), latitude and longitude. The other attributes are not relevant, and the key is autogenerated (identity column), so it's of no use to me.
The other table contains transaction records that have among other attributes a timestamp (FECHATRX column), and the same non-key file ID attribute the GPS table has, and also an autogenerated key with no relation at all with the other key.
For each file ID there are several records in both tables that have to be somewhat joined in order to obtain for a given file ID and transaction record both its latitude and longitude. The tables aren't ordered at all.
My idea is to pair records of the same file ID and I imagine it to be this way (I haven't done it yet because it was explained to me earlier today):
Order both tables by file ID and timestamp
For the same file ID all the transaction table records who have a timestamp equal or greater than the first timestamp from the GPS table and lower than the following timestamp from the same GPS table will be given both latitude and longitude values from that first record, for they are considered to belong to that latitude-longitude pair (actually they probably are somewhere in the middle, but this is an assumption and everybody agrees with this)
When a transaction record has a timestamp equal or greater than the second timestamp, then the third timestamp will act as an end point, all the records in between from the transaction table will obtain the same coordinates from the second record until one timestamp equals or be greater than the third, and so on until a new file ID is reached or there are no records left in one or both tables.
To me this sounds like nested cursors and several variables to save the first GPS record's values while we are also saving the second GPS record's timestamp for comparison purposes, and of course the file ID itself as a control variable, but is this the best way to obtain the latitude / longitude data for each and every transaction record from the GPS table?
Are other approaches better than using nested cursors?
As I said I haven't done anything yet, the only thing I can do is to show you some data from both tables, I just wanted to know if there is another (and simpler) way of doing this than using nested cursors.
Thank you.
Alejandro
No need to reorder tables or use a complex cursor loop. A properly constructed index can provide an efficient join, and a CROSS APPLY or OUTER_APPLY can be used to handle the complex "select closest prior GPS coordinate" lookup logic.
Assuming your table structure is something like:
GPS(gps_id, file_id, timestamp, latitude, longitude, ...)
Transaction(transaction_id, timestamp, file_id, ...)
First create an index on the GPS table to allow efficient lookup by file_id and timestamp.
CREATE INDEX IX_GPS_FileId_Timestamp
ON GPS(file_id, timestamp)
INCLUDE(latitude, longitude)
The INCLUDE clause is optional, but it allows the index to serve up lat/long without the need to access the primary table.
You can then use a query something like:
SELECT *
FROM Transaction T
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp <= T.timestamp
ORDER BY G.timestamp DESC
) G1
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp >= T.timestamp
ORDER BY G.timestamp
) G2
CROSS APPLY and OUTER APPLY are like INNER JOIN and LEFT JOIN, but have more flexibility to define a subquery with complex conditions to handle cases like this.
The G1 subquery will efficiently select the immediately prior or equal GPS timestamp record with the same file_id. G2 does the same for equal or immediately following. Per your requirements, you only need G1, but having both might give you the opportunity to interpolate between the two points or to handle cases where there is no preceding matching record.
See this fiddle for a demo.

ST_Intersects() query took too long

I'm working on a query using the PostGIS extension that implements a 'spatial join' work. Running the query took an incredibly long time and failed in the end. The query is as follows:
CREATE INDEX buffer_table_geom_idx ON buffer_table USING GIST (geom);
CREATE INDEX point_table_geom_idx ON point_table USING GIST (geom);
SELECT
point_table.*,
buffer_table.something
FROM
point_table
LEFT JOIN buffer_table ON ST_Intersects (buffer_table.geom, point_table.geom);
where the point_table stands for a table that contains over 10 million rows of point records; the buffer_table stands for a table that contains only one multi-polygon geometry.
I would want to know if there is anything wrong with my code and ways to adjust. Thanks in advance.
With a LEFT JOIN you're going through every single record of point_table and therefore ignoring the index. Try this and see the difference:
SELECT point_table.*
FROM point_table
JOIN buffer_table ON ST_Contains(buffer_table.geom, point_table.geom);
Divide and conquer with ST_SubDivide
Considering the size of your multipolygon (see comments), it might be interesting to divide it into smaller pieces, so that the number of vertices for each containment/intersection calculation also gets reduced, consequently making the query less expensive.
First divide the large geometry into smaller pieces and store in another table (you can also use a CTE/Subquery)
CREATE TABLE buffer_table_divided AS
SELECT ST_SubDivide(geom) AS geom FROM buffer_table
CREATE INDEX buffer_table_geom_divided_idx ON buffer_table_divided USING GIST (geom);
.. and perform your query once again against this new table:
SELECT point_table.*
FROM point_table
JOIN buffer_table_divided d ON ST_Contains (d.geom, point_table.geom);
Demo: db<>fiddle

Postgres 9.4 which type of index would be ideal for a float column

I was in mysql and now have joined postgres and I have a table that is getting up to 300,000 new records a day but also has many reads. I have 2 columns that I think would be ideal for indexes: latitudes and longitudes and I Know that postgres has different types of indexes and my question is which type of index would be best for a table that has many writes and reads? This is the query for the reads
SELECT p.fullname,s.post,to_char(s.created_on, 'MON DD,YYYY'),last_reply,s.id,
r.my_id,s.comments,s.city,s.state,p.reputation,s.profile_id
FROM profiles as p INNER JOIN streams as s ON (s.profile_id=p.id) Left JOIN
reputation as r ON (r.stream_id=s.id and r.my_id=?) where s.latitudes >=?
AND ?>= s.latitudes AND s.longitudes>=? AND ?>=s.longitudes order by
s.last_reply desc limit ?"
As you can see the 2 columns in the where clause are latitudes and longitudes
PostgreSQL has the point data type with many operators that have good support from the gist index. So if at all possible change your table definition to use a point rather than 2 floats.
Inserting point data is very easy, just use point(longitudes, latitudes) for the column, instead of putting the two values in separate columns. Same with getting data out: lnglat[0] is the longitude and lnglat[1] is the latitude.
The index would be something like this:
CREATE INDEX idx_mytable_lnglat ON streams USING gist (lnglat pointops);
There is also the box data type, which would be great for grouping all your parameters and finding a point in a box is highly optimized in the gist index.
With a point in the table and a box to search on, your query reduces to this:
SELECT p.fullname, s.post, to_char(s.created_on, 'MON DD,YYYY'), last_reply, s.id,
r.my_id, s.comments, s.city, s.state, p.reputation, s.profile_id
FROM profiles AS p
JOIN streams AS s ON (s.profile_id = p.id)
LEFT JOIN reputation AS r ON r.stream_id = s.id AND r.my_id = ?
WHERE s.lnglat && box(?, ?, ?, ?)
ORDER BY s.last_reply DESC
LIMIT ?;
The phrase s.lnglat && box(?, ?, ?, ?) means "the value of column lnglat overlaps with (meaning: is inside) the box".
If the latitude or longitude columns are sorted, you would probably want to use a B-tree index.
From the Postgres documentation page on indices:
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of the [greater than / lesser than-type operators]
You can read more about indices here.
Edit: Some of the G* indices look like they might be of use if you need to index on both latitude and longitude, since they appear to allow multi-dimensional (e.g. 2d) indexing.
Edit2: In order to actually create the index, you'd want to do something along the lines of (although you may need to change the table name to suite your needs):
CREATE INDEX idx_lat ON s(latitudes);
Take note that B-tree indices are default so you don't need to specify the type.
Read more about index creation here.

number of points within a radius of another set of points

I have two tables. One is a list of stores (with lat/long). The other is a list of customer addresses (with lat/long). What I want is a query that will return the number of customers within a certain radius for each store in my table. This gives me the total number of customers within 10,000 meters of ANY store, but I'm not sure how to loop it to return one row for each store with a count.
Note that I'm doing this queries using cartoDB, where the_geom is basically long/lat.
SELECT COUNT(*) as customer_count FROM customer_table
WHERE EXISTS(
SELECT 1 FROM store_table
WHERE ST_Distance_Sphere(store_table.the_geom, customer_table.the_geom) < 10000
)
This results in a single row :
customer_count
4009
Suggestions on how to make this work against my problem? I'm open to doing this other ways that might be more efficient (faster).
For reference, the column with store names, which would be in one column is store_identifier.store_table
I'll assume that you use the_geom to represent the coordinate (lat/lon) of store and customer. I will also assume that the_geom is of geography type. Your query will be something like this
select s.id, count(*) as customer_count
from customers c
inner join stores s
on st_dwithin(c.the_geom, s.the_geom, 10000)
group by s.id
This should give you neat table with a store id and count of customers within 10,000 meters from the store.
If the_geom is of type geometry, you query will be very similar but you should use st_distance_sphere() instead in order to express distance in kilometers (not degrees).

How do I use ST_Contains in following case?

I have two tables. First with points, and second with polygons. I need to find out which points are in required polygon according to the attribute gid.
Using query: SELECT table1.* FROM table1, table2 WHERE table2.gid=1 AND ST_Contains(table2.geom2, table1.geom1);
What I get is empty table (only columns without data)...
Tnx
Are you sure there are any intersecting points? Try
SELECT COUNT(*) FROM table2 WHERE table2.gid=1
It should return 1.
Another thing you could try is using ST_Intersects instead of ST_Contains.
Otherwise, you might need to post some data dumps of data you think should match.