I'm creating an edge table, how to prevent duplicating edges - postgresql

The query is like so:
CREATE TABLE Edge_Table AS
SELECT a.*gid,nextval('ty') AS edge_gid,
ST_SetSRID(ST_MakeLine(a.geom, getcentroids(a.gid)),4326) AS geom_line
FROM Points_table a;
where my getcentroids Function returns 8 nearest points to each point creating an edge with each one of them, the problem arises with duplicates as same edge is created from 1->2 and 2->1, how do I optimise this query itself as large data has to be processed, could index or UNIQUE constraint help?

Related

SELECT query returns more records than exists

Background
I have a table with raster data (grib_data) created by using raster2pgsql.
I have created a second table (turb_mod) with a subset of the points in grib_data that has a value above a certain threshold.
This subset table (turb_mod) has been created with the following query
WITH turb AS (SELECT rid, rast, (ST_PixelAsPoints(rast)).val AS val
FROM grib_data
)
SELECT rid, rast INTO turb_mod
FROM turb WHERE val > 0.5;
The response when creating the table is "SELECT 53" indicating that the table turb_mod would now hold 53 rows
Problem
If I now try to return the raster data from turb_mod using the below query it returns all records from the original table, not the 53 that I am expecting
SELECT (ST_PixelAsPoints(rast)).x AS x FROM turb_mod;
Questions
Why does my query not return only the 53 records?
Is there a better way to create a table with a selection of raster points from the original table? I want to use the subset to apply further geospatial functions like spatial clustering.
In your final SELECT, you're calling the function ST_PixelAsPoints, which is a set-returning function. This results in an output row [being] generated for each element of the function's result set (reference), and can thus result in a different row count to that of your source table, turb_mod.
Your query is functionally equivalent to this (preferred) syntax:
SELECT points.x
FROM
turb_mod
JOIN LATERAL ST_PixelAsPoints(rast) points ON TRUE;
This syntax better shows what's happening, and also shows how you might choose to include more columns from the function's output, which may help to answer your second point.

ST_Intersects() query took too long

I'm working on a query using the PostGIS extension that implements a 'spatial join' work. Running the query took an incredibly long time and failed in the end. The query is as follows:
CREATE INDEX buffer_table_geom_idx ON buffer_table USING GIST (geom);
CREATE INDEX point_table_geom_idx ON point_table USING GIST (geom);
SELECT
point_table.*,
buffer_table.something
FROM
point_table
LEFT JOIN buffer_table ON ST_Intersects (buffer_table.geom, point_table.geom);
where the point_table stands for a table that contains over 10 million rows of point records; the buffer_table stands for a table that contains only one multi-polygon geometry.
I would want to know if there is anything wrong with my code and ways to adjust. Thanks in advance.
With a LEFT JOIN you're going through every single record of point_table and therefore ignoring the index. Try this and see the difference:
SELECT point_table.*
FROM point_table
JOIN buffer_table ON ST_Contains(buffer_table.geom, point_table.geom);
Divide and conquer with ST_SubDivide
Considering the size of your multipolygon (see comments), it might be interesting to divide it into smaller pieces, so that the number of vertices for each containment/intersection calculation also gets reduced, consequently making the query less expensive.
First divide the large geometry into smaller pieces and store in another table (you can also use a CTE/Subquery)
CREATE TABLE buffer_table_divided AS
SELECT ST_SubDivide(geom) AS geom FROM buffer_table
CREATE INDEX buffer_table_geom_divided_idx ON buffer_table_divided USING GIST (geom);
.. and perform your query once again against this new table:
SELECT point_table.*
FROM point_table
JOIN buffer_table_divided d ON ST_Contains (d.geom, point_table.geom);
Demo: db<>fiddle

Search for all records within X distance of any records in a table

I have a table with a bunch of geographies of hospitals (roughly 100 rows), and another table with a bunch of geographies of something else (tens of thousands of rows). How do I select ALL of the latter records that are within X radius of ANY of the former records?
Use ST_DWithin() from PostGIS:
SELECT *
FROM whatever w
WHERE EXISTS (
SELECT FROM hospital h
WHERE ST_DWithin(h.the_geog, w.the_geog, $distance_in_meters)
);
The EXISTS semi-join is not only (probably) fastest, it also avoids duplicates that might come out of similar queries with a plain (OUTER) JOIN.
You should at least have this spatial GiST index:
CREATE INDEX ON hospital USING gist (the_geog);
Related:
PostGIS radius query

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

ST_contains taking too much time

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.
As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);