Find cluster given node in PostgreSQL - postgresql

I am representing a graph in Postgres 9.1 (happens to be bidirectional and cyclic):
CREATE TABLE nodes (
id SERIAL PRIMARY KEY,
name text
);
CREATE TABLE edges (
id SERIAL PRIMARY KEY,
node1_id int REFERENCES nodes(id),
node2_id int REFERENCES nodes(id)
);
Given a particular node ID, want to retrieve all other nodes in that cluster. I started with the "Paths from a single node" example here, and this is where I got:
WITH RECURSIVE search_graph(id, path) AS (
SELECT id, ARRAY[id]
FROM nodes
UNION
SELECT e.node2_id, sg.path || e.node2_id
FROM search_graph sg
JOIN edges e
ON e.node1_id = sg.id
)
-- find all nodes connected to node 3
SELECT DISTINCT id FROM search_graph WHERE path #> ARRAY[3];
I can't figure out a) if there is a simpler way to write this since I don't care about collecting the full path, and b) how to make it traverse in both directions (node1->node2 and node2->node1 for each edge). Shedding any light on a good approach would be appreciated. Thanks!

A couple points.
First, you really want to make sure your path traversal is not going to go into a loop. Secondly handling both sides is not too bad. Finally depending on what you are doing, you may want to push the where clause into the CTE somehow to reduce generating every possible graph network and then picking the one you want.
Traversing itself both directions is not too hard. I haven't tested this but it should be possible with something like:
WITH RECURSIVE search_graph(path, last_node1, last_node2) AS (
SELECT ARRAY[id], id, id
FROM nodes WHERE id = 3 -- start where we want to start!
UNION ALL
SELECT sg.path || e.node2_id || e.node1_id, e.node1_id, e.node2_id
FROM search_graph sg
JOIN edges e
ON (e.node1_id = sg.last_node2 AND NOT path #> array[e.node2_id])
OR (e.node2_id = sg.last_node1 AND NOT path #> array[e.node1_id])
)
-- Moved where clause to start of graph search
SELECT distinct unnest(path) FROM search_graph; -- duplicates possible
Hope this helps.

Related

Get nearest neighbor point from a table closest to the point in a given row of another table

I have two tables, table_a has polygons and the centroids of those polygons. table_b has another set of points overlaping the geometries in table_a.
For each row of table_a I need to find the point from table_b closest to the centroid of that row.
INSERT INTO nearest_node (nearest_drive_node)
SELECT osmid FROM london_drive_nodes
ORDER BY london_drive_nodes.geom <-> nearest_node.lsoa_centroid
LIMIT 1;
This returns
SQL Error [42P01]: ERROR: invalid reference to FROM-clause entry for table "nearest_node"
Hint: There is an entry for table "nearest_node", but it cannot be referenced from
this part of the query.
I'm not sure exactly how to use the value from table_a as the point in the ORDER BY part of the query. The examples I've found are finding the nearest neighbor of a single point as a text string, rather than a column of points.
Inserting the closest node as a new row in the table, without any other attribute, seems wrong. You most certainly want to update the existing records.
You must compute the closest node for each row of the input table, which can be achieve with a sub query.
UPDATE nearest_node
SET nearest_drive_node = (
SELECT london_drive_nodes.osmid
FROM london_drive_nodes
ORDER BY nearest_node.geom <-> london_drive_nodes.geom
LIMIT 1
);
If you were to just select (and eventually to insert this information in another table), you would rely on a lateral join:
select a.osmid,closest_pt.osmid, closest_pt.dist
from tablea a
CROSS JOIN LATERAL
(SELECT
osmid ,
a.geom <-> b.geom as dist
FROM tableb b
ORDER BY a.geom <-> b.geom
LIMIT 1) AS closest_pt;
The problem seems to be that you reference nearest_node in the query, but not in the FROM clause, but in general, your query wouldn't work "for each row" anyway. Try combining st_distance and regular min with group by to get the minimum distance, then wrap it in a CTE or subquery to identify which node it actually is:
WITH distances AS (
SELECT nn.id, nn.lsoa_centroid, min(st_distance(ldn.geom, nn.lsoa_centroid)) as min_distance
FROM london_drive_nodes ldn, nearest_node nn
GROUP BY nn.id
)
INSERT INTO nearest_node (nearest_drive_node)
SELECT ldn.osmid
FROM distances
JOIN london_drive_nodes ldn ON distances.min_distance = st_distance(distances.wgs84_centroid, ldn.wgs84_coordinates)

How to minimize the use of CTEs in a Postgresql query

The follow code works well with a small number of vector features. However when I run the query using a larger table (~35,000 rows), my memory use goes to 100% (32GB) and then I get a "Connection to the server has been lost" in pgadmin. I am running on localhost, so the issue is not network related. I'm guessing its because I am using to many CTE's (WITH queries). I was thinking of nesting the query in a PL/pgSQL loop and updating a table with the results. Thereby closing the temporary tables after each iteration. This seems like an inelegant solution and I was hoping someone might be able to show me how I can minimize the use of CTE's in the below query.
CREATE TABLE dem_stats AS
WITH
-- Select Features using lookup table and determine the raster tiles said features are intersecting
feat AS
(SELECT title_no,
a.grid_tile_name || '.asc' AS tile_name,
a.wkb_geometry as geom
FROM test_polygons a, parcels_all_shapefile_lookup_osgb_grid_5km b
WHERE a.title_no = b.olp_title_no
),
-- Merge rasters tiles from main raster file that intersect features
merged_rast AS
(SELECT ST_Union(rast,1) AS rast
FROM dem, feat
WHERE filename
IN (tile_name)
),
-- As the tiles are now merged duplicates are not required
feat_temp AS
(SELECT DISTINCT ON (title_no) * FROM feat
),
-- Clip merged raster and obtain pixel statistics
b_stats AS
(SELECT title_no, (stats).*
FROM (SELECT title_no, ST_SummaryStats(ST_Clip(a.rast,1,b.geom,-9999,true)) AS stats
FROM merged_rast a
INNER JOIN feat_temp b
ON ST_Intersects(b.geom,a.rast)
) AS foo
)
-- Summarise statistics for each title number
SELECT title_no,
count As pixel_val_count,
min AS pixel_val_min,
max AS pixel_val_max,
mean AS pixel_val_mean,
stddev AS pixel_val_stddev
FROM b_stats
WHERE count > 0;
though it is not exactly readable you can always inline the CTEs - this is sometimes a good idea because CTEs are optimization fences in PostgreSQL: https://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

Fully matching sets of records of two many-to-many tables

I have Users, Positions and Licenses.
Relations are:
users may have many licenses
positions may require many licenses
So I can easily get license requirements per position(s) as well as effective licenses per user(s).
But I wonder what would be the best way to match the two sets? As logic goes user needs at least those licenses that are required by a certain position. May have more, but remaining are not relevant.
I would like to get results with users and eligible positions.
PersonID PositionID
1 1 -> user 1 is eligible to work on position 1
1 2 -> user 1 is eligible to work on position 2
2 1 -> user 2 is eligible to work on position 1
3 2 -> user 3 is eligible to work on position 2
4 ...
As you can see I need a result for all users, not a single one per call, which would make things much much easier.
There are actually 5 tables here:
create table Person ( PersonID, ...)
create table Position (PositionID, ...)
create table License (LicenseID, ...)
and relations
create table PersonLicense (PersonID, LicenseID, ...)
create table PositionLicense (PositionID, LicenseID, ...)
So basically I need to find positions that a particular person is licensed to work on. There's of course a much more complex problem here, because there are other factors, but the main objective is the same:
How do I match multiple records of one relational table to multiple records of the other. This could as well be described as an inner join per set of records and not per single record as it's usually done in TSQL.
I'm thinking of TSQL language constructs:
rowsets but I've never used them before and don't know how to use them anyway
intersect statements maybe although these probably only work over whole sets and not groups
Final solution (for future reference)
In the meantime while you fellow developers answered my question, this is something I came up with and uses CTEs and partitioning which can of course be used on SQL Server 2008 R2. I've never used result partitioning before so I had to learn something new (which is a plus altogether). Here's the code:
with CTEPositionLicense as (
select
PositionID,
LicenseID,
checksum_agg(LicenseID) over (partition by PositionID) as RequiredHash
from PositionLicense
)
select per.PersonID, pos.PositionID
from CTEPositionLicense pos
join PersonLicense per
on (per.LicenseID = pos.LicenseID)
group by pos.PositionID, pos.RequiredHash, per.PersonID
having pos.RequiredHash = checksum_agg(per.LicenseID)
order by per.PersonID, pos.PositionID;
So I made a comparison between these three techniques that I named as:
Cross join (by Andriy M)
Table variable (by Petar Ivanov)
Checksum - this one here (by Robert Koritnik, me)
Mine already orders results per person and position, so I also added the same to the other two to make return identical results.
Resulting estimated execution plan
Checksum: 7%
Table variable: 2% (table creation) + 9% (execution) = 11%
Cross join: 82%
I also changed Table variable version into a CTE version (instead of table variable a CTE was used) and removed order by at the end and compared their estimated execution plans. Just for reference CTE version 43% while original version had 53% (10% + 43%).
One way to write this efficiently is to do a join of PositionLicences with PersonLicences on the licenceId. Then count the non nulls grouped by position and person and compare with the count of all licences for position - if equal than that person qualifies:
DECLARE #tmp TABLE(PositionId INT, LicenseCount INT)
INSERT INTO #tmp
SELECT PositionId as PositionId
COUNT(1) as LicenseCount
FROM PositionLicense
GROUP BY PositionId
SELECT per.PersonID, pos.PositionId
FROM PositionLicense as pos
INNER JOIN PersonLicense as per ON (pos.LicenseId = per.LicenseId)
GROUP BY t.PositionID, t.PersonId
HAVING COUNT(1) = (
SELECT LicenceCount FROM #tmp WHERE PositionId = t.PositionID
)
I would approach the problem like this:
Get all the (distinct) users from PersonLicense.
Cross join them with PositionLicense.
Left join the resulting set with PersonLicense using PersonID and LicenseID.
Group the results by PersonID and PositionID.
Filter out those (PersonID, PositionID) pairs where the number of licenses in PositionLicense does not match the number of those in PersonLicense.
And here's my implementation:
SELECT
u.PersonID,
pl.PositionID
FROM (SELECT DISTINCT PersonID FROM PersonLicense) u
CROSS JOIN PositionLicense pl
LEFT JOIN PersonLicense ul ON u.PersonID = ul.PersonID
AND pl.LicenseID = ul.LicenseID
GROUP BY
u.PersonID,
pl.PositionID
HAVING COUNT(pl.LicenseID) = COUNT(ul.LicenseID)

Finding duplicates between two tables

I've got two SQL2008 tables, one is a "Import" table containing new data and the other a "Destination" table with the live data. Both tables are similar but not identical (there's more columns in the Destination table updated by a CRM system), but both tables have three "phone number" fields - Tel1, Tel2 and Tel3. I need to remove all records from the Import table where any of the phone numbers already exist in the destination table.
I've tried knocking together a simple query (just a SELECT to test with just now):
select t2.account_id
from ImportData t2, Destination t1
where
(t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
... but I'm aware this is almost certainly Not The Way To Do Things, especially as it's very slow. Can anyone point me in the right direction?
this query requires a little more that this information. If You want to write it in the efficient way we need to know whether there is more duplicates each load or more new records. I assume that account_id is the primary key and has a clustered index.
I would use the temporary table approach that is create a normalized table #r with an index on phone_no and account_id like
SELECT Phone, Account into #tmp
FROM
(SELECT account_id, tel1, tel2, tel3
FROM destination) p
UNPIVOT
(Phone FOR Account IN
(Tel1, tel2, tel3)
)AS unpvt;
create unclustered index on this table with the first column on the phone number and the second part the account number. You can't escape one full table scan so I assume You can scan the import(probably smaller). then just join with this table and use the not exists qualifier as explained. Then of course drop the table after the processing
luke
I am not sure on the perforamance of this query, but since I made the effort of writing it I will post it anyway...
;with aaa(tel)
as
(
select Tel1
from Destination
union
select Tel2
from Destination
union
select Tel3
from Destination
)
,bbb(tel, id)
as
(
select Tel1, account_id
from ImportData
union
select Tel2, account_id
from ImportData
union
select Tel3, account_id
from ImportData
)
select distinct b.id
from bbb b
where b.tel in
(
select a.tel
from aaa a
intersect
select b2.tel
from bbb b2
)
Exists will short-circuit the query and not do a full traversal of the table like a join. You could refactor the where clause as well, if this still doesn't perform the way you want.
SELECT *
FROM ImportData t2
WHERE NOT EXISTS (
select 1
from Destination t1
where (t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
)