Is this a good candidate for plpgsql? - postgresql

Several of the introductory tutorials I've read on using for loops in plpgsql have said unless I really need a for loop, I should find another way.
As a programming novice I can't figure out a better way than a for loop.
I would like to go through each parcel gid, calculate the nearest street intersection by bounding box, then assign the intersection id to the nearest_intersection column in parcels.
All the parts of this function work nicely. But I can't figure out how to put them together.
BEGIN
FOR gid IN SELECT gid FROM parcels
LOOP
UPDATE parcels SET nearest_intersection =
(select intersection.osm_id
from intersections
order by
intersections.geom <-> (select geom from parcels where gid = parcel_id)
limit 1;)
end loop;
end;
Thank you!

In your current code, the loop doesn't make sense indeed because the UPDATE alone already processes every row of the parcels table.
What you probably want is (without a loop):
UPDATE parcels R SET nearest_intersection =
(select intersection.osm_id
from intersections
order by intersections.geom <-> R.geom
limit 1);
which in procedural thinking would be the equivalent to:
for every row R of parcels, find the row in intersections
whose geom is the nearest to R.geom and copies its osm_id into
R.nearest_intersection
On the other hand, if it had to be done with a loop, it would look like this:
FOR var_gid IN SELECT gid FROM parcels
LOOP
UPDATE parcels SET nearest_intersection =
(select intersection.osm_id
from intersections
order by
intersections.geom <-> parcels.geom)
limit 1)
WHERE parcels.gid=var_gid;
end loop;

Don't be a hostage of SQL purism. Write functions with loops. When you are a postgres expert you'll change them to queries. Or not.
You probably missed WHERE clause for UPDATE.

Related

Is there any way to update this column faster using PostgreSQL

I have about 200,000,000 rows and I am trying to update one of the columns, and this query seems particularly slow, so I am not sure what exactly wrong or if it is just slow.
UPDATE table1 p
SET location = a.location
FROM table2 a
WHERE p.idv = a.idv;
I curently on idv for both of the tables. Is there someway to make this faster?
Encounter the same problem several weeks ago , finally I use the following strategies to drastically improve the speed. I guess it is not the best approach , but just for your reference.
Write a simple function which accept a range of Id. The function will execute the update SQL but just update these range of ID.
Also add 'location != a.location' to the where clause . I heard that it can help to reduce the table become bloated which will affect query performance and need to do vacuum to restore the performance.
I execute the function continuously using about 30 threads which intuitively I think it can reduce the total time required by approximately 30 times. You can adjust to use a even higher number of threads if you are ambitious enough.
So it executes something likes below concurrently :
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 1 and 100000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 100001 and 200000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 200001 and 300000;
.....
.....
Also it has another advantage that I can know the update progress and what is the estimated remaining time to go by printing some simple timing statistic in each function.
Creating a new table can be faster than update existing data. So you can try the following:
CREATE TABLE new_table AS
SELECT
a.*, -- here you can set all fields you need
CALESCE(b.location, a.location) location -- update location field from table2
FROM table1 a
LEFT JOIN table2 b ON b.idv = a.idv;
After creation you will be able to drop old table and to rename the new.

Get nearest neighbor point from a table closest to the point in a given row of another table

I have two tables, table_a has polygons and the centroids of those polygons. table_b has another set of points overlaping the geometries in table_a.
For each row of table_a I need to find the point from table_b closest to the centroid of that row.
INSERT INTO nearest_node (nearest_drive_node)
SELECT osmid FROM london_drive_nodes
ORDER BY london_drive_nodes.geom <-> nearest_node.lsoa_centroid
LIMIT 1;
This returns
SQL Error [42P01]: ERROR: invalid reference to FROM-clause entry for table "nearest_node"
Hint: There is an entry for table "nearest_node", but it cannot be referenced from
this part of the query.
I'm not sure exactly how to use the value from table_a as the point in the ORDER BY part of the query. The examples I've found are finding the nearest neighbor of a single point as a text string, rather than a column of points.
Inserting the closest node as a new row in the table, without any other attribute, seems wrong. You most certainly want to update the existing records.
You must compute the closest node for each row of the input table, which can be achieve with a sub query.
UPDATE nearest_node
SET nearest_drive_node = (
SELECT london_drive_nodes.osmid
FROM london_drive_nodes
ORDER BY nearest_node.geom <-> london_drive_nodes.geom
LIMIT 1
);
If you were to just select (and eventually to insert this information in another table), you would rely on a lateral join:
select a.osmid,closest_pt.osmid, closest_pt.dist
from tablea a
CROSS JOIN LATERAL
(SELECT
osmid ,
a.geom <-> b.geom as dist
FROM tableb b
ORDER BY a.geom <-> b.geom
LIMIT 1) AS closest_pt;
The problem seems to be that you reference nearest_node in the query, but not in the FROM clause, but in general, your query wouldn't work "for each row" anyway. Try combining st_distance and regular min with group by to get the minimum distance, then wrap it in a CTE or subquery to identify which node it actually is:
WITH distances AS (
SELECT nn.id, nn.lsoa_centroid, min(st_distance(ldn.geom, nn.lsoa_centroid)) as min_distance
FROM london_drive_nodes ldn, nearest_node nn
GROUP BY nn.id
)
INSERT INTO nearest_node (nearest_drive_node)
SELECT ldn.osmid
FROM distances
JOIN london_drive_nodes ldn ON distances.min_distance = st_distance(distances.wgs84_centroid, ldn.wgs84_coordinates)

I need to use the ith value from a column

In a loop, I want to get the ith value from the table every time, I write like this:
FOR i IN 1..(select count (*) from table1 ) LOOP
INSERT INTO TABLE2
select id from table1 where column_nam in (select column_nam[i] from table1);
end loop;
end
For example, column_nam[1]=HPPC003, but it works wrong, how should I do it?
The comment by #a_horse_with_no_name is correct in that your ideas seem to be on shaky ground. You can't have a reference for the relational database rows without anything to base it off of.
While I agree in that you should revisit the basics, I would also like to propose a solution to this particular problem; If you had another column field in the table you could use this as the "ith" counter reference. i.e. columns of Reference and Data, with 1,2,3 etc for Reference and HPPC003, HPFC002 for the Data column, where you could then SELECT Data WHERE Reference==1 to get HPPC003.
I hope this helps, and Elmasri & Navathe is a very good reference for learning the foundations of Databases.

Generate result set of a certain length

I need to insert certain amount of rows into some table with values taken from variables. I certainly can do a loop inserting single row at a time, but that's too straightforward. I am looking for more elegant solution. My current thoughts are around INSERT INTO ... SELECT ... statement, but now I need a query that will generate the amount of rows that I need. I tried to write recursive CTE to do it:
CREATE FUNCTION ufGenerateRows(#numRows INT = 1)
RETURNS #RtnValue TABLE
(
RowID INT NOT NULL
)
AS
BEGIN
WITH numbers AS
(
SELECT 1 as N
UNION ALL
SELECT N + 1
FROM numbers
WHERE N + 1 <= #numRows
)
INSERT INTO #RtnValue
SELECT N
FROM numbers
RETURN
END
GO
It works, but has a limit of recursion depth of 100, which is inappropriate for me. Can you suggest alternatives?
always use the dbo. schema prefix when creating or referencing objects, especially functions.
you should strive to create inline table-valued functions, as opposed to multi-statement table-valued functions, when possible.
Recursive CTEs are about the least efficient way to generate a set (see this three-part series for much better examples):
http://www.sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
http://www.sqlperformance.com/2013/01/t-sql-queries/generate-a-set-2
http://www.sqlperformance.com/2013/01/t-sql-queries/generate-a-set-3
Here is one example:
CREATE FUNCTION dbo.GenerateRows(#numRows INT = 1)
RETURNS TABLE
AS
RETURN
(
SELECT TOP (#numRows) RowID = ROW_NUMBER() OVER (ORDER BY s1.[number])
FROM master.dbo.spt_values AS s1
-- CROSS JOIN master.dbo.spt_values AS s2
ORDER BY s1.[number]
);
If you need more than ~2,500 rows, you can cross join with itself, or another table.
Even better would be to create your own numbers table (again, see the links above for examples).
Don't think iteratively - looping - but set-based - all at once.
An INSERT INTO...SELECT TOP x… should do what you need without repeated inserts.
I will follow with an example when I'm not bound to my phone.
UPDATE:
What #AaronBertrand said. :} A CROSS JOIN in the SELECT is spot-on.

PostgreSQL statistical mode value

I am using the SQL query
SELECT round(avg(int_value)) AS modal_value FROM t;
to obtain modal value, that, of couse, not is correct, but is a first option to show some result.
So, my question is, "How to do the thing right?".
With PostgreSQL 8.3+ we can use this user-defined agregate to define mode:
CREATE FUNCTION _final_mode(anyarray) RETURNS anyelement AS $f$
SELECT a FROM unnest($1) a
GROUP BY 1 ORDER BY COUNT(1) DESC, 1
LIMIT 1;
$f$ LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, STYPE=anyarray,
FINALFUNC=_final_mode, INITCOND='{}'
);
but, as an user-defined average, with big tables it can be slow (compare sum/count with buildin AVG function). With PostgreSQL 9+, there are no direct (buildin) function for calculate statistical mode value? Perhaps using pg_stats... How to do something like
SELECT (most_common_vals(int_value))[1] AS modal_value FROM t;
The pg_stats VIEW can be used for this kind of task (even once, by hand)?
Since PostgreSQL 9.4 there is a built-in aggregate function mode. It is used like
SELECT mode() WITHIN GROUP (ORDER BY some_value) AS modal_value FROM tbl;
Read more about ordered-set aggregate functions here:
36.10.3. Ordered-Set Aggregates
Built-in Ordered-Set Aggregate Functions
See other answers for dealing with older versions of Postgres.
You can try something like:
SELECT int_value, count(*)
FROM t
GROUP BY int_value
ORDER BY count(*) DESC
LIMIT 1;
The idea behind it - you get the count for every int_value, then order them (so that the biggest count goes first), then LIMIT the query to first row only, to get the int_value with highest count only.
If you want to do it by groups:
select
int_value * 10 / (select max(int_value) from t) g,
min(int_value) "from",
max(int_value) "to",
count(*) total
from t
group by 1
order by 4 desc
At the question introductiom I cited this link with a good SQL-coded solution (and #IgorRomanchenko used the same algorithm in this answer). #ClodoaldoNeto shows a "new solution", but is for scalars and measures as I comment, not is an answer for the current question.
Pasted 2 months and ~40Views, no new issue...
Conlusions
Conclusions using only informations (and evidence of the absence of further info) of this page and cited links. Summary:
The user-defined aggregate mode() is enough, we not need a build-in (compiled) version.
There are no infrastructure for optimizations, a build-in do the something than the user-defined.
I tested the cited SQL aggregate function , in contexts like
SELECT mode(some_value) AS modal_value FROM t;
And, on my tests, it was fast... So, not justify an "build-in function" (like STATS_MODE of Oracle), only in a "statistical package" demand context -- but if you will spend time and memory to install something I suggest R language.
Another implicit question, was about a statistical package "preparing" or making use of some PostgreSQL-infrastructure (like pg_stats)... A good clue for a "canonical answer" is at the comment of #IgorRomanchenko: "pg_stat (...) contains only estimates, not the exact value". So, mode function can not make use of infrastructure, as I supposed.
NOTE: we must remember that, for "modal intervals", we can use another function, see #ClodoaldoNeto's answer.
The mode is of the most value that has occurred, so I sobreescrevi the function I found here and I made this:
CREATE OR REPLACE FUNCTION _final_mode(anyarray)
RETURNS anyelement AS
$BODY$
SELECT
CASE
WHEN t1.cnt <> t2.cnt THEN t1.a
ELSE NULL
END
FROM
(SELECT a, COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 1
) as t1,
(SELECT a,
COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 2 OFFSET 1
) as t2
$BODY$
LANGUAGE 'sql' IMMUTABLE;
-- Tell Postgres how to use our aggregate
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, --Function to call for each row. Just builds the array
STYPE=anyarray,
FINALFUNC=_final_mode, --Function to call after everything has been added to array
INITCOND='{}' --Initialize an empty array when starting
);