Postgres/Postgis database - Query running infinitely with ST_MakeLine - postgresql

I currently have a database scheme called cardata, which amongst others have these columns:
+-------------+-----------------------------+---------------------------------+
| tripid(int) | point geometry(point, 4326) | line geometry(linestring, 4326) |
+-------------+-----------------------------+---------------------------------+
| 1 | <data> | |
| 1 | <data> | |
The point column contains sequential GPS measurements, and as such I would like to transform them into linestrings. The whole trip can contain thousands of points, but I want a linestring between every single point.
I have tried to formulate this as an update to my table which looks like this:
UPDATE cardata
SET line = ST_MakeLine(foo.point, lead)
FROM (
SELECT point, LEAD(point, 1)
OVER w
FROM cardata
GROUP BY point
WINDOW w AS (ORDER BY point)
) AS foo
WHERE lead IS NOT NULL
The idea is that for each row I use that point and the next to make a linestring (ST_MakeLine) saved in the first row (line). This should continue until the trip ends, ignoring the last entry in which LEAD should be null.
For now I do not know how to formulate a single query that distinguishes between different tripid, but that is for another time. For now I just want linestrings between all points in the entire table.
The problem is however that my query seems to run forever, and does not change anything in the table. I do not understand why. I tried testing that the inner SELECT query behaved as expected - It returns 46561 rows of 47055 total rows. This is also odd since I believe it should return 47054 rows - i.e. only conclude that for the last entry, LEAD is null.
Lastly I tried running the ST_MakeLine on two random points, which seems to work fine.
What makes this query run forever?

Another guy that I am working with finally found a working solution, although neither of us quite understand what exactly was wrong with the original query.
As such I am still interested in the answer to my original question, namely what made the query run forever?
This is the working query, which works entirely as intended:
UPDATE cardata
SET line = resline
FROM(
SELECT tripid,
id,
ST_MakeLine(point, next_point) AS resline
FROM(
SELECT tripid,
id,
point,
lead(point) OVER w AS next_point,
lead(tripid) OVER w AS next_tripid
FROM cardata
WINDOW w AS(ORDER BY id)
) AS allcalclines
WHERE tripid = 1 AND next_tripid = tripid
) AS calclines
WHERE cardata.id = calclines.id

Related

What is the best approach?

At work we have a SQL Server 2019 instance. There are two big tables in the same database that have to be joined to obtain specific data: one contains GPS data taken at 4 minutes interval, but there could be in between records as well. The important thing here is that there is a non-key attribute called file_id, a timestamp (DATE_TIME column), latitude and longitude. The other attributes are not relevant, and the key is autogenerated (identity column), so it's of no use to me.
The other table contains transaction records that have among other attributes a timestamp (FECHATRX column), and the same non-key file ID attribute the GPS table has, and also an autogenerated key with no relation at all with the other key.
For each file ID there are several records in both tables that have to be somewhat joined in order to obtain for a given file ID and transaction record both its latitude and longitude. The tables aren't ordered at all.
My idea is to pair records of the same file ID and I imagine it to be this way (I haven't done it yet because it was explained to me earlier today):
Order both tables by file ID and timestamp
For the same file ID all the transaction table records who have a timestamp equal or greater than the first timestamp from the GPS table and lower than the following timestamp from the same GPS table will be given both latitude and longitude values from that first record, for they are considered to belong to that latitude-longitude pair (actually they probably are somewhere in the middle, but this is an assumption and everybody agrees with this)
When a transaction record has a timestamp equal or greater than the second timestamp, then the third timestamp will act as an end point, all the records in between from the transaction table will obtain the same coordinates from the second record until one timestamp equals or be greater than the third, and so on until a new file ID is reached or there are no records left in one or both tables.
To me this sounds like nested cursors and several variables to save the first GPS record's values while we are also saving the second GPS record's timestamp for comparison purposes, and of course the file ID itself as a control variable, but is this the best way to obtain the latitude / longitude data for each and every transaction record from the GPS table?
Are other approaches better than using nested cursors?
As I said I haven't done anything yet, the only thing I can do is to show you some data from both tables, I just wanted to know if there is another (and simpler) way of doing this than using nested cursors.
Thank you.
Alejandro
No need to reorder tables or use a complex cursor loop. A properly constructed index can provide an efficient join, and a CROSS APPLY or OUTER_APPLY can be used to handle the complex "select closest prior GPS coordinate" lookup logic.
Assuming your table structure is something like:
GPS(gps_id, file_id, timestamp, latitude, longitude, ...)
Transaction(transaction_id, timestamp, file_id, ...)
First create an index on the GPS table to allow efficient lookup by file_id and timestamp.
CREATE INDEX IX_GPS_FileId_Timestamp
ON GPS(file_id, timestamp)
INCLUDE(latitude, longitude)
The INCLUDE clause is optional, but it allows the index to serve up lat/long without the need to access the primary table.
You can then use a query something like:
SELECT *
FROM Transaction T
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp <= T.timestamp
ORDER BY G.timestamp DESC
) G1
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp >= T.timestamp
ORDER BY G.timestamp
) G2
CROSS APPLY and OUTER APPLY are like INNER JOIN and LEFT JOIN, but have more flexibility to define a subquery with complex conditions to handle cases like this.
The G1 subquery will efficiently select the immediately prior or equal GPS timestamp record with the same file_id. G2 does the same for equal or immediately following. Per your requirements, you only need G1, but having both might give you the opportunity to interpolate between the two points or to handle cases where there is no preceding matching record.
See this fiddle for a demo.

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim
Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Select until row matches in postgresql?

Is there a way to select rows until some condition is met? I.e. a type of limit, but not limited to N rows, but to all the rows until the first non-matching row?
For example, say I have the table:
CREATE TABLE t (id SERIAL PRIMARY KEY, rank INTEGER, value INTEGER);
INSERT INTO t (rank, value) VALUES ( 1, 1), (2, 1), (2,2),(3,1);
that is:
test=# SELECT * FROM t;
id | rank | value
----+------+-------
1 | 1 | 1
2 | 2 | 1
3 | 2 | 2
4 | 3 | 1
(4 rows)
I want to order by rank, and select up until the first row that is over 1.
I.e. SELECT * FROM t ORDER BY rank UNTIL value>1
and I want the first 2 rows back?
One solution is to use a subquery and bool_or:
SELECT * FROM
( SELECT id, rank, value, bool_and(value<2) OVER (order by rank, id) AS ok FROM t ORDER BY rank) t2
WHERE ok=true
BUT wont that end up going through all rows, even if I only want a handful?
(real world context: I have timestamped events in a table, I can use a window query lead/lag to select the time between two events, I want all event from now going back as long as they happened less than 10 minutes apart – the lead/lag window query complicates things, so simplified example here)
edit: made window-function order by rank, id
What you want is a sort of stop-condition. As far as I am aware there is no such thing in SQL, at least PostgreSQL's dialect.
What you can do is use a PL/PgSQL procedure to read rows from a cursor and return them until the stop condition is met. It won't be super fast, but it'll be alright. It's just a FOR loop over a query with an IF expression THEN exit; ELSE return next; END IF;. No explicit cursor is required because PL/PgSQL will use one internally if you FOR loop over a query.
Another option is to create a cursor and read chunks of rows from it in the application, then discard part of the last chunk once the stop condition is met.
Either way, a cursor is going to be what you want.
A stop expression wouldn't actually be too hard to implement in PostgreSQL by the way. You'd have to implement a new executor node type, but the new CustomScan support would make that practical to do in an extension. Then you'd just evaluate an expression to decide whether or not to carry on fetching rows.
You can try something such as:
select * from t, (
select rank from t where value = 1 order by "rank" limit 1) x
where t.rank <= x.rank order by rank;
It will make two passes through the first part of the table (which you might be able to cut by creating an index on (rank, value = 1)) but shouldn't evaluate the rest of the table if you have an index on rank.
[If you could have window expressions in where clauses you could use a window expression to make sure any previous rows didn't have value = 1.. but even if this were possible, then getting the query evaluator to use to limit search would be yet another challenge.]
This may be no better than your solution, since you begged the question, "won't that end up going through all rows?"
I can tell you this -- the explain plan is different than your solution. I don't know how the guts of PostgreSQL works, but if I were writing a "max" function, I would think it would always be O(n). By contrast, you had an order by which is average case O(n log n), worst case O(n^2).
That said, I cannot deny that this will go through all rows:
select * from sandbox.t
where id < (select min (id) from sandbox.t where value > 1)
One thing to clarify, though, is that unless you scan all rows, I'm not sure how you could determine the minimum value. Any time you invoke an aggregate concept across all records, doesn't that mean that you must read all rows?

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.