PostgreSQL UPDATE doesn't seem to update some rows - postgresql

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';

If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Related

Unexpected sort order on postgres left outer join

Background
I'm using Postgres 11 and pgAdmin4 v5.2. The problem I describe below is on my dev machine which has both the postgres server and pgAdmin client.
Questions I've looked at on SO that deal with incorrect ordering seem related to collation-related issues with ordering of text fields, whereas my problem is on an integer field.
Setup
I have a table norm_plans that contains ~5k records.
Column | Type
---------------------------------
canon_id | integer
name | character varying(200)
...
other fields
canon_id is autopopulated using a sequence.
I've created a new table norm_plans_cmp as a copy of norm_plans (CREATE TABLE norm_plans_cmp AS TABLE norm_plans WITH DATA;)
I next insert some new records into norm_plans and update some existing records (fields other than canon_id.
The new records increment the sequence and are assigned canon_id values as expected.
I now want to compare norm_plans against norm_plans_cmp so I perform a left outer join:
select a.*, b.*
from norm_plans a
left outer join norm_plans_cmp b
on a.canon_id = b.canon_id
order by a.canon_id
Problem
I would expect records to be sorted by canon_id. This holds true from 1-2000, but after 2,000 I get canon_ids from 5,001 to 5,111 (which is the last canon_id) and then it again picks up from 2,001. I'm viewing this data from pgAdmin, see screenshot 1 below showing the shift from 2,000 to 5,001, and screenshot 2 showing the transition again from 5,111 back to 2,001.
Additional observations
While incorrect, the ordering seems consistent. Running the query multiple times results in the same (incorrect) ordering.
Despite my question title, I'm not totally sure the left join has anything to do with this.
Running SELECT * ... ORDER BY canon_id on norm_plans or norm_plans_cmp alone also result in incorrect ordering, albeit at different points in the order.
Answers to this SO question suggest index corruption may be a contributing problem, but I have no indexes on either norm_plans or norm_plans_cmp (canon_id is not defined as a PK).
At this point, I'm stumped!

Optimise a simple update query for large dataset

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:
------- -----------
| event | | parameter |
------- -----------
| id | | id |
| order | | eventId |
------- | order |
-----------
Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:
UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id
The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.
There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:
Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
-> Hash Join (cost=706691.80..1706622.39 rows=11217313 width=155)
Hash Cond: (parameter."eventId" = e.id)
-> Seq Scan on parameter (cost=0.00..435684.13 rows=11217313 width=145)
-> Hash (cost=557324.91..557324.91 rows=7724791 width=26)
-> Seq Scan on event e (cost=0.00..557324.91 rows=7724791 width=26)
Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".
Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:
I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)
The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?
Edit for Jim Jones' comment:
I executed the following query:
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
I got 9 identical rows like the following:
pid | locktype | mode | query | query-start | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active
query column:
--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.
If some rows already have the target value, you can skip empty updates (at full cost). Like:
UPDATE parameter p
SET "order" = e."order"
FROM event e
WHERE p."eventId" = e.id
AND p."order" IS DISTINCT FROM e."order"; -- this
If both "order" columns are defined NOT NULL, simplify to:
...
AND p."order" <> e."order";
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.
Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.
The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.
I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?
I suggest creating a new table, then dropping the old one and renaming the new table.
CREATE TABLE parameter_new AS
SELECT
parameter.id,
parameter."eventId",
e."order"
FROM
parameter
JOIN event AS "e" ON
"e".id = parameter."eventId"
Later, once you verify things:
ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;
Later, once you're completely certain:
DROP TABLE parameter_old;

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim
Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Postgres/Postgis database - Query running infinitely with ST_MakeLine

I currently have a database scheme called cardata, which amongst others have these columns:
+-------------+-----------------------------+---------------------------------+
| tripid(int) | point geometry(point, 4326) | line geometry(linestring, 4326) |
+-------------+-----------------------------+---------------------------------+
| 1 | <data> | |
| 1 | <data> | |
The point column contains sequential GPS measurements, and as such I would like to transform them into linestrings. The whole trip can contain thousands of points, but I want a linestring between every single point.
I have tried to formulate this as an update to my table which looks like this:
UPDATE cardata
SET line = ST_MakeLine(foo.point, lead)
FROM (
SELECT point, LEAD(point, 1)
OVER w
FROM cardata
GROUP BY point
WINDOW w AS (ORDER BY point)
) AS foo
WHERE lead IS NOT NULL
The idea is that for each row I use that point and the next to make a linestring (ST_MakeLine) saved in the first row (line). This should continue until the trip ends, ignoring the last entry in which LEAD should be null.
For now I do not know how to formulate a single query that distinguishes between different tripid, but that is for another time. For now I just want linestrings between all points in the entire table.
The problem is however that my query seems to run forever, and does not change anything in the table. I do not understand why. I tried testing that the inner SELECT query behaved as expected - It returns 46561 rows of 47055 total rows. This is also odd since I believe it should return 47054 rows - i.e. only conclude that for the last entry, LEAD is null.
Lastly I tried running the ST_MakeLine on two random points, which seems to work fine.
What makes this query run forever?
Another guy that I am working with finally found a working solution, although neither of us quite understand what exactly was wrong with the original query.
As such I am still interested in the answer to my original question, namely what made the query run forever?
This is the working query, which works entirely as intended:
UPDATE cardata
SET line = resline
FROM(
SELECT tripid,
id,
ST_MakeLine(point, next_point) AS resline
FROM(
SELECT tripid,
id,
point,
lead(point) OVER w AS next_point,
lead(tripid) OVER w AS next_tripid
FROM cardata
WINDOW w AS(ORDER BY id)
) AS allcalclines
WHERE tripid = 1 AND next_tripid = tripid
) AS calclines
WHERE cardata.id = calclines.id

optimizing postgres view for timestamps and aggregation of fields from another table

I've greatly simplified the examples to hopefully produce a clear enough question that can be answered:
Consider a table of events
CREATE TABLE alertable_events
(
unique_id text NOT NULL DEFAULT ''::text,
generated_on timestamp without time zone NOT NULL DEFAULT now(),
message_text text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_events PRIMARY KEY (unique_id),
)
with the following data:
COPY alertable_events (unique_id,message_text,generated_on) FROM stdin;
one message one 2014-03-20 06:00:00.000000
two message two 2014-03-21 06:00:00.000000
three message three 2014-03-22 06:00:00.000000
four message four 2014-03-23 06:00:00.000000
five message five 2014-03-24 06:00:00.000000
\.
And for each event, there is a list of fields
CREATE TABLE alertable_event_fields
(
unique_id text NOT NULL DEFAULT ''::text,
field_name text NOT NULL,
field_value text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_event_fields PRIMARY KEY (unique_id, field_name),
CONSTRAINT fk_alertable_event_fields_0 FOREIGN KEY (unique_id)
REFERENCES alertable_events (unique_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
)
with the following data:
COPY alertable_event_fields (unique_id,field_name,field_value) FROM stdin;
one field1 a
one field2 b
two field1 z
two field2 y
three field1 a
three field2 m
four field1 a
four field2 b
five field1 z
five field2 y
\.
I want to define a view that produces the following:
| unique_id | fields | message_text | generated_on | updated_on | count |
| five | z|y | message five | 2014-03-21 06:00:00.000000 | 2014-03-24 06:00:00.000000 | 2 |
| four | a|b | message four | 2014-03-20 06:00:00.000000 | 2014-03-23 06:00:00.000000 | 2 |
| three | a|m | message three | 2014-03-22 06:00:00.000000 | 2014-03-22 06:00:00.000000 | 1 |
Notably:
fields is a pipe delimited string (or any serialization of) the field values (json encoding of field_name:field_value pairs would be even better ... but I can work with pipe_delim for now)
the output is grouped by matching fields. Update 3/30 12:45am The values are ordered by their field_name's alphabetically therefore a|b would not match b|a
a count is produced of the events that match that field set. updated 3/30 12:45am there can be different number of fields per unique_id, a match requires matching all fields and not a subset of the fields.
generated_on is the timestamp of the first event
updated_on is the timestamp of the most recent event
message_text is the message_text of the most recent event
I've produced this view, and it works for small data sets, however, as the alertable_events table grows, it becomes exceptionally slow. I can only assume I'm doing something wrong in the view because I have never dealt with anything quite so ugly.
Update 3/30 12:15PM EDT It looks like I may have server tuning problems causing this high run-times, see added explain for more info. If you see a glaring issue there, I'd be greatly interested in tweaking the server's configuration.
Can anyone piece together a view that handles large datasets well and has a significantly better run time than this? Perhaps using hstore? (I'm running 9.2 preferrably, though 9.3 if I can have a nice json encoding of the fields.)
Updated 3/30 11:30AM I'm beginning to think my issue may be server tuning (which means I'll need to talk to the SA) Here's a very simple explain (analyze,buffers) which is showing a ridiculous run-time for as few as 8k rows in the unduplicated_event_fields
Update 3/30 7:20PM I bumped my available memory to 5MB using SET WORK_MEM='5MB' (which is plenty for the query below), strangely, even though the planner went to in memory quicksort, it actually took on average 100ms longer!
explain (analyze,buffers)
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=771.11..892.79 rows=4056 width=80) (actual time=588.679..630.989 rows=4056 loops=1)
Buffers: shared hit=143, temp read=90 written=90
-> Sort (cost=771.11..791.39 rows=8112 width=80) (actual time=588.591..592.622 rows=8112 loops=1)
Sort Key: unique_id
Sort Method: external merge Disk: 712kB
Buffers: shared hit=143, temp read=90 written=90
-> Seq Scan on alertable_event_fields a (cost=0.00..244.40 rows=8112 width=80) (actual time=0.018..5.478 rows=8112 loops=1)
Filter: (message_name = 'LIMIT_STATUS'::text)
Buffers: shared hit=143
Total runtime: 632.323 ms
(10 rows)
Update 3/30 4:10AM EDT I'm still not completely satisfied and would be interested in any further optimization. I have a requirement to support 500msgs/sec steady state, and although most of those should not be "events", I get a little backlogged right now when stress testing.
Update 3/30 12:00PM EDT Here's my most readable iteration yet, unfortunately, for 4000 rows I'm still looking at 600ms runtimes! ... (see above, as its mostly contained with the inner most query) Any help here would be greatly appreciated
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT a.unique_id,a.message_text,
b."values",b.generated_on,b.updated_on,b.count
FROM alertable_events a
JOIN (
SELECT b."values",
min(a.generated_on) AS generated_on,
max(a.generated_on) AS updated_on,
count(*) AS count
FROM alertable_events a
JOIN (
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id
) b USING (unique_id)
GROUP BY b."values"
) b ON a.generated_on=b.updated_on
ORDER BY updated_on DESC;
Update 3/30 12:00PM EDT removed old stuff as this is getting too long
Some pointers
Invalid query
Your current query is incorrect unless generated_on is unique, which is not declared in the question and probably is not the case:
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT ...
FROM alertable_events a
JOIN ( ... ) b ON a.generated_on=b.updated_on -- !! unreliable
Possibly faster
SELECT DISTINCT ON (f.fields)
unique_id -- most recent
, f.fields
, e.message_text -- most recent
, min(e.generated_on) OVER (PARTITION BY f.fields) AS generated_on -- "first"
, e.generated_on AS updated_on -- most recent
, count(*) OVER (PARTITION BY f.fields) AS ct
FROM alertable_events e
JOIN (
SELECT unique_id, array_to_string(array_agg(field_value), '|') AS fields
FROM (
SELECT unique_id, field_value
FROM alertable_event_fields
ORDER BY 1, field_name -- a bit of a hack, but much faster
) f
GROUP BY 1
) f USING (unique_id)
ORDER BY f.fields, e.generated_on DESC;
SQL Fiddle.
The result is currently sorted by fields. If you need a different sort order, you'd need to wrap it in another subquery ...
Major points
The output column name generated_on conflicts with the input column generated_on. You have to table-qualify the column e.generated_on to refer to the input column. I added table-qualification everywhere to make it clear, but it is only actually necessary the ORDER BY clause. The manual:
If an ORDER BY expression is a simple name that matches both an
output column name and an input column name, ORDER BY will interpret
it as the output column name. This is the opposite of the choice that
GROUP BY will make in the same situation. This inconsistency is made
to be compatible with the SQL standard.
The updated query should also be faster (as intended all along). Run EXPLAIN ANALYZE again.
For the whole query, indexes will hardly be of use. Only if you select specific rows ... One possible exception: a covering index for alertable_event_fields:
CREATE INDEX f_idx1
ON alertable_event_fields (unique_id, field_name, field_value);
Lots of write operations might void the benefit, though.
array_agg(field_value ORDER BY ...) tends to be slower for big sets than pre-sorting in a subquery.
DISTINCT ON is convenient here. Not sure, whether it's actually faster, though, since ct and generated_on have to be computed in separate window functions, which requires another sort step.
work_mem: setting it too high can actually harm performance. More in the Postgres Wiki. or in "Craig's list".
Generally this is hard to optimize. Indexes fail because the sort order depends on two tables. If you can work with a snapshot, consider a MATERIALIZED VIEW.