How to update large amount of rows in PostgreSQL?

How to update large amount of rows in PostgreSQL? - postgresql

I need to update thousands of rows in a table. For example, I have 1000 rows with ids - 1, 2.. 1000:
mytable:
| id | value1 | value2 |
| 1 | Null | Null |
| 2 | Null | Null |
...
| 1000 | Null | Null |
Now I need to change first 10 rows. I can do it like this:
UPDATE mytable SET value1=42, value2=111 WHERE id=1
...
UPDATE mytable SET value1=42, value2=111 WHERE id=10
This requires to many requests and not very fast, so I decide to do this optimization:
UPDATE mytable SET value1=42 WHERE id in (1, 2, 3.. 10)
UPDATE mytable SET value2=111 WHERE id in (1, 2, 3.. 10)
Note: In this case I can actually write SET value1=42, value2=111 but in real world applications this sets of ids is not the same, for one rows I need to set value1, for other - value2, for some subset of rows I need to set both. Because of that I need two queries.
The problem is that I have very large amount of id's. This queries is something about 1Mb!
Q1: Is this a right way to optimize this updates?
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I can't use where statement, I've just have lots of row id's in my program.

Create a TEMPORARY TABLE and populate it with your target ids and new values. Then use UPDATE with FROM clause to join to that target and do it in a single command.
In general, whenever you have large numbers of id/values like this life gets easier if you move them into the database first.

Q1: Is this a right way to optimize this updates?
It should be still possible to write it in one single query using the CASE ... WHEN syntactic construct:
UPDATE mytable SET
value1 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value1
END,
value2 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value2
END;
etc.
You mentioned that you may have to update rows in multiple spots, and the above let you do that without problem in one single query.
Update: I overlooked the fact that speed was your main concern (you said "optimize"), and my answer is not correct in that regard. Using a temporary table as explained in the chosen answer leads to much better performances.
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I don't think that Postgresql should have much problems handling a large query (even much larger than 1mb). Remember that SQL DB initialization scripts can be way larger that 1mb.

Related

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';

If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Counting entries depending on foreign key

I have 2 tables, let's call them T_FATHER and T_CHILD, where each father can have multiple childs, like so:
T_FATHER
--------------------------
ID - BIGINT, from Generator
T_CHILD
-------------------------------
ID - BIGINT, from Generator
FATHER_ID - BIGINT, Foreign Key
Now I want to add a counter to the T_CHILD table, that starts with 1 and adds 1 for every new child, but not globally, but per father, like:
ID | FATHER_ID | COUNTER |
--------------------------
1 | 1 | 1 |
--------------------------
2 | 1 | 2 |
--------------------------
3 | 2 | 1 |
--------------------------
My initial thought was creating a before-insert-trigger that counts how many childs are present for the given father and add 1 for the counter. This should work fine unless there are 2 inserts at the same time, which would end with the same counter. Chances are high that this never actually happens - but better save than sorry.
I don't know if it is possible to use a generator, but don't think so as there would have to be a generator per father.
My current approach is using the aforementioned trigger and add a unique index to FATHER_ID + COUNTER, so that only one of the simultaneous inserts goes through. I will have to handle the exception client-side (and reattempt the failed insert).
Is there a better way to handle this directly in Firebird?
PS: There won't be any deletes on any of the two tables, so this is not an issue.

Even with a generator per FATHER_ID you couldn't use them for this, because generators are not transaction safe. If your transaction is rolled back for whatever reason, the generator will have increased anyway, causing gaps.
If there are no deletes, I think your approach with a unique constraint is valid. I would consider an alternative however.
You could decide not to store the counter as such – storing counters in a database is often a bad idea. Instead, only track the insertion order. For that, a generator is usable, because every new record will have a higher value and gaps won't matter. In fact, you don't need anything but the ID you already have. Determine the numbering when selecting; for every child you want to know how many children there are with the same father but a lower ID. As a bonus, deletes would work normally.
Here's a proof of concept using a nested query:
SELECT ID, FATHER_ID,
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = C.FATHER_ID
AND OTHERS.ID < C.ID) AS COUNTER
FROM T_CHILD AS C
There's also the option of a window function. It has to have a derived table to also count any rows that are ultimately not being selected:
SELECT * FROM (
SELECT ID, FATHER_ID,
ROW_NUMBER() OVER(PARTITION BY FATHER_ID ORDER BY ID) AS COUNTER
FROM T_CHILD
-- Filtering that wouldn't affect COUNTER (e.g. WHERE FATHER_ID ... AND ID < ...)
)
-- Filtering that would affect COUNTER (e.g. WHERE ID > ...)
These two options have completely different performance characteristics. Which one, if either at all, is suitable for you depends on your data size and access patterns.

And when you try with a computed field and the Select solution of Thijs van Dien ?
CREATE TABLE T_CHILD(
ID INTEGER,
FATHER_ID INTEGER,
COUNTER COMPUTED BY (
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = T_CHILD.FATHER_ID
AND OTHERS.ID < T_CHILD.ID)
)
);

During the insert, you should just do a "Select...count + 1" directly inside that field.
But I would probably reconsider adding that field in the first place. It feels like redundant information that could easily be deduced at the moment you need it.(For example, by using DENSE_RANK http://www.firebirdfaq.org/faq343/)

Select until row matches in postgresql?

Is there a way to select rows until some condition is met? I.e. a type of limit, but not limited to N rows, but to all the rows until the first non-matching row?
For example, say I have the table:
CREATE TABLE t (id SERIAL PRIMARY KEY, rank INTEGER, value INTEGER);
INSERT INTO t (rank, value) VALUES ( 1, 1), (2, 1), (2,2),(3,1);
that is:
test=# SELECT * FROM t;
id | rank | value
----+------+-------
1 | 1 | 1
2 | 2 | 1
3 | 2 | 2
4 | 3 | 1
(4 rows)
I want to order by rank, and select up until the first row that is over 1.
I.e. SELECT * FROM t ORDER BY rank UNTIL value>1
and I want the first 2 rows back?
One solution is to use a subquery and bool_or:
SELECT * FROM
( SELECT id, rank, value, bool_and(value<2) OVER (order by rank, id) AS ok FROM t ORDER BY rank) t2
WHERE ok=true
BUT wont that end up going through all rows, even if I only want a handful?
(real world context: I have timestamped events in a table, I can use a window query lead/lag to select the time between two events, I want all event from now going back as long as they happened less than 10 minutes apart – the lead/lag window query complicates things, so simplified example here)
edit: made window-function order by rank, id

What you want is a sort of stop-condition. As far as I am aware there is no such thing in SQL, at least PostgreSQL's dialect.
What you can do is use a PL/PgSQL procedure to read rows from a cursor and return them until the stop condition is met. It won't be super fast, but it'll be alright. It's just a FOR loop over a query with an IF expression THEN exit; ELSE return next; END IF;. No explicit cursor is required because PL/PgSQL will use one internally if you FOR loop over a query.
Another option is to create a cursor and read chunks of rows from it in the application, then discard part of the last chunk once the stop condition is met.
Either way, a cursor is going to be what you want.
A stop expression wouldn't actually be too hard to implement in PostgreSQL by the way. You'd have to implement a new executor node type, but the new CustomScan support would make that practical to do in an extension. Then you'd just evaluate an expression to decide whether or not to carry on fetching rows.

You can try something such as:
select * from t, (
select rank from t where value = 1 order by "rank" limit 1) x
where t.rank <= x.rank order by rank;
It will make two passes through the first part of the table (which you might be able to cut by creating an index on (rank, value = 1)) but shouldn't evaluate the rest of the table if you have an index on rank.
[If you could have window expressions in where clauses you could use a window expression to make sure any previous rows didn't have value = 1.. but even if this were possible, then getting the query evaluator to use to limit search would be yet another challenge.]

This may be no better than your solution, since you begged the question, "won't that end up going through all rows?"
I can tell you this -- the explain plan is different than your solution. I don't know how the guts of PostgreSQL works, but if I were writing a "max" function, I would think it would always be O(n). By contrast, you had an order by which is average case O(n log n), worst case O(n^2).
That said, I cannot deny that this will go through all rows:
select * from sandbox.t
where id < (select min (id) from sandbox.t where value > 1)
One thing to clarify, though, is that unless you scan all rows, I'm not sure how you could determine the minimum value. Any time you invoke an aggregate concept across all records, doesn't that mean that you must read all rows?

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?

I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.

according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo

#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.

I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.

It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.

I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse