I have to update multiple records in a table with the most efficient way out there having the least latency and without utilising CPU extensively. At a time records to update can be ranged from 1 to 1000.
We do not want to lock the database when this update occurs as other services are utilising it.
Note: There are no dependencies generated from this table towards any other table in the system.
After looking in many places I've drilled down a few ways to do the task-
simple-update: A simple update query to the table with update command with already known id's
Either multiple update queries (one query for each individual record), or
Usage of update ... from clause as mentioned here as a single query (one query for all records)
delete-then-insert: Firstly, delete the outdated data and then insert updated data with new id's (since there is no dependency on records, new id's are acceptable)
insert-then-delete: Firstly, insert updated records with new id's and then delete outdated data using old id's (since there is no dependency on records, new id's are acceptable)
temp-table: Firstly, insert updated records into a temporary table. Secondly, update the original table with inserted records from the temporary table. At last, remove the temporary table.
We must not drop the existing table and create a new one in its place
We must not truncate the existing table because we have a huge number of records that we cannot store in the buffer memory
I'm open to any more suggestions.
Also, what will be the impact of making the update all at once vs doing it in batches of 100, 200 or 500?
References:
https://dba.stackexchange.com/questions/75631
https://dba.stackexchange.com/questions/230257
As mentioned by #Frank Heikens in the comments, I'm sure that different people will have different statistics based on their system design. I did some checks and I have found some insights to share for one of my development systems.
Configurations of the system used:
AWS
Engine: PostgreSQL
Engine version: 12.8
Instance class: db.m6g.xlarge
Instance vCPU: 4
Instance RAM: 16GB
Storage: 1000 GiB
I used a lambda function and pg package to write data into a table (default FILLFACTOR) that contains 34,09,304 records.
Both lambda function and database were in the same region.
UPDATE 1000 records into the database with a single query
Run
Time taken
1
143.78ms
2
115.277ms
3
98.358ms
4
98.065ms
5
114.78ms
6
111.261ms
7
107.883ms
8
89.091ms
9
88.42ms
10
88.95ms
UPDATE 1000 records into the database with a single query in 2 batches of 500 records concurrently
Run
Time taken
1
43.786ms
2
48.099ms
3
45.677ms
4
40.578ms
5
41.424ms
6
44.052ms
7
42.155ms
8
37.231ms
9
38.875ms
10
39.231ms
DELETE + INSERT 1000 records into the database
Run
Time taken
1
230.961ms
2
153.159ms
3
157.534ms
4
151.055ms
5
132.865ms
6
153.485ms
7
131.588ms
8
135.99ms
9
287.143ms
10
175.562ms
I did not proceed to check for updating records with the help of another buffer table because I had found my answer.
I've seen the database metrics graph provided by the AWS and by looking into those it was clear that DELETE + INSERT was more CPU intensive. And from the statistics shared above DELETE + INSERT took more time as compared to UPDATE.
If updates are done concurrently in batches, yes, updates will be faster, depending on the number of connections (a connection pool is recommended).
Using a buffer table, truncate, and other methods might be more suitable approaches if needed to update almost all the records in a giant table, though I currently do not have metrics to support this. However, for a limited number of records, UPDATE is a fine choice to proceed with.
Also, if not executed properly, please be mindful that if DELETE + INSERT fails, you might lose records and if INSERT + DELETE fails you might end up having duplicate records.
Related
Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks
Stefan
I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key
lastUpdatedMillis
ExampleKind/1040
5
ExampleKind/1052
0
ExampleKind/1064
12
ExampleKind/1065
100
ExampleKind/1070
42
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis field.
The the lastUpdatedMillis ASC index table would have the following logical entries:
Index Key
Entity Key
0
ExampleKind/1052
5
ExampleKind/1040
12
ExampleKind/1064
42
ExampleKind/1070
100
ExampleKind/1065
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
For every ExampleKind Entity returned by the query, perform some operation which updates lastUpdatedMillis
Some of the updates may fail, so we repeat the query from step 1 again to catch any remaining entities.
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key
Entity Key
x
xxxx
x
xxxx
x
xxxx
42
ExampleKind/1070
...
Und so weiter ...
x
xxxx
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis field.
For example, instead of scanning the entire range of lastUpdateMillis < now, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
This deadlining query problem could occur with any kind of query over the index (projection, keys only, full entity, etc)
Despite what the error message says, no extra index here is need or would speed up the operation. Datastore's built-in ASC/DESC index over each field already exists for you and is serving this query.
Very simply update to reset 1 column in a table with approx 5mil rows as:
UPDATE t_Daily
SET Price= NULL
Price is not part of any of the indexes on that table.
Running this without indexes takes 45s.
Running this with one or more indexes takes at least 20 mins (I keep having to stop it).
I fully understand why maintaining indexes affects the performance of insert and update statements, but this update makes no changes to the table indexes so why does it have this terrible effect on performance?
Any ideas much appreciated.
That is normal and expected: updating an index can be about ten times as expensive as updating the table itself. The table has no ordering!
If price is not indexed, you can use HOT updates that avoid updating the indexes. To make use of that, the table has to be defined with a fillfactor under 100 so that updated rows can find room in the same block as the original rows.
Found some further info (thanks to Laurenz-Albe for the HOT tip).
This link https://malisper.me/postgres-heap-only-tuples/ states that
Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index
So it is re-writing the index despite only updating a column not in the index.
We have around 90 million rows in a new Postgresql table in an RDS instance. It contains 2 numbers, start_num and end_num(Bigint, mostly finance related) and details related to those numbers. The PK is on the start_num and end_num and table is CLUSTERed on this. The query will always be range query. Input will be a number and the output will be range in which this number is falling along with details. For eg: There is a row which has start_num=112233443322 and end_num as 112233543322. The input comes in as 112233443645. So the row containing 112233443322, 112233543322 needs to be returned.
select start_num, end_num from ipinfo.ipv4 where input_value between start_num and end_num;
This is always going into seq scan and the PK is not getting used. I have tried creating separate indexes on start_num and end_num desc but not much change in time. We are looking for an output of less than 300 ms. Now, I am wondering if that is even possible in Postgresql for range queries on large data sets or this is due to the Postgresql being on AWS RDS.
Looking forward to some advice on steps to improve the performance.
I have a redshift cluster with a single dc1.large node. I've got data writing into it, on order of 50 million records a day, in the format of a timestamp, a user ID and an item ID. The item ID (varchar) is unique, the user ID (varchar) is not, and the timestamp (timestamp) is not.
In my redshift DB of about 110m records, if I have a table with no sort key, it takes about 30 seconds to search for a single item ID.
If I have a table with a sort key on item ID, I get a single item ID search time of about 14-16 seconds.
If I have a table with an interleved sort key with all three columns, the single item ID search time is still 14-16 seconds.
What I'm hoping to achieve is the ability to query for the records of thousands or tens of thousands of item IDs on order of a second.
The query just looks like
select count(*) from rs_table where itemid = 'id123';
or
select count(*) from rs_table where itemid in ('id123','id124','id125');
This query comes back in 541ms
select count(*) from rs_table;
AWS documentation suggests that there is a compile time for queries the first time they're run, but I don't think that's what I'm seeing (and it would be not ideal if it was, since each unique set of 10,000 IDs might never be queried in exactly the same order again.
I have to assume I'm doing something wrong with either the sort key design, the query, or some combination of the two - for only ~10g of table space, something like redshift shouldn't take this long to query, right?
Josh,
We probably need a few additional pieces of information to give you a good recommendation.
Here are some things to start thinking about.
Are most of your queries record lookups as you describe above?
What is your distribution key?
Do you join this table with other large fact tables?
If you load 50M records per day and you only have 110M records in the
table, does that mean that you only store 2 days?
Do you do massive deletes and then load another 50M records per day?
Do you run ANALYZE after your loads?
If you deleted a large number of records, did you run VACUUM?
If all of your queries are similar to the ones that you describe, why are you using Redshift? Amazon DynamoDB or MongoDB (even Cassandra) would be great database choices for the types of queries that you describe.
If you run analytical workloads Redshift is an excellent platform. If you are more interested in "record lookups" the NoSQL options, as well as mysql or MariaDB might give you better performance.
Also, if this is a dev/test environment and you have loaded and deleted large amounts of data without ever running a VACUUM you would see significant performance degradation.
We have a postgres process that works as follows:-
1) CSV file split into a table with a row per record
2) pagent runs a postgres function that reads each record and writes it to a new table as either a new record or an update
3) a trigger runs on the new table and depending on the record value runs a plv8 function to update its data (there's a fair bit of json processing involved and plv8 was the easiest way to code it). The second update comes from plv8 and we've used the pattern below:-
query = plv8.prepare('...');
query.execute(<params>);
query.free();
When we monitor this we see that processing 5000 records uses 14Gb of virtual memory. So something is awry as the CSV record is < 1k in size. This became acute after we added a new index to the table.
Where should we look for solutions to this? Is it normal and is it linked to the indexes being updated in the transaction or another factor.