Postgres: updating not-changed rows - postgresql

Say, I have a following query:
UPDATE table_name
SET column_name1 = column_value1, ..., column_nameN = column_valueN
WHERE id = M
The thing is, that column_value1, ..., column_valueN have not changed. Will this query be really executed and what about performance in this case comparing to update with really changed data? What if I have about 50 of such queries per page with not-changed data?

You need to help postgresql here by specifying only the changed columns and rows. It will go ahead and perform update on whatever you specify without checking if the data has been changed.
p.s. This is where ORM comes in handy.
EDIT: You may also be interested in How can I speed up update/replace operations in PostgreSQL?, where the OP went through all the troubles to speed up UPDATE performance, when the best performance can be achieved by updating changed data only.

Related

Querying timestamp column In q

I want to count the number of records inserted in a kdb+ database using a q query.
Currently, using below query:
count select from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
It works but not highly performant. Any recommendations to make it efficient is highly appreciated.
Thank you for your help.
If you only want count then use 'count i' inside select like below:
q) select count i from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
This will only get the count instead of fetching full data which is what your query is doing and that's one of the reasons for taking more time.
And if it is a partitioned database, then add 'date' in the filter as #Callum Biggs mentioned.
Given the information you have provided I'm assuming you're querying on-disk data, likely saved in a standard date partitioned structure. In this case, you should be specifying a date clause before you specify a time clause, this will prevent searching all the date directories.
select from executionTable where date=2019.09.07, ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
I'd suggest reading through the whitepaper on query optimization, it will give some guidance in good query structure, and how to take advantage of map reduction in kdb.

Are idx_scan statistics reset automatically (default)?

I was looking at the tables (pg_stat_user_indexes and pg_stat_user_tables) and discovered many indices that are not being used.
But before I think about doing any operations to remove these indices, I need to understand what period was the analysis of this data (idx_scan), has it been since the database was created?
In the pg_stat_database table (column stats_reset) there is a date that normally is today or up to 15 days ago, but does this process interfere with the tables I mentioned above?
No command pg_stat_reset() was executed.
Does the pg_stat_reset() command clear the tables (pg_stat_user_indexes and pg_stat_user_tables)?
My goal is to understand the period of data collected so that I can make a decision.
Statistics are cumulative and are kept from the time of cluster creation on.
So if you see the pg_stat_database.stats_reset change regularly, there must be somebody or something doing that explicitly with the pg_stat_reset() function.
Doing so is somewhat problematic, because this resets all statistics, including those in pg_stat_user_tables which govern when autovacuum and autoanalyze take place. So after a reset these will be a little out of whack until autoanalyze has collected new statistics.
The better way is to take regular snapshots and calculate the difference.
You are right that you should collect data over a longer time before you determine if an index can be canned or not. For example, some activity may only take place once a month, but require certain indexes.
Before dropping indexes, consider that indexes also serve other purposes besides being scanned:
They can be UNIQUE or back a constraint, in which case they serve a purpose even when they are never scanned.
Indexes on expressions make PostgreSQL collect statistics on the distribution of the indexed expression, which can have a notable effect on query planning and the quality of your execution plans.
You could use the query in this blog to find all the indexes that serve no purpose at all.
Only superuser is allowed to reset statistic. Query planer depends on statistic.
Use snapshots:
CREATE TABLE stat_idx_snap_m10_d29_16_12 AS SELECT * FROM pg_stat_user_indexes;
CREATE TABLE stat_idx_snap_m10_d29_16_20 AS SELECT * FROM pg_stat_user_indexes;
Analyze difference any time later:
SELECT
s2.relid, s2.indexrelid, s2.schemaname, s2.relname, s2.indexrelname,
s2.idx_scan - s1.idx_scan as idx_scan,
s2.idx_tup_read - s1.idx_tup_read as idx_tup_read,
s2.idx_tup_fetch - s1.idx_tup_fetch as idx_tup_fetch
FROM stat_idx_snap_m10_d29_16_20 s2
FULL OUTER JOIN stat_idx_snap_m10_d29_16_12 s1
ON s2.relid = s1.relid AND s2.indexrelid = s1.indexrelid
ORDER BY s2.idx_scan - s1.idx_scan ASC;

PostgreSQL slow update with index

Very simply update to reset 1 column in a table with approx 5mil rows as:
UPDATE t_Daily
SET Price= NULL
Price is not part of any of the indexes on that table.
Running this without indexes takes 45s.
Running this with one or more indexes takes at least 20 mins (I keep having to stop it).
I fully understand why maintaining indexes affects the performance of insert and update statements, but this update makes no changes to the table indexes so why does it have this terrible effect on performance?
Any ideas much appreciated.
That is normal and expected: updating an index can be about ten times as expensive as updating the table itself. The table has no ordering!
If price is not indexed, you can use HOT updates that avoid updating the indexes. To make use of that, the table has to be defined with a fillfactor under 100 so that updated rows can find room in the same block as the original rows.
Found some further info (thanks to Laurenz-Albe for the HOT tip).
This link https://malisper.me/postgres-heap-only-tuples/ states that
Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index
So it is re-writing the index despite only updating a column not in the index.

Partial index not being used in psql 8.2

I would like to run a query on a large table along the lines of:
SELECT DISTINCT user FROM tasks
WHERE ctime >= '2012-01-01' AND ctime < '2013-01-01' AND parent IS NULL;
There is already an index on tasks(ctime), but most (75%) of rows have a non-NULL parent, so that's not very effective.
I attempted to create a partial index for those rows:
CREATE INDEX CONCURRENTLY task_ctu_np ON tasks (ctime, user)
WHERE parent IS NULL;
but the query planner continues to choose the tasks(ctime) index instead of my partial index.
I'm using postgresql 8.2 on the server, and my psql client is 8.1.
First, I second Richard's suggestion that upgrading should be at the top of your priority. The areas of partial indexes, etc. have, as I understood it, improved significantly since 8.2.
The second thing is you really need the actual query plans with timing information (EXPLAIN ANALYZE) because without these we can't talk about selectivity, etc.
So my order of business if I were you would be to upgrade first and then tune after that.
Now, I understand that 8.3 is a big upgrade (it is the only one that caused us issues in LedgerSMB). You may need some time to address that, but the alternative is to get further behind and be asking questions on a version that is less and less in current understanding as time goes on.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.