We are doing a migration job on which GIN index creation on a JSonB column takes too long time to create. After investigating a bit , we think if we increase the maintenance_work_mem limit (it is 120MB now), it would speed up the things. But We are not sure if this would interrupt current on going index creation/restart the instance . We are running PostgreSql on GCP
You can change maintenance_work_mem any time without disturbing active database sessions, but it won't have any effect on CREATE INDEX statements that are already running.
Related
I see that index cache hit rate (=80%) and table cache hit rate (=93%) are lower than they should be (>99%).
All frequent queries are using indexes and have execution time ~1ms.
Would removing unused indexes help increasing hit rates? What about adding more indexes?
Or is it time for upgrading the DB? (Adding more RAM especially)
I just want to check that my understanding of these two things is correct. If it's relevant, I am using Postgres 9.4.
I believe that one should vacuum a database when looking to reclaim space from the filesystem, e.g. periodically after deleting tables or large numbers of rows.
I believe that one should analyse a database after creating new indexes, or (periodically) after adding or deleting large numbers of rows from a table, so that the query planner can make good calls.
Does that sound right?
vacuum analyze;
collects statistics and should be run as often as much data is dynamic (especially bulk inserts). It does not lock objects exclusive. It loads the system, but is worth of. It does not reduce the size of table, but marks scattered freed up place (Eg. deleted rows) for reuse.
vacuum full;
reorganises the table by creating a copy of it and switching to it. This vacuum requires additional space to run, but reclaims all not used space of the object. Therefore it requires exclusive lock on the object (other sessions shall wait it to complete). Should be run as often as data is changed (deletes, updates) and when you can afford others to wait.
Both are very important on dynamic database
Correct.
I would add that you can change the value of the default_statistics_target parameter (default to 100) in the postgresql.conf file to a higher number, after which, you should restart your server and run analyze to obtain more accurate statistics.
In my application I need to massively improve insert performance. Example: A file with about 21K records takes over 100 min to insert. There are reasons it can takes some time, like 20 min or so but over 100 min is just too long.
Data is inserted into 3 tables (many-to-many). Id's are generated from a sequence but I have already googled and set hibernate.id.new_generator_mappings = true and allocationSize + sequence increment to 1000.
Also the amount of data is not anything extraordinary at all, the file is 90 mb.
I have verified with visual vm that most of the time is spent in jdbc driver (postgresql) and hibernate. I think the issue is related to a unique constraint in the child table. The service layer makes a manual check (=SELECT) before inserting. If the record already exists, it reuses it instead of waiting for a constraint exception.
So to sum it up for the specific file there will be 1 insert per table (could be different but not for this file which is the ideal (fastest) case). That means total 60k inserts + 20k selects. Still over 100 min seems very long (yeah hardware counts and it is on a simple PC with 7200 rpm drive, no ssd or raid). However this is an improved version over a previous application (plain jdbc) on which the same insert on this hardware took about 15 min. Considering that in both cases about 4-5 min is spent on "pre-processing" the increase is massive.
Any tips who this could be improved? Is there any batch loading functionality?
see
spring-data JPA: manual commit transaction and restart new one
Add entityManager.flush() and entityManager.clear() after every n-th call to save() method. If you use hibernate add hibernate.jdbc.batch_size=100 which seems like a reasonable choice.
Performance increase was > 10x, probably close to 100x.
sounds like a database problem. check your tables if they use InnoDB or MyISAM, the latter is in my experience very slow with insert and is the default for new dbs. remove foreign keys as far as you can
If your problem really is related to a single unique index InnoDB could do the trick.
All the question is in the title,
if we kill a cluster query on a 100 millions row table, will it be dangerous for database ?
the query is running for 2 hours now, and i need to access the table tomorrow morning (12h left hopefully).
I thought it would be far quicker, my database is running on raid ssd and Bi-Xeon Processor.
Thanks for your wise advice.
Sid
No, you can kill the cluster operation without any risk. Before the operation is done, nothing has changed to the original table- and indexfiles. From the manual:
When an index scan is used, a temporary copy of the table is created
that contains the table data in the index order. Temporary copies of
each index on the table are created as well. Therefore, you need free
space on disk at least equal to the sum of the table size and the
index sizes.
When a sequential scan and sort is used, a temporary sort file is also
created, so that the peak temporary space requirement is as much as
double the table size, plus the index sizes.
As #Frank points out, it is perfectly fine to do so.
Assuming you want to run this query in the future and assuming you have the luxury of a service window and can afford some downtime, I'd tweak some settings to boost the performance a bit.
In your configuration:
turn off fsync, for higher throughput to the file system
Fsync stands for file system sync. With fsync on, the database waits for the file system to commit on every page flush.
maximize your maintenance_work_mem
It's ok to just take all memory available, as it will not be allocated during production hours. I don't know how big your table and the index you are working on are, things will run faster when they can be fully loaded in main memory.
I have a table with a few million tuples.
I perform updates in most of them.
The first update takes about a minute. The second, takes two minutes. The third update takes four minutes.
After that, I execute a VACUUM FULL.
Then, I execute the update again, which takes two minutes.
If I dump the database and recreate it, the first update will take one minute.
Why doesn't PostgreSQL performance get back to its maximum after a VACUUM FULL?
VACUUM FULL does not compact the indexes. In fact, indexes can be in worse shape after performing a VACUUM FULL. After a VACUUM FULL, you should REINDEX the table.
However, VACUUM FULL+REINDEX is quite slow. You can achieve the same effect of compacting the table and the indexes using the CLUSTER command which takes a fraction of the time. It has the added benefit that it will order your table based on the index you choose to CLUSTER on. This can improve query performance. The downsides to CLUSTER over VACUUM FULL+REINDEX is that it requires approximately twice the disk space while running. Also, be very careful with this command if you are running a version older than 8.3. It is not MVCC safe and you can lose data.
Also, you can do a no-op ALTER TABLE ... ALTER COLUMN statement to get rid of the table and index bloat, this is the quickest solution.
Finally, any VACUUM FULL question should also address the fact why you need to do this? This is almost always caused by incorrect vacuuming. You should be running autovacuum and tuning it properly so that you never have to run a VACUUM FULL.
The order of the tuples might be different, this results in different queryplans. If you want a fixed order, use CLUSTER. Lower the FILLFACTOR as well and turn on auto_vacuum. And did you ANALYZE as well?
Use EXPLAIN to see how a query is executed.