I have db table which has around 5-6 Mn entries and it is taking around 20 minutes to perform vacuuming. Since, one field of this table is updated very frequently, thereare a lot of dead rows to deal with.
For an estimate, with our current user base it can have 2 Million dead tuples on daily basis. So, vacuuming of this table requires both:
Read IO: as the whole table is not present in shared memory.
Write IO: as there are a lot of entries to update.
What should be an ideal way to vacuum this table? Should I increase the autovacuum_cost_limit to allow more operations per autovacuum run? But as i can see, it will increase IOPS, which again might hinder the performance. Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it? If I decrease it it will run more often, although write IO will decrease but it will lead to more number of time period with high read IO.
Also, as the user base will increase it will take more and more time as the size of table with increase and vacuum will have to read a lot from disk. So, what should I do?
One of the solution I have thought of:
Separate the highly updated column and make a separate table.
Tweaking the parameter to make it run more often to decrease write IO(as discussed above). How to handle more Read IO, as vacuum will now run more often?
Combine point 2 along with increasing RAM to reduce Read IO as well.
In general what is the approach that people takes, because I assume people must have very big table 10GB or more, that needs to be vacuumed.
Separating the column is a viable strategy but would be a last resort to me. PostgreSQL already has a high per-row overhead, and doing this would double it (which might also remove most of the benefit). Plus, it would make your queries uglier, harder to read, harder to maintain, easier to introduce bugs. Where splitting it would be most attractive is if index-only-scans on a set of columns not including this is are important to you, and splitting it out lets you keep the visibility map for those remaining columns in a better state.
Why do you care that it takes 20 minutes? Is that causing something bad to happen? At that rate, you could vacuum this table 72 times a day, which seems to be way more often than it actually needs to be vacuumed. In v12, the default value for autovacuum_vacuum_cost_delay was dropped 10 fold, to 2ms. This change in default was not driven by changes in the code in v12, but rather by the realization that the old default was just out of date with modern hardware in most cases. I would have no trouble pushing that change into v11 config; but I don't think doing so would address your main concern, either.
Do you actually have a problem with the amount of IO you are generating, or is it just conjecture? The IO done is mostly sequential, but how important that is would depend on your storage hardware. Do you see latency spikes while the vacuum is happening? Are you charged per IO and your bill is too high? High IO is not inherently a problem, it is only a problem if it causes a problem.
Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it?
If I decrease it it will run more often, although write IO will
decrease but it will lead to more number of time period with high read
IO.
Running more often probably won't decrease your write IO by much if any. Every table/index page with at least one obsolete tuple needs to get written, during every vacuum. Writing one page just to remove one obsolete tuple will cause more writing than waiting until there are a lot of obsolete tuples that can all be removed by one write. You might be writing somewhat less per vacuum, but doing more vacuums will make up for that, and probably far more than make up for it.
There are two approaches:
Reduce autovacuum_vacuum_cost_delay for that table so that autovacuum becomes faster. It will still consume I/O, CPU and RAM.
Set the fillfactor for the table to a value less than 100 and make sure that the column you update frequently is not indexed. Then you could get HOT updates which don't require VACUUM.
Related
On RDS PostgreSQL instance, below are the vacuum parameters set.
autovacuum_vacuum_cost_delay 5
autovacuum_vacuum_cost_limit 3000
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 0
From what I understand from this blog - https://www.2ndquadrant.com/en/blog/autovacuum-tuning-basics/
There is a cost associated with if the page being vacuumed is in shared buffer or not. If the vacuum_cost_page_miss is set to zero, I am thinking its going to assume the cost of reading from disk is free and since the cost limit is set to 3000, the autovacuum will be performing lots of IO until it reaches the cost limit. Is my understanding correct? Would it mean something else by setting this parameter to 0?
0 is not a special value here, it means the same thing it does in ordinary arithmetic. So, no throttling will get applied on the basis of page misses.
This is my preferred setting for vacuum_cost_page_miss. Page misses are inherently self-limiting. Once a page is needed and not found, then the process stalls until the page is read. No more read requests will get issued by that process while it is waiting. This is in contrast to page dirtying. There is nothing other than vacuum_cost_page_dirty-driven throttling to prevent the vacuum process from dirtying pages far faster than they can be written to disk, leading to IO constipation which will then disturb everyone else on the system.
If you are going to reduce vacuum_cost_page_miss to zero, you should also set vacuum_cost_page_hit to zero. Having the latter high than the former is weird. Maybe whoever came up with those setting just figured that 1 was already low enough, so there was no point in changing yet another setting.
vacuum_cost_page_miss throttling could be particularly bad before v9.6 (when freeze map was introduced) when freezing large tables which have hit autovacuum_freeze_max_age, but have seen few changes since the last freeze. PostgreSQL will charge vacuum_cost_page_miss for every page, even though most of them will be found already in the kernel page cache (but not in shared_buffers) through the magic of readahead. So it will slow-walk the table as if it were doing random reads, while doing no useful work and holding the table lock hostage. This might be the exact thing that lead your predecessor to make the changes he made.
the autovacuum will be performing lots of IO until it reaches the cost limit.
Autovacuum once begun has a mostly fixed task to do, and will do the amount of IO it needs to do to get it done. At stake is not how much IO it will do, but over how much time it does it.
These settings are silly in my opinion.
Setting the cost for a page found in shared buffers higher than the cost for a page read from disk does not make any sense. Also, if you want to discount I/O costs, why leave vacuum_cost_page_dirty at 20?
Finally, increasing the cost limit and leaving the cost delay at a high value like 5 (the default from v12 on is 2) can only be explained if RDS is based on a PostgreSQL version older than v12.
It feels like whoever dabbled with the autovacuum settings had a vague idea to make autovacuum more aggressive, but didn't understand it well enough to do it right. I think the default settings of v12 are better.
In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.
I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.
https://medium.com/#maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
Simple. With dimension snapshots where a new partition is appended at
each ETL schedule. The dimension table becomes a collection of
dimension snapshots where each partition contains the full dimension
as-of a point in time. “But only a small percentage of the data
changes every day, that’s a lot of data duplication!”. That’s right,
though typically dimension tables are negligible in size in proportion
to facts. It’s also an elegant way to solve SCD-type problematic by
its simplicity and reproducibility. Now that storage and compute are
dirt cheap compared to engineering time, snapshoting dimensions make
sense in most cases.
While the traditional type-2 slowly changing dimension approach is
conceptually sound and may be more computationally efficient overall,
it’s cumbersome to manage. The processes around this approach, like
managing surrogate keys on dimensions and performing surrogate key
lookup when loading facts, are error-prone, full of mutations and
hardly reproducible.
I have worked with systems with different levels of partitioning.
Generally any partitioning is OK as long as you have check constrains on partitions which allow query planner to find adequate partitions for query. Or you will have to query specific partition directly for some special cases. Otherwise you will see sequential scans over all partitions even for simple queries.
Daily partitions are completely OK do not worry. And I worked event with data collector based on PG which needed to have partitions for every 5 minutes of data because it collected several TBs per day.
Number of partitions can become a bigger problem only when you have several thousands or dozens of thousands of partitions - with this amount of partitions everything goes to different level of problems.
You will have to set proper max_locks_per_transaction for example to be able to work with them. Because even simple select over parent table places SharedAccessLock over all partitions which is not exactly nice but PG inheritance works this way.
Plus higher planing time for query - in our data warehouse we sometimes see planning times for queries like several minutes and queries taking only seconds - which is a bit craped... But it is hard to do anything with it because current PG planner works this way.
But PROs still overweight CONs so I highly recommend to use any partitioning granularity you need.
Quick question from a PostgreSQL (relative) newb:
We run a batch process that, as its final step, deletes most of the previous batches.
Disk space is a concern, so we need to ensure that PostgreSQL cleans up after itself.
Other than forcing PostgreSQL to garbage-collect faster, is there any difference between explicitly calling VACUUM at the end of the batch vs. letting the auto-VACUUM daemon handle it? Is there any reason to recommend one approach vs. the other?
Thanks!
Way back when there was one vacuum, and it was full and blocking. Then PostgreSQL guys added non-blocking vacuum. But you still had to schedule it yourself.
Then some genius made a daemon that ran vacuum automatically for you when the tables needed it. It uses the exact same vacuum command you or I would use, but has a lot of settings, especially default ones, that make it run slower and less intrusively. Primarily these settings are for worker threads (default 3), delay cost (20ms for autovac, 0ms for regular vac) and autovacuum cost delay limit (-1, i.e. use system setting which is 200).
Therefore, regular vacuum is VERY aggressive with no cost delay, and will run as hard and fast as your IO subsystem will let it. It basically competes with your regular workload for IO bandwidth.
Generally you can do one of two things in your situation:
One: Make autovacuum more aggressive. By lowering the autovacuum_vacuum_cost_delay from 20 to something in the 2 to 5 range it will run much faster but still not get in the way too much.
Two: Run regular vacuums by hand. Since regular vacuums, by default, have no cost_delay, this will be the fastest but also the most distruptive.
Decision is yours based on usage patterns etc.
I have to process many millions of data records. A data record has a record-type string at the beginning of a record. Processing is record-type-dependent but does not require to 'if'/'elsif' the type, just selecting an array-slice mask from a hash.
However, on the order of once-per-million I might encounter a record type that require a totally different kind of processing.
I hate to insert an 'if' testing for this record type that will return 'true' so rarely.
Any suggestions?
Thanks
Meir
The answer is: Don't worry about it.
The speed of your CPU is considerably higher than that of your disk IO, so an if test is just not going to make a lot of difference - even if you ignored e.g. branch prediction algorithms.
An SSD will do about 1500 IO operations per second, and to quote Borodin from the comments:
A reasonable average disk read speed is 100MB per second. Say your records are 100 bytes each, that means you can read 1 million records per second, or 1μs per record. A 2011 Intel Core i5 processor runs at 83,000 MIPS, and so can
execute 83,000 instructions in the time taken to read one record. It is pointless to avoid a few test and branch instructions amongst all that.
Basically this is true in any code - your IO to storage is almost always your limiting factor, because CPUs have followed Moore's law, but the actual rotational speed of a spinning disk hasn't really changed in 15+ years. SSDs are something of a revolutionary change, but they're still too expensive to use as bulk storage options (and even if that wasn't true, they're still going to be the bottleneck on a sustained data transfer/processing operation).
I am currently configuring a server that will run a kdb+ tickerplant with several subscription processes. Is there an optimal physical memory type for realtime kdb data?
Checkout the type sizes at http://code.kx.com/q/ref/card/#datatypes
Answer depends on what you mean by "efficient" - by the far the largest hit you take in latency is memory allocation, so the less you have to allocate the better. That means smaller types.
But of course you have to weigh that up against your use cases.
For your realtime always make sure the tickerplant inserts the time column so that #s is maintained on the time column for efficient querying.
The tickerplant itself publishes on a timer - the longer the timer the less hit on cpu, but then the tp is collecting data for a while before publishing. Again, weigh up against use cases. BTW make sure your tickerplant is writing the log file to a fast local disk so as to decrease pub delay and iowait.
If you're operating high load from multiple sources, consider OS tweaks too like tcp quickack ( http://www.techrepublic.com/article/take-advantage-of-tcp-ip-options-to-optimize-data-transmission/). There's similar tweaks for memory allocation and disk i/o.