WAL usage by vacuum - postgresql

In PG log file we can find some statistics about how vacuum works on a table.
Why vacuum needs to access WAL for cleaning up a table?
pages: 0 removed, 1640 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 99960 remain, 0 are dead but not yet removable, oldest xmin: 825
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 8.878 MB/s, avg write rate: 0.000 MB/s
buffer usage: 48 hits, 1 misses, 0 dirtied
**WAL usage: 1 records**, 0 full page images, 229 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

VACUUM modifies the table by deleting old row versions, freezing rows and truncating empty pages at the end of the table.
All modifications to a table are logged in WAL, and that applies to VACUUM as well. After all, you need to replay these activities after a crash.

Related

n_dead_tup grows despite pretty agressive autovacuum seetings, drops only after index scan

I have a relatively big (~2tb, ~20 billion rows) database with just two tables
autovacuum_max_workers = 6
autovacuum_vacuum_threshold = 40000
autovacuum_vacuum_insert_threshold = 100000
autovacuum_analyze_threshold = 40000
autovacuum_vacuum_scale_factor = 0
autovacuum_vacuum_insert_scale_factor = 0
autovacuum_vacuum_cost_delay = 2
autovacuum_vacuum_cost_limit = 3000
maintenance_work_mem = 4GB
With the current load autovacuum starts every minute or two (which I thought was good, since more vacuuming = better)
But the porblem is - n_dead_tup steadily increases over time, and vacuum time grows exponentially as well.
Then after about an hour or two, an index scan happens, with a huge spike in IO, and sometimes hangs the system for long enough for the app to crash.
Is there anything I can do to prevent these spikes? May be force index scans more often?
Here's the graph of n_dead_tup steadily growing, despite autovaccums happening.
And when the index scan finally happens, the spike in IO would be enormous, making the rest of the chart look like a line.
And in the logs, I can see that each autovacuum takes about twice as much time as the previous did. All untill an index scan happens
2022-11-12 19:51:05.036 UTC [67209] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 5.01 s, system: 3.98 s, elapsed: 16.70 s
2022-11-12 19:52:16.269 UTC [67782] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 6.38 s, system: 6.32 s, elapsed: 27.73 s
2022-11-12 19:55:27.550 UTC [68810] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 7.22 s, system: 10.33 s, elapsed: 38.50 s
and after some time
2022-11-12 21:30:38.277 UTC [96988] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 29.21 s, system: 66.48 s, elapsed: 261.49 s
2022-11-12 22:05:23.651 UTC [98593] LOG: automatic vacuum of table "public.initial_scores": index scans: 1
system usage: CPU: user: 1278.22 s, system: 239.80 s, elapsed: 2062.45 s
There are no long-running queries which could prevent vacuum from cleaning properly, and there are no prepared quueries.
Per request adding verbose output of
vacuuming "ryd-db.public.initial_scores"
launched 4 parallel vacuum workers for index vacuuming (planned: 4)
finished vacuuming "ryd-db.public.initial_scores": index scans: 1
pages: 0 removed, 122251463 remain, 13228190 scanned (10.82% of total)
tuples: 275918 removed, 1578025058 remain, 241296 are dead but not yet removable
removable cutoff: 3503893034, which was 1596319 XIDs old when operation ended
index scan needed: 2593283 pages from table (2.12% of total) had 3961159 dead item identifiers removed
index "pk_initial_scores": pages: 7896774 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_date_created": pages: 18729021 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_channel_id": pages: 3291717 in total, 12 newly deleted, 13 currently deleted, 10 reusable
index "ix_initial_scores_date_published": pages: 2625645 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_category": pages: 2164691 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 84.367 MB/s, avg write rate: 15.163 MB/s
buffer usage: 15364625 hits, 21630601 misses, 3887689 dirtied
WAL usage: 9888050 records, 6215601 full page images, 30970053477 bytes
system usage: CPU: user: 636.32 s, system: 120.53 s, elapsed: 2003.03 s
vacuuming "ryd-db.pg_toast.pg_toast_16418"
finished vacuuming "ryd-db.pg_toast.pg_toast_16418": index scans: 1
pages: 0 removed, 17770543 remain, 890550 scanned (5.01% of total)
tuples: 20012 removed, 87435806 remain, 793 are dead but not yet removable
removable cutoff: 3505488713, which was 94878 XIDs old when operation ended
index scan needed: 179949 pages from table (1.01% of total) had 283573 dead item identifiers removed
index "pg_toast_16418_index": pages: 2735506 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 257.567 MB/s, avg write rate: 13.215 MB/s
buffer usage: 945149 hits, 3758561 misses, 192836 dirtied
WAL usage: 413262 records, 98383 full page images, 448496268 bytes
system usage: CPU: user: 10.31 s, system: 9.47 s, elapsed: 114.00 s
analyzing "public.initial_scores"
"initial_scores": scanned 300000 of 122251463 pages, containing 3959953 live rows and 3253 dead rows; 300000 rows in sample, 1613700159 estimated total rows
Autovacuum run when index scan is triggered
2022-11-14 00:56:44.889 UTC [640096] LOG: automatic vacuum of table "ryd-db.public.initial_scores": index scans: 1
pages: 0 removed, 122251463 remain, 14421445 scanned (11.80% of total)
tuples: 325666 removed, 1577912706 remain, 309680 are dead but not yet removable
removable cutoff: 3496504042, which was 3600494 XIDs old when operation ended
index scan needed: 2469439 pages from table (2.02% of total) had 3578028 dead item identifiers removed
index "pk_initial_scores": pages: 7893180 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_date_created": pages: 18721360 in total, 1 newly deleted, 1 currently deleted, 1 reusable
index "ix_initial_scores_channel_id": pages: 3291207 in total, 6 newly deleted, 7 currently deleted, 6 reusable
index "ix_initial_scores_date_published": pages: 2625593 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_category": pages: 2164594 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 86.157 MB/s, avg write rate: 11.919 MB/s
buffer usage: 16609082 hits, 49454603 misses, 6841583 dirtied
WAL usage: 9554520 records, 5682531 full page images, 28223936896 bytes
system usage: CPU: user: 1262.91 s, system: 218.72 s, elapsed: 4484.40 s
2022-11-14 00:57:01.086 UTC [640096] LOG: automatic analyze of table "ryd-db.public.initial_scores"
avg read rate: 143.711 MB/s, avg write rate: 1.044 MB/s
buffer usage: 8007 hits, 297889 misses, 2165 dirtied
system usage: CPU: user: 9.14 s, system: 1.71 s, elapsed: 16.19 s
And without index scan
2022-11-14 01:09:21.788 UTC [664482] LOG: automatic vacuum of table "ryd-db.public.initial_scores": index scans: 0
pages: 0 removed, 122251463 remain, 7751023 scanned (6.34% of total)
tuples: 1346523 removed, 1596376142 remain, 248208 are dead but not yet removable
removable cutoff: 3500120771, which was 583770 XIDs old when operation ended
index scan bypassed: 1420208 pages from table (1.16% of total) have 1892380 dead item identifiers
avg read rate: 75.417 MB/s, avg write rate: 14.280 MB/s
buffer usage: 7023840 hits, 7103108 misses, 1344965 dirtied
WAL usage: 2307792 records, 585139 full page images, 3890305134 bytes
system usage: CPU: user: 82.42 s, system: 40.27 s, elapsed: 735.81 s
When vacuum does an index cleanup, it needs to read the entire index(es). This inevitably means a lot of IO if the index doesn't fit in cache. This is in contrast to the table scan, which only needs to scan the part of the table not already marked all_visible. Which, depending on usage, could be a small part of the entire table (which I assume is the case here, otherwise every autovacuum would be consuming huge amounts of IO)
You can force the index cleanup to occur on every vacuum, by setting the vacuum_index_cleanup storage parameter for the tables. But it is hard to see how this would be a good thing, since the clean up is what is causing the problem in the first place.
Throttling the autovacuum so it can run the index cleanup without consuming all your IO capacity is probably the right solution, which it seems you already hit on. I would have done this by lowering cost_limit rather than by increasing cost_delay because that would be moving the settings more toward their defaults, but it probably doesn't make much of a difference.
Also, if you partitioned the table by date (or something like that) such that most of the activity is concentrated in one or two the partitions at a time, that could improve this situation. The cleanup scan would only need to be done on the index(es) for the actively changing partitions, greatly decreasing the IO consumption.
There probably isn't much point in vacuuming the table so aggressively only to bail out and not do the index cleanup scan. So I would also increase autovacuum_vacuum_threshold and/or autovacuum_vacuum_scale_factor.
However, one thing I can't explain here is the pattern in the n_dead_tup graph. I can readily reproduce the instant spike down at the end of each autovac (whether it did the index-cleanup or not), but I don't understand the instant spike up. In my hands it just ramps up, then spikes down, over and over. I am assuming you are using a version after 11, but which version is it?

Attributes applied on column "i" in a table in q kdb - Performance

When I run below query to get the count of a table, the size and time taken to run the query is almost same. / table t has 97029 records and 124 cols
Q.1. - Does column i in below query uses unique attribute internally to return the output in constant time using has function?
\ts select last i from t where date=.z.d-5 / 3j, 1313248j
/ time taken to run the query and memory used is always same not matter how many times we run same query
When I run below query:
For the first time required time and memory is very high but from next run the time and memory required is very less.
Q.2. Does kdb caches the output when we run the query for the first time and show the output from cache from next time?
Q.3 Is there attribute applied on column i while running below query, if so, then which one?
\ts select count i from t where date=.z.d-5 / 1512j, 67292448j
\ts select count i from t where date=.z.d-5 / 0j, 2160j
When running below query:
Q.4 Is any attribute applied on column i on running below query?
\ts count select from t where date=.z.d-5 / 184j, 37292448j
/time taken to run the query and memory used is always same not matter how many times we run
Q.5 which of the following queries should be used to get the column of tables with very high number of records? Any other query which can be more fast and less memory consuming to get same result?
There isn't a u# attribute applied to the i column, to see this:
q)n:100000
q)t:([]a:`u#til n)
q)
q)\t:1000 select count distinct a from t
2
q)\t:1000 select count distinct i from t
536
The timing of those queries isn't constant, there are just not enough significant figures to see the variability. Using
\ts:100 select last i from t where date=.z.d-5
will run the query 100 times and highlight that the timing is not constant.
The first query will request more memory be allocated to the q process, which will remain allocated to the process unless garbage collection is called (.Q.gc[]). The memory usage stats can be viewed with .Q.w[]. For example, in a new session:
q).Q.w[]
used| 542704
heap| 67108864
peak| 67108864
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
6569
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)b:0
q)
q).Q.w[]
used| 542768
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
877
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
Also, assuming the table in question is partitioned, the query shown will populate .Q.pn which can then be used to get the count afterwards, for example
q).Q.pn
quotes|
trades|
q)\ts select count i from quotes where date=2014.04.25
0 2656
q).Q.pn
quotes| 85204 100761 81724 88753 115685 125120 121458 97826 99577 82763
trades| ()
In more detail, .Q.ps does some of the select operations under the hood. if you look on the 3rd line:
if[$[#c;0;(g:(. a)~,pf)|(. a)~,(#:;`i)];f:!a;j:dt[d]t;...
this checks the "a" (select) part of the query, and if it is
(#:;`i)
(which is count i) it ends up running .Q.dt, which runs .Q.cn, which gets the partition counts. So the first time it is running this, it runs .Q.cn, getting the count for all partitions. The next time .Q.cn is run, it can just look up the values in the dictionary .Q.pn which is a lot faster.
See above.
See above about attributes on i. count is a separate operation, not part of the query, and wont be affected by attributes on columns, it will see the table as a list.
For tables on disk, each column should contain a header where the count of the vector is available at very little expense:
q)`:q set til 123
`:q
q)read1 `:q
0xfe200700000000007b000000000000000000000000000000010000000000000002000000000..
q)9#read1 `:q
0xfe200700000000007b
q)`int$last 9#read1 `:q
123i
q)
q)`:q set til 124
`:q
q)9#read1 `:q
0xfe200700000000007c
q)`int$last 9#read1 `:q
124i
Still, reading any file usually takes ~1ms at least, so the counts are cached as mentioned above.

Sphinx composite (distributed) big indexes

Experience problem with indexing lot's of content data. Searching for the suitable solution.
The logic if following:
Robot is uploading content every day to the database.
Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.
Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).
Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?
I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.
The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.
The question is how to fill distributed index dynamically?
Daily iteration sample:
main index
chunk1 - 50000 rows
chunk2 - 50000 rows
chunk3 - 35000 rows
delta index
10000 new rows
rotate "delta"
merge "delta" into "main"
Please, advice.
Thanks to #barryhunter
RT indexes is a solution here.
Good manual is here: https://www.sphinxconnector.net/Tutorial/IntroductionToRealTimeIndexes
I've tested match queries on 3 000 000 000 letters. The speed is close to be the same as for "plain" index type. Total index size on HDD is about 2 GB.
Populating sphinx rt index:
CPU usage: ~ 50% of 1 core / 8 cores,
RAM usage: ~ 0.5% / 32 GB, Speed: quick as usual select - insert (mostly depends on using batch insert or row-by-row)
NOTE:
"SELECT MAX(id) FROM sphinx_index_name" will produce error "fullscan requires extern docinfo". Setting docinfo = extern will not solve this. So keep counter simply in mysql table (like for sphinx delta index: http://sphinxsearch.com/docs/current.html#delta-updates).

curr_item doesn't seem to be right?

I'm using a simple lua program to do a batch insert, the expire time of each item was set to 86400 seconds, which won't get expired for the whole day.
Right now I have 1,000,000 curr_items, but if I dump them with memcached-tools, and grep for '^add', I got only 27235 items
%> memcached-tool 127.0.0.1:11211 display
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
8 480B 4s 464 1012907 no 0 0 0
%> memcached-tool 127.0.0.1:11211 dump | grep ^add -c
Dumping memcache contents
Number of buckets: 1
Number of items : 1012907
Dumping bucket 8 - 1012907 total items
27235
%> memcached-tool 127.0.0.1:11211 stats | egrep '(curr|bytes)'
bytes 447704894
bytes_read 407765187
bytes_written 78574999
curr_connections 10
curr_items 1012907
limit_maxbytes 2147483648
I need this to estimate possible memory need for my system, but now I'm not sure about the item count, which one is correct?
Okay, there's hard limit on cachedump, which is 2M as in the mailing list.
That's what stopped the tool from dumping more keys.

Optimizing MongoDB

I have around 105 milions records similar to this:
{
"post_id": 1314131221,
"date": 1309187001,
"upvotes": 2342
}
in MongoDB collection.
I also have an index on "post_id" and "date".
Then i need to do this:
db.fb_pages_fans.find({
post_id: 1314131221,
date: {"$gt": 1309117001, "$lta": 1309187001}
}).sort({date: 1});
If i set "date" on the specific date:
when it returns 30 records, it took ~130ms
when it returns 90 records, it took ~700ms
when it returns 180 records, it took ~1200ms
Of course i'm talking about first query, second and more queries are very fast, but i need to have first queries fast.
It's much more slower from 90 records than PostgreSQL, which I use now. Why is that so slow?
btw. creating index on mentioned two "cols" on 105mil records took around 24 hours.
It runs on one machine with 12GB RAM, here is a log from mongostats when i was executing the query:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 23.9g 24.1g 8m 0 0 0 0|0 0|0 62b 1k 1 18:34:04
0 1 0 0 0 1 0 23.9g 24.1g 8m 21 0 0 0|0 0|0 215b 3k 1 18:34:05
If your first query is slow and all consequtive, similar queries fast then mongo is moving the queried data from disk to memory. That's relatively hard to avoid with data sets that size. Use mongostat and check faults statistic to see if you're getting pagefaults during your queries. Alternatively it might be that your index(es) do not fit into memory, in which case you can try and right balance them so that the relevant, high throughput parts of it are consistently in physical memory.
Also, are we talking a single physical database or a sharded setup?