I have a relatively big (~2tb, ~20 billion rows) database with just two tables
autovacuum_max_workers = 6
autovacuum_vacuum_threshold = 40000
autovacuum_vacuum_insert_threshold = 100000
autovacuum_analyze_threshold = 40000
autovacuum_vacuum_scale_factor = 0
autovacuum_vacuum_insert_scale_factor = 0
autovacuum_vacuum_cost_delay = 2
autovacuum_vacuum_cost_limit = 3000
maintenance_work_mem = 4GB
With the current load autovacuum starts every minute or two (which I thought was good, since more vacuuming = better)
But the porblem is - n_dead_tup steadily increases over time, and vacuum time grows exponentially as well.
Then after about an hour or two, an index scan happens, with a huge spike in IO, and sometimes hangs the system for long enough for the app to crash.
Is there anything I can do to prevent these spikes? May be force index scans more often?
Here's the graph of n_dead_tup steadily growing, despite autovaccums happening.
And when the index scan finally happens, the spike in IO would be enormous, making the rest of the chart look like a line.
And in the logs, I can see that each autovacuum takes about twice as much time as the previous did. All untill an index scan happens
2022-11-12 19:51:05.036 UTC [67209] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 5.01 s, system: 3.98 s, elapsed: 16.70 s
2022-11-12 19:52:16.269 UTC [67782] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 6.38 s, system: 6.32 s, elapsed: 27.73 s
2022-11-12 19:55:27.550 UTC [68810] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 7.22 s, system: 10.33 s, elapsed: 38.50 s
and after some time
2022-11-12 21:30:38.277 UTC [96988] LOG: automatic vacuum of table "public.initial_scores": index scans: 0
system usage: CPU: user: 29.21 s, system: 66.48 s, elapsed: 261.49 s
2022-11-12 22:05:23.651 UTC [98593] LOG: automatic vacuum of table "public.initial_scores": index scans: 1
system usage: CPU: user: 1278.22 s, system: 239.80 s, elapsed: 2062.45 s
There are no long-running queries which could prevent vacuum from cleaning properly, and there are no prepared quueries.
Per request adding verbose output of
vacuuming "ryd-db.public.initial_scores"
launched 4 parallel vacuum workers for index vacuuming (planned: 4)
finished vacuuming "ryd-db.public.initial_scores": index scans: 1
pages: 0 removed, 122251463 remain, 13228190 scanned (10.82% of total)
tuples: 275918 removed, 1578025058 remain, 241296 are dead but not yet removable
removable cutoff: 3503893034, which was 1596319 XIDs old when operation ended
index scan needed: 2593283 pages from table (2.12% of total) had 3961159 dead item identifiers removed
index "pk_initial_scores": pages: 7896774 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_date_created": pages: 18729021 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_channel_id": pages: 3291717 in total, 12 newly deleted, 13 currently deleted, 10 reusable
index "ix_initial_scores_date_published": pages: 2625645 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_category": pages: 2164691 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 84.367 MB/s, avg write rate: 15.163 MB/s
buffer usage: 15364625 hits, 21630601 misses, 3887689 dirtied
WAL usage: 9888050 records, 6215601 full page images, 30970053477 bytes
system usage: CPU: user: 636.32 s, system: 120.53 s, elapsed: 2003.03 s
vacuuming "ryd-db.pg_toast.pg_toast_16418"
finished vacuuming "ryd-db.pg_toast.pg_toast_16418": index scans: 1
pages: 0 removed, 17770543 remain, 890550 scanned (5.01% of total)
tuples: 20012 removed, 87435806 remain, 793 are dead but not yet removable
removable cutoff: 3505488713, which was 94878 XIDs old when operation ended
index scan needed: 179949 pages from table (1.01% of total) had 283573 dead item identifiers removed
index "pg_toast_16418_index": pages: 2735506 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 257.567 MB/s, avg write rate: 13.215 MB/s
buffer usage: 945149 hits, 3758561 misses, 192836 dirtied
WAL usage: 413262 records, 98383 full page images, 448496268 bytes
system usage: CPU: user: 10.31 s, system: 9.47 s, elapsed: 114.00 s
analyzing "public.initial_scores"
"initial_scores": scanned 300000 of 122251463 pages, containing 3959953 live rows and 3253 dead rows; 300000 rows in sample, 1613700159 estimated total rows
Autovacuum run when index scan is triggered
2022-11-14 00:56:44.889 UTC [640096] LOG: automatic vacuum of table "ryd-db.public.initial_scores": index scans: 1
pages: 0 removed, 122251463 remain, 14421445 scanned (11.80% of total)
tuples: 325666 removed, 1577912706 remain, 309680 are dead but not yet removable
removable cutoff: 3496504042, which was 3600494 XIDs old when operation ended
index scan needed: 2469439 pages from table (2.02% of total) had 3578028 dead item identifiers removed
index "pk_initial_scores": pages: 7893180 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_date_created": pages: 18721360 in total, 1 newly deleted, 1 currently deleted, 1 reusable
index "ix_initial_scores_channel_id": pages: 3291207 in total, 6 newly deleted, 7 currently deleted, 6 reusable
index "ix_initial_scores_date_published": pages: 2625593 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "ix_initial_scores_category": pages: 2164594 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 86.157 MB/s, avg write rate: 11.919 MB/s
buffer usage: 16609082 hits, 49454603 misses, 6841583 dirtied
WAL usage: 9554520 records, 5682531 full page images, 28223936896 bytes
system usage: CPU: user: 1262.91 s, system: 218.72 s, elapsed: 4484.40 s
2022-11-14 00:57:01.086 UTC [640096] LOG: automatic analyze of table "ryd-db.public.initial_scores"
avg read rate: 143.711 MB/s, avg write rate: 1.044 MB/s
buffer usage: 8007 hits, 297889 misses, 2165 dirtied
system usage: CPU: user: 9.14 s, system: 1.71 s, elapsed: 16.19 s
And without index scan
2022-11-14 01:09:21.788 UTC [664482] LOG: automatic vacuum of table "ryd-db.public.initial_scores": index scans: 0
pages: 0 removed, 122251463 remain, 7751023 scanned (6.34% of total)
tuples: 1346523 removed, 1596376142 remain, 248208 are dead but not yet removable
removable cutoff: 3500120771, which was 583770 XIDs old when operation ended
index scan bypassed: 1420208 pages from table (1.16% of total) have 1892380 dead item identifiers
avg read rate: 75.417 MB/s, avg write rate: 14.280 MB/s
buffer usage: 7023840 hits, 7103108 misses, 1344965 dirtied
WAL usage: 2307792 records, 585139 full page images, 3890305134 bytes
system usage: CPU: user: 82.42 s, system: 40.27 s, elapsed: 735.81 s
When vacuum does an index cleanup, it needs to read the entire index(es). This inevitably means a lot of IO if the index doesn't fit in cache. This is in contrast to the table scan, which only needs to scan the part of the table not already marked all_visible. Which, depending on usage, could be a small part of the entire table (which I assume is the case here, otherwise every autovacuum would be consuming huge amounts of IO)
You can force the index cleanup to occur on every vacuum, by setting the vacuum_index_cleanup storage parameter for the tables. But it is hard to see how this would be a good thing, since the clean up is what is causing the problem in the first place.
Throttling the autovacuum so it can run the index cleanup without consuming all your IO capacity is probably the right solution, which it seems you already hit on. I would have done this by lowering cost_limit rather than by increasing cost_delay because that would be moving the settings more toward their defaults, but it probably doesn't make much of a difference.
Also, if you partitioned the table by date (or something like that) such that most of the activity is concentrated in one or two the partitions at a time, that could improve this situation. The cleanup scan would only need to be done on the index(es) for the actively changing partitions, greatly decreasing the IO consumption.
There probably isn't much point in vacuuming the table so aggressively only to bail out and not do the index cleanup scan. So I would also increase autovacuum_vacuum_threshold and/or autovacuum_vacuum_scale_factor.
However, one thing I can't explain here is the pattern in the n_dead_tup graph. I can readily reproduce the instant spike down at the end of each autovac (whether it did the index-cleanup or not), but I don't understand the instant spike up. In my hands it just ramps up, then spikes down, over and over. I am assuming you are using a version after 11, but which version is it?
In PG log file we can find some statistics about how vacuum works on a table.
Why vacuum needs to access WAL for cleaning up a table?
pages: 0 removed, 1640 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 99960 remain, 0 are dead but not yet removable, oldest xmin: 825
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 8.878 MB/s, avg write rate: 0.000 MB/s
buffer usage: 48 hits, 1 misses, 0 dirtied
**WAL usage: 1 records**, 0 full page images, 229 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM modifies the table by deleting old row versions, freezing rows and truncating empty pages at the end of the table.
All modifications to a table are logged in WAL, and that applies to VACUUM as well. After all, you need to replay these activities after a crash.
When I run below query to get the count of a table, the size and time taken to run the query is almost same. / table t has 97029 records and 124 cols
Q.1. - Does column i in below query uses unique attribute internally to return the output in constant time using has function?
\ts select last i from t where date=.z.d-5 / 3j, 1313248j
/ time taken to run the query and memory used is always same not matter how many times we run same query
When I run below query:
For the first time required time and memory is very high but from next run the time and memory required is very less.
Q.2. Does kdb caches the output when we run the query for the first time and show the output from cache from next time?
Q.3 Is there attribute applied on column i while running below query, if so, then which one?
\ts select count i from t where date=.z.d-5 / 1512j, 67292448j
\ts select count i from t where date=.z.d-5 / 0j, 2160j
When running below query:
Q.4 Is any attribute applied on column i on running below query?
\ts count select from t where date=.z.d-5 / 184j, 37292448j
/time taken to run the query and memory used is always same not matter how many times we run
Q.5 which of the following queries should be used to get the column of tables with very high number of records? Any other query which can be more fast and less memory consuming to get same result?
There isn't a u# attribute applied to the i column, to see this:
q)n:100000
q)t:([]a:`u#til n)
q)
q)\t:1000 select count distinct a from t
2
q)\t:1000 select count distinct i from t
536
The timing of those queries isn't constant, there are just not enough significant figures to see the variability. Using
\ts:100 select last i from t where date=.z.d-5
will run the query 100 times and highlight that the timing is not constant.
The first query will request more memory be allocated to the q process, which will remain allocated to the process unless garbage collection is called (.Q.gc[]). The memory usage stats can be viewed with .Q.w[]. For example, in a new session:
q).Q.w[]
used| 542704
heap| 67108864
peak| 67108864
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
6569
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)b:0
q)
q).Q.w[]
used| 542768
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
877
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
Also, assuming the table in question is partitioned, the query shown will populate .Q.pn which can then be used to get the count afterwards, for example
q).Q.pn
quotes|
trades|
q)\ts select count i from quotes where date=2014.04.25
0 2656
q).Q.pn
quotes| 85204 100761 81724 88753 115685 125120 121458 97826 99577 82763
trades| ()
In more detail, .Q.ps does some of the select operations under the hood. if you look on the 3rd line:
if[$[#c;0;(g:(. a)~,pf)|(. a)~,(#:;`i)];f:!a;j:dt[d]t;...
this checks the "a" (select) part of the query, and if it is
(#:;`i)
(which is count i) it ends up running .Q.dt, which runs .Q.cn, which gets the partition counts. So the first time it is running this, it runs .Q.cn, getting the count for all partitions. The next time .Q.cn is run, it can just look up the values in the dictionary .Q.pn which is a lot faster.
See above.
See above about attributes on i. count is a separate operation, not part of the query, and wont be affected by attributes on columns, it will see the table as a list.
For tables on disk, each column should contain a header where the count of the vector is available at very little expense:
q)`:q set til 123
`:q
q)read1 `:q
0xfe200700000000007b000000000000000000000000000000010000000000000002000000000..
q)9#read1 `:q
0xfe200700000000007b
q)`int$last 9#read1 `:q
123i
q)
q)`:q set til 124
`:q
q)9#read1 `:q
0xfe200700000000007c
q)`int$last 9#read1 `:q
124i
Still, reading any file usually takes ~1ms at least, so the counts are cached as mentioned above.
I'm modifying the charset mapping for a sphinx cluster and I've run into a bit of an oddity, one which the documentation does not cover. The previous author of the charset_table and ngram_chars definitions has put the CJK unicode ranges in both charset mapping and ngrams.
Is this necessary?
If not, what is the purpose of this duplication?
I am going to answer my own question after doing some extensive testing. As it turns out, charset_table and ngram_chars complement each other rather than one being a subset of the other.
Testing run
Docset
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="foo"/>
</sphinx:schema>
<sphinx:document id="123">
<foo><![CDATA[ぇえぉおかがきぎく]]></foo>
</sphinx:document>
</sphinx:docset>
Just charset_table
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇ ': returned 0 matches of 0 total in 0.000 sec
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=1500
words:
1. 'ぇえぉおかがきぎく': 1 documents, 1 hits
Just ngram_chars
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=9500
words:
1. 'ぇ': 1 documents, 1 hits
2. 'え': 1 documents, 1 hits
3. 'ぉ': 1 documents, 1 hits
4. 'お': 1 documents, 1 hits
5. 'か': 1 documents, 1 hits
6. 'が': 1 documents, 1 hits
7. 'き': 1 documents, 1 hits
8. 'ぎ': 1 documents, 1 hits
9. 'く': 1 documents, 1 hits
So, the presence of a character in charset_table does not in any way affect the indexing if the character is present in ngram_chars. They do not depend on one another.
I admit never used ngram_chars, but I think chars listed in ngram_chars do also need to be in charset_table
'charset_table', defines all chars that get indexed, then 'ngram_chars' defines ones that get segmented.
if only in 'charset_table' then will be indexed as normal words
if only in 'ngram_chars' then have no effect.
I have around 105 milions records similar to this:
{
"post_id": 1314131221,
"date": 1309187001,
"upvotes": 2342
}
in MongoDB collection.
I also have an index on "post_id" and "date".
Then i need to do this:
db.fb_pages_fans.find({
post_id: 1314131221,
date: {"$gt": 1309117001, "$lta": 1309187001}
}).sort({date: 1});
If i set "date" on the specific date:
when it returns 30 records, it took ~130ms
when it returns 90 records, it took ~700ms
when it returns 180 records, it took ~1200ms
Of course i'm talking about first query, second and more queries are very fast, but i need to have first queries fast.
It's much more slower from 90 records than PostgreSQL, which I use now. Why is that so slow?
btw. creating index on mentioned two "cols" on 105mil records took around 24 hours.
It runs on one machine with 12GB RAM, here is a log from mongostats when i was executing the query:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 23.9g 24.1g 8m 0 0 0 0|0 0|0 62b 1k 1 18:34:04
0 1 0 0 0 1 0 23.9g 24.1g 8m 21 0 0 0|0 0|0 215b 3k 1 18:34:05
If your first query is slow and all consequtive, similar queries fast then mongo is moving the queried data from disk to memory. That's relatively hard to avoid with data sets that size. Use mongostat and check faults statistic to see if you're getting pagefaults during your queries. Alternatively it might be that your index(es) do not fit into memory, in which case you can try and right balance them so that the relevant, high throughput parts of it are consistently in physical memory.
Also, are we talking a single physical database or a sharded setup?