High autovacuum_vacuum_cost_limit - why?

High autovacuum_vacuum_cost_limit - why? - postgresql

Update 2022-11-10: I have opened a case with AWS for this one and will let you know here once they have responded.
Postgres 12.9 AWS Managed on db.r5.4xlarge which has 64GB RAM.
autovacuum_vacuum_cost_limit is at 1800:
select setting from pg_settings where name = 'autovacuum_vacuum_cost_limit'; -- 1800
Parameter group in AWS Console:
autovacuum_vacuum_cost_limit: GREATEST({log(DBInstanceClassMemory/21474836480)*600},200)
rds.adaptive_autovacuum: 1
Calculation for autovacuum_vacuum_cost_limit IMHO:
64 Gigabytes = 68,719,476,736 Bytes
GREATEST({log(68719476736/21474836480)*600},200)
GREATEST({log(3.2)*600},200)
GREATEST({0.50514997832*600},200)
GREATEST(303.089986992,200)
CloudWatch Metric MaximumUsedTransactionIDs hovers around 200mio. Many tables are close to 200mio.
So autovacuum_vacuum_cost_limit should be 303 IMHO? Why is autovacuum_vacuum_cost_limit at 1800? I would expect 303. I think I can rule out manual intervention. Thank you.

Here is the answer from aws. In short, the explanation is twofold:
by "log" they mean "log base 2" and not "log base 10"
they are rounding at one point
The autovacuum_vacuum_cost_limit formula is GREATEST({log(DBInstanceClassMemory/21474836480)*600},200).
Log in above formula is log of base 2 and the log value is rounded off before it is multiplied by 600.
As your instance's instance type is r5.2xlarge, the instance class memory is 64 gb.
DBInstanceClassMemory= 64 GiB =68719476736 bytes
Therefore, the following calculation is used to calculate the autovacuum_vacuum_cost_limit value:
GREATEST({log(68719476736/21474836480)600},200)
= GREATEST({log(3.2)600},200) --> log base 2 of (3.2) is 1.678 which is rounded off to 2
= GREATEST({2*600},200)
= GREATEST({1200},200)
= 1200

Related

How to turn prometheus irate function to sql

I need to turn prometheus irate function to sql language, and i cannot really find the calculation logic anywhere.
i have the following query in prometheus sql:
100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[40s])) * 100)
Let's say i have the following data for a cpu:
v 20 50 100 200 201 230
----x-+----x------x-------x-------x--+-----x-----
t 10 20 30 40 50 60
| <-- range=40s -->|
t
My question is not really related to postgres, since i could solve this problem in sql if i would know what is the formula i should develop.
i understand that i have to get the last two datapoints difference and divide value_diff with time_diff:
(201-200)/(50-40), but how the 40s window comes into the picture?
((201-200)/(50-40))/40 ?
What would be the proper mathematical calculation for the above prometheus query?
And how i should do the same if i have 8 cpu data?
I tried to search for documentation, but could not find any proper explanation what is going on behind.
Thanks

Orange3 Frequent Itemsets performance

I just realized that the performance of Frequent Itemsets is strongly correlated with number of item per basket. I run the following code:
import datetime
from datetime import datetime
from orangecontrib.associate.fpgrowth import *
%%time
T = [[1,3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]
itemsets = frequent_itemsets(T, 1)
a=list(itemsets)
As I increased the number of item in T the running time increased as follow:
Item running time
21 3.39s
22 9.14s
23 15.8s
24 37.4s
25 1.2 min
26 10 min
27 35 min
28 95 min
For 31 item sets it took 10 hours without returning any results. I am wondering if there is anyway to run it for more than 31 items in reasonable time? In this case I just need pairwise item set (A-->B) while my understanding is frequent_itemsets count all possible combination and that is probably why it is running time is highly correlated with number of items. Is there any way to tell the method to limit the number item to count like instead of all combination just pairwise?

You could use other software that allows to specify constraints on the itemsets such as a length constraints. For example, you can consider the SPMF data mining library (disclosure: I am the founder), which offers about 120 algorithms for itemset and pattern mining. It will let you use FPGrowth with a length constraint. So you could for example mine only the patterns with 2 items or 3 items. You could also try other features such as mining association rules. That software works on text file and can be called for the command line, and is quite fast.

A database of a one single transaction of 21 items results in 2097151 itemsets.
>>> T = [list(range(21))]
>>> len(list(frequent_itemsets(T, 1)))
2097151
Perhaps instead of absolute support set as low as a single transaction (1), choose the support to be e.g. 5% of all transactions (.05).
You should also limit the returned itemsets to contain exactly two items (antecedent and consequent for later association rule discovery), but the runtime will still be high due to, as you understand, sheer combinatorics.
len([itemset
for itemset, support in frequent_itemsets(T, 1)
if len(itemset) == 2])
At the moment, there is no such filtering available inside the algorithm, but the source is open to tinkering.

PostgreSQL Config File and Query Design for Calculations

I'm new to PG and trying to run this simple query and it's crashing Postgres. The query works in a few seconds if I only try to calculate r1, but says "Out of memory" if I try to calculate r2 through r6 in addition to r1 as below. Poor query design? I'm going to reference the calculated fields r1...r6 in other calculations, so was thinking about making this query a view. My key config file parameters are below. Windows 10, PG 9.6, 40GB RAM, 64-bit. Any ideas on what to do? Thanks!
Edited: I tried adding LIMIT 500 to the end and that worked, but if I run a query on this query, i.e. to use the calculated r1,r2,r3... in another query, will the new query see all the records or will it be limited to just the 500?
SELECT
public.psda.price_y1,
public.psda.price_y2,
public.psda.price_y3,
public.psda.price_y4,
public.psda.price_y5,
public.psda.price_y6,
public.psda.price_y7,
(price_y1 - price_y2) / nullif(price_y2, 0) AS r1,
(price_y2 - price_y3) / nullif(price_y3, 0) AS r2,
(price_y3 - price_y4) / nullif(price_y4, 0) AS r3,
(price_y4 - price_y5) / nullif(price_y5, 0) AS r4,
(price_y5 - price_y6) / nullif(price_y6, 0) AS r5,
(price_y6 - price_y7) / nullif(price_y7, 0) AS r6
FROM
public.psda
My config file parameters:
max_connections = 50
shared_buffers = 1GB
effective_cache_size = 20GB
work_mem = 400MB
maintenance_work_mem = 1GB
wal_buffers = 16MB
max_wal_size = 2GB
min_wal_size = 1GB
checkpoint_completion_target = 0.7
default_statistics_target = 100

use this service http://pgtune.leopard.in.ua/
when calc total RAM for PostgreSQL = total RAM - RAM for OS
PGTune calculate configuration for PostgreSQL based on the maximum performance for a given hardware configuration.
It isn't a silver bullet for the optimization settings of PostgreSQL.
Many settings depend not only on the hardware configuration, but also on the size of the database, the number of clients and the complexity of queries, so that optimally configure the database can only be given all these parameters.

Sphinx composite (distributed) big indexes

Experience problem with indexing lot's of content data. Searching for the suitable solution.
The logic if following:
Robot is uploading content every day to the database.
Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.
Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).
Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?
I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.
The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.
The question is how to fill distributed index dynamically?
Daily iteration sample:
main index
chunk1 - 50000 rows
chunk2 - 50000 rows
chunk3 - 35000 rows
delta index
10000 new rows
rotate "delta"
merge "delta" into "main"
Please, advice.

Thanks to #barryhunter
RT indexes is a solution here.
Good manual is here: https://www.sphinxconnector.net/Tutorial/IntroductionToRealTimeIndexes
I've tested match queries on 3 000 000 000 letters. The speed is close to be the same as for "plain" index type. Total index size on HDD is about 2 GB.
Populating sphinx rt index:
CPU usage: ~ 50% of 1 core / 8 cores,
RAM usage: ~ 0.5% / 32 GB, Speed: quick as usual select - insert (mostly depends on using batch insert or row-by-row)
NOTE:
"SELECT MAX(id) FROM sphinx_index_name" will produce error "fullscan requires extern docinfo". Setting docinfo = extern will not solve this. So keep counter simply in mysql table (like for sphinx delta index: http://sphinxsearch.com/docs/current.html#delta-updates).

how can I set up my sql server to email me when databases are approaching max size?

I was working on an old project the other day over vpn for a client and found an issue where I was purging data off of the wrong PK and as a result their database was huge and was slow to return info which was causing our software to freak out.
I got to thinking that I would just like to know when I am approaching the max size. I know how to set up the sql server for email notification but I've only sent test messages. I looked at my databases properties hoping I would see some options related to email but I saw nothing.
I've seen where you can send out the email after a job so i'm hoping you can do this too. Anyone know how I can achieve this?

sys.database_files has a size column which stores the number of pages. A page is 8kb, so you need to multiply this by 8 * 1.024, which is 8.192. This will show us the size of the file on disk. Just replace [database name] with the actual name of your database, and adjust the size check if you want something other than 2 GB as the warning threshold.
DECLARE #size DECIMAL(20,2);
SELECT #size = SUM(size * 8.192)/1000 FROM [database name].sys.database_files;
IF #size >= 2000 -- this is in MB
BEGIN
-- send e-mail
END
If you want to do it for all databases, you can do this without going into each individual database's sys.database_files view, by using master.sys.sysaltfiles - I have observed that the size column here is not always in sync with the size column in sys.database_files - I would trust the latter first.