Observation of DBMS_STATS global preferences for autonomous database; and is there any risk in changing to desired values or overriding at table level - oracle-autonomous-db

I am interested in understanding the rationale behind two DBMS_STATS global preferences in the autonomous database / data warehouse and what is the risk /downside in changing as compared to the risk in non-autonomous database.
In the autonomous database:
DBMS_STATS
.GET_PREFS
( PNAME => 'METHOD_OPT'
)
AS METHOD_OPT
FROM DUAL;
Yields: FOR ALL COLUMNS SIZE 254
And
SELECT
DBMS_STATS
.GET_PREFS
( PNAME => 'INCREMENTAL_LEVEL'
)
AS INCREMENTAL_LEVEL
FROM DUAL
;
Yields: TABLE
In the the non-Autonomous database those two queries yield
FOR ALL COLUMNS SIZE AUTO and PARTITION.
I would like to understand the rationale for the difference and understand the negatives of either changing the global setting or overriding it at the table level for the autonomous database.
With respect to METHOD_OPT, the autonomous database setting seems to be wasting resources (time, cpu, and space) for no gain (unless one is talking about when data is loaded before there is any usage).
With respect to INCREMENTAL_LEVEL the autonomous setting seems to be beneficial for non-partitioned tables that are in partitions exchanges. But for partitioned tables, it seems to be forcing entire table work because the setting of table is requesting a TABLE SYNOPSE to be created even when only a single partition is modified. The following command is used to gather the statistics:
DBMS_STATS.GATHER_SCHEMA_STATS
( USER
, CASCASE => TRUE
, OPTIONS => 'GATHER AUTO'
, DEGREE => DBMS_STATS.AUTO_DEGREE
);
The objective was that the schema stats would only decide to do partition level statistics gathering for stale partitions and to use that information to INCREMENTALLY do the global table statistics. But the observed behavior seems to be table scans due to the increment level
And yes, the following query does return TRUE and is not overwritten at the schema/table level.
SELECT
DBMS_STATS
.GET_PREFS
( PNAME => 'INCREMENTAL'
)
AS INCREMENTAL
FROM DUAL
;
So to cut to the questions at hand:
Why might the autonomous database setting these global DBMS_STATS preferences in this manner?
Is there any prohibition with either changing them as indicated or overwriting them at the table level?
What are the possible downsides?
Any insights are appreciated. Thanks in advance.

Related

Azure Postgres AUTOVACUM AND ANALYZE THRESHOLD - How to change it?

I am coming again with another Postgres question. We are using the Managed Service from Azure that uses autovacuum. Both vacuum and statistics are automatic.
The problem I am getting is that for a specific query, when it is running at specific hours, the plan is not good. I realized that after collecting statistics manually, the plan behaves better back again.
From the documentation of Azure I got the following:
The vacuum process reads physical pages and checks for dead tuples.
Every page in shared_buffers is considered to have a cost of 1
(vacuum_cost_page_hit). All other pages are considered to have a cost
of 20 (vacuum_cost_page_dirty), if dead tuples exist, or 10
(vacuum_cost_page_miss), if no dead tuples exist. The vacuum operation
stops when the process exceeds the autovacuum_vacuum_cost_limit.
After the limit is reached, the process sleeps for the duration
specified by the autovacuum_vacuum_cost_delay parameter before it
starts again. If the limit isn't reached, autovacuum starts after the
value specified by the autovacuum_naptime parameter.
In summary, the autovacuum_vacuum_cost_delay and
autovacuum_vacuum_cost_limit parameters control how much data cleanup
is allowed per unit of time. Note that the default values are too low
for most pricing tiers. The optimal values for these parameters are
pricing tier-dependent and should be configured accordingly.
The autovacuum_max_workers parameter determines the maximum number of
autovacuum processes that can run simultaneously.
With PostgreSQL, you can set these parameters at the table level or
instance level. Today, you can set these parameters at the table level
only in Azure Database for PostgreSQL.
Let's imagine that I want to stress the default values I have for specific tables, as currently all of them are using the default ones for the whole database.
Keeping in mind that, I could try to use ( where X is what I don't know )
ALTER TABLE tablename SET (autovacuum_vacuum_threshold = X );
ALTER TABLE tablename SET (autovacuum_vacuum_scale_factor = X);
ALTER TABLE tablename SET (autovacuum_vacuum_cost_limit = X );
ALTER TABLE tablename SET (autovacuum_vacuum_cost_delay = X );
Currently I have these values in pg_stat_all_tables
SELECT schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,last_analyze,last_autovacuum,last_autoanalyze,analyze_count,autoanalyze_count
FROM pg_stat_all_tables where schemaname = 'swp_am_hcbe_pro'
and relname in ( 'submissions','applications' )
"swp_am_hcbe_pro" "applications" "264615" "11688533" "18278" "2021-11-11 08:45:45.878654+00" "2021-11-11 13:50:27.498745+00" "2021-11-10 12:02:04.690082+00" "1" "152"
"swp_am_hcbe_pro" "submissions" "663107" "687757" "51603" "2021-11-11 08:46:48.054731+00" "2021-11-07 04:41:30.205468+00" "2021-11-04 15:25:45.758618+00" "1" "20"
Those two tables are by far the ones getting most of the DML activity.
Questions
How can I determine which values for those specific parameters of the auto_vacuum are the best for tables with huge dml activity ?
How can I force Postgres to run more times the automatic analyze for these tables that I can get more up-to-date statistics ? According to the documentation
autovacuum_analyze_threshold
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE in any one table. The default is 50
tuples. This parameter can only be set in the postgresql.conf file or
on the server command line; but the setting can be overridden for
individual tables by changing table storage parameters.
Does it mean that either deletes, updates or inserts gets to 50 triggers an auto analyze ? Because I am not seeing this behaviour.
If I change the values for the tables, should I do the same for their indexes ? Is there any option like cascade or similar that changing the table makes the values also affect the corresponding indexes ?
Thank you in advance for any advice. If you need any further details, let me know.

PostgreSQL Database size is not equal to sum of size of all tables

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.
The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

PostgreSQL performance tuning with table partitions

I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.
Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);
What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!

Is using Table variables faster than temp tables

Am I safe to assume that where I have stored procedures using the tempdb to write a temporary table, I'd be better off switching these to table variables to get better performance?
Temp tables are better in performance. If you use a Table Variable and the Data in the Variable gets too big, the SQL Server converts the Variable automatically into a temp table.
It depends, like almost every Database related question, on what you try to do. So it is hard to answer without more information.
So my answer is, try it and have a look at the execution plan. Use the fastest way with the lowest costs.
MSDN - Displaying Graphical Execution Plans (SQL Server Management Studio)
#Table can be faster as there is less "setup time" since the object is in memory only.
#Tables have a lot of catches though.
You can have a primary key on a #Table but thats about it. Other indexes Clustered NonClustered for combinations of columns are not possible.
Also if your table is going to contain any real data volumes (more then about 200 maybe 1000 rows) then accessing the table will be slower. Especially when you will probably not have a useful index on it.
#Tables are a pain in procs as they need to be dropped when debugging, They take longer to create. and they take longer to setup as you need to add indexs as a second step. But if you have lots of data then its #tables every time.
Even in cases where you have less then 100 rows of data in a table you may still want to use #Tables as you can create a usefull index on the table.
In summary i use #Tables most of the time for the ease when doing simple proc etc. But anything that need to perform should be a #Table.
#Tables have no statistics so the execution plan entails more guesswork. Hence the recommended upper limit of 1000-ish rows. #Tables have statistics but these can be cached between invocations. If your cardinalities differ significantly each time the SP runs you'd want to REBUILD and RECOMPILE each time. This is an overhead, of course, but one which must be balanced against the cost of a rubbish plan.
Both types will do IO to TempDB.
So no, #Tables are not a panacea.
Table variables can perform very poorly as the number of rows in them increases.
Why is this?
Table variables don’t have distribution statistics and don’t trigger recompiles. Because of this, SQL Server is not able to estimate the number of rows in a table variable like it does for normal tables. When the optimiser compiles code that contains a table variable, it assumes a table is empty and uses an expected row count of 1 for the cardinality estimate. Because the optimiser only thinks a table variable contains a single row, it picks operators for the execution plan that work well with a small set of records, like the NESTED LOOPS operator for a JOIN operation.
As an example, I have just fixed a stored procedure which was performing poorly. The code was populating a table variable and using it in a join to filter the number of rows to accounts which were relevant:
FROM dbo.DimInvestorAccount
INNER JOIN #accounts acclist
ON acclist.AccountNumber = DimInvestorAccount.investorAccountNumber
+ 9 additional tables joined...
When run for list of 1700 accounts, the query was taking 1m17s. Just changing the filter table definition from:
DECLARE #accounts TABLE (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
to
CREATE TABLE #accounts (AccountNumber VARCHAR(20) COLLATE Latin1_General_BIN INDEX idx NONCLUSTERED)
brought the query time down to 800ms. Note that with 5 rows in the table, there was no significant difference - both temp table and table variable run in +/-400ms.
Microsoft's recommendation is to use Table Variables if the number of rows is <100.
Note that Microsoft have made changes in SQL Server 2019 to improve this (v15.x/Compatibility level 150)