HOT and FILLFACTOR Results
I've got some high-UPDATE tables where I've adjusted the FILLFACTOR to 95%, and I'm checking back in on them. I don't think that I've got the settings right, and am unclear how to tune them intelligently. I took another pass through Laurenz Albe's helpful blog post on HOT updates
https://www.cybertec-postgresql.com/en/hot-updates-in-postgresql-for-better-performance/
... and the clear source code READ ME:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT
Below is a query, adapted from the blog post, to check the status of the tables in the system, along with some sample output:
SELECT relname,
n_tup_upd as total_update_count,
n_tup_hot_upd as hot_update_count,
coalesce(div_safe(n_tup_upd, n_tup_hot_upd),0) as total_by_hot,
coalesce(div_safe(n_tup_hot_upd, n_tup_upd),0) as hot_by_total
FROM pg_stat_user_tables
order by 4 desc;
A few results:
relname total_update_count hot_update_count total_by_hot hot_by_total
rollups 369418 128 2886.0781 0.00034649097
q_event 71781 541 132.68207 0.007536813
analytic_scan 2104727 34304 61.35515 0.016298551
clinic 4424 77 57.454544 0.017405063
facility_location 179636 6489 27.683157 0.03612305
target_snapshot 494 18 27.444445 0.036437247
inv 1733021 78234 22.151762 0.045143135
I'm unsure what ratio(s) I'm looking for here. Can anyone advise me how to read these results, or what to read to figure out how to interpret them?
Are These UPDATEs HOTable?
I didn't address this basic point in the original draft of this question. I checked my patch from a few months back, and I ran SET (fillfactor = 95) and then VACUUM (FULL, VERBOSE, ANALYZE) on 13 of my tables. (The VERBOSE is in there as I had some tables that couldn't VACUUM because of a months-old process that needed clearing out, and that's how I found the problem. pg_stat_activity is my friend.)
However, at least most touch an indexed column...but with an identical value. Like 1 = 1, so no change to the value. I've been thinking that that is HOTable. If I'm wrong about that, bummer. If not, I'm mostly hoping to clarify what exactly the goal is for the relationships amongst fillfactor, n_tup_upd, and n_tup_hot_upd.
SELECT relname,
n_tup_upd as total_update_count,
n_tup_hot_upd as hot_update_count,
coalesce(div_safe(n_tup_upd, n_tup_hot_upd),0) as total_by_hot,
coalesce(div_safe(n_tup_hot_upd, n_tup_upd),0) as hot_by_total,
(select value::integer from table_get_options('data',relname) where option = 'fillfactor') as fillfactor_setting
FROM pg_stat_user_tables
WHERE relname IN (
'activity',
'analytic_productivity',
'analytic_scan',
'analytic_sterilizer_load',
'analytic_sterilizer_loadinv',
'analytic_work',
'assembly',
'data_file_info',
'inv',
'item',
'print_job',
'q_event')
order by 4 desc;
Results:
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| relname | total_update_count | hot_update_count | total_divided_by_hot | hot_divided_by_total | fillfactor_setting |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| q_event | 71810 | 553 | 129.85533 | 0.0077008773 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_scan | 2109206 | 34536 | 61.072678 | 0.016373934 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| inv | 1733176 | 78387 | 22.110502 | 0.045227375 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| item | 630586 | 32110 | 19.638306 | 0.05092089 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_sterilizer_loadinv | 76976539 | 5206806 | 14.783831 | 0.06764147 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_work | 8117050 | 608847 | 13.331839 | 0.07500841 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| assembly | 90580 | 7281 | 12.4405985 | 0.08038198 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_sterilizer_load | 19249 | 2997 | 6.422756 | 0.1556964 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| activity | 3795 | 711 | 5.3375525 | 0.18735178 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_productivity | 106486 | 25899 | 4.1115875 | 0.24321507 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| print_job | 1414 | 388 | 3.6443298 | 0.27439886 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| data_file_info | 402086 | 285663 | 1.4075537 | 0.7104525 | 90 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
(I just looked for and found an on-line table generator to help out with this kind of example at https://www.tablesgenerator.com/text_tables. It's a bit awkward to use, but faster than building out monospaced aligned text manually.)
FILLFACTOR and HOT update ratio
Figured I could sort this out a bit by adapting Laurenz Albe's code from https://www.cybertec-postgresql.com/en/hot-updates-in-postgresql-for-better-performance/. All I've done here is make a script that builds out a table with a FILLFACTOR of 10, 20, 30.....100% and then updated it in the same way for each percentage. Every time the table is created, it is populated with 256 records, which are then updated 10 times each. The update sets a non-indexed field back to itself, so no value actually changes:
UPDATE mytable SET val = val;
Below are the results:
+------------+---------------+-------------+--------------+
| FILLFACTOR | total_updates | hot_updates | total_to_hot |
+------------+---------------+-------------+--------------+
| 10 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 20 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 30 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 40 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 50 | 2350 | 2223 | 1.06 |
+------------+---------------+-------------+--------------+
| 60 | 2350 | 2188 | 1.07 |
+------------+---------------+-------------+--------------+
| 70 | 2350 | 1883 | 1.25 |
+------------+---------------+-------------+--------------+
| 80 | 2350 | 1574 | 1.49 |
+------------+---------------+-------------+--------------+
| 90 | 2350 | 1336 | 1.76 |
+------------+---------------+-------------+--------------+
| 100 | 2350 | 987 | 2.38 |
+------------+---------------+-------------+--------------+
From this, it seems that when the total_to_hot ratio rises, there may be a benefit to increasing the FILLFACTOR.
https://www.postgresql.org/docs/13/monitoring-stats.html
n_tup_upd counts all updates, including HOT updates, and n_tup_hot_upd counts HOT updates only. But it doesn't seem to be a count of "could have been a HOT update, if we hadn't run out of room on the page." That would be great, but it also seems like a lot to ask for. (And maybe more expensive to keep track of that can be justified?)
Here is the script. I edited and re-ran the test with each FILLFACTOR.
-- Set up the table for the test
DROP TABLE IF EXISTS mytable;
CREATE TABLE mytable (
id integer PRIMARY KEY,
val integer NOT NULL
) WITH (autovacuum_enabled = off);
-- Change the FILLFACTOR. The default is 100.
ALTER TABLE mytable SET (fillfactor = 10); -- The only part that changes between runs.
-- Seed the data
INSERT INTO mytable
SELECT *, 0
FROM generate_series(1, 235) AS n;
-- Thrash the data
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
-- How does it look?
SELECT n_tup_upd as total_updates,
n_tup_hot_upd as hot_updates,
div_safe(n_tup_upd, n_tup_hot_upd) as total_to_hot
FROM pg_stat_user_tables
WHERE relname = 'mytable';
Checking the FILLFACTOR Setting
As a side-note, I wanted a quick call to check the FILLFACTOR setting on a table, and it turned out to be more involved than I thought. I wrote up a function that works, but could likely see some improvements...if anyone has suggestions. I call it like this:
select * from table_get_options('foo', 'bar');
or
select * from table_get_options('foo','bar') where option = 'fillfactor';
Here's the code, if anyone has improvements to offer:
CREATE OR REPLACE FUNCTION dba.table_get_options(text,text)
RETURNS TABLE (
schema_name text,
table_name text,
option text,
value text
)
LANGUAGE SQL AS
$BODY$
WITH
packed_options AS (
select pg_class.relname as table_name,
btrim(pg_options_to_table(pg_class.reloptions)::text, '()') as option_kvp -- Convert to text (fillfactor,95), then strip off ( and )
from pg_class
join pg_namespace
on pg_namespace.oid = pg_class.relnamespace
where pg_namespace.nspname = $1
and relname = $2
and reloptions is not null
),
unpacked_options AS (
select $1 as schema_name,
$2 as table_name,
split_part(option_kvp, ',', 1) as option,
split_part(option_kvp, ',', 2) as value
from packed_options
)
select * from unpacked_options;
$BODY$;
The numbers show that your strategy is not working, and the overwhelming majority of updates are not HOT. You also show the reason: Even if you update an indexed column to the original value, you won't get a HOT update.
The solution would be to differentiate by including the indexed column in the UPDATE statement only if it is really modified.
A fillfactor of 95 is also pretty high, unless you have tables with really small rows. Perhaps you would get better results with a setting like 90 or 85.
Related
We did a vacuum full on our table and toast. The dead tuples dropped drastically, however the max transaction id stays pretty much the same. My question is, why did it the max transaction id not go down as dead tuples go down drastically?
Before
select relname,last_autovacuum ,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,vacuum_count,autovacuum_count from pg_stat_all_tables where relname in ('examples','pg_toast_16450');
relname | last_autovacuum. | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | vacuum_count | autovacuum_count
----------------+-------------------------------+-----------+------------+---------------+------------+------------+---------------------+--------------+------------------
examples | 2022-01-18 23:26:52.432808+00 | 57712813 | 9818 | 48386674 | 3601588 | 306558 | 42208 | 0 | 44
pg_toast_16450 | 2022-01-17 23:14:42.516933+00 | 0 | 5735566377 | 0 | 3763818 | 805501171 | 11472355929 | 0 | 51
SELECT max(age(datfrozenxid)) FROM pg_database;
max
-----------
199857797
After
select relname,last_autovacuum ,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,vacuum_count,autovacuum_count from pg_stat_all_tables where relname in ('examples','pg_toast_16450');
relname | last_autovacuum | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | vacuum_count | autovacuum_count
----------------+-------------------------------+-----------+-------------+--------------+------------+------------+---------------------+--------------+------------------
examples | 2022-02-01 15:41:17.722575+00 | 120692014 | 9818 | 98148003 | 4172134 | 17666 | 150566 | 1 | 4064
pg_toast_16450 | 2022-02-01 20:49:30.552251+00 | 0 | 16169731895 | 0 | 5557218 | 33365 | 32342853690 | 0 | 15281
SELECT max(age(datfrozenxid)) FROM pg_database;
max
-----------
183888023
Yes, that is as expected. You need VACUUM to freeze tuples. VACUUM (FULL) doesn't.
Users tend to be confused, because both are triggered by the VACUUM statement, but VACUUM (FULL) is actually something entirely different from VACUUM. It is not just “a more thorough VACUUM”. The only thing they have in common is that they get rid of dead tuples. VACUUM (FULL) does not modify tuples, as freezing has to do, it just copies them around (or doesn't, if they are dead).
MY SITUATION:
I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.
A (VERY) simplified version of the code can be seen below:
WITH start_data AS (
SELECT user_id
,COUNT(web_visits) AS count_web_visits
,COUNT(button_clicks) AS count_button_clicks
,COUNT(login) AS count_log_in
,SUM(time_on_site) AS total_time_on_site
,CURRENT_DATE AS run_date
FROM web.table
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
AND some_flag = 1
AND some_other_flag = 2
GROUP BY user_id
ORDER BY user_id DESC
)
The output might look something like the below:
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
This output is then stored in it's own AWS/Redhsift table and will form base table for the task.
SELECT *
into myschema.base_table
FROM start_data
DESIRED OUTPUT:
What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.
The way I see it, I can either go forwards or backwards, it doesn't matter.
That is to say, I can either:
Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)
OR
Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.
Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 412 | 339 |180 | 3600 | 20-01-20 |
| 2391823 | 417 | 6253 |863 | 2400 | 20-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 20-01-20 |
| 5561296 | 281 | 679 |262 | 4200 | 20-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 331 | 204 |83 | 3200 | 21-01-20 |
| 2391823 | 652 | 1222 |409 | 7200 | 21-01-20 |
| 3729128 | 71 | 248 |71 | 720 | 21-01-20 |
| 5561296 | 366 | 722 |519 | 3600 | 21-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 213 | 808 |57 | 3600 | 22-01-20 |
| 2391823 | 817 | 4265 |476 | 1200 | 22-01-20 |
| 3729128 | 33 | 128 |62 | 120 | 22-01-20 |
| 5561296 | 623 | 411 |283 | 2400 | 22-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
WHAT I HAVE TRIED:
I have successfully created a WHILE loop to sequentially increment the date as follows:
CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE
start_date DATE := '2020-11-20' ;
BEGIN
WHILE CURRENT_DATE > start_date
LOOP
RAISE INFO 'Date: %', start_date;
start_date = start_date + 1;
END LOOP;
RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;
CALL retrospective_data();
Thus producing the dates as follows:
INFO: Date: 2020-11-20
INFO: Date: 2020-11-21
INFO: Date: 2020-11-22
INFO: Date: 2020-11-23
INFO: Date: 2020-11-24
INFO: Date: 2020-11-25
INFO: Date: 2020-11-26
INFO: Loop Statment Executed Successfully
Query 1 OK: CALL
WHAT I NEED HELP WITH:
I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date
But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.
Any help appreciated.
It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.
You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.
Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.
Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).
SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
SELECT visit_date, user_id, web_visits, ...
FROM web.table
WHERE some_flag = 1 AND some_other_flag = 2
UNION ALL -- this is where I want to union with a full set of dates by user_id
( SELECT visit_date, user_id, NULL as web_visits, ...
FROM (
SELECT DISTINCT user_id FROM web.table
CROSS JOIN
SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
FROM web.table
)
)
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;
The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.
Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.
I want to create a function that can create a table, in which part of the columns is derived from the other two tables.
input table1:
This is a static table for each loan. Each loan has only one row with information related to that loan. For example, original unpaid balance, original interest rate...
| id | loan_age | ori_upb | ori_rate | ltv |
| --- | -------- | ------- | -------- | --- |
| 1 | 360 | 1500 | 4.5 | 0.6 |
| 2 | 360 | 2000 | 3.8 | 0.5 |
input table2:
This is a dynamic table for each loan. Each loan has seraval rows show the loan performance in each month. For example, current unpaid balance, current interest rate, delinquancy status...
| id | month| cur_upb | cur_rate |status|
| ---| --- | ------- | -------- | --- |
| 1 | 01 | 1400 | 4.5 | 0 |
| 1 | 02 | 1300 | 4.5 | 0 |
| 1 | 03 | 1200 | 4.5 | 1 |
| 2 | 01 | 2000 | 3.8 | 0 |
| 2 | 02 | 1900 | 3.8 | 0 |
| 2 | 03 | 1900 | 3.8 | 1 |
| 2 | 04 | 1900 | 3.8 | 2 |
output table:
The output table contains information from table1 and table2. Payoffupb is the last record of cur_upb in table2. This table is built for model development.
| id | loan_age | ori_upb | ori_rate | ltv | payoffmonth| payoffupb | payoffrate |lastStatus | modification |
| ---| -------- | ------- | -------- | --- | ---------- | --------- | ---------- |---------- | ------------ |
| 1 | 360 | 1500 | 4.5 | 0.6 | 03 | 1200 | 4.5 | 1 | null |
| 2 | 360 | 2000 | 3.8 | 0.5 | 04 | 1900 | 3.8 | 2 | null |
Most columns in the output table can directly get or transferred from columns in the two input tables, but some columns can not get then leave blank.
My main question is how to write a function to take two tables as inputs and output another table?
I already wrote the feature transformation part for data files in 2018, but I need to do the same thing again for data files in some other years. That's why I want to create a function to make things easier.
As you want to insert the latest entry of table2 against each entry of table1 try this
insert into table3 (id, loan_age, ori_upb, ori_rate, ltv,
payoffmonth, payoffupb, payoffrate, lastStatus )
select distinct on (t1.id)
t1.id, t1.loan_age, t1.ori_upb, t1.ori_rate, t1.ltv, t2.month, t2.cur_upb,
t2.cur_rate, t2.status
from
table1 t1
inner join
table2 t2 on t1.id=t2.id
order by t1.id , t2.month desc
DEMO1
EDIT for your updated question:
Function to do the above considering table1, table2, table3 structure will be always identical.
create or replace function insert_values(table1 varchar, table2 varchar, table3 varchar)
returns int as $$
declare
count_ int;
begin
execute format('insert into %I (id, loan_age, ori_upb, ori_rate, ltv, payoffmonth, payoffupb, payoffrate, lastStatus )
select distinct on (t1.id) t1.id, t1.loan_age, t1.ori_upb,
t1.ori_rate,t1.ltv,t2.month,t2.cur_upb, t2.cur_rate, t2.status
from %I t1 inner join %I t2 on t1.id=t2.id order by t1.id , t2.month desc',table3,table1,table2);
GET DIAGNOSTICS count_ = ROW_COUNT;
return count_;
end;
$$
language plpgsql
and call above function like below which will return the number of inserted rows:
select * from insert_values('table1','table2','table3');
DEMO2
I am trying to create a view in Redshift to enable us to see the latest data in each table.
We have datasets that update a various schedules and every table has a column "updated" that contains a datestamp of the rows last update.
What I want to achive is a view at the bottom (from these two example tables):
other.bigtable
+-----+--------+------------------+
| id | stat | updated |
+-----+--------+------------------+
| A2 | rgerhg | 03/05/2020 05:00 |
| F5 | bdfb | 03/05/2020 05:00 |
| GF5 | bb | 03/05/2020 05:00 |
+-----+--------+------------------+
default.test
+----+------+------------------+
| id | name | updated |
+----+------+------------------+
| 1 | A | 02/02/2008 00:00 |
| 2 | B | 02/02/2008 00:00 |
| 3 | C | 02/02/2008 00:00 |
| 4 | F | 02/02/2008 00:00 |
| 5 | T | 02/02/2010 00:00 |
+----+------+------------------+
default.view_updates
+---------+------------+------------------+
| schema | table_name | max_update |
+---------+------------+------------------+
| default | test | 02/02/2010 00:00 |
| other | big_table | 03/05/2020 05:00 |
+---------+------------+------------------+
So far I am as far as getting tables and schemas but have no idea where to start on the dates. Redshift seems a bit more limited.
EDIT:
Utilising some code stolen from the web I was hoping to use this to then create the table for the extra column:
select t.table_schema,
t.table_name
from information_schema.tables t
inner join information_schema.columns c
on c.table_name = t.table_name
and c.table_schema = t.table_schema
where c.column_name = 'updated'
and t.table_schema not in ('information_schema', 'pg_catalog')
and t.table_type = 'BASE TABLE'
order by t.table_schema;
[Source: https://dataedo.com/kb/query/amazon-redshift/find-tables-with-specific-column-name]
you can select the most recent date from each table and union together (and put in a view if you like).
Select * from (select top 1 'test', updated from test order by updated desc)
union all
Select * from (select top 1 'big_table', updated from big_table order by updated desc);
You can have a long list of "union all"s up to some limit. This hard codes the tables into the view - I assume this is what you are looking for.
More simply put than the below: if one has one or multiple query parameters, e.g. x_id, (or report / table function parameters) that are performance crucial (e.g. some primary key index can be used) and it may be (depending on the use case/report filters applied, ...) one of
null
an exact match (e.g. some unique id)
a like expression
or even a regexp expression
then if all these possibilities are coded in a single query, I only see and know that the optimizer will
generate a unique static plan, independent of the actual parameter runtime-value
and thus can't assume to use some index on x_id although it may be e.g. some exact match
Are there ather ways to handle this than to
let some PL/SQL code choose out of n predefined and use case optimized queries/views?
which can be quite large the more such flexible parameters one has
or some manually string-constructed and dynamically compiled query?
Basically I have two slightly different use cases/questions as documented and executable below:
A - select * from tf_sel
B - select * from data_union
which could potentially be solved via SQL hints or using some other trick.
To speed these queries up I am currently separating the "merged queries" on a certain implementation level (table function) which is quite cumbersome and harder to maintain, but assures the queries are running quite fast due their better execution plan.
As I see it, the main problem seems the static nature of the optimizer sql plan that is always the same altough it could be much more efficient, if it would consider some "query-time-constant" filter parameters.
with
-- Question A: What would be a good strategy to make tf_sel with tf_params nearly as fast as query_use_case_1_eq
-- which actually provides the same result?
--
-- - a complex query should be used in various reports with filters
-- - we want to keep as much as possible filter functionality on the db side (not the report engine side)
-- to be able to utilize the fast and efficient db engine and for loosely coupled software design
complex_query as ( -- just some imaginable complex query with a lot of table/view joins, aggregation/analytical functions etc.
select 1 as id, 'ab12' as indexed_val, 'asdfasdf' x from dual
union all select 2, 'ab34', 'a uiop345' from dual
union all select 3, 'xy34', 'asdf 0u0duaf' from dual
union all select 4, 'xy55', ' asdja´sf asd' from dual
)
-- <<< comment the following lines in to test it with the above
-- , query_use_case_1_eq as ( -- quite fast and maybe the 95% use case
-- select * from complex_query where indexed_val = 'ab12'
-- )
--select * from query_use_case_1_eq
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- --------
-- 1 ab12 asdfasdf
-- <<< comment the following lines in to test it with the above
-- , query_use_case_2_all as ( -- significantly slower due to a lot of underlying calculations
-- select * from complex_query
-- )
--select * from query_use_case_2_all
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- -------------
-- 1 ab12 asdfasdf
-- 2 ab34 a uiop345
-- 3 xy34 asdf 0u0duaf
-- 4 xy55 asdja´sf asd
-- <<< comment the following lines in to test it with the above
-- , query_use_case_3_like as (
-- select * from complex_query where indexed_val like 'ab%'
-- )
--select * from query_use_case_3_like
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- ---------
-- 1 ab12 asdfasdf
-- 2 ab34 a uiop345
-- <<< comment the following lines to simulate the table function
, tf_params as ( -- table function params: imagine we have a table function where these are passed depending on the report
select 'ab12' p_indexed_val, 'eq' p_filter_type from dual
)
, tf_sel as ( -- table function select: nicely integrating all query possiblities, but beeing veeery slow :-(
select q.*
from
tf_params p -- just here so this example works without the need for the actual function
join complex_query q on (1=1)
where
p_filter_type = 'all'
or (p_filter_type = 'eq' and indexed_val = p_indexed_val)
or (p_filter_type = 'like' and indexed_val like p_indexed_val)
or (p_filter_type = 'regexp' and regexp_like(indexed_val, p_indexed_val))
)
-- actually we would pass the tf_params above if it were a real table function
select * from tf_sel
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- --------
-- 1 ab12 asdfasdf
-- Question B: How can we speed up data_union with dg_filter to be as fast as the data_group1 query which
-- actually provides the same result?
--
-- A very similar approach is considered in other scenarios where we like to join the results of
-- different queries (>5) returning joinable data and beeing filtered based on the same parameters.
-- <<< comment the following lines to simulate the union problem
-- , data_group1 as ( -- may run quite fast
-- select 'dg1' dg_id, q.* from complex_query q where x < 'a' -- just an example returning some special rows that should be filtered later on!
-- )
--
-- , data_group2 as ( -- may run quite fast
-- select 'dg2' dg_id, q.* from complex_query q where instr(x,'p') >= 0 -- just an example returning some special rows that should be filtered later on!
-- )
--
--
-- , dg_filter as ( -- may be set by a report or indirectly by user filters
-- select 'dg1' dg_id from dual
-- )
--
-- , data_union as ( -- runs much slower due to another execution plan
-- select * from (
-- select * from data_group1
-- union all select * from data_group2
-- )
-- where dg_id in (select dg_id from dg_filter)
-- )
--
--select * from data_union
-- >>>
-- DG_ID ID INDEXED_VAL X
-- ----- -- ----------- -------------
-- dg1 4 xy55 asdja´sf asd
this is a comment to the sample code and answer provided by jonearles
Actually your answer was a mix up of my (unrelated although occuring together in certain scenarios) use cases A and B. Although it's nevertheless essential that you mentioned the optimizer has dynamic FILTER and maybe other capabilities.
use case B ("data partition/group union")
Actually use case B (based on your sample table) looks more like this, but I still have to check for the performance issue in the real scenario. Maybe you can see some problems with it already?
select * from (
select 'dg1' data_group, x.* from sample_table x
where mod(to_number(some_other_column1), 100000) = 0 -- just some example restriction
--and indexed_val = '3635' -- commenting this in and executing this standalone returns:
----------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1 | 23 | 2 (0)|
--| 1 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 2 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
----------------------------------------------------------------------------------------
union all
select 'dg2', x.* from sample_table x
where mod(to_number(some_other_column2), 9999) = 0 -- just some example restriction
union all
select 'dg3', x.* from sample_table x
where mod(to_number(some_other_column3), 3635) = 0 -- just some example restriction
)
where data_group in ('dg1') and indexed_val = '35'
-------------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 3 | 639 | 2 (0)|
--| 1 | VIEW | | 3 | 639 | 2 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | TABLE ACCESS BY INDEX ROWID | SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 4 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 5 | FILTER | | | | |
--| 6 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 7 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 8 | FILTER | | | | |
--| 9 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 10 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
-------------------------------------------------------------------------------------------
use case A (filtering by column query type)
Based on your sample table this is more like what I wanna do.
As you can see the query with just the fast where p.ft_id = 'eq' and x.indexed_val = p.val shows the index usage, but having all the different filter options in the where clause will cause the plan switch to always use a full table scan :-/
(Even if I use the :p_filter_type and :p_indexed_val_filter everywhere in the SQL than just in the one spot I put it, it won't change.)
with
filter_type as (
select 'all' as id from dual
union all select 'eq' as id from dual
union all select 'like' as id from dual
union all select 'regexp' as id from dual
)
, params as (
select
(select * from filter_type where id = :p_filter_type) as ft_id,
:p_indexed_val_filter as val
from dual
)
select *
from params p
join sample_table x on (1=1)
-- the following with the above would show the 'eq' use case with a fast index scan (plan id 14/15)
--where p.ft_id = 'eq' and x.indexed_val = p.val
------------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1 | 23 | 12 (0)|
--| 1 | VIEW | | 4 | 20 | 8 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | FILTER | | | | |
--| 4 | FAST DUAL | | 1 | | 2 (0)|
--| 5 | FILTER | | | | |
--| 6 | FAST DUAL | | 1 | | 2 (0)|
--| 7 | FILTER | | | | |
--| 8 | FAST DUAL | | 1 | | 2 (0)|
--| 9 | FILTER | | | | |
--| 10 | FAST DUAL | | 1 | | 2 (0)|
--| 11 | FILTER | | | | |
--| 12 | NESTED LOOPS | | 1 | 23 | 4 (0)|
--| 13 | FAST DUAL | | 1 | | 2 (0)|
--| 14 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 15 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 16 | VIEW | | 4 | 20 | 8 (0)|
--| 17 | UNION-ALL | | | | |
--| 18 | FILTER | | | | |
--| 19 | FAST DUAL | | 1 | | 2 (0)|
--| 20 | FILTER | | | | |
--| 21 | FAST DUAL | | 1 | | 2 (0)|
--| 22 | FILTER | | | | |
--| 23 | FAST DUAL | | 1 | | 2 (0)|
--| 24 | FILTER | | | | |
--| 25 | FAST DUAL | | 1 | | 2 (0)|
------------------------------------------------------------------------------------------
where
--mod(to_number(some_other_column1), 3000) = 0 and -- just some example restriction
(
p.ft_id = 'all'
or
p.ft_id = 'eq' and x.indexed_val = p.val
or
p.ft_id = 'like' and x.indexed_val like p.val
or
p.ft_id = 'regexp' and regexp_like(x.indexed_val, p.val)
)
-- with the full flexibility of the filter the plan shows a full table scan (plan id 13) :-(
--------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1099 | 25277 | 115 (3)|
--| 1 | VIEW | | 4 | 20 | 8 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | FILTER | | | | |
--| 4 | FAST DUAL | | 1 | | 2 (0)|
--| 5 | FILTER | | | | |
--| 6 | FAST DUAL | | 1 | | 2 (0)|
--| 7 | FILTER | | | | |
--| 8 | FAST DUAL | | 1 | | 2 (0)|
--| 9 | FILTER | | | | |
--| 10 | FAST DUAL | | 1 | | 2 (0)|
--| 11 | NESTED LOOPS | | 1099 | 25277 | 115 (3)|
--| 12 | FAST DUAL | | 1 | | 2 (0)|
--| 13 | TABLE ACCESS FULL| SAMPLE_TABLE | 1099 | 25277 | 113 (3)|
--| 14 | VIEW | | 4 | 20 | 8 (0)|
--| 15 | UNION-ALL | | | | |
--| 16 | FILTER | | | | |
--| 17 | FAST DUAL | | 1 | | 2 (0)|
--| 18 | FILTER | | | | |
--| 19 | FAST DUAL | | 1 | | 2 (0)|
--| 20 | FILTER | | | | |
--| 21 | FAST DUAL | | 1 | | 2 (0)|
--| 22 | FILTER | | | | |
--| 23 | FAST DUAL | | 1 | | 2 (0)|
--------------------------------------------------------------------------
Several features enable the optimizer to produce dynamic plans. The most common feature is FILTER operations, which should not be confused with filter predicates. A FILTER operation allows Oracle to enable or disable part of the plan at runtime based on a dynamic value. This feature normally works with bind variables, other types of dynamic queries may not use it.
Sample schema
create table sample_table
(
indexed_val varchar2(100),
some_other_column1 varchar2(100),
some_other_column2 varchar2(100),
some_other_column3 varchar2(100)
);
insert into sample_table
select level, level, level, level
from dual
connect by level <= 100000;
create index sample_table_idx1 on sample_table(indexed_val);
begin
dbms_stats.gather_table_stats(user, 'sample_table');
end;
/
Sample query using bind variables
explain plan for
select * from sample_table where :p_filter_type = 'all'
union all
select * from sample_table where :p_filter_type = 'eq' and indexed_val = :p_indexed_val
union all
select * from sample_table where :p_filter_type = 'like' and indexed_val like :p_indexed_val
union all
select * from sample_table where :p_filter_type = 'regexp' and regexp_like(indexed_val, :p_indexed_val);
select * from table(dbms_xplan.display(format => '-cost -bytes -rows'));
Sample plan
This demonstrates vastly different plans being used depending on input. A single = will use an INDEX RANGE SCAN, no predicate will use a TABLE ACCESS FULL. The
regular expression also uses a full table scan since there is no way to index regular expressions. Although depending on the exact type of expressions it may be
possible to enable useful indexing through function based indexes or Oracle Text indexes.
Plan hash value: 100704550
------------------------------------------------------------------------------
| Id | Operation | Name | Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 00:00:01 |
| 1 | UNION-ALL | | |
|* 2 | FILTER | | |
| 3 | TABLE ACCESS FULL | SAMPLE_TABLE | 00:00:01 |
|* 4 | FILTER | | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| SAMPLE_TABLE | 00:00:01 |
|* 6 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 00:00:01 |
|* 7 | FILTER | | |
| 8 | TABLE ACCESS BY INDEX ROWID BATCHED| SAMPLE_TABLE | 00:00:01 |
|* 9 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 00:00:01 |
|* 10 | FILTER | | |
|* 11 | TABLE ACCESS FULL | SAMPLE_TABLE | 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(:P_FILTER_TYPE='all')
4 - filter(:P_FILTER_TYPE='eq')
6 - access("INDEXED_VAL"=:P_INDEXED_VAL)
7 - filter(:P_FILTER_TYPE='like')
9 - access("INDEXED_VAL" LIKE :P_INDEXED_VAL)
filter("INDEXED_VAL" LIKE :P_INDEXED_VAL)
10 - filter(:P_FILTER_TYPE='regexp')
11 - filter( REGEXP_LIKE ("INDEXED_VAL",:P_INDEXED_VAL))
(more for situation A) but also applicable to B) in this way ...)
I am now using some hybrid approach (combination of the 1. and 2. points in my question) and actually quite like it, because it also provides good debugging and encapsulation possibilities and the optimizer does not have to deal at all with finding the best strategy based on basically logically separated queries in a bigger query, e.g. on internal FILTER rules, which may be good or at worst incredibly more inefficient:
using this in the report
select *
from table(my_report_data_func_sql(
:val1,
:val1_filter_type,
:val2
))
where the table function is defined like this
create or replace function my_report_data_func_sql(
p_val1 integer default 1234,
p_val1_filter_type varchar2 default 'eq',
p_val2 varchar2 default null
) return varchar2 is
query varchar2(4000) := '
with params as ( -- *: default param
select
''||p_val1||'' p_val1, -- eq*
'''||p_val1_filter_type||''' p_val1_filter_type, -- [eq, all*, like, regexp]
'''||p_val2||''' p_val2 -- null*
from dual
)
select x.*
from
params p -- workaround for standalone-sql-debugging using "with" statement above
join my_report_data_base_view x on (1=1)
where 1=1 -- ease of filter expression adding below
'
-- #### FILTER CRITERIAS are appended here ####
-- val1-filter
||case p_val1_filter_type
when 'eq' then '
and val1 = p_val1
' when 'like' then '
and val1 like p_val1
' when 'regexp' then '
and regexp_like(val1, p_val1)
' else '' end -- all
;
begin
return query;
end;
;
and would produce the following by example:
select *
from table(my_report_data_func_sql(
1234,
'eq',
'someval2'
))
/*
with params as ( -- *: default param
select
1 p_val1, -- eq*
'eq' p_val1_filter_type, -- [eq, all*, like, regexp]
'someval2' p_val2 -- null*
from dual
)
select x.*
from
params p -- workaround for standalone-sql-debugging using "with" statement above
join my_report_data_base_view x on (1=1)
where 1=1 -- ease of filter expression adding below
and val1 = p_val1
*/