There are lots of questions on this topic, but all of them seem to be more complex cases than what I'm looking at at the moment and the answers don't seem applicable.
OHDSI=> \d record_counts
Table "results2.record_counts"
Column | Type | Modifiers
------------------------+-----------------------+-----------
concept_id | integer |
schema | text |
table_name | text |
column_name | text |
column_type | text |
descendant_concept_ids | bigint |
rc | numeric |
drc | numeric |
domain_id | character varying(20) |
vocabulary_id | character varying(20) |
concept_class_id | character varying(20) |
standard_concept | character varying(1) |
Indexes:
"rc_dom" btree (domain_id, concept_id)
"rcdom" btree (domain_id)
"rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)
The table has 3,133,778 records, so Postgres shouldn't be ignoring the index because of small table size.
I filter on domain_id, which is indexed, and the index is ignored:
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on record_counts (cost=0.00..76744.81 rows=2079187 width=87)
Filter: ((domain_id)::text = 'Drug'::text)
I turn off seqscan and:
OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on record_counts (cost=42042.13..105605.97 rows=2079187 width=87)
Recheck Cond: ((domain_id)::text = 'Drug'::text)
-> Bitmap Index Scan on rcdom (cost=0.00..41522.33 rows=2079187 width=0)
Index Cond: ((domain_id)::text = 'Drug'::text)
Indeed, the plan says it's going to be more expensive to use the index than not, but why? If the index lets it handle many fewer records, shouldn't it be quicker to use it?
Ok, it looks like Postgres knew what it was doing. The particular value of the indexed column I was using ('Drug') happened to account for 66% of the rows in the table. So, yes, the filter makes the row set significantly smaller, but since those rows would be scattered between pages, the index doesn't allow them to be retrieved faster.
OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
domain_id | rows | pct
---------------------+---------+---------
Drug | 2074991 | 66.2137
Condition | 466882 | 14.8984
Observation | 217807 | 6.9503
Procedure | 165800 | 5.2907
Measurement | 127239 | 4.0602
Device | 29410 | 0.9385
Spec Anatomic Site | 28783 | 0.9185
Meas Value | 10415 | 0.3323
Unit | 2350 | 0.0750
Type Concept | 2170 | 0.0692
Provider Specialty | 1957 | 0.0624
Specimen | 1767 | 0.0564
Metadata | 1689 | 0.0539
Revenue Code | 538 | 0.0172
Place of Service | 480 | 0.0153
Race | 467 | 0.0149
Relationship | 242 | 0.0077
Condition/Obs | 182 | 0.0058
Currency | 180 | 0.0057
Condition/Meas | 115 | 0.0037
Route | 81 | 0.0026
Obs/Procedure | 78 | 0.0025
Condition/Device | 52 | 0.0017
Condition/Procedure | 25 | 0.0008
Meas/Procedure | 25 | 0.0008
Gender | 19 | 0.0006
Device/Procedure | 9 | 0.0003
Meas Value Operator | 9 | 0.0003
Visit | 8 | 0.0003
Drug/Procedure | 3 | 0.0001
Spec Disease Status | 3 | 0.0001
Ethnicity | 2 | 0.0001
When I use any other value in the where clause (including 'Condition', with 15% of the rows), Postgres does use the index.
(Somewhat surprisingly, even after I cluster the table based on the domain_id index, it still doesn't use the index when I filter on 'Drug', but the performance improvement for filtering out 34% of the rows doesn't seem worth pursuing this further.)
Related
I have a PostgreSQL table that stores OHLCV data from an exchange. The table has recently become large and I'm trying to reduce its size by understanding how the different table components contribute to the total size.
This is the table's schema:
+-----------+-----------------------------+-------------------------+
| Column | Type | Modifiers |
|-----------+-----------------------------+-------------------------|
| ts | timestamp without time zone | not null default now() |
| millisecs | bigint | not null |
| open | numeric | not null |
| high | numeric | not null |
| low | numeric | not null |
| close | numeric | not null |
| volume | numeric | not null |
| symbol_id | integer | not null |
+-----------+-----------------------------+-------------------------+
The size of the table without indices and toasts is:
SELECT pg_size_pretty(pg_relation_size('ohlcv'));
+----------------+
| pg_size_pretty |
|----------------|
| 6871 MB |
+----------------+
However, when I get the size of each column and add up the results, I get:
SELECT
pg_size_pretty(
sum(pg_column_size(ts)) +
sum(pg_column_size(millisecs)) +
sum(pg_column_size(open)) +
sum(pg_column_size(high)) +
sum(pg_column_size(low)) +
sum(pg_column_size(close)) +
sum(pg_column_size(volume)) +
sum(pg_column_size(symbol_id))
) FROM ohlcv;
+----------------+
| pg_size_pretty |
|----------------|
| 3769 MB |
+----------------+
This is a fairly large difference in size. I realized NUMERIC is a variable-size type if the precision and scale are not specified, so I figured the difference in sizes is due to column padding. I ran the following query to test my theory:
SELECT
pg_size_pretty((
max(pg_column_size(ts)) +
max(pg_column_size(millisecs)) +
max(pg_column_size(open)) +
max(pg_column_size(high)) +
max(pg_column_size(low)) +
max(pg_column_size(close)) +
max(pg_column_size(volume)) +
max(pg_column_size(symbol_id))
) * count(*)) FROM ohlcv;
+----------------+
| pg_size_pretty |
|----------------|
| 5797 MB |
+----------------+
This query gets the maximum number of bytes for each column and then multiplies that by the number of rows.
The result is closer to the table size but still more than 1 GB smaller.
Finally, there's this query that returns a number somewhere between the two:
SELECT pg_size_pretty(sum(pg_column_size(ohlvc.*))) FROM ohlvc;
+----------------+
| pg_size_pretty |
|----------------|
| 6240 MB |
+----------------+
Can someone help me understand in detail the differences between these sizes? I understand there's overhead per-table, but what accounts for the difference between the results of the last 2 queries?
By the way, I have already vacuumed the table with
VACUUM FULL VERBOSE ohlcv;
HOT and FILLFACTOR Results
I've got some high-UPDATE tables where I've adjusted the FILLFACTOR to 95%, and I'm checking back in on them. I don't think that I've got the settings right, and am unclear how to tune them intelligently. I took another pass through Laurenz Albe's helpful blog post on HOT updates
https://www.cybertec-postgresql.com/en/hot-updates-in-postgresql-for-better-performance/
... and the clear source code READ ME:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT
Below is a query, adapted from the blog post, to check the status of the tables in the system, along with some sample output:
SELECT relname,
n_tup_upd as total_update_count,
n_tup_hot_upd as hot_update_count,
coalesce(div_safe(n_tup_upd, n_tup_hot_upd),0) as total_by_hot,
coalesce(div_safe(n_tup_hot_upd, n_tup_upd),0) as hot_by_total
FROM pg_stat_user_tables
order by 4 desc;
A few results:
relname total_update_count hot_update_count total_by_hot hot_by_total
rollups 369418 128 2886.0781 0.00034649097
q_event 71781 541 132.68207 0.007536813
analytic_scan 2104727 34304 61.35515 0.016298551
clinic 4424 77 57.454544 0.017405063
facility_location 179636 6489 27.683157 0.03612305
target_snapshot 494 18 27.444445 0.036437247
inv 1733021 78234 22.151762 0.045143135
I'm unsure what ratio(s) I'm looking for here. Can anyone advise me how to read these results, or what to read to figure out how to interpret them?
Are These UPDATEs HOTable?
I didn't address this basic point in the original draft of this question. I checked my patch from a few months back, and I ran SET (fillfactor = 95) and then VACUUM (FULL, VERBOSE, ANALYZE) on 13 of my tables. (The VERBOSE is in there as I had some tables that couldn't VACUUM because of a months-old process that needed clearing out, and that's how I found the problem. pg_stat_activity is my friend.)
However, at least most touch an indexed column...but with an identical value. Like 1 = 1, so no change to the value. I've been thinking that that is HOTable. If I'm wrong about that, bummer. If not, I'm mostly hoping to clarify what exactly the goal is for the relationships amongst fillfactor, n_tup_upd, and n_tup_hot_upd.
SELECT relname,
n_tup_upd as total_update_count,
n_tup_hot_upd as hot_update_count,
coalesce(div_safe(n_tup_upd, n_tup_hot_upd),0) as total_by_hot,
coalesce(div_safe(n_tup_hot_upd, n_tup_upd),0) as hot_by_total,
(select value::integer from table_get_options('data',relname) where option = 'fillfactor') as fillfactor_setting
FROM pg_stat_user_tables
WHERE relname IN (
'activity',
'analytic_productivity',
'analytic_scan',
'analytic_sterilizer_load',
'analytic_sterilizer_loadinv',
'analytic_work',
'assembly',
'data_file_info',
'inv',
'item',
'print_job',
'q_event')
order by 4 desc;
Results:
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| relname | total_update_count | hot_update_count | total_divided_by_hot | hot_divided_by_total | fillfactor_setting |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| q_event | 71810 | 553 | 129.85533 | 0.0077008773 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_scan | 2109206 | 34536 | 61.072678 | 0.016373934 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| inv | 1733176 | 78387 | 22.110502 | 0.045227375 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| item | 630586 | 32110 | 19.638306 | 0.05092089 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_sterilizer_loadinv | 76976539 | 5206806 | 14.783831 | 0.06764147 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_work | 8117050 | 608847 | 13.331839 | 0.07500841 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| assembly | 90580 | 7281 | 12.4405985 | 0.08038198 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_sterilizer_load | 19249 | 2997 | 6.422756 | 0.1556964 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| activity | 3795 | 711 | 5.3375525 | 0.18735178 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| analytic_productivity | 106486 | 25899 | 4.1115875 | 0.24321507 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| print_job | 1414 | 388 | 3.6443298 | 0.27439886 | 95 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
| data_file_info | 402086 | 285663 | 1.4075537 | 0.7104525 | 90 |
+-----------------------------+--------------------+------------------+----------------------+----------------------+--------------------+
(I just looked for and found an on-line table generator to help out with this kind of example at https://www.tablesgenerator.com/text_tables. It's a bit awkward to use, but faster than building out monospaced aligned text manually.)
FILLFACTOR and HOT update ratio
Figured I could sort this out a bit by adapting Laurenz Albe's code from https://www.cybertec-postgresql.com/en/hot-updates-in-postgresql-for-better-performance/. All I've done here is make a script that builds out a table with a FILLFACTOR of 10, 20, 30.....100% and then updated it in the same way for each percentage. Every time the table is created, it is populated with 256 records, which are then updated 10 times each. The update sets a non-indexed field back to itself, so no value actually changes:
UPDATE mytable SET val = val;
Below are the results:
+------------+---------------+-------------+--------------+
| FILLFACTOR | total_updates | hot_updates | total_to_hot |
+------------+---------------+-------------+--------------+
| 10 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 20 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 30 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 40 | 2350 | 2350 | 1.00 |
+------------+---------------+-------------+--------------+
| 50 | 2350 | 2223 | 1.06 |
+------------+---------------+-------------+--------------+
| 60 | 2350 | 2188 | 1.07 |
+------------+---------------+-------------+--------------+
| 70 | 2350 | 1883 | 1.25 |
+------------+---------------+-------------+--------------+
| 80 | 2350 | 1574 | 1.49 |
+------------+---------------+-------------+--------------+
| 90 | 2350 | 1336 | 1.76 |
+------------+---------------+-------------+--------------+
| 100 | 2350 | 987 | 2.38 |
+------------+---------------+-------------+--------------+
From this, it seems that when the total_to_hot ratio rises, there may be a benefit to increasing the FILLFACTOR.
https://www.postgresql.org/docs/13/monitoring-stats.html
n_tup_upd counts all updates, including HOT updates, and n_tup_hot_upd counts HOT updates only. But it doesn't seem to be a count of "could have been a HOT update, if we hadn't run out of room on the page." That would be great, but it also seems like a lot to ask for. (And maybe more expensive to keep track of that can be justified?)
Here is the script. I edited and re-ran the test with each FILLFACTOR.
-- Set up the table for the test
DROP TABLE IF EXISTS mytable;
CREATE TABLE mytable (
id integer PRIMARY KEY,
val integer NOT NULL
) WITH (autovacuum_enabled = off);
-- Change the FILLFACTOR. The default is 100.
ALTER TABLE mytable SET (fillfactor = 10); -- The only part that changes between runs.
-- Seed the data
INSERT INTO mytable
SELECT *, 0
FROM generate_series(1, 235) AS n;
-- Thrash the data
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
UPDATE mytable SET val = val;
SELECT pg_sleep(1);
-- How does it look?
SELECT n_tup_upd as total_updates,
n_tup_hot_upd as hot_updates,
div_safe(n_tup_upd, n_tup_hot_upd) as total_to_hot
FROM pg_stat_user_tables
WHERE relname = 'mytable';
Checking the FILLFACTOR Setting
As a side-note, I wanted a quick call to check the FILLFACTOR setting on a table, and it turned out to be more involved than I thought. I wrote up a function that works, but could likely see some improvements...if anyone has suggestions. I call it like this:
select * from table_get_options('foo', 'bar');
or
select * from table_get_options('foo','bar') where option = 'fillfactor';
Here's the code, if anyone has improvements to offer:
CREATE OR REPLACE FUNCTION dba.table_get_options(text,text)
RETURNS TABLE (
schema_name text,
table_name text,
option text,
value text
)
LANGUAGE SQL AS
$BODY$
WITH
packed_options AS (
select pg_class.relname as table_name,
btrim(pg_options_to_table(pg_class.reloptions)::text, '()') as option_kvp -- Convert to text (fillfactor,95), then strip off ( and )
from pg_class
join pg_namespace
on pg_namespace.oid = pg_class.relnamespace
where pg_namespace.nspname = $1
and relname = $2
and reloptions is not null
),
unpacked_options AS (
select $1 as schema_name,
$2 as table_name,
split_part(option_kvp, ',', 1) as option,
split_part(option_kvp, ',', 2) as value
from packed_options
)
select * from unpacked_options;
$BODY$;
The numbers show that your strategy is not working, and the overwhelming majority of updates are not HOT. You also show the reason: Even if you update an indexed column to the original value, you won't get a HOT update.
The solution would be to differentiate by including the indexed column in the UPDATE statement only if it is really modified.
A fillfactor of 95 is also pretty high, unless you have tables with really small rows. Perhaps you would get better results with a setting like 90 or 85.
I am making an index on a table with ~90 000 000 rows. Fulltext search must be done on a varchar field, called email. I also set parent_id as an attribute.
When I run queries to search emails that match words with small amount of hits, they are fired immediately:
mysql> SELECT count(*) FROM users WHERE MATCH('diedsmiling');
+----------+
| count(*) |
+----------+
| 26 |
+----------+
1 row in set (0.00 sec)
mysql> show meta;
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | diedsmiling |
| docs[0] | 26 |
| hits[0] | 26 |
+---------------+-------------+
6 rows in set (0.00 sec)
Things get complicated when I am searching for emails that match words with a big amount of hits:
mysql> SELECT count(*) FROM users WHERE MATCH('mail');
+----------+
| count(*) |
+----------+
| 33237994 |
+----------+
1 row in set (9.21 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 9.210 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
+---------------+----------+
6 rows in set (0.00 sec)
Using parent_id attribute, doesn't give any profit:
mysql> SELECT count(*) FROM users WHERE MATCH('mail') AND parent_id = 62003;
+----------+
| count(*) |
+----------+
| 21404 |
+----------+
1 row in set (8.66 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 8.666 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
Here are my sphinx configs:
source src1
{
type = mysql
sql_host = HOST
sql_user = USER
sql_pass = PASS
sql_db = DATABASE
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT id, parent_id, email \
FROM users
sql_attr_uint = parent_id
}
index test1
{
source = src1
path = /var/lib/sphinx/test1
}
The query that I need to run looks like:
SELECT * FROM users WHERE MATCH('mail') AND parent_id = 62003;
I need to get all emails that match a certain work and have a certain parent_id.
My questions are:
Is there a way to optimize the situation described above? Maybe there is a more convenient matching mode for such type of queries? If I migrate to a server with SSD disks will the performance growth be significant?
Just to get count can just do
Select id from index where match(...) limit 0 option ranker=none; show meta;
And get from total_found.
Will be much more efficient than count[*) which invokes group by.
Or even call keywords('word','index',1); if only single words.
In a PostgreSQL 9.4.0 database I have a busy table with 22 indexes which are larger than the actual data in the table.
Since most of these indexes are for columns which are almost entirely NULL, I've been trying to replace some of them with partial indexes.
One of the columns is: auto_decline_at timestamp without time zone. This has 5453085 NULLS out of a total 5457088 rows.
The partial index replacement is being used, but according to the stats, the old index is also still in use, so I am afraid to drop it.
Selecting from pg_tables I see:
tablename | indexname | num_rows | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched
-----------+---------------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------
jobs | index_jobs_on_auto_decline_at | 5.45496e+06 | 1916 MB | 3123 MB | N | 17056009 | 26506058607 | 26232155810
jobs | index_jobs_on_auto_decline_at_partial | 5.45496e+06 | 1916 MB | 120 kB | N | 6677 | 26850779 | 26679802
And a few minutes later:
tablename | indexname | num_rows | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched
-----------+---------------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------
jobs | index_jobs_on_auto_decline_at | 5.45496e+06 | 1916 MB | 3124 MB | N | 17056099 | 26506058697 | 26232155900
jobs | index_jobs_on_auto_decline_at_partial | 5.45496e+06 | 1916 MB | 120 kB | N | 6767 | 27210639 | 27039623
So number_of_scans is increasing for both of them.
The index definitions:
"index_jobs_on_auto_decline_at" btree (auto_decline_at)
"index_jobs_on_auto_decline_at_partial" btree (auto_decline_at) WHERE auto_decline_at IS NOT NULL
The only relevant query I can see in my logs follows this pattern:
SELECT "jobs".* FROM "jobs" WHERE (jobs.pending_destroy IS NULL OR jobs.pending_destroy = FALSE) AND "jobs"."deleted_at" IS NULL AND (state = 'assigned' AND auto_decline_at IS NOT NULL AND auto_decline_at < '2015-08-17 06:57:22.325324')
EXPLAIN ANALYSE gives me the following plan, which uses the partial index as expected:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using index_jobs_on_auto_decline_at_partial on jobs (cost=0.28..12.27 rows=1 width=648) (actual time=22.143..22.143 rows=0 loops=1)
Index Cond: ((auto_decline_at IS NOT NULL) AND (auto_decline_at < '2015-08-17 06:57:22.325324'::timestamp without time zone))
Filter: (((pending_destroy IS NULL) OR (NOT pending_destroy)) AND (deleted_at IS NULL) AND ((state)::text = 'assigned'::text))
Rows Removed by Filter: 3982
Planning time: 2.731 ms
Execution time: 22.179 ms
(6 rows)
My questions:
Why is index_jobs_on_auto_decline_at still being used?
Could this same query sometimes use index_jobs_on_auto_decline_at, or is there likely to be another query I am missing?
Is there a way to log the queries which are using index_jobs_on_auto_decline_at?
More simply put than the below: if one has one or multiple query parameters, e.g. x_id, (or report / table function parameters) that are performance crucial (e.g. some primary key index can be used) and it may be (depending on the use case/report filters applied, ...) one of
null
an exact match (e.g. some unique id)
a like expression
or even a regexp expression
then if all these possibilities are coded in a single query, I only see and know that the optimizer will
generate a unique static plan, independent of the actual parameter runtime-value
and thus can't assume to use some index on x_id although it may be e.g. some exact match
Are there ather ways to handle this than to
let some PL/SQL code choose out of n predefined and use case optimized queries/views?
which can be quite large the more such flexible parameters one has
or some manually string-constructed and dynamically compiled query?
Basically I have two slightly different use cases/questions as documented and executable below:
A - select * from tf_sel
B - select * from data_union
which could potentially be solved via SQL hints or using some other trick.
To speed these queries up I am currently separating the "merged queries" on a certain implementation level (table function) which is quite cumbersome and harder to maintain, but assures the queries are running quite fast due their better execution plan.
As I see it, the main problem seems the static nature of the optimizer sql plan that is always the same altough it could be much more efficient, if it would consider some "query-time-constant" filter parameters.
with
-- Question A: What would be a good strategy to make tf_sel with tf_params nearly as fast as query_use_case_1_eq
-- which actually provides the same result?
--
-- - a complex query should be used in various reports with filters
-- - we want to keep as much as possible filter functionality on the db side (not the report engine side)
-- to be able to utilize the fast and efficient db engine and for loosely coupled software design
complex_query as ( -- just some imaginable complex query with a lot of table/view joins, aggregation/analytical functions etc.
select 1 as id, 'ab12' as indexed_val, 'asdfasdf' x from dual
union all select 2, 'ab34', 'a uiop345' from dual
union all select 3, 'xy34', 'asdf 0u0duaf' from dual
union all select 4, 'xy55', ' asdja´sf asd' from dual
)
-- <<< comment the following lines in to test it with the above
-- , query_use_case_1_eq as ( -- quite fast and maybe the 95% use case
-- select * from complex_query where indexed_val = 'ab12'
-- )
--select * from query_use_case_1_eq
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- --------
-- 1 ab12 asdfasdf
-- <<< comment the following lines in to test it with the above
-- , query_use_case_2_all as ( -- significantly slower due to a lot of underlying calculations
-- select * from complex_query
-- )
--select * from query_use_case_2_all
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- -------------
-- 1 ab12 asdfasdf
-- 2 ab34 a uiop345
-- 3 xy34 asdf 0u0duaf
-- 4 xy55 asdja´sf asd
-- <<< comment the following lines in to test it with the above
-- , query_use_case_3_like as (
-- select * from complex_query where indexed_val like 'ab%'
-- )
--select * from query_use_case_3_like
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- ---------
-- 1 ab12 asdfasdf
-- 2 ab34 a uiop345
-- <<< comment the following lines to simulate the table function
, tf_params as ( -- table function params: imagine we have a table function where these are passed depending on the report
select 'ab12' p_indexed_val, 'eq' p_filter_type from dual
)
, tf_sel as ( -- table function select: nicely integrating all query possiblities, but beeing veeery slow :-(
select q.*
from
tf_params p -- just here so this example works without the need for the actual function
join complex_query q on (1=1)
where
p_filter_type = 'all'
or (p_filter_type = 'eq' and indexed_val = p_indexed_val)
or (p_filter_type = 'like' and indexed_val like p_indexed_val)
or (p_filter_type = 'regexp' and regexp_like(indexed_val, p_indexed_val))
)
-- actually we would pass the tf_params above if it were a real table function
select * from tf_sel
-- >>>
-- ID INDEXED_VAL X
-- -- ----------- --------
-- 1 ab12 asdfasdf
-- Question B: How can we speed up data_union with dg_filter to be as fast as the data_group1 query which
-- actually provides the same result?
--
-- A very similar approach is considered in other scenarios where we like to join the results of
-- different queries (>5) returning joinable data and beeing filtered based on the same parameters.
-- <<< comment the following lines to simulate the union problem
-- , data_group1 as ( -- may run quite fast
-- select 'dg1' dg_id, q.* from complex_query q where x < 'a' -- just an example returning some special rows that should be filtered later on!
-- )
--
-- , data_group2 as ( -- may run quite fast
-- select 'dg2' dg_id, q.* from complex_query q where instr(x,'p') >= 0 -- just an example returning some special rows that should be filtered later on!
-- )
--
--
-- , dg_filter as ( -- may be set by a report or indirectly by user filters
-- select 'dg1' dg_id from dual
-- )
--
-- , data_union as ( -- runs much slower due to another execution plan
-- select * from (
-- select * from data_group1
-- union all select * from data_group2
-- )
-- where dg_id in (select dg_id from dg_filter)
-- )
--
--select * from data_union
-- >>>
-- DG_ID ID INDEXED_VAL X
-- ----- -- ----------- -------------
-- dg1 4 xy55 asdja´sf asd
this is a comment to the sample code and answer provided by jonearles
Actually your answer was a mix up of my (unrelated although occuring together in certain scenarios) use cases A and B. Although it's nevertheless essential that you mentioned the optimizer has dynamic FILTER and maybe other capabilities.
use case B ("data partition/group union")
Actually use case B (based on your sample table) looks more like this, but I still have to check for the performance issue in the real scenario. Maybe you can see some problems with it already?
select * from (
select 'dg1' data_group, x.* from sample_table x
where mod(to_number(some_other_column1), 100000) = 0 -- just some example restriction
--and indexed_val = '3635' -- commenting this in and executing this standalone returns:
----------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1 | 23 | 2 (0)|
--| 1 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 2 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
----------------------------------------------------------------------------------------
union all
select 'dg2', x.* from sample_table x
where mod(to_number(some_other_column2), 9999) = 0 -- just some example restriction
union all
select 'dg3', x.* from sample_table x
where mod(to_number(some_other_column3), 3635) = 0 -- just some example restriction
)
where data_group in ('dg1') and indexed_val = '35'
-------------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 3 | 639 | 2 (0)|
--| 1 | VIEW | | 3 | 639 | 2 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | TABLE ACCESS BY INDEX ROWID | SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 4 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 5 | FILTER | | | | |
--| 6 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 7 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 8 | FILTER | | | | |
--| 9 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 10 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
-------------------------------------------------------------------------------------------
use case A (filtering by column query type)
Based on your sample table this is more like what I wanna do.
As you can see the query with just the fast where p.ft_id = 'eq' and x.indexed_val = p.val shows the index usage, but having all the different filter options in the where clause will cause the plan switch to always use a full table scan :-/
(Even if I use the :p_filter_type and :p_indexed_val_filter everywhere in the SQL than just in the one spot I put it, it won't change.)
with
filter_type as (
select 'all' as id from dual
union all select 'eq' as id from dual
union all select 'like' as id from dual
union all select 'regexp' as id from dual
)
, params as (
select
(select * from filter_type where id = :p_filter_type) as ft_id,
:p_indexed_val_filter as val
from dual
)
select *
from params p
join sample_table x on (1=1)
-- the following with the above would show the 'eq' use case with a fast index scan (plan id 14/15)
--where p.ft_id = 'eq' and x.indexed_val = p.val
------------------------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1 | 23 | 12 (0)|
--| 1 | VIEW | | 4 | 20 | 8 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | FILTER | | | | |
--| 4 | FAST DUAL | | 1 | | 2 (0)|
--| 5 | FILTER | | | | |
--| 6 | FAST DUAL | | 1 | | 2 (0)|
--| 7 | FILTER | | | | |
--| 8 | FAST DUAL | | 1 | | 2 (0)|
--| 9 | FILTER | | | | |
--| 10 | FAST DUAL | | 1 | | 2 (0)|
--| 11 | FILTER | | | | |
--| 12 | NESTED LOOPS | | 1 | 23 | 4 (0)|
--| 13 | FAST DUAL | | 1 | | 2 (0)|
--| 14 | TABLE ACCESS BY INDEX ROWID| SAMPLE_TABLE | 1 | 23 | 2 (0)|
--| 15 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 1 | | 1 (0)|
--| 16 | VIEW | | 4 | 20 | 8 (0)|
--| 17 | UNION-ALL | | | | |
--| 18 | FILTER | | | | |
--| 19 | FAST DUAL | | 1 | | 2 (0)|
--| 20 | FILTER | | | | |
--| 21 | FAST DUAL | | 1 | | 2 (0)|
--| 22 | FILTER | | | | |
--| 23 | FAST DUAL | | 1 | | 2 (0)|
--| 24 | FILTER | | | | |
--| 25 | FAST DUAL | | 1 | | 2 (0)|
------------------------------------------------------------------------------------------
where
--mod(to_number(some_other_column1), 3000) = 0 and -- just some example restriction
(
p.ft_id = 'all'
or
p.ft_id = 'eq' and x.indexed_val = p.val
or
p.ft_id = 'like' and x.indexed_val like p.val
or
p.ft_id = 'regexp' and regexp_like(x.indexed_val, p.val)
)
-- with the full flexibility of the filter the plan shows a full table scan (plan id 13) :-(
--------------------------------------------------------------------------
--| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------
--| 0 | SELECT STATEMENT | | 1099 | 25277 | 115 (3)|
--| 1 | VIEW | | 4 | 20 | 8 (0)|
--| 2 | UNION-ALL | | | | |
--| 3 | FILTER | | | | |
--| 4 | FAST DUAL | | 1 | | 2 (0)|
--| 5 | FILTER | | | | |
--| 6 | FAST DUAL | | 1 | | 2 (0)|
--| 7 | FILTER | | | | |
--| 8 | FAST DUAL | | 1 | | 2 (0)|
--| 9 | FILTER | | | | |
--| 10 | FAST DUAL | | 1 | | 2 (0)|
--| 11 | NESTED LOOPS | | 1099 | 25277 | 115 (3)|
--| 12 | FAST DUAL | | 1 | | 2 (0)|
--| 13 | TABLE ACCESS FULL| SAMPLE_TABLE | 1099 | 25277 | 113 (3)|
--| 14 | VIEW | | 4 | 20 | 8 (0)|
--| 15 | UNION-ALL | | | | |
--| 16 | FILTER | | | | |
--| 17 | FAST DUAL | | 1 | | 2 (0)|
--| 18 | FILTER | | | | |
--| 19 | FAST DUAL | | 1 | | 2 (0)|
--| 20 | FILTER | | | | |
--| 21 | FAST DUAL | | 1 | | 2 (0)|
--| 22 | FILTER | | | | |
--| 23 | FAST DUAL | | 1 | | 2 (0)|
--------------------------------------------------------------------------
Several features enable the optimizer to produce dynamic plans. The most common feature is FILTER operations, which should not be confused with filter predicates. A FILTER operation allows Oracle to enable or disable part of the plan at runtime based on a dynamic value. This feature normally works with bind variables, other types of dynamic queries may not use it.
Sample schema
create table sample_table
(
indexed_val varchar2(100),
some_other_column1 varchar2(100),
some_other_column2 varchar2(100),
some_other_column3 varchar2(100)
);
insert into sample_table
select level, level, level, level
from dual
connect by level <= 100000;
create index sample_table_idx1 on sample_table(indexed_val);
begin
dbms_stats.gather_table_stats(user, 'sample_table');
end;
/
Sample query using bind variables
explain plan for
select * from sample_table where :p_filter_type = 'all'
union all
select * from sample_table where :p_filter_type = 'eq' and indexed_val = :p_indexed_val
union all
select * from sample_table where :p_filter_type = 'like' and indexed_val like :p_indexed_val
union all
select * from sample_table where :p_filter_type = 'regexp' and regexp_like(indexed_val, :p_indexed_val);
select * from table(dbms_xplan.display(format => '-cost -bytes -rows'));
Sample plan
This demonstrates vastly different plans being used depending on input. A single = will use an INDEX RANGE SCAN, no predicate will use a TABLE ACCESS FULL. The
regular expression also uses a full table scan since there is no way to index regular expressions. Although depending on the exact type of expressions it may be
possible to enable useful indexing through function based indexes or Oracle Text indexes.
Plan hash value: 100704550
------------------------------------------------------------------------------
| Id | Operation | Name | Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 00:00:01 |
| 1 | UNION-ALL | | |
|* 2 | FILTER | | |
| 3 | TABLE ACCESS FULL | SAMPLE_TABLE | 00:00:01 |
|* 4 | FILTER | | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| SAMPLE_TABLE | 00:00:01 |
|* 6 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 00:00:01 |
|* 7 | FILTER | | |
| 8 | TABLE ACCESS BY INDEX ROWID BATCHED| SAMPLE_TABLE | 00:00:01 |
|* 9 | INDEX RANGE SCAN | SAMPLE_TABLE_IDX1 | 00:00:01 |
|* 10 | FILTER | | |
|* 11 | TABLE ACCESS FULL | SAMPLE_TABLE | 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(:P_FILTER_TYPE='all')
4 - filter(:P_FILTER_TYPE='eq')
6 - access("INDEXED_VAL"=:P_INDEXED_VAL)
7 - filter(:P_FILTER_TYPE='like')
9 - access("INDEXED_VAL" LIKE :P_INDEXED_VAL)
filter("INDEXED_VAL" LIKE :P_INDEXED_VAL)
10 - filter(:P_FILTER_TYPE='regexp')
11 - filter( REGEXP_LIKE ("INDEXED_VAL",:P_INDEXED_VAL))
(more for situation A) but also applicable to B) in this way ...)
I am now using some hybrid approach (combination of the 1. and 2. points in my question) and actually quite like it, because it also provides good debugging and encapsulation possibilities and the optimizer does not have to deal at all with finding the best strategy based on basically logically separated queries in a bigger query, e.g. on internal FILTER rules, which may be good or at worst incredibly more inefficient:
using this in the report
select *
from table(my_report_data_func_sql(
:val1,
:val1_filter_type,
:val2
))
where the table function is defined like this
create or replace function my_report_data_func_sql(
p_val1 integer default 1234,
p_val1_filter_type varchar2 default 'eq',
p_val2 varchar2 default null
) return varchar2 is
query varchar2(4000) := '
with params as ( -- *: default param
select
''||p_val1||'' p_val1, -- eq*
'''||p_val1_filter_type||''' p_val1_filter_type, -- [eq, all*, like, regexp]
'''||p_val2||''' p_val2 -- null*
from dual
)
select x.*
from
params p -- workaround for standalone-sql-debugging using "with" statement above
join my_report_data_base_view x on (1=1)
where 1=1 -- ease of filter expression adding below
'
-- #### FILTER CRITERIAS are appended here ####
-- val1-filter
||case p_val1_filter_type
when 'eq' then '
and val1 = p_val1
' when 'like' then '
and val1 like p_val1
' when 'regexp' then '
and regexp_like(val1, p_val1)
' else '' end -- all
;
begin
return query;
end;
;
and would produce the following by example:
select *
from table(my_report_data_func_sql(
1234,
'eq',
'someval2'
))
/*
with params as ( -- *: default param
select
1 p_val1, -- eq*
'eq' p_val1_filter_type, -- [eq, all*, like, regexp]
'someval2' p_val2 -- null*
from dual
)
select x.*
from
params p -- workaround for standalone-sql-debugging using "with" statement above
join my_report_data_base_view x on (1=1)
where 1=1 -- ease of filter expression adding below
and val1 = p_val1
*/