Basically, I have a temp table and I am populating the table with the same data using different techniques in order to find the fastest one in my case. The three methods are:
select inserted information using joins
select inserted information where most of the
tables/logic/calculations are included in inline function
select inserted information where most of the
tables/logic/calculations are included in table-value function
With each method the table is populated with the same data and I get best performance using the table value function. But here the things comes strange.
After the temp table is populated a simple SELECT is done over it, with GROUP BY and ORDER BY on all columns. Because the data is the same I have expected same execution plans, but I get this:
Where the first row is the execution plan of the table-value function and the second one is the execution plan for the first and the second method.
Why I have two different executions plan for table with same data? Why it is not always using the first one as it is faster then the the second one?
Note, as this is connected with data being sorted and grouped by, I have supposed when I use the table-valued function maybe the date is already sorted, but simple select of the results display that the data is sorted in the same way in each case.
There is more to execution plans than data
Take a look at Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
Something like this is not SARGAble and a sub-optimal plan will be created
where year(payment_dt) = year(getDate())
and month(payment_dt) = month(getDate())
The optimizer will create an optimal plan for
where payment_dt >= dateadd(mm, datediff(mm, 0, getdate())+0, 0)
and payment_dt < dateadd(mm, datediff(mm, 0, getdate())+1, 0)
Those two will return the same rows, one will use a seek, the other one a scan because functions are used around the columns
Related
I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.
My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm
I have created a materialized view for the purposes of feeding into a dashboard.
My goal is to make this table selectable in the fastest way possible and I'm not sure how to approach it. I was hoping that if I describe the table and how it will be used, someone could offer some direction.
The context is a website with funnel steps.Each row is an instance of a user triggering a funnel step such as add to cart, checkout, payment details and then finally transaction.
Since the table is for the purposes of analytics, it will be refreshed automatically with cron once a day only, in the morning, so I'm not worried about real time update speed, only select speed with various where clauses.
Suppose I have the fields described below:
(N = ~13M and expected to be ~20 by January, growing several million per month)
Table is unique with the combination of session id, user id and funnel step.
- Session Id (Id, so some duplication but generally very very granular - Varchar)
- User Id (Id, so some duplication but generally very very granular - Varchar)
- Date (Date)
- Funnel Step (10 distinct value - Varchar)
- Device Category (3 distinct values - Varchar)
- Country (~ 100 distinct values - varchar)
- City (~1000+ distinct values - varchar)
- Source (several thousand distinct values, nevertheless, stakeholder would like a filter - varchar)
Would I index each field individually? Or, should I index all fields in a oner? Per the documentation, I think I can index up to 32 fields at once. But would that be advisable here given my primary goal of select query speed over everything else?
The table will feed into dashboard that reads the table and dynamically translates filter inputs into where clauses. Each time the user adjusts a filter, the table will be read and grouped and aggregated based on the filter / where clause inputs.
Example query:
select
event_action,
count(distinct user_id) as users
from website_data.ecom_funnel
where date >= $input_start_date
and date <= $input_end_date
and device_category in ($mobile, $desktop, $tablet)
and country in ($list of all countries minus any not selected)
and source in ($list of all sources minus any not selected)
group by 1 order by users desc
This will result in a funnel shaped table of data.
I cannot aggregate before hand because the primary metric of concern is users, not sessions. These must be de-duped from the underlying table. Classic example... Suppose a person visits a website once a day for a week. Then the sum of unique visitors for that week is 1, however if I summed visitors by day I would get 7. Similar with my table, some users take multiple sessions to complete the funnel. So, this is why I cannot pre aggregate the table, since I need to apply filters to the underlying data and then count(distinct user id).
Here's explain on a subset of fields if it is useful:
QUERY PLAN
Sort (cost=862194.66..862194.68 rows=9 width=24)
Sort Key: (count(DISTINCT client_id)) DESC
-> GroupAggregate (cost=847955.01..862194.51 rows=9 width=24)
Group Key: event_action
-> Sort (cost=847955.01..852701.48 rows=1898589 width=37)
Sort Key: event_action
-> Seq Scan on ecom_funnel (cost=0.00..589150.14 rows=1898589 width=37)
Filter: ((device_category = ANY ('{mobile,desktop}'::text[])) AND (source = 'google'::text))
My overarching, specific question is, given my use case, should I index each field individually or should I create one single index? Does it matter?
On top of that, any tips for optimising this materialized view to run a select query faster would be appreciated.
Looking at your filter conditions, you should check the cardinality of device_category field by posting
select device_category, count(*) from website_data.ecom_funnel group by device_category
and looking at the values to determine if an index should firstly include this column. Possible index here (without knowing the cardinality) would be multicolumn and include:
(device_category, date)
Saying that, there's no benefit from creating indexes on each separate column as your query wouldn't use them all, so it does matter. You would slow down other CRUD operations that aren't Read operation.
Creating an index on all columns won't probably speed it up too much for you as well, but that's based on the data lying under the hood (in the table) and how your filters compare to the overall query without them (cardinality of values in columns being filtered). This would most likely create a huge overhead of going through the index tree and then obtaining rowids to return the data you need.
Summing up, I would try to narrow the index down to the columns that matter most in your filtering which means they cut most of the data being retrieved. If your query is meant to return majority of rows from the table then there's a need to aggregate, unfortunately, as this wouldn't speed things up.
Hope it helps.
Edit: I've just read that you already posted count of distinct values among your table. I'm not sure what Funnel Step is bound to in your table, but assuming it's a column named event_action, it might be beneficial to instead create an index that would help in grouping as well by doing:
(date, event_action)
It seems like you have omitted the GROUP BY clause at all, which should be included and it should be grouping by event_action, since that's what your select part is doing.
If you narrow the date down to several days/months every time you perform a select query, it might be a huge benefit to create index with first date column.
Remember, that position of column in an index matters.
If you look for values from several months let's say, you should preaggregate and store precalculated values from each month in another table and then UNION ALL that data to the current query which would only select data from current (still being updated) time.
I am new to using CTEs but I work with a humongous database and think they would cause less stress to the system that subqueries. I'm not sure if what I want to do is possible.
I have 2 CTEs with different columns from different tables but each CTE has the same sample_num (same data type of int) in them that could be used to join them if possible. I use the first CTE to limit the data for samples. I want the second CTE to look into the first and if the sample numbers match, include that sample number data in the second CTE. The reason I have the second CTE is because I use it's data to create a pivot table.
Ultimately what I want to do in my outer query is to use the fields from the first CTE and add the pivot table columns from the second CTE to the left. Basically marry the two CTEs side by side in the final outer query.
Is this possible or am I making this a lot harder than it needs to be. Remember, I work on a huge database with thousands of users.
I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.
I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)