TSQL: left join on view very slow - tsql

in some complex stored procedure we use view and left join on this view. It takes 40 sec to execute.
Now, if we create TABLE variable and store in it result of view, then do the left join on this variable and not on the view, it takes 3 sec...
What can explain this behavior?

The view expands into the main query. S
So if you have 5 tables in the view, these expand with the extra table into one big query plan with 6 tables. The performance difference will most likely be caused by added complexity and permutations of the extra table you left join with.
Another potential issue: Do you then left join on a column that has some processing on it? This will further kill performance.

Courtesy of Alden W. This response for one of my questions like yours by Alden W.
You may be able to get better performance by specifying the VIEW ALGORITHM as MERGE. With MERGE MySQL will combine the view with your outside SELECT's WHERE statement, and then come up with an optimized execution plan.
To do this however you would have to remove the GROUP BY statement from your VIEW. As it is, if a GROUP BY statement is included in your view, MySQL will choose the TEMPLATE algorithm. A temporary table is being created of the entire view first, before being filtered by your WHERE statement.
If the MERGE algorithm cannot be used, a temporary table must be used instead. MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying table)
Here is the link with more info. http://dev.mysql.com/doc/refman/5.0/en/view-algorithms.html
If you can change your view to not include the GROUP BY statement, to specify the view's algorithm the syntax is:
CREATE ALGORITHM = MERGE VIEW...

Related

Is it possible to use one field in a cte to limit the data in another cte

I am new to using CTEs but I work with a humongous database and think they would cause less stress to the system that subqueries. I'm not sure if what I want to do is possible.
I have 2 CTEs with different columns from different tables but each CTE has the same sample_num (same data type of int) in them that could be used to join them if possible. I use the first CTE to limit the data for samples. I want the second CTE to look into the first and if the sample numbers match, include that sample number data in the second CTE. The reason I have the second CTE is because I use it's data to create a pivot table.
Ultimately what I want to do in my outer query is to use the fields from the first CTE and add the pivot table columns from the second CTE to the left. Basically marry the two CTEs side by side in the final outer query.
Is this possible or am I making this a lot harder than it needs to be. Remember, I work on a huge database with thousands of users.

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

distinct performance on large table

Suppose I have a table with a column that has repeats, e.g.
Column1
---------
a
a
a
a
b
a
c
d
e
... so on
Maybe it has hundreds of thousands of rows. Then say, I need to pull the distinct values from this column. I can do so easily in a SELECT with DISTINCT, but I'm wondering on performance?
I could also give each item in Column1 an id, and then create a new table referenced by Column1 (to normalize this more appropriately). Though, this adds extra complexity to making an insert, and adds in joins for other possible queries.
Is there some way to index just the distinct values in a column, or is the normalization thing the only way to go?
Index on column1 will considerably speed up processing of distinct, but if you are willing to trade some space and some (short) time during insert/update/delete, you can resort to materialized view. This is indexed view you might consider as dynamic table produced and maintained following view definition.
create view view1
with schemabinding
as
select column1,
count_big(*) cnt
from theTable
group by column1
-- create unique clustered index ix_view1 on view1(column1)
(Do not forget to execute commented create index command. I usually do it this way so that view definition contains index definition, reminding me to apply it if I need to change the view.)
When you want to use it be sure to add noexpand hint to force use of materialized data (this part will remain mistery to me - something created as performance enhancement is not turned on by default, but rather activated on spot).
select *
from view1 (noexpand)

select * or individual columns

Is there any difference in performance between
select *
from table name
and
select [col1]
,[col2]
......
,[coln]
from table name
It is a SQL antipattern to use SELECT *. It is faster for the database when you specify columns (it doesn't have to look them up) and more importantly, you should not ever specify more columns than you actually need. If you have a join in your query you have at least one column you don't need (the join column) and thus SELECT * is always slower to return records in a query with a join since it is returning more infomation than necessary.
Now all this sounds like a small improvement and in a small system it might be, but as the database grows and gets busy, the performance implications become bigger. There is no excuse for using SELECT *.
SELECT * is also bad for maintenance especially if you use it to insert records - always specify both the columns you are inserting to and the fields from the select in an INSERT statement that uses a SELECT. It will break if you change the table structure. You may also end up showing the user columns you don't want them to see such as the GUID added for replication.
If you use SELECT * in a view (at least in SQL Server) the view will still not automatically update in the case of a change to the underlying tables. If someone gets the silly idea to re-arrange the column order in the table (yeah I know you shouldn't but people do this on occasion) using SELECT * might mean that data will show up in the wrong columns in reports or inserts which can cause problems where things might be misinterpreted. I can think of one case where two columns were swapped in a staging table and the social security number became the amount we intended to pay the person giving a speech. You can see how that might really muck up the accounting except we didn't use SELECT * so we were safe because the columns retained the same name.
I want to note that you don't even save much development time by using SELECT *. It takes me a max of about 15 seconds more to use the table names even in a large table as I drag them over from the object browser in SQL Server (you can get all the columns in one step).
I suppose select * takes a few cpu cycles less since sql server parses your statements, but i don't think it will make a noticable difference

Suggestion needed to avoid table scan in large view w/many duplicate values

Could anyone offer a solution to speed up one of our processes? We have a view used for reporting that is a union all of 10 tables. The view has 180 million rows. We would like to generate a list of the distinct values of individual columns. The current SQL generated by the reporting tool does a select distinct on the view which takes 10 minutes. Preferably the solution would be automatically updated. We have been trying to create a MQT in DB2 udb V8 as a union all, refresh immediate with little success. Any suggestions would be greatly appreciated.
Charles.
There are a lot of restrictions in DB2 8.2 for refresh immediate MQTs, and they can have a significant performance impact on applications that write to the base tables. That said, you may be able to use an MQT. However, instead of using SELECT DISTINCT, try making the query look something like:
select yourcolumn, count(*) as ignore
from union_all_view
group by yourcolumn
The column (yourcolumn) from must be defined as NOT NULL for this to work (in DB2 8.2). The optimizer may not select this MQT if you still issue SELECT DISTINCT against the union all view, so you may need to query the MQT (or a view defined on top of it) directly. Ignore the column "ignore" in the MQT -- that is there only for DB2; if you really don't want to see it, you can create a vew on top of the MQT.
However, this is really a database design issue. Why do you need to scan 180 million rows of data to find the unique values in a particular column? Why don't these values already reside in their own table, with foreign keys defined against it from each of the 10 base tables?