TSQL view select optimization when function is present - tsql

I have this simple SQL as a source in a SSIS task:
Select * from budgetview
the source is:
CREATE VIEW [dbo].[BudgetView] AS
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
dbo.fn_DateFromJulian(SDIVD) AS Date,
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460
There are NO advices for indexes, every thing seems optimized.
The function fn_DateFromJulian source is:
CREATE FUNCTION [dbo].[fn_DateFromJulian]
(
#JulianDate numeric(6,0)
)
RETURNS date
AS
BEGIN
declare #resultdate date=dateadd(year,#JulianDate/1000,'1900-01-01')
set #resultdate=dateadd(day,#JulianDate%1000 -1,#resultdate)
return #resultdate
END
The problem is that i am waiting around 20 minutes just to get the rows going in SSIS....
I am waiting there 20mins BEFORE it gets started
Are there any suggestions to find the culprit?

My assumption is that the time spent on the view is consumed by calculating the Julian date value. Without seeing the actual query plan, it seems a fair guess based on the articles below.
Rewrite the original function as a table valued function below (I've simply mashed your code together, there are likely opportunities for improvement)
CREATE FUNCTION dbo.fn_DateFromJulianTVF
(
#JulianDate numeric(6,0)
)
RETURNS TABLE AS
RETURN
(
SELECT dateadd(day,#JulianDate%1000 -1,dateadd(year,#JulianDate/1000,CAST('1900-01-01' AS date))) AS JDEDate
)
Usage would be
CREATE VIEW [dbo].[BudgetView] AS
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
J.JDEDate AS [Date],
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460 AS T
CROSS APPLY
dbo.fn_DateFromJulianTVF(T.SDIVD) AS J
Scalar valued function, smell like code reuse, performs like a reused disposable diaper
https://sql.kiwi/2012/09/compute-scalars-expressions-and-execution-plan-performance.html
http://blogs.lobsterpot.com.au/2011/11/08/when-is-a-sql-function-not-a-function/

Just checking, but am I right to understand that for every unique value of T.SDIVD there will be just one unique result value of the function ? In other words, no two different T.SDIVD will return the same value from the function?
In that case what is happening here (IMHO) is that you first do scan over the entire table, for each and every record calculate the f(SDIVD) value and then send that entire resultset through an aggregation (DISTINCT).
Since functions are far from optimal in MSSQL I'd suggest to limit their use by turning around the chain of events and doing it like this:
CREATE VIEW [dbo].[BudgetView] AS
SELECT /* DISTINCT */
Country,
Company,
Customer,
PrintableItemNumber,
dbo.fn_DateFromJulian(SDIVD) AS Date,
Quantity,
Value,
Weight
FROM (
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
SDIVD,
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460 ) dist_F553460
)
If you had lots of double records this should improve performance, if you only had few of them it won't make much of a difference, if any. If you know you have no doubles at all you should get rid of the DISTINCT in the first place as that is what causing the delay!
Anyway, regarding the function you can add the following trick:
CREATE FUNCTION [dbo].[fn_DateFromJulian]
(
#JulianDate numeric(6,0)
)
RETURNS date
WITH SCHEMABINDING
AS
BEGIN
declare #resultdate date=dateadd(year,#JulianDate/1000,'1900-01-01')
set #resultdate=dateadd(day,#JulianDate%1000 -1,#resultdate)
return #resultdate
END
The WITH SCHEMABINDING causes some internal optimisations that will make its execution slightly faster, YMMV. There are limitations to it, but here it will work nicely.
Edit: removed the 'outer' DISTINCT since it's (likely, cf my first assumption) not needed.

Related

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

Dynamic FROM clause in Postgres

Using PostgreSQL 9.1.13 I've written the followed query to calculate some data:
WITH windowed AS (
SELECT a.person_id, a.category_id,
CAST(dense_rank() OVER w AS float) / COUNT(*) OVER (ORDER BY category_id) * 100.0 AS percentile
FROM (
SELECT DISTINCT ON (person_id, category_id) *
FROM performances s
-- Want to insert a FROM clause here
INNER JOIN person p ON s.person_id = p.ident
ORDER BY person_id, category_id, created DESC
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
)
SELECT category_id,percentile FROM windowed
WHERE person_id = 1;
I now want to turn this into a stored procedure but my issue is that in the middle there, where I showed the comment, I need to place a dynamic WHERE clause. For example, I'd like to add something like:
WHERE p.weight > 110 OR p.weight IS NULL
The calling application let's people pick filters and so I want to be able to pass the appropriate filters into the query. There could be 0 or many filters, depending on the caller, but I could pass it all in as a properly formatted where clause as a string parameter, for example.
The calling application just sends values to a webservice, which then builds the string and calls the stored procedure, so SQL injection attacks won't really be an issue.
The calling application just sends values to a webservice, which then
builds the string and calls the stored procedure, so SQL injection
attacks won't really be an issue.
Too many cooks spoil the broth.
Either let your webserive build the SQL statement or let Postgres do it. Don't use both on the same query. That leaves two possible weak spots for SQL injection attacks and makes debugging and maintenance a lot harder.
Here is full code example for a plpgsql function that builds and executes an SQL statement dynamically while making SQL injection impossible (just from two days ago):
Robust approach for building SQL queries programmatically
Details heavily depend on exact requirements.

TSQL Keyword Previous or Last or something similar

This question is geared for those who have more SQL experience than me.
I am writing a query(that will eventually be a Stored Procedure but this should be irrelevant) where I want to select the count of rows if the most recent entry's is equivalent to the one that was just entered before. And i want to continue to do this until it hits an entry that has a different value. (Poorly explained so I will show the example)
In my table I have a column 'Product_Id' and when this query is run i want it take the product_id and compare it to the previously entered product Id, if its the same I want to add one, and I want it to keep checking the previously entered product_id until it runs into a different product_id
I'm hoping it sounds more complicated than it is, and the query would look something like
Select count(Product_ID)
FROM dbo.myTable
Where Product_Id = previous(Product_Id)
Now, i know that previous isn't a keyword in TSQL, and neither was Last, but I'm hoping of someone who knows a keyword that does what I am asking.
Edit for Sam
USE DbName;
GO
WITH OrderedCount as
(
select ROW_NUMBER() OVER (Order by dbo.Line_Production.Run_Date DESC) as RowNumber,
Line_Production.Product_ID
From dbo.Line_Production
)
Select RowNumber, COUNT(OrderedCount.Product_ID) as PalletCount
From OrderedCount
WHERE OrderedCount.RowNumber + 1 = RowNumber
and Product_ID = Product_ID
Group by RowNumber
The OrderedCount portion works, and it returns the data back how I want it, I'm now having trouble comparing the Product_ID's for different RowNumbers
my Where Clause is wrong
There's no keyword. That would be a nice magic solution, but it doesn't exist, at least in part because there is no guaranteed ordering (okay, you could have the keyword only if there is an ORDER BY...). I can write you a query, but that'll take time, so for now I'll give you a few steps and I'll come back and see if you still need help in a bit.
Figure out an ORDER BY, otherwise no order is guaranteed. If there is a time entered field, that's a good choice, or an index, that works too.
Learn to use Row_Number.
Compare the table (with Row_Number) to itself where instance1.row - 1 = instance2.row.
If product_id is an identity column, couldn't you just do product_id - 1? In other words, if it's sequential, it's the same as using ROW_NUMBER mentioned in the previous comment.

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.

SQL DateTime Conversion Fails when No Conversion Should be Taking Place

I'm modifying an existing query for a client, and I've encountered a somewhat baffling issue.
Our client uses SQL Server 2008 R2 and the database in question provides the user the ability to specify custom fields for one of its tables by making use of an EAV structure. All of the values stored in this structure are varchar(255), and several of the fields are intended to store dates. The query in question is being modified to use two of these fields and compare them (one is a start, the other is an end) against the current date to determine which row is "current".
The issue I'm having is that part of the query does a CONVERT(DateTime, eav.Value) in order to turn the varchar into a DateTime. The conversions themselves all succedd and I can include the value as part of the SELECT clause, but part of the question is giving me a conversion error:
Conversion failed when converting date and/or time from character string.
The real kicker is this: if I define the base for this query (getting a list of entities with the two custom field values flattened into a single row) as a view and select against the view and filter the view by getdate(), then it works correctly, but it fails if I add a join to a second table using one of the (non-date) fields from the view. I realize that this might be somewhat hard to follow, so I can post an example query if desired, but this question is already getting a little long.
I've tried recreating the basic structure in another database and including sample data, but the new database behaves as expected, so I'm at a loss here.
EDIT In case it's useful, here's the statement for the view:
create view Festival as
select
e.EntityId as FestivalId,
e.LookupAs as FestivalName,
convert(Date, nvs.Value) as ActivityStart,
convert(Date, nve.Value) as ActivityEnd
from tblEntity e
left join CustomControl ccs on ccs.ShortName = 'Activity Start Date'
left join CustomControl cce on cce.ShortName = 'Activity End Date'
left join tblEntityNameValue nvs on nvs.CustomControlId = ccs.IdCustomControl and nvs.EntityId = e.EntityId
left join tblEntityNameValue nve on nve.CustomControlId = cce.IdCustomControl and nve.EntityId = e.EntityId
where e.EntityType = 'Festival'
The failing query is this:
select *
from Festival f
join FestivalAttendeeAll fa on fa.FestivalId = f.FestivalId
where getdate() between f.ActivityStart and f.ActivityEnd
Yet this works:
select *
from Festival f
where getdate() between f.ActivityStart and f.ActivityEnd
(EntityId/FestivalId are int columns)
I've encountered this type of error before, it's due to the "order of operations" performed by the execution plan.
You are getting that error message because the execution plan for your statement (generated by the optimizer) is performing the CONVERT() operation on rows that contain string values that can't be converted to DATETIME.
Basically, you do not have control over which rows the optimizer performs that conversion on. You know that you only need that conversion done on certain rows, and you have predicates (WHERE or ON clauses) that exclude those rows (limit the rows to those that need the conversion), but your execution plan is performing the CONVERT() operation on rows BEFORE those rows are excluded.
(For example, the optimizer may be electing to a do a table scan, and performing that conversion on every row, before any predicate is being applied.)
I can't give a specific answer, without a specific question and specific SQL that is generating the error.
One simple approach to addressing the problem would be to use the ISDATE() function to test whether the string value can be converted to a date.
That is, replace:
CONVERT(DATETIME,eav.Value)
with:
CASE WHEN ISDATE(eav.Value) > 0 THEN CONVERT(DATETIME, eav.Value) ELSE NULL END
or:
CONVERT(DATETIME, CASE WHEN ISDATE(eav.Value) > 0 THEN eav.Value ELSE NULL END)
Note that the ISDATE() function is subject to some significant limitations, such as being affected by the DATEFORMAT and LANGUAGE settings of the session.
If there is some other indication on the eav row, you could use some other test, to conditionally perform the conversion.
CASE WHEN eav.ValueIsDateTime=1 THEN CONVERT(DATETIME, eav.Value) ELSE NULL END
The other approach I've used is to try to gain some modicum of control over the order of operations of the optimizer, using inline views or Common Table Expressions, with operations that force the optimizer to materialize them and apply predicates, so that happens BEFORE any conversion in the outer query.