STRING_split performance Issue - tsql

I have two queries which split comma separated list into rows and insert into table variable.
For first query I have used custom function which is:
USER defined Function for Spilt.
Create FUNCTION [dbo].[Split_S]
(
#sInputList VARCHAR(MAX)
,#sDelimiter VARCHAR(8)
)
RETURNS #List TABLE ([item] VARCHAR(8000))
AS
BEGIN
DECLARE #sItem VARCHAR(MAX)
WHILE CHARINDEX(#sDelimiter,#sInputList,0) <> 0
BEGIN
SELECT
#sItem=RTRIM(LTRIM(SUBSTRING(#sInputList,1,CHARINDEX(#sDelimiter,#sInputList,0)-1)))
,#sInputList=RTRIM(LTRIM(SUBSTRING(#sInputList,CHARINDEX(#sDelimiter,#sInputList,0)+LEN(#sDelimiter),LEN(#sInputList))))
IF LEN(#sItem) > 0
INSERT INTO #List SELECT #sItem
END
IF LEN(#sInputList) > 0
INSERT INTO #List SELECT #sInputList-- Put the last item in
RETURN
END
Query 1 :
DECLARE #F TABLE(F BIGINT)
INSERT INTO #F
SELECT [item] FROM [dbo].[Split_S]
(N'82,13,51,68,6',',')
Query 2 :
DECLARE #F2 TABLE(F BIGINT)
INSERT INTO #F2
SELECT Value
from
STRING_SPLIT(N'82,13,51,68,6',',')
Query Plan of Both Query
Why 37% and using STRING_SPLIT Its 63% .
but if i only compare select statement then query cost of STRING_SPLIT is 1%.
Which query has better performance and why?

If you will check only the part of the query that include the select query, then you will get that using STRING_SPLIT gives much better performance according too execution plan (EP). the result will be 99% vs 1%.
But when we use the data that returned by the STRING_SPLIT function (for example "select... into" or like in your case "insert...select'), then you might notice that the server uses "table spool (Eager Spool)" which make the difference. This operator takes the rows and stores them in a hidden temporary object stored in the tempdb database (the idea of using this logic is that the spooled data can be reused later in the execution plan). The "eager" spool takes ALL rows from the previously operator at one time, which means that this is "blocking operator".

Related

How to declare Array variable in Azure Data Warehouse?

I have a list of strings I use in a sql query similar to this:
select count(*) from sometable where somefield in ('val1','val2',...'valn')
I use this pattern in several queries in a single stored proc. I want to reuse the stored proc, changing the values in the array periodically. Using normal SQL databases, you can declare a table variable type but that is not supported in SQL Data Warehouse. You can use a temp table, but these and table variables require more editing when the values change (requiring insert statements or unions to populate the table). How can I declare an array variable?
Create a varchar variable and use the STRING_SPLIT function in a select statement:
DECLARE #ids varchar(8000)
set #ids = 'Val1,Val2,...ValN'
select count(*) from sometable
where somefield in (SELECT value FROM STRING_SPLIT(#ids, ','))
While this works, I'm not sure how well it scales; for performance reasons, you can fall back to using a temp table - then use VSCode to edit the insert statements (Ctrl+Shift+L is your friend).

Postgres using functions inside queries

I have a table with common word values to match against brands - so when someone types in "coke" I want to match any possible brand names associated with it as well as the original term.
CREATE TABLE word_association ( commonterm TEXT, assocterm TEXT);
INSERT INTO word_association ('coke', 'coca-cola'), ('coke', 'cocacola'), ('coke', 'coca-cola');
I have a function to create a list of these values in a pipe-delim string for pattern matching:
CREATE OR REPLACE FUNCTION usp_get_search_terms(userterm text)
RETURNS text AS
$BODY$DECLARE
returnstr TEXT DEFAULT '';
BEGIN
SET DATESTYLE TO DMY;
returnstr := userterm;
IF EXISTS (SELECT 1 FROM word_association WHERE LOWER(commonterm) = LOWER(userterm)) THEN
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE commonterm = userterm;
END IF;
RETURN returnstr;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION usp_get_search_terms(text)
OWNER TO customer_role;
If you call SELECT * FROM usp_get_search_terms('coke') you end up with
coke|coca-cola|cocacola|coca cola
EDIT: this function runs <100ms so it works fine.
I want to run a query with this text inserted e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE LOWER(X.online_description) % usp_get_search_terms ('coke');
This takes approx 56s to run against my table of ~500K records.
If I get the raw text and use it in the query it takes ~300ms e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE X.online_description % '(coke|coca-cola|cocacola|coca cola)';
The result sets are identical.
I've tried modifying what the output string from the function to e.g. enclose it in quotes and parentheses but it doesn't seem to make a difference.
Can someone please advise why there is a difference here? Is it the data type or something about calling functions inside queries? Thanks.
Your function might take 100ms, but it's not calling your function once; it's calling it 500,000 times.
It's because your function is declared VOLATILE. This tells Postgres that either the function returns different values when called multiple times within a query (like clock_timestamp() or random()), or that it alters the state of the database in some way (for example, by inserting records).
If your function contains only SELECTs, with no INSERTs, calls to other VOLATILE functions, or other side-effects, then you can declare it STABLE instead. This tells the planner that it can call the function just once and reuse the result without affecting the outcome of the query.
But your function does have side-effects, due to the SET DATESTYLE statement, which takes effect for the rest of the session. I doubt this was the intention, however. You may be able to remove it, as it doesn't look like date formatting is relevant to anything in there. But if it is necessary, the correct approach is to use the SET clause of the CREATE FUNCTION statement to change it only for the duration of the function call:
...
$BODY$
LANGUAGE plpgsql STABLE
SET DATESTYLE TO DMY
COST 100;
The other issue with the slow version of the query is the call to LOWER(X.online_description), which will prevent the query from utilising the index (since online_description is indexed, but LOWER(online_description) is not).
With these changes, the performance of both queries is the same; see this SQLFiddle.
So the answer came to me about dawn this morning - CTEs to the rescue!
Particularly as this is the "simple" version of a very large query, it helps to get this defined once in isolation, then do the matching against it. The alternative (given I'm calling this from a NodeJS platform) is to have one request retrieve the string of terms, then make another request to pass the string back. Not elegant.
WITH matches AS
( SELECT * FROM usp_get_search_terms('coke') )
, main AS
( SELECT X.article_number, X.online_description
FROM articles X
JOIN matches M ON X.online_description % M.usp_get_search_terms )
SELECT * FROM main
Execution time is somewhere around 300-500ms depending on term searched and articles returned.
Thanks for all your input guys - I've learned a few things about PostGres that my MS-SQL background didn't necessarily prepare me for :)
Have you tried removing the IF EXISTS() and simply using:
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE LOWER(commonterm) = LOWER(userterm)
In instead of calling the function for each row call it once:
select x.article_number, x.online_description
from
woolworths.articles x
cross join
woolworths.usp_get_search_terms ('coke') c (s)
where lower(x.online_description) % s

Removing empty tables from TSQL sproc output data sets?

I have a TSQL sproc that does three loops in order to find relevant data. If the first loop renders no results, then the second one normally does. I append another table that has multiple values that I can use later on.
So at most I should only have two tables returned in the dataset from the sproc.
The issue is that if the first loop is blank, I then end up with three data tables in my data set.
In my C# code, I can remove this empty table, but would rather not have it returned at all from the sproc.
Is there a way to remove the empty table from within the sproc, given the following:
EXEC (#sqlTop + #sqlBody + #sqlBottom)
SET #NumberOfResultsReturned = ##ROWCOUNT;
.
.
.
IF #NumberOfResultsReturned = 0
BEGIN
SET #searchLoopCount = #searchLoopCount + 1
END
ELSE
BEGIN
-- we have data, so no need to run again
BREAK
END
The process goes as follows: On the first loop there could be no results. Thus the rowcount will be zero because the EXEC executes a dynamically created SQL query. That's one table.
In the next iteration, results are returned, making that two data tables in the dataset output, plus my third one added on the end.
I didn't want to do a COUNT(*) then if > 0 then perform the query as I want to minimize the queries.
Thanks.
You can put the result for your SP in a table variable and then check if the table variable has any data in it.
Something like this with a SP named GetData that returns one integer column.
declare #T table(ID int)
declare #SQL varchar(25)
-- Create dynamic SQL
set #SQL = 'select 1'
-- Insert result from #SQL to #T
insert into #T
exec (#SQL)
-- Check for data
if not exists(select * from #T)
begin
-- No data continue loop
set #searchLoopCount = #searchLoopCount + 1
end
else
begin
-- Have data so wee need to query the data
select *
from #T
-- Terminate loop
break
end

UPDATE-no-op in SQL MERGE statement

I have a table with some persistent data in it. Now when I query it, I also have a pretty complex CTE which computes the values required for the result and I need to insert missing rows into the persistent table. In the end I want to select the result consisting of all the rows identified by the CTE but with the data from the table if they were already in the table, and I need the information whether a row has been just inserted or not.
Simplified this works like this (the following code runs as a normal query if you like to try it):
-- Set-up of test data, this would be the persisted table
DECLARE #target TABLE (id int NOT NULL PRIMARY KEY) ;
INSERT INTO #target (id) SELECT v.id FROM (VALUES (1), (2)) v(id);
-- START OF THE CODE IN QUESTION
-- The result table variable (will be several columns in the end)
DECLARE #result TABLE (id int NOT NULL, new bit NOT NULL) ;
WITH Source AS (
-- Imagine a fairly expensive, recursive CTE here
SELECT * FROM (VALUES (1), (3)) AS Source (id)
)
MERGE INTO #target AS Target
USING Source
ON Target.id = Source.id
-- Perform a no-op on the match to get the output record
WHEN MATCHED THEN
UPDATE SET Target.id=Target.id
WHEN NOT MATCHED BY TARGET THEN
INSERT (id) VALUES (SOURCE.id)
-- select the data to be returned - will be more columns
OUTPUT source.id, CASE WHEN $action='INSERT' THEN CONVERT(bit, 1) ELSE CONVERT(bit, 0) END
INTO #result ;
-- Select the result
SELECT * FROM #result;
I don't like the WHEN MATCHED THEN UPDATE part, I'd rather leave the redundant update away but then I don't get the result row in the OUTPUT clause.
Is this the most efficient way to do this kind of completing and returning data?
Or would there be a more efficient solution without MERGE, for instance by pre-computing the result with a SELECT and then perform an INSERT of the rows which are new=0? I have difficulties interpreting the query plan since it basically boils down to a "Clustered Index Merge" which is pretty vague to me performance-wise compared to the separate SELECT followed by INSERT variant. And I wonder if SQL Server (2008 R2 with CU1) is actually smart enough to see that the UPDATE is a no-op (e.g. no write required).
You could declare a dummy variable and set its value in the WHEN MATCHED clause.
DECLARE #dummy int;
...
MERGE
...
WHEN MATCHED THEN
UPDATE SET #dummy = 0
...
I believe it should be less expensive than the actual table update.

Navigating the results of a stored procedure via a cursor using T-SQL

Due to a legacy report generation system, I need to use a cursor to traverse the result set from a stored procedure. The system generates report output by PRINTing data from each row in the result set. Refactoring the report system is way beyond scope for this problem.
As far as I can tell, the DECLARE CURSOR syntax requires that its source be a SELECT clause. However, the query I need to use lives in a 1000+ line stored procedure that generates and executes dynamic sql.
Does anyone know of a way to get the result set from a stored procedure into a cursor?
I tried the obvious:
Declare Cursor c_Data For my_stored_proc #p1='foo', #p2='bar'
As a last resort, I can modify the stored procedure to return the dynamic sql it generates instead of executing it and I can then embed this returned sql into another string and, finally, execute that. Something like:
Exec my_stored_proc #p1='foo', #p2='bar', #query='' OUTPUT
Set #sql = '
Declare Cursor c_Data For ' + #query + '
Open c_Data
-- etc. - cursor processing loop etc. goes here '
Exec #sql
Any thoughts? Does anyone know of any other way to traverse the result set from a stored proc via a cursor?
Thanks.
You could drop the results from the stored proc into a temp table and select from that for your cursor.
CREATE TABLE #myResults
(
Col1 INT,
Col2 INT
)
INSERT INTO #myResults(Col1,Col2)
EXEC my_Sp
DECLARE sample_cursor CURSOR
FOR
SELECT
Col1,
Col2
FROM
#myResults
Another option may be to convert your stored procedure into a table valued function.
DECLARE sample_cursor CURSOR
FOR
SELECT
Col1,
Col2
FROM
dbo.NewFunction('foo', 'bar')
You use INSERT ... EXEC to push the result of the procedure into a table (can be a temp #table or a #table variable), the you open the cursor over this table. The article in the link discusses the problems that may occur with this technique: it cannot be nested and it forces a transaction around the procedure.
You could execute your SP into a temporary table and then iterate over the temporary table with the cursor
create table #temp (columns)
insert into #temp exec my_stored_proc ....
perform cursor work
drop table #temp