Proc is running slow with NOT EXISTS - tsql

I'm working on trying to create a stored procedure however I'm running into a issue where the stored procedure runs for over 5 minutes due to close to 50k records.
The process seems pretty straight forward, I'm just not sure why it is taking so long.
Essentially I have two tables:
Table_1
ApptDate ApptName ApptDoc ApptReason ApptType
-----------------------------------------------------------------------
03/15/2021 Physical Dr Smith Yearly Day
03/15/2021 Check In Dr Doe Check In Day
03/15/2021 Appt oth Dr Dee Check In Monthly
Table_2 - this table has the same exact structure as Table_1, what I am trying to achieve is simply archive the the data from Table_1
DECLARE #Date_1 as DATETIME
SET #Date_1 = GetDate() - 1
INSERT INTO Table_2 (ApptDate, ApptName, ApptDoc, ApptReason)
SELECT ApptDate, ApptName, ApptDoc, ApptReason
FROM Table_1
WHERE ApptType = 'Day' AND ApptDate = #Date_1
AND NOT EXISTS (SELECT 1 FROM Table_2
WHERE AppType = 'Day' AND ApptDate = #Date_1)
So this stored procedure seems pretty straight forward, however the NOT EXIST is causing it to be really slow.
The reason for NOT EXIST, is that this stored procedure is part of a bigger process that runs multiple times a day (morning, afternoon, night). I'm trying to make sure that I only have 1 copy of the the '03/15/2021' data. I'm basically running an archive process on previous days data (#Date_1)
Any thoughts how this can be "sped up".

For this query:
INSERT INTO Table_2 (ApptDate, ApptName, ApptDoc, ApptReason)
SELECT ApptDate, ApptName, ApptDoc, ApptReason
from Table_1 t1
Where ApptType = 'Day' and
ApptDate = #Date_1 and
NOT EXISTS (Select 1
from Table_2 t2
where t2.AppType = t1.AppType and
t2.ApptDate = t1.ApptDate
);
You want indexes on: table_1(ApptType) and more importantly, Table_2(AppType, ApptDate) or Table_2(ApptDate, AppType).
Note: I changed the correlation clause to just refer to the values in the outer query. This seems more general than your version, but should have the same performance (in this case).

Related

Is there any way to update this column faster using PostgreSQL

I have about 200,000,000 rows and I am trying to update one of the columns, and this query seems particularly slow, so I am not sure what exactly wrong or if it is just slow.
UPDATE table1 p
SET location = a.location
FROM table2 a
WHERE p.idv = a.idv;
I curently on idv for both of the tables. Is there someway to make this faster?
Encounter the same problem several weeks ago , finally I use the following strategies to drastically improve the speed. I guess it is not the best approach , but just for your reference.
Write a simple function which accept a range of Id. The function will execute the update SQL but just update these range of ID.
Also add 'location != a.location' to the where clause . I heard that it can help to reduce the table become bloated which will affect query performance and need to do vacuum to restore the performance.
I execute the function continuously using about 30 threads which intuitively I think it can reduce the total time required by approximately 30 times. You can adjust to use a even higher number of threads if you are ambitious enough.
So it executes something likes below concurrently :
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 1 and 100000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 100001 and 200000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 200001 and 300000;
.....
.....
Also it has another advantage that I can know the update progress and what is the estimated remaining time to go by printing some simple timing statistic in each function.
Creating a new table can be faster than update existing data. So you can try the following:
CREATE TABLE new_table AS
SELECT
a.*, -- here you can set all fields you need
CALESCE(b.location, a.location) location -- update location field from table2
FROM table1 a
LEFT JOIN table2 b ON b.idv = a.idv;
After creation you will be able to drop old table and to rename the new.

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

Querying Missing rows in TSQL

We have a table that is populated from information on multiple computers every day. The problem is sometimes it doesn't pull information from certain computers.
So for a rough example, the table columns would read computer_name, information_pulled, qty_pulled, date_pulled.
So Lets say it pulled every day in a week, except the 15th. A query will pull
Computer_name, Information_pulled, qty_pulled, date_pulled
computer1 infopulled 2 2014-06-14
computer2 infopulled 3 2014-06-14
computer3 infopulled 2 2014-06-14
computer1 infopulled 2 2014-06-15
computer3 infopulled 1 2014-06-15
computer1 infopulled 3 2014-06-16
computer2 infopulled 2 2014-06-16
computer3 infopulled 4 2014-06-16
As you can see, nothing pulled in for computer 2 on the 15th. I am looking to write a query that pulls up missing rows for a specific date.
For Example, after running it it says
computer 2 null null 20140615
or anything close to this. We're trying to catch it each morning when this table isn't populated that way we can be proactive and I am not positive I can even query for missing data w/o searching for null values.
You need to have a master list of all your computers somewhere, so that you know when a computer is not accounted for in your table. Say that you have a table called Computer that holds this.
Declare a variable to store the date you want to check:
declare #date date
set #date = '6/15/2014'
Then you can query for missing rows like this:
select c.Computer_name, null, null, #date
from Computer c
where not exists(select 1
from myTable t
where t.Computer_name = c.Computer_name
and t.date_pulled = #date)
SQL Fiddle
If you are certain that every computer_name already exists in your table at least once, you could skip creating a separate Computer table, and modify the query like this:
select c.Computer_name, null, null, #date
from (select distinct Computer_name from myTable) c
where not exists(select 1
from myTable t
where t.Computer_name = c.Computer_name
and t.date_pulled = #date)
This query isn't as robust because it will not show computers that do not already have a row in your table (e.g. a new computer, or a problematic computer that has never had its information pulled).
I think a cross-join will answer your problem.
In the query below, every computer will have to have successfully uploaded at least once and at least one every day.
This way you'll get every missing computer/date couple.
select
Compare.*
from Table_1 T1
right join (
select *
from
(select Computer_name from Table_1 group by Computer_name) CPUS,
(select date_pulled from Table_1 group by date_pulled) DAYs
) Compare
on T1.Computer_name=Compare.Computer_name
and T1.date_pulled=Compare.date_pulled
where T1.Computer_name is null
Hope this help.
If you join the table to itself by date and computer_name like the following, you should get a list of missing dates
SELECT t1.computer_name, null as information_pulled, null as qty_pulled,
DATEADD(day,1,t1.date_pulled) as missing_date
FROM computer_info t1
LEFT JOIN computer_info t2 ON t2.date_pulled = DATEADD(day,1,t1.date_pulled)
AND t2.computer_name = t1.computer_name
WHERE t1.date_pulled >= '2014-06-14'
AND t2.date_pulled IS NULL
This will also get the next date that hasn't been pulled yet, but that should be clear and you could add an additional condition to filter it out.
AND DATEADD(day,1,t1.date_pulled) < '2014-06-17'
Of course, this only works if you know each of the computer names already exist in the table for previous days. If not, #Jerrad's suggestion to create a separate computer table would help.
EDIT: if the gap is larger than a single day, you may want to see that
SELECT t1.computer_name, null as info, null as qty_pulled,
DATEADD(day,1,t1.date_pulled) as missing_date,
t3.date_pulled AS next_pulled_date
FROM computer_info t1
LEFT JOIN computer_info t2 ON t2.date_pulled = DATEADD(day,1,t1.date_pulled)
AND t2.computer_name = t1.computer_name
LEFT JOIN computer_info t3 ON t3.date_pulled > t1.date_pulled
AND t3.computer_name = t1.computer_name
LEFT JOIN computer_info t4 ON t4.date_pulled > t1.date_pulled
AND t4.date_pulled < t3.date_pulled
AND t4.computer_name = t1.computer_name
WHERE t1.date_pulled >= '2014-06-14'
AND t2.date_pulled IS NULL
AND t4.date_pulled IS NULL
AND DATEADD(day,1,t1.date_pulled) < '2014-06-17'
The 't3' join will join all dates over the first missing one and the 't4' join along with t4.pulled_date IS NULL will exclude all but the lowest of those dates.
You could do this with subqueries as well, but excluding joins have served me well in the past.

SQL query problem when upgrading from SQL Server 2000 to SQL Server 2008 R2

I am currently upgrading a database server from SQL Server 2000 to SQL Server 2008 R2. One of my queries used to take under a second to run and now takes in excess of 3 minutes (running on faster a faster machine).
I think I have located where it is going wrong but not why it is going wrong. Could somebody explain what the problem is and how I might resolve it?
The abridged code is as follows:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
...
FROM
Registrar reg
JOIN EnabledType et ON et.enabledTypeCode = reg.enabled
LEFT JOIN [Transaction] txn ON txn.registrarId = reg.registrarId
WHERE
txn.transactionid IS NULL OR
txn.transactionid IN
(
SELECT MAX(transactionid)
FROM [Transaction]
GROUP BY registrarid
)
I believe the issue is located on the "txn.transactionid IS NULL OR" line. If I remove this condition it runs as fast as it used to (less than a second) and returns all the records minus the 3 rows that that statement would have included. If I remove the second part of the OR statement it returns the 3 rows that I would expect in less than a second.
Could anybody point me in the correct direction as to why this is happening and when this change occured?
Many thanks in advance
Jonathan
I have accepted Alex's solution and included the new version of the code. It seems that we have found 0.1% of queries that the new query optimiser runs slower.
WITH txn AS (
SELECT registrarId, balance , ROW_NUMBER() OVER (PARTITION BY registrarid ORDER BY transactionid DESC) AS RowNum
FROM [Transaction]
)
SELECT
reg.registrarId,
reg.ianaId,
reg.registrarName,
reg.clientId,
reg.enabled,
ISNULL(txn.balance, 0.00) AS [balance],
reg.alertBalance,
reg.disableBalance,
et.enabledTypeName
FROM
Registrar reg
JOIN EnabledType et
ON et.enabledTypeCode = reg.enabled
LEFT JOIN txn
ON txn.registrarId = reg.registrarId
WHERE
ISNULL(txn.RowNum,1)=1
ORDER BY
registrarName ASC
Try restructuring the query using a CTE and ROW_NUMBER...
WITH txn AS (
SELECT registrarId, transactionid, ...
, ROW_NUMBER() OVER (PARTITION BY registrarid ORDER BY transactionid DESC) AS RowNum
FROM [Transaction]
)
SELECT
...
FROM
Registrar reg
JOIN EnabledType et ON et.enabledTypeCode = reg.enabled
LEFT JOIN txn ON txn.registrarId = reg.registrarId
AND txn.RowNum=1