my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.
I am new to using CTEs but I work with a humongous database and think they would cause less stress to the system that subqueries. I'm not sure if what I want to do is possible.
I have 2 CTEs with different columns from different tables but each CTE has the same sample_num (same data type of int) in them that could be used to join them if possible. I use the first CTE to limit the data for samples. I want the second CTE to look into the first and if the sample numbers match, include that sample number data in the second CTE. The reason I have the second CTE is because I use it's data to create a pivot table.
Ultimately what I want to do in my outer query is to use the fields from the first CTE and add the pivot table columns from the second CTE to the left. Basically marry the two CTEs side by side in the final outer query.
Is this possible or am I making this a lot harder than it needs to be. Remember, I work on a huge database with thousands of users.
I am running two sql queries say,
select obname from table1 where obid = 12
select modname from table2 where modid = 12
Both are taking very less time, say 300 ms each.
But when I am running:
select obname, modname
from (select obname from table1 where obid = 12) as alias1,
(select modname from table2 where modid = 12) as alias2
It is taking 3500ms. Why is it so?
In general, putting two scalar queries in the from clause is not going to affect performance. In fact, from an application perspective, one query may be faster because there is less overhead going back and forth to the database. A scalar query returns one column and one row.
However, if the queries are returning multiple rows, then your little comma is doing a massive Cartesian product (which is why I always use CROSS JOIN rather than a comma in a FROM clause). In that case, all bets are off, because the data has to be processed after the results start returning.
This is in a stored procedure..This if statement, then I do a little work. The #AsOfDate is a passed in variable of date datatype. The question I have is Why do I get better performance by removing the inner-most exists, but ONLY when the entire statement is in an IF EXISTS?
The two tables:
dbo.TXXX_InventoryDetail -- 1.3 billion records..stats up to date
dbo.TXXX_InventoryFull -- 9.8 million records..stats up to date
Statement:
if exists (select 1
from dbo.TXXX_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryFull'. Scan count 9240262, logical reads 29548864
Table 'T001_InventoryDetail'. Scan count 1, logical reads 17259
If I remove the second where exists and do a join:
if exists (select 1
from dbo.TXXX_InventoryDetail o,
dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut)
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryDetail'. Scan count 0, logical reads 333952
Table 'TXXX_InventoryFull'. Scan count 1, logical reads 630
Now..the reason I think it is the if exists is that if I remove it and do a select count(*) like this:
select COUNT(*)
from dbo.T001_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate
TXXX_InventoryFull'. Scan count 41, logical reads 692
T001_InventoryDetail'. Scan count 65, logical reads 17477
Worktable'. Scan count 0, logical reads 0
It is generally said that one should avoid doing coordinated subqueries in the predicate, as these tend to force nested loop joins. When querying large datasets, especially where one's trying to discover a difference between the sets, it's important to allow the query optimizer to choose dynamically between hash, merge and nested loop algorhithms, which may not be possible if the query is structured using a coordinated subquery. Better to create these as derived tables in the FROM clause.
I have found similar issues using the EXISTS statement on a SQL 08 R2 server, where the exact same statement runs fine on SQL 08 and SQL 05.
I found that changing something like
WHILE EXISTS(SELECT * FROM X)
Would be super slow, but:
WHILE ISNULL((SELECT TOP 1 ID FROM X), 0) <> 0
Runs perfectly fast again.
To me, it seems like an R2 issue...
I would guess that the plan you get is quite different when you use the join. Perhaps the imbalance in the number of rows (very large outer table, smaller inner table) is giving the optimizer fits, but it can probably eliminate rows much easier with the join (you'll probably see additional loop operators with the worse query). Tough to really guess without seeing the plans or being able to reproduce, but you should always aim at eliminating the most rows as early in the plan as possible. Pulling back millions of rows through several operators / subqueries only to eliminate most of them later in the plan is almost certainly going to yield worse performance.
SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000