optimize sql "With ismember Select From query" - tsql

I have a query which run extremely slow when "checking is_member" in comparison to just loading the whole dataset. This view acts as a security check, it checks if you are a member of a particular group - ie group 1, then the next column will state what access it has - ie division 2.
This view then is joined with the Fact table, so that it will only retrieve division 2 rows.
The question is, does the is_member execute for each line of Fact data? Just my theory because it runs 1000 times faster without this view. And if anyone can suggest an alternative structure?
WITH group_security AS (SELECT DISTINCT division_cod FROM dbo.dim_group_security_division AS gsd
WHERE (IS_MEMBER(group_name) = 1))
SELECT dbo.dim_division.dim_division_key, dbo.dim_division.division_ID, dbo.dim_division.division_code, dbo.dim_division.division_name
FROM dbo.dim_division INNER JOIN
group_security ON dbo.dim_division.division_code = group_security.division_code OR group_security.division_code = 'ALL'

Since you JOIN on dbo.dim_division.division_code, do you have an index on this column?
Alternatively you could give this a try:
SELECT dim.dim_division_key,
dim.division_ID,
dim.division_code,
dim.division_name
FROM dim_division dim
WHERE EXISTS ( SELECT *
FROM dbo.dim_group_security_division gsd
WHERE gsd.division_code IN ('ALL', dim.division_code)
AND IS_MEMBER(gsd.group_name) = 1 )
This way the system can stop at the first 'match' in dim_group_security_division instead of having to find all and then aggregate the result because of the DISTINCT.
In this case, it might also be useful to have an index on gsd.division_code to speed things up a bit.

Related

postgres chooses an aweful query plan , how can that be fixed

I'm trying to optimize this query :
EXPLAIN ANALYZE
select
dtt.matching_protein_seq_ids
from detected_transcript_translation dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id
WHERE
dtt.matching_protein_seq_ids && ARRAY[654819, 294711]
;
When seq_scan are allowed (set enable_seqscan = on), the optimizer chooses a pretty awful plan that runs in 49.85 seconds :
https://explain.depesz.com/s/WKbew
With set enable_seqscan = off, the plan chosen uses proper indexes and the query runs instantely.
https://explain.depesz.com/s/ISHV
note that I did run a ANALYZE on all three tables...
Your problem is that PostgreSQL cannot estimate the WHERE condition well, so it estimates it as a certain percentage of the estimated total rows, which is way too much.
If you know that there will always few result rows for a query like this, you could cheat by defining a function
CREATE OR REPLACE FUNCTION matching_transcript_translations(integer[])
RETURNS SETOF detected_transcript_translation
LANGUAGE SQL
STABLE STRICT
ROWS 2 /* pretend there are always exactly two matching rows */
AS
'SELECT * FROM detected_transcript_translation
WHERE matching_protein_seq_ids && $1';
You could use that like
select
dtt.matching_protein_seq_ids
from matching_transcript_translations(ARRAY[654819, 294711]) dtt
join peptide_spectrum_match psm
on psm.detected_transcript_translation_id =
dtt.detected_transcript_translation_id
join peptide_spectrum_match_sequence psms
on psm.peptide_spectrum_match_sequence_id =
psms.peptide_spectrum_match_sequence_id;
Then PostgreSQL should be cheated into thinking that there will be exactly one matching row.
However, if there are a lot of matching rows, the resulting plan will be even worse than your current plan is…

Inner join versus doing a where in clause

Say I have about 10-30 rows returned from my query:
select * from users where location=10;
Is there any difference between the following:
select *
from users u
inner join location l on u.location = l.id
where location=10;
versus:
select * from users where location=10; # say this returns 1,2,3,4,5
select * from location where id IN (1,2,3,4,5)
Basically I want to know if there are any performance differences between doing an inner join versus doing a WHERE IN clause.
Is there a difference between issuing one query and issuing two queries? Well, I certainly hope so. The SQL engine is doing work, and it does twice as much work (from a certain perspective) for two queries.
In general, parsing a single query is going to be faster than parsing one query, returning an intermediate result set, and then feeding it back to another query. There is overhead in query compilation and in passing data back and forth.
For this query:
select *
from users u inner join
location l
on u.location = l.id
where u.location = 10;
You want an index on users(location) and location(id).
I do want to point something else out. The queries are not equivalent. The real comparison query is:
select l.*
from location l
where l.id = 10;
You are using the same column for the where and the on. Hence, this would be the most efficient version and you want an index on location(id).
One way to compare the performance of different queries is to use postgresql's EXPLAIN command, for instance:
EXPLAIN select *
from users u inner join
location l
on u.location = l.id
where u.location = 10;
Which will tell you how the database will get the data. Watch for things like sequential scans, which indicate you could benefit from an index. It also gives estimates on the cost of each operation and how many rows it might be operating on. Some operations can yield a lot more rows than you would expect, which the database then reduces to the set it returns to you.
You can also use EXPLAIN ANALYZE [query], which will actually run the query and give you timing information.
See the postgresql documentation for more information.

SQL NESTED Query Optimization

I am running two sql queries say,
select obname from table1 where obid = 12
select modname from table2 where modid = 12
Both are taking very less time, say 300 ms each.
But when I am running:
select obname, modname
from (select obname from table1 where obid = 12) as alias1,
(select modname from table2 where modid = 12) as alias2
It is taking 3500ms. Why is it so?
In general, putting two scalar queries in the from clause is not going to affect performance. In fact, from an application perspective, one query may be faster because there is less overhead going back and forth to the database. A scalar query returns one column and one row.
However, if the queries are returning multiple rows, then your little comma is doing a massive Cartesian product (which is why I always use CROSS JOIN rather than a comma in a FROM clause). In that case, all bets are off, because the data has to be processed after the results start returning.

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000