Using a UNION or UNION ALL on two select statements makes them incredibly slower - tsql

I have two queries, let's call them Query A and Query B.
Both of these queries run in under a second for the scenario I'm testing and Query A returns 1 result, and Query B returns 0 results.
If I union (or union all) these two queries, it takes over a minute to return the (expected) 1 result.
Both queries select the same columns from the same tables. I could potentially rewrite this entire thing without a union by having a highly conditional where clause but I was trying to get away from doing that.
Any ideas? I'm not sure how much of the exact query and schema I can get away with sharing, but I'm happy to provide what I can.
This is on MSSQL 2008 if it matters to anyone's response.

I would try looking at the execution plans within Management Studio for the individual queries, and then compare that to the execution plan for the query containing the UNION.
If there's that drastic of a difference in the execution times, I would imagine that there's something wrong with the execution plan for the UNION'd query. Identifying what's different will help point you (and maybe us) in the right direction on what the underlying problem is.

The separate clauses in a UNION that are very similar and on the same tables can be merged into one query by the optimiser. You can see this by the lack on UNION operator in the query plan. I've seen similar things before but rarely
What you can do is a SELECT.. INTO #temp... for the first query followed by an INSERT #temp... for the second
Now, where did I read this...

Are they both doing table scans? It sounds like it might be exceeding cache capacity and you're caching to disk.
Even if they are from the same table the records would probably lock independently.

Related

How to: Change actual execution method from "row" to "batch" - Azure SQL Server

I am having some major issues. When inserting data into my database, I am using an INSTEAD OF INSERT trigger which performs a query.
On my TEST database, this query takes much less than 1 second for insert of a single row. In production however, this query takes MUCH longer (> 30 seconds for 1 row).
When comparing the Execution plans for both of them, there seems to be some CLEAR differences:
Test has: "Actual Execution Method: Batch"
Prod has: "Actual Execution Method: Row"
Test has: "Actual number of rows: 1"
Prod has: "Actual number of rows 92.000.000"
Less than a week ago production was running similar to test. But not anymore - sadly.
Can any of you help me figure out why?
I believe, if I can just get the same execution plan for both, it should be no problem.
Sometimes using query hint OPTION(hash Join) helps to force a query plan to use batch processing mode. The following query that uses AdventureWorks2012 sample database demonstrates what I am saying.
SELECT s.OrderDate, s.ShipDate, sum(d.OrderQty),avg(d.UnitPrice),avg(d.UnitPriceDiscount)
FROM Demo d
join Sales.SalesOrderHeader s
on d.SalesOrderID=s.SalesOrderID
WHERE d.OrderQty>500
GROUP BY s.OrderDate,s.ShipDate
The above query uses row mode. With the query hint it then uses batch mode.
SELECT s.OrderDate, s.ShipDate, sum(d.OrderQty),avg(d.UnitPrice),avg(d.UnitPriceDiscount)
FROM Demo d
join Sales.SalesOrderHeader s
on d.SalesOrderID=s.SalesOrderID
WHERE d.OrderQty>500
GROUP BY s.OrderDate,s.ShipDate
OPTION(hash Join)
You don't get to force row vs. batch processing directly in SQL Server. It is a cost-based decision in the optimizer. You can (as you have noticed) force a plan that was generated that uses batch mode. However, there is no specific "only use batch mode" model on purpose as it is not always the fastest. Batch mode execution is like a turbo on a car engine - it works best when you are working with larger sets of rows. It can be slower on small cardinality OLTP queries.
If you have a case where you have 1 row vs. 92M rows, then you have a bigger problem with having a problem that has high variance in the number of rows processed in the query. That can make it very hard to make a query optimal for all scenarios if you have parameter sensitivity or the shape of the query plan internally can create cases where sometimes you have only one row vs. 92M. Ultimately, the solutions for this kind of problem are either to use option(recompile) if the cost of the compile is far less than the variance from having a bad plan or (as you have done) finding a specific plan in the query store that you can force that works well enough for all cases.
Hope that helps explain what is happening under the hood.
I have found a somewhat satifying solution to my problem.
By going into Query store of the database, using Microsoft SQL Server Management Studio, I was able to Force a specific plan for a specific query - but only if the plan was already made by the query.

Old vs New Style Joins

SQL gets processed in this order:
From,
Where,
Group By,
Having,
Select,
Order By
In the new style of syntax for joins (explicitly using the word join), why doesn't this work faster than using the old style of joins (listing tables and then using a where clause)?
From gets processed before Where, so why wouldn't the newer style of join be faster?
The way that I imagine it is like this:
If you use the old style syntax, you are looking at entire tables and then filtering out the results.
If you use the new style syntax, you are filtering out your results first before moving to a 2nd step.
Am I missing something?
When you send a query to postgresql, it doesn't always do scanning, filtering, etc in the same order. It examines the query, the tables involved, any constraints or indexes that might be involved, and comes up with an execution plan. If you want to see the execution pan for a query, you can use EXPLAIN, and it will invoke the planner without actually executing the query. Here's some documentation for EXPLAIN.
You tagged your question for postgresql, but other RDBMSes have similar facilities for examining the query plan.

PostgreSQL Results Same Explanation on Different Queries

I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?
I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.

"Order by" degraded performance in sql

Hi i have one problem while executing sql in postgresql.
I have a similar query like this:
select A, B , lower(C) from myTable ORDER BY A, B ;
WIthout ORDER BY clause, I get the result in 11 ms , but with order by , it took more than 4 minutes to retrieve the same results.
These column contains lots of data (1000000 or more) and has lot of duplicate data
Can any one suggest me solution??
Thank you
but with order by , it took more than 4 minutes to retrieve the same results.
udo already explained how indexes can be used to speed up sorting, this is probably the way you want to go.
But another solution (probably) is increasing the work_mem variable. This is almost always beneficial, unless you have many queries running at the same time.
When sorting large result sets, which don't fit in your work_mem setting, PostgreSQL resorts to a slow disk-based sort. If you allow it to use more memory, it will do fast in-memory sorts instead.
NB! Whenever you ask questions about PostgreSQL performance, you should post the EXPLAIN ANALYZE output for your query, and also the version of Postgres.
Have you tried putting an index on A,B?
That should speed things up.
Did you try using a DISTINCT for eliminating duplicates? This should be more efficient than an order by statement.