Why does this Oracle 10g SQL run slow only when I query a subquery with a where clause? - oracle10g

I can't paste in the entire SQL for various reasons, so consider this example:
select *
from
(select nvl(get_quantity(1), 10) available_qty
from dual)
where available_qty > 30;
get_quantity is a function that makes a calculation based on the ID of a record that's passed through it. If it returns null, I use nvl() to force it to 10.
The query runs very slow when I use the WHERE clause in the parent query. When I comment out the WHERE clause, however, it runs very fast. What I don't get is why it can display the data very fast, but it can't query it just as fast. I am querying the results of a subquery, too. I was under the impression that subqueries return a "rendered" dataset. It's almost as if querying the available_qty identifier is causing it to reference something within the subquery.
This is why I don't think the contents of the get_quantity function are relevant here, so I didn't bother posting it. Instead, I think it's a misunderstanding on my part of how Oracle handles subqueries and whatnot.
Do any of you Oracle gurus have any idea what I am doing wrong?
Afterthought: as I was entering tags for this question, the tag "correlated subquery" came up. In doing some quick research, it seems that this type of subquery somewhat depends on the outer query. Could this be related to my problem?

Let's try an experiment. First we'll run the following query:
select lvl, rnd
from (select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd from dual) b;
The "a" subquery will return 5 rows with values from 1 to 5. The "b" subquery will return one row with a random value. If the function is run before the two tables are join (by Cartesian), the same random value will be returned for each row. The actual results:
LVL RND
---------- ----------
1 .417932089
2 .963531718
3 .617016889
4 .128395638
5 .069405568
5 rows selected.
Clearly the function was run for each of the joined rows, not for the subquery before the join. This is a result of Oracle's optimizer deciding that the best path for the query is to do things in that order. To prevent this, we have to add something to the second subquery that will make Oracle run the subquery in it's entirety before performing the join. We'll add rownum to the subquery, since Oracle knows rownum will change if it's run after the join. The following query demonstrates this:
select lvl, rnd from (
select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd, rownum from dual) b;
As you can see from the results, the function was only run once in this case:
LVL RND
---------- ----------
1 .028513902
2 .028513902
3 .028513902
4 .028513902
5 .028513902
5 rows selected.
In your case, it seems likely that the filter provided by the where clause is making the optimizer take a different path, where it's running the function repeatedly, rather than once. By making Oracle run the subquery as written, you should get more consistent run-times.

Related

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

pg 9.4 vs 10 differences on order by random()

Following query is not randomizing array in postgres 10. Is this expected behaviour?
select array(select generate_series(1,10) order by random());
v9.4.15
array
------------------------
{7,1,10,6,2,8,9,4,5,3}
v10.4
array
------------------------
{1,2,3,4,5,6,7,8,9,10}
This is a consequence of commit 69f4b9c85f168ae006929eec44fc44d569e846b9 that changes how set-returning functions in the SELECT list are handled.
Tim's answer and your comment show how to deal with the problem.
I think the issue here is the newer version of Postgres has an optimizer which is getting smarter, and is caching away the value of random() after a single call to that function.
One workaround is to force a new random value to be calculated for each record. We can add a dummy WHERE clause to force this:
WITH cte AS (
select generate_series(1,10) AS col
)
SELECT col
FROM cte
WHERE col IS NOT NULL
ORDER BY random();
Demo
You may observe in the demo that the order is in fact random. However, in the same demo if you run your orignal query the order won't be random.
Edit:
The reason why this tricks works is that the WHERE clause convinces the optimizer that you really care about the values being used in each record. Therefore, it calls the function in ORDER BY once for each record rather than caching it.

Does inner join effect order by?

I have a function a() which gives result in a specific order.
I want to do:
select final.*,tablex.name
from a() as final
inner join tablex on (a.key=tablex.key2)
My question is, can I guarantee that the join won't effect the order of rows as a() set it?
a() is:
select ....
from....
joins...
order by x,y,z
The short version:
The order of rows returned by a SQL query is not guaranteed in any way unless you use an order by
Any order you see without an order by is pure coincidence and can not be relied upon.
So how did I always get the correct order so far? when I did Select * from a()
If your function is a SQL function, then the query inside the function is executed "as is" (it's essentially "inlined") so you only run a single query that does have an order by. If it's a PL/pgSQL function and the only thing it does is a RETURN QUERY ... then you again only have a single query that is executed which does have an order by.
Assuming you do use a SQL function, then running:
select final.*,tablex.name
from a() as final
join tablex on a.key=tablex.key2
is equivalent to:
select final.*,tablex.name
from (
-- this is your query inside the function
select ...
from ...
join ...
order by x,y,z
) as final
join tablex on a.key=tablex.key2;
In this case the order by inside the derived table doesn't make sense as it might be "overruled" by an overall order by statement. In fact some databases would outright reject this query (and I sometime wish Postgres would do as well).
Without an order by on the **overall* query, the database is free to choose any order of rows that it wants.
So to get back to the initial question:
can I guarantee that the join won't effect the order of rows as a() set it?
The answer to that is a clear: NO - the order of the rows for that query is in no way guaranteed. If you need an order that you can rely on, you have to specify an order by.
I would even go so far to remove the order by from the function - what if someone runs: select * from a() order by z,y,x - I don't think Postgres will be smart enough to remove the order by inside the function.

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

CASE statement in FROM clause

Can you use CASE in the FROM clause of a SELECT statement to determine from which table to retrieve data?
My database has multiple versions of a table. The value of an input parameter in a procedure will tell the procedure whether to retrieve data from version 1, 2 or 3. The syntax I am trying to use is similar to:
SELECT * FROM (CASE input_parameter WHEN 1 THEN version1 WHEN 2 THEN version 2 WHEN 3 THEN version3 END) WHERE ...
Can this be done? If so, am I using the correct syntax?
It can't be done in the SQL statement itself. You'll need to construct the SQL statement dynamically in order to achieve this kind of result.
You can't dynamically select the table like that in straight-up SQL. You would need a stored procedure to do exactly what you are wanting. There are some workarounds though.
You could do something janky in your FROM clause like:
SELECT *
FROM
(SELECT null as "whatever") as fakeTable
LEFT OUTER JOIN version1 on input_parameter = 1
LEFT OUTER JOIN version2 on input_parameter = 2
LEFT OUTER JOIN version3 on input_parameter = 3
This will work since the input_parameter can only be one value at a time. Should you decide you want both version1 and version2 joined if the input_parameter is 2 then you will end up with a cross join and may god have mercy on your soul.
You could do something janky with a UNION:
SELECT * FROM version1 WHERE input_paramter=1
UNION ALL
SELECT * FROM version2 WHERE input_paramter=2
UNION ALL
SELECT * FROM version3 WHERE input_paramter=3
This is a bit nicer since a screw up will only bring back 2 or 3 times as many results instead of the screw up in example 1 where you get n^2 or n^3 results.
I'm not sure which one is going to cause more trouble from a CPU-I/O standpoint, but I would guess that they are pretty close from an execution path standpoint, and if the data is small, it probably won't matter anyway.