Redshift - Parameters in IN clause making query recompilation

Redshift - Parameters in IN clause making query recompilation - amazon-redshift

I have a filter with IN clause that would get CustomerIDs from user selection in my query. I am benchmarking the query execution and it seems by changing the no. of parameters in the IN clause in redshift causes the query to be recompiled. The subsequent query run after parameter change in IN clause takes more time than previous run.
select * -- some aggregations here
from FactCustomer c
inner join FactSales s on c.CustomerID = s.CustomerId
inner join DimDate d on d.DateID = s.DateID
inner join DimTime t on t.TimeID = d.timeID
where d.SalesDate between '01-01-2018' and '12-31-2018'
and c.CustomerID in ( 1,3,7,9, 11, 13, 15, 17, 19, 21, 23, 25, 27 )
-- there are over 200 customers and the user can select as much as they want in the filter
-- the customerIDs comes from the user dyamically thru the dashboard
-- if the no. of parameters in customerID change to less or more the subsequent queries take time
My question
What would be an alternative in Redshift so that I can get customerIds from the user that can be used in my query without using the IN clause?
P.S. Using Temp tables is not an option with my case.
Any insights appreciated.

Related

Postgres SQL query group by get most recent record instead of an aggregate

This is a current postgres query I have:
sql = """
SELECT
vms.campaign_id,
avg(vms.open_rate_uplift) as open_rate_average,
avg(vms.click_rate_uplift) as click_rate_average,
avg(vms.conversion_rate_uplift) as conversion_rate_average,
avg(cms.incremental_opens),
avg(cms.incremental_clicks),
avg(cms.incremental_conversions)
FROM
experiments.variant_metric_snapshot vms
INNER JOIN experiments.campaign_metric_snapshot cms ON vms.campaign_id = cms.campaign_id
WHERE
vms.campaign_id IN %(campaign_ids)s
GROUP BY
vms.campaign_id
"""
whereby I get the average incremental_opens, incremental_clicks, and incremental_conversions per campaign group from the cms table. However, instead of the average, I want the most recent values for the 3 fields. See the cms table screenshot below - I want the values from the record with the greatest (i.e. most recent) event_id (instead of an average for all records) for a given group).
How can I do this? Thanks

It sounds like you want a lateral join.
FROM
experiments.variant_metric_snapshot vms
CROSS JOIN LATERAL (select * from experiments.campaign_metric_snapshot cms where vms.campaign_id = cms.campaign_id order by event_id desc LIMIT 1) cms
WHERE...

If you are after a quick and dirty solution you can use array_agg function with minimal change to your query.
SELECT
vms.campaign_id,
avg(vms.open_rate_uplift) as open_rate_average,
avg(vms.click_rate_uplift) as click_rate_average,
avg(vms.conversion_rate_uplift) as conversion_rate_average,
(array_agg(cms.incremental_opens ORDER BY cms.event_id DESC))[1] AS incremental_opens,
..
FROM
experiments.variant_metric_snapshot vms
INNER JOIN experiments.campaign_metric_snapshot cms ON vms.campaign_id = cms.campaign_id
WHERE
vms.campaign_id IN %(campaign_ids)s
GROUP BY
vms.campaign_id;

Difference between subquery and correlated subquery for given code

Example for correlated subquery given in a book is as follows;
Customers who placed orders on February 12, 2007
SELECT custid, companyname
FROM Sales.Customers AS C
WHERE EXISTS
(SELECT *
FROM Sales.Orders AS O
WHERE O.custid = C.custid
AND O.orderdate = '20070212');
But, I wrote following code for the same purpose using simple subquery
SELECT custid, companyname
FROM Sales.Customers
WHERE custid IN
(SELECT [custid] FROM [Sales].[Orders]
WHERE [orderdate] ='20070212')
Both gives identical output. Which method is better? and why? and I do not understand the use of EXISTS here in the first set of codes

I tried similar queries on my own data on SQL Server 2016 SP!:
select
*
from EXT.dbo_CUSTTABLE
where ACCOUNTNUM in
(select CUSTACCOUNT from EXT.dbo_SALESLINE b
where b.CREATEDDATETIME between '20170101 00:00' and '20170102 23:59');
select
*
from EXT.dbo_CUSTTABLE a
where exists
(select * from EXT.dbo_SALESLINE b
where a.ACCOUNTNUM=b.CUSTACCOUNT
and b.CREATEDDATETIME between '20170101 00:00' and '20170102 23:59');
Look at the execution plans, they are identical!
If I add a clustered index on the customer table, and an index on the salesline, we get a more efficient query, with index seek and inner join, in stead of table scans and hash joins, but still identical!:
Now if you are using another version of SQL server youre results may vary, since the query optimizer changes between versions.

Left outer join using 2 of 3 tables in Postgresql

I need to show all clients entered into the system for a date range.
All clients are assigned to a group, but not necessarily to a staff.
When I run the query as such:
SELECT
clients.name_lastfirst_cs,
to_char (clients.date_intake,'MM/DD/YY')AS Date_Created,
clients.client_id,
clients.display_intake,
staff.staff_name_cs,
groups.name
FROM
public.clients,
public.groups,
public.staff,
public.link_group
WHERE
clients.zrud_staff = staff.zzud_staff AND
clients.zzud_client = link_group.zrud_client AND
groups.zzud_group = link_group.zrud_group AND
clients.date_intake BETWEEN (now() - '8 days'::interval)::timestamp AND now()
ORDER BY
groups.name ASC,
clients.client_id ASC,
staff.staff_name_cs ASC
I get 121 entries
if I comment out:
SELECT
clients.name_lastfirst_cs,
to_char (clients.date_intake,'MM/DD/YY')AS Date_Created,
clients.client_id,
clients.display_intake,
-- staff.staff_name_cs, -- Line Commented out
groups.name
FROM
public.clients,
public.groups,
public.staff,
public.link_group
WHERE
-- clients.zrud_staff = staff.zzud_staff AND --Line commented out
clients.zzud_client = link_group.zrud_client AND
groups.zzud_group = link_group.zrud_group AND
clients.date_intake BETWEEN (now() - '8 days'::interval)::timestamp AND now()
ORDER BY
groups.name ASC,
clients.client_id ASC,
staff.staff_name_cs ASC
I get 173 entries
I know I need to do an outer join to capture all clients regardless of if there
is a staff assigned, but each attempt has failed. I have done outer joins with
two tables, but adding a third has twisted my brain.
Thanks for any suggestions

I have no way of testing this (or of knowing that it is right) but what I read in your query is that you want something similar to this:
SELECT --I just used short aliases. I choose something other than the table name so I know it is an alias "c" for client etc...
c.name_lastfirst_cs,
to_char (c.date_intake,'MM/DD/YY')AS Date_Created,
c.client_id,
c.display_intake,
s.staff_name_cs,
g.name,
l.zrud_client AS "link_client",--I'm selecting some data here so that I can debug later, you can just filter this out with another select if you need to
l.zzud_group AS "link_group" --Again, so I can see these relationships
FROM
public.clients c
LEFT OUTER JOIN staff s ON --is staff required? If it isn't then outer join (optional)
s.zzud_staff = c.zrud_staff --so we linked staff to clients here
LEFT OUTER JOIN public.link_group l ON --this looks like a lookup table to me so we select the lookup record
l.zrud_client = c.zzud_client -- this is how I define the lookup, a client id
LEFT OUTER JOIN public.groups g ON --then we use that to lookup a group
g.zzup_group = l.zrud_group --which is defined by this data here
WHERE -- the following must be true
c.date_intake BETWEEN (now() - '8 days'::interval)::timestamp AND now()
Now for the why: I've basically moved your where clause to JOIN x ON y=z syntax. In my experience this is a better way to write an maintain queries as it allows you to specify relationships between tables rather than doing a big-ol'-join and trying to filter that data with the where clause. Keep in mind each condition is REQUIRED not optional so when you say you want records with the following conditions you're going to get them (and if I read this right--I probably don't as I don't have a schema in-front of me) if a record is missing a link-table record OR a staff member you're going to filter it out.
Alternatively (possibly significantly slower) You can SELECT anything so you can chain it like:
SELECT
*
FROM
(
SELECT
*
FROM
public.clients
WHERE
x condition
)
WHERE
y condition
OR
SELECT * FROM x WHERE x.condition IN (SELECT * FROM y)
In your case this tactic probably won't be easier than a standard join syntax.
^And some serious opinion here: I recommend you use the join syntax I outlined above here. It is functionally the same as joining and specifying a where clause, but as you noted, if you don't understand the relationships it can cause a Cartesian join. http://www.tutorialspoint.com/sql/sql-cartesian-joins.htm . Lastly, I tend to specify what type of join I want. I write INNER JOIN and OUTER JOIN a lot in my queries because it helps the next person (usually me) figure out what the heck I meant. If it is optional use an outer join, if it is required use an inner join (default).
Good luck! There are much better SQL developers out there and there's probably another way to do it.

INNER JOIN, LEFT/RIGHT OUTER JOIN

Apology in advance for a long question, but doing this just for the sake of learning:
i'm new to SQL and researching on JOIN for now. I'm getting two different behaviors when using INNER and OUTER JOIN. What I know is, INNER JOIN gives an intersection kind of result while returning only common rows among tables, and (LEFT/RIGHT) OUTER JOIN is outputting what is common and remaining rows in LEFT or RIGHT tables, depending upon LEFT/RIGHT clause respectively.
While working with MS Training Kit and trying to solve this practice: "Practice 2: In this practice, you identify rows that appear in one table but have no matches in another. You are given a task to return the IDs of employees from the HR.Employees table who did not handle orders (in the Sales.Orders table) on February 12, 2008. Write three different solutions using the following: joins, subqueries, and set
operators. To verify the validity of your solution, you are supposed to return employeeIDs: 1, 2, 3, 5, 7, and 9."
I'm successful doing this with subqueries and set operators but with JOIN is returning something not expected. I've written the following query:
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
I'm expecting results as: 1, 2, 3, 5, 7, and 9 (6 rows)
But what i'm getting is: 1,1,1,2,2,2,3,3,3,4,4,5,5,5,6,6,7,7,7,8,8,9,9,9 (24 rows)
I tried some videos but could not understand this side of INNER/OUTER JOIN. I'll be grateful if someone could help this side of JOIN, why is it so and what should I try to understand while working with JOIN.

you can also use left outer join to get not matching
*** The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
SELECT
H.empid
FROM
HR.Employees AS H
LEFT OUTER JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
WHERE O.empid IS NULL
Above script will return emp id who did not handle orders on specify date

here you can see all kind of join
Diagram taken from: http://dsin.wordpress.com/2013/03/16/sql-join-cheat-sheet/
adjust your query to be like this
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
where O.orderdate = '2008-02-12' AND O.empid IN null
ORDER BY
E.empid
;

USE TSQL2012;
SELECT
distinct E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;

Primary things to always remind yourself when working with SQL JOINs:
INNER JOINs require a match in the join in order for result set rows produced prior to the INNER JOIN to remain in the result set. When no match is found for a row, the row is discarded from the result set.
For a row fed to an INNER JOIN that matches to ONLY one row, only one copy of that row fed to the result set is delivered.
For a row fed to an INNER JOIN that matches to multiple rows, the row will be delivered multiple times, once for each row match from the INNER JOIN table.
OUTER JOINs will not discard rows fed to them in the result set, whether or not the OUTER JOIN results in a match or not.
Just like INNER JOINs, if an OUTER JOIN matches to more than one row, it will increase the number of rows in the result set by duplicating rows equal to the number of rows matched from the OUTER JOIN table.
Ask yourself "if I get NO match on the JOIN, do I want the row discarded or not?" If the answer is NO, use an OUTER JOIN. If the answer is YES, use an INNER JOIN.
If you don't need to reference any of the columns from a JOIN table, don't perform a JOIN at all. Instead, use a WHERE EXISTS, WHERE NOT EXISTS, WHERE IN, WHERE NOT IN, etc. or similar, depending on your database engine in use. Don't rely on the database engine to be smart enough to discard unreferenced columns resulting from JOINs from the result set. Some databases may be smart enough to do that, some not. There's no reason to pull columns into a result set only to not reference them. Doing so increases chance of reduced performance.
Your JOIN of:
JOIN HR.Employees AS E
ON E.empid <> H.empid
...is matching to all Employees rows with a DIFFERENT EMPID to all rows fed to that join. Use of NOT EQUAL on an INNER JOIN is a very rare thing to do or need, especially if the JOIN predicate is testing only ONE condition. That is why your getting duplicate rows in the result set.
On DB2, we could perform an EXCEPTION JOIN to accomplish that using a JOIN alone. Normally, on DB2, I would use a WHERE NOT EXISTS for that. On SQL Server you could do a JOIN to a query where the query set is all employees without orders in SALES.ORDERS on the specified date, but I don't know if that violates the rules of your tutorial.
Naveen posted the solution it appears your tutorial is looking for!

DB2 Query Structure Using User-Defined Function as a Table

I'm a little new to DB2, and am having trouble developing a query. I have created a user-defined function that returns a table of data which I want to then join and select from in larger select statement. I'm working on a sensitive db, so the query below isn't what I'm literally running, but it's almost exactly like it (without the other 10 joins I have to do lol).
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
INNER JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35
This works great if I run this query without the where clause (or some other where clause that doesn't check for an ID). When I include the where clause, I get the following error:
Error during Prepare 58004(-901)[IBM][CLI Driver][DB2/LINUXX8664]
SQL0901N The SQL statement failed because of a non-severe system
error. Subsequent SQL statements can be processed. (Reason "Bad Plan;
Unresolved QNC found".) SQLSTATE=58004
I have tracked the issue down to fact that I'm using one of join criteria for the parameters (B.customerId). I have validated this fact by replacing B.customerId with a valid customerId, and the query works great. Problem is, I don't know the customerId when calling this query. I know only the orderId (in this example).
Any thoughts on how to restructure this so I can make only 1 call to get all the info? I know the plan is the problem b/c the customerId isn't getting resolved before the function is called.

So if I understand correctly, the function getShippingHistory(customerId) returns a table.
And if you call it with a single customer Id that table gets joined in your query above no problem at all.
But the way you have the query written above, you are asking db2 to call the function for every row returned by your query (i.e. every b.customerId that matches your join and where conditions).
So I'm not sure what behaviour you are expecting, because what you're asking for is a table back for every row in your query, and db2 (nor I) can figure out what the result is supposed to look like.
So in terms of restructuring your query, think about how you can change the getShippingHistory logic when multiple customer Ids are involved.

i found the best solution (given the current query structure) is to use a LEFT join instead of an INNER join in order force the LEFT part of the join to happen which will resolve the customerId to a value by the time it gets to the function call.
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
LEFT JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse