Redshift Cross join ignoring where clause - amazon-redshift

I have the following query:
WITH MY_CTE as
(
select
....
.....
)
SELECT
MY_CTE.*
,tt.currency as most_used_currency
from MY_CTE
cross join
(select t.currency
from My_CTE t
group by t.currency
order by count(*) desc
limit 1
) tt
where MY_CTE.currency = 'EUR'
but the cross join is ignoring my where clause.
How can I enforce that it processes the where clause before working on the cross join please?
Sample data returned:
This is obviously wrong because I said do not include currency SEK, and yet it is saying its the most popular currency.
I cannot put the where clause inside of the cross join because I will be using this in tableau and need the users to be able to filter on certain criteria, e.g. currency.
The most popular currency should be EUR if the MY_CTE is filtered to show only EUR currency

WHERE condition in this case has nothing to do with cross join, it just filters rows after join is already performed. If you need to report only single currency there are simplest two options where to add currency filter (added as comments in SQL):
1) Option 1 - add filter already in CTE statement
2) Option 2 - add filter at the end (as already done) and within tt part.
WITH MY_CTE as
(
select
....
.....
/* OPTION 1*/
)
SELECT
MY_CTE.*
,tt.currency as most_used_currency
from MY_CTE
cross join
(select t.currency
from My_CTE t
/* OPTION 2 first place*/
group by t.currency
order by count(*) desc
limit 1
) tt
where MY_CTE.currency = 'EUR' /* OPTION 2a second place*/

The alias tt will return the most popular currency overall, which is SEK. If you want to filter for separate currencies, you'll need to put them in the inner query as well as the outer one. However, if that isn't an option, you'll want to return all currencies with their popularity, and filter on the most popular one you allow.
....
....
SELECT
LAST_VALUE(MY_CTE.customer_id)
OVER (partition by customer_id
ORDER BY tt.popularity
rows between unbounded preceding and unbounded following)
.... /* rest of your columns */
, LAST_VALUE(tt.currency)
OVER (partition by customer_id
ORDER BY tt.popularity
rows between unbounded preceding and unbounded following)
from MY_CTE
cross join
(select t.currency,
count(*) popularity
from My_CTE t
group by t.currency
order by count(*) desc
/* removed limit 1 */
) tt
where MY_CTE.currency = 'EUR'
AND tt.currency IN ('EUR') /* Added tt.currency filter */

Related

Find the next oldest row in Redshift

I have a table called user_activity in Redshift that has department, user_id, activity_type, activity_id, activity_date.
I'd like to query a daily report of how many days since the last event (of any type). Using CROSS APPLY (SQL Server) or LATERAL JOIN (Postgres 9+), I'd do something like...
SELECT d.date, a.last_activity_date
FROM date_table d
CROSS JOIN (
SELECT DISTINCT user_id FROM activity_table
) u
CROSS APPLY (
SELECT TOP 1 activity_date as last_activity_date
FROM activity_table
WHERE user_id = u.user_id AND activity_date <= d.date
ORDER BY activity_date DESC
) a
For now, I write it similar to the below, but it is a bit slow and I am afraid it'll only get slower.
with user_activity as (
select distinct activity_date, user_id from activity_table
)
select
d.date, u.user_id,
max(u.activity_date) as last_activity_date
from date_table d
inner join user_activity u on u.activity_date <= d.date
where d.date between '2020-01-01' and current_date
group by 1, 2
Can someone suggest a good alternative for my needs or for CROSS APPLY / LATERAL JOIN.
As you are seeing cross joining and inequality joining will slow down as you data grows and are generally not the approach you want in Redshift. This is due to the data size increase that comes with this type of action when applied to large data tables that are typical in Redshift.
You want to use window functions to perform this type of analysis. But you will need to step back and rethink how you will structure the SQL. A MAX(activity_date) window function, partitioned by user_id and ordered by date and with a frame clause of all preceding rows, will find the most recent activity to any activity.
Now this will produce only rows for user_ids and dates that exist in the data table and it looks like you want 1 row for each date for each user_id, right? To do this you need to UNION in a frame of data that has 1 row for each date for each user_id ahead of the window function. You will need NULLs in for the other columns so that the data widths match. You will also want the dates in a separate column from activity_date. Now all dates for all user ids will be in the source and the window function will give you the result you want.
You also ask ‘how is this better than the joins?’ Well in the joins you are replicating all the data records by the number of dates which can get really big. In this approach you just have the original data records plus one row per user_id per date (which is the size of your output) and as the number of records per user_id grows this approach doesn’t.
——— Request to modify asker’s code per comments made to their approach ———
Your code is definitely on the right track as you have removed the massive inequality join of your original. I made 2 comments about it. The first is that I believe you need GROUP BY user_id, date to prevent multiple rows per user_id per date that would result if there are records for the same user_id on a single date with differing activity_types. This is a simple oversight.
The second is to state that I intended for you to use UNION ALL, not LEFT JOIN, in combining the actual data and the user_id/date framework. Your approach works fine but I have found that unioning with very large amounts of data is generally faster than joining but you do need to make sure the columns match up. Either way we end up with a data segment with 3 columns - 2 date columns, one with NULLs for framework rows, and 1 user_id. Your approach is fine and the difference in performance is likely very small unless you have huge tables.
Since you asked for a rewrite, here it is with both changes. (NOTE: my laptop is in the shop so I don’t have ready access to Redshift at the moment and this SQL is untested. If the intent is not clear from this and you need me to debug it will be delayed by a few days. I’m keeping your setup methods and SQL structure.)
with date_table as (
select '2000-01-01'::date as date
union all
select '2000-01-02'::date
union all
select '2000-01-03'::date
union all
select '2000-01-04'::date
union all
select '2000-01-05'::date
union all
select '2000-01-06'::date
),
users as (
select 1 as user_id
union all
select 2
union all
select 3
),
user_activity as (
select 1 as user_id, '2000-01-01'::date as activity_date
union all
select 1 as user_id, '2000-01-04'::date as activity_date
union all
select 3 as user_id, '2000-01-03'::date as activity_date
union all
select 1 as user_id, '2000-01-05'::date as activity_date
union all
select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
select d.date, u.user_id
from date_table d
cross join users u
),
user_date_activity as (
select cal_date, user_id,
lag(max(activity_date), 1) ignore nulls over (partition by user_id order by date) as last_activity_date
from (
Select user_id, date as cal_date, NULL as activity_date from user_dates
Union all
Select user_id, activity_date as cal_date, activity_date from user_activity
)
Group by user_id, cal_date
)
select * from user_date_activity
order by user_id, cal_date```
This was my query based on Bill's answer.
with date_table as (
select '2000-01-01'::date as date
union all
select '2000-01-02'::date
union all
select '2000-01-03'::date
union all
select '2000-01-04'::date
union all
select '2000-01-05'::date
union all
select '2000-01-06'::date
),
users as (
select 1 as user_id
union all
select 2
union all
select 3
),
user_activity as (
select 1 as user_id, '2000-01-01'::date as activity_date
union all
select 1 as user_id, '2000-01-04'::date as activity_date
union all
select 3 as user_id, '2000-01-03'::date as activity_date
union all
select 1 as user_id, '2000-01-05'::date as activity_date
union all
select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
select d.date, u.user_id
from date_table d
cross join users u
),
user_date_activity as (
select ud.date, ud.user_id,
lag(ua.activity_date, 1) ignore nulls over (partition by ud.user_id order by ud.date) as last_activity_date
from user_dates ud
left join user_activity ua on ud.date = ua.activity_date and ud.user_id = ua.user_id
)
select * from user_date_activity
order by user_id, date

How to get the MAX(SUM of values) to find the category with the biggest total? PostgreSQL

I have two tables. One is Transactions and the other is Tickets. In Tickets I have the Ticket_Number,the name of the Category(Theater,Cinema,Concert), the Price of the Ticket. In Transactions I also have the Ticket_Number. What i want to do is to Get a SUM of money for each Category, and then with that data I want to Select the Category with the most money.
I already managed to get the SUM for each category but I am stuck here
SELECT category, SUM (Tickets.Price) AS Price
FROM Tickets,Transactions
WHERE Tickets.ticket_num=Transactions.ticket_num
GROUP BY Category
ORDER BY Price DESC;
I know i can add LIMIT 1 but I know it's not correct because 2 or more values can be the same
Using ROW_NUMBER to generate a sequence based on the sum of the price. Then, restrict to only the matching aggregated row with the highest total price.
WITH cte AS (
SELECT category, SUM(t1.Price) AS Price,
ROW_NUMBER() OVER (ORDER BY SUM(t1.Price) DESC) rn
FROM Tickets t1
INNER JOIN Transactions t2
ON t1.ticket_num = t2.ticket_num
GROUP BY Category
)
SELECT category, Price
FROM cte
WHERE rn = 1
ORDER BY Price DESC;
Note that if you want to capture all categories tied for the highest price, should a tie occur, then replace ROW_NUMBER in the above CTE with RANK, keeping everything else the same.
What you are looking for is a window function DENSE_RANK() which will handle ties properly.
RANK() will also work for your case, but if you would like to extend it to get TOP N places with ties (where N > 1), dense rank is the way to go.
SELECT Category, Price
FROM (
SELECT
Category,
SUM(ti.Price) AS Price,
DENSE_RANK() OVER (ORDER BY SUM(ti.Price) DESC) AS rnk
FROM Tickets ti
INNER JOIN Transactions tr ON
ti.ticket_num = tr.ticket_num
GROUP BY Category
) t
WHERE rnk = 1
I've also replaced the old style and not recommended joining of tables as comma separated list in FROM clause to a proper INNER JOIN clause and assigned aliases to tables.
You can use rank() to rank the sums of the prices, more expensive first.
SELECT category,
price
FROM (SELECT category,
sum(tickets.price) price,
rank() OVER (ORDER BY sum(tickets.price) DESC) r
FROM tickets
INNER JOIN transactions
ON transactions.ticket_num = tickets.ticket_num
GROUP BY category) x
WHERE r = 1;
I also took the liberty to rewrite your join from the ancient comma style to a modern, clearer version.

How to translate SQL to DAX, Need to add FILTER

I want to create calculated table that will summarize In_Force Premium from existing table fact_Premium.
How can I filter the result by saying:
TODAY() has to be between `fact_Premium[EffectiveDate]` and (SELECT TOP 1 fact_Premium[ExpirationDate] ORDE BY QuoteID DESC)
In SQL I'd do that like this:
`WHERE CONVERT(date, getdate()) between CONVERT(date, tblQuotes.EffectiveDate)
and (
select top 1 q2.ExpirationDate
from Table2 Q2
where q2.ControlNo = Table1.controlno
order by quoteid` desc
)
Here is my DAX statement so far:
In_Force Premium =
FILTER(
ADDCOLUMNS(
SUMMARIZE(
//Grouping necessary columns
fact_Premium,
fact_Premium[QuoteID],
fact_Premium[Division],
fact_Premium[Office],
dim_Company[CompanyGUID],
fact_Premium[LineGUID],
fact_Premium[ProducerGUID],
fact_Premium[StateID],
fact_Premium[ExpirationDate]
),
"Premium", CALCULATE(
SUM(fact_Premium[Premium])
),
"ControlNo", CALCULATE(
DISTINCTCOUNT(fact_Premium[ControlNo])
)
), // Here I need to make sure TODAY() falls between fact_Premium[EffectiveDate] and (SELECT TOP 1 fact_Premium[ExpirationDate] ORDE BY QuoteID DESC)
)
Also, what would be more efficient way, to create calculated table from fact_Premium or create same table using sql statement (--> Get Data--> SQL Server) ?
There are 2 potential ways in T-SQL to get the next effective date. One is to use LEAD() and another is to use an APPLY operator. As there are few facts to work with here are samples:
select *
from (
select *
, lead(EffectiveDate) over(partition by CompanyGUID order by quoteid desc) as NextEffectiveDate
from Table1
join Table2 on ...
) d
or
select table1.*, oa.NextEffectiveDate
from Table1
outer apply (
select top(1) q2.ExpirationDate AS NextEffectiveDate
from Table2 Q2
where q2.ControlNo = Table1.controlno
order by quoteid desc
) oa
nb. an outer apply is a little similar to a left join in that it will allow rows with a NULL to be returned by the query, if that is not needed than use cross apply instead.
In both these approaches you may refer to NextEffectiveDate in a final where clause, but I would prefer to avoid using the convert function if that is feasible (this depends on the data).

Unable to get Percentile_Cont() to work in Postgresql

I am trying to calculate a percentile using the percentile_cont() function in PostgreSQL using common table expressions. The goal is find the top 1% of accounts regards to their balances (called amount here). My logic is to find the 99th percentile which will return those whose account balances are greater than 99% of their peers (and thus finding the 1 percenters)
Here is my query
--ranking subquery works fine
with ranking as(
select a.lname,sum(c.amount) as networth from customer a
inner join
account b on a.customerid=b.customerid
inner join
transaction c on b.accountid=c.accountid
group by a.lname order by sum(c.amount)
)
select lname, networth, percentile_cont(0.99) within group
order by networth over (partition by lname) from ranking ;
I keeping getting the following error.
ERROR: syntax error at or near "order"
LINE 2: ...ame, networth, percentile_cont(0.99) within group order by n..
I am thinking that perhaps I forgot a closing brace etc. but I can't seem to figure out where. I know it could be something with the order keyword but I am not sure what to do. Can you please help me to fix this error?
This tripped me up, too.
It turns out percentile_cont is not supported in postgres 9.3, only in 9.4+.
https://www.postgresql.org/docs/9.4/static/release-9-4.html
So you have to use something like this:
with ordered_purchases as (
select
price,
row_number() over (order by price) as row_id,
(select count(1) from purchases) as ct
from purchases
)
select avg(price) as median
from ordered_purchases
where row_id between ct/2.0 and ct/2.0 + 1
That query care of https://www.periscopedata.com/blog/medians-in-sql (section: "Median on Postgres")
You are missing the brackets in the within group (order by x) part.
Try this:
with ranking
as (
select a.lname,
sum(c.amount) as networth
from customer a
inner join account b on a.customerid = b.customerid
inner join transaction c on b.accountid = c.accountid
group by a.lname
order by networth
)
select lname,
networth,
percentile_cont(0.99) within group (
order by networth
) over (partition by lname)
from ranking;
I want to point out that you don't need a subquery for this:
select c.lname, sum(t.amount) as networth,
percentile_cont(0.99) within group (order by sum(t.amount)) over (partition by lname)
from customer c inner join
account a
on c.customerid = a.customerid inner join
transaction t
on a.accountid = t.accountid
group by c.lname
order by networth;
Also, when using table aliases (which should be always), table abbreviations are much easier to follow than arbitrary letters.

multiple extract() with WHERE clause possible?

So far I have come up with the below:
WHERE (extract(month FROM orders)) =
(SELECT min(extract(month from orderdate))
FROM orders)
However, that will consequently return zero to many rows, and in my case, many, because many orders exist within that same earliest (minimum) month, i.e. 4th February, 9th February, 15th Feb, ...
I know that a WHERE clause can contain multiple columns, so why wouldn't the below work?
WHERE (extract(day FROM orderdate)), (extract(month FROM orderdate)) =
(SELECT min(extract(day from orderdate)), min(extract(month FROM orderdate))
FROM orders)
I simply get: SQL Error: ORA-00920: invalid relational operator
Any help would be great, thank you!
Sample data:
02-Feb-2012
14-Feb-2012
22-Dec-2012
09-Feb-2013
18-Jul-2013
01-Jan-2014
Output:
02-Feb-2012
14-Feb-2012
Desired output:
02-Feb-2012
I recreated your table and found out you just messed up the brackets a bit. The following works for me:
where
(extract(day from OrderDate),extract(month from OrderDate))
=
(select
min(extract(day from OrderDate)),
min(extract(month from OrderDate))
from orders
)
Use something like this:
with cte1 as (
select
extract(month from OrderDate) date_month,
extract(day from OrderDate) date_day,
OrderNo
from tablename
), cte2 as (
select min(date_month) min_date_month, min(date_day) min_date_day
from cte1
)
select cte1.*
from cte1
where (date_month, date_day) = (select min_date_month, min_date_day from cte2)
A common table expression enables you to restructure your data and then use this data to do your select. The first cte-block (cte1) selects the month and the day for each of your table rows. Cte2 then selects min(month) and min(date). The last select then combines both ctes to select all rows from cte1 that have the desired month and day.
There is probably a shorter solution to that, however I like common table expressions as they are almost all the time better to understand than the "optimal, shortest" query.
If that is really what you want, as bizarre as it seems, then as a different approach you could forget the extracts and the subquery against the table to get the minimums, and use an analytic approach instead:
select orderdate
from (
select o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
from orders o
)
where rn = 1;
ORDERDATE
---------
01-JAN-14
The row_number() effectively adds a pseudo-column to every row in your original table, based on the month and day in the order date. The rn values are unique, so there will be one row marked as 1, which will be from the earliest day in the earliest month. If you have multiple orders with the same day/month, say 01-Jan-2013 and 01-Jan-2014, then you'll still only get exactly one with rn = 1, but which is picked is indeterminate. You'd need to add further order by conditions to make it deterministic, but I have no idea what you might want.
That is done in the inner query; the outer query then filters so that only the records marked with rn = 1 is returned; so you get exactly one row back from the overall query.
This also avoids the situation where the earliest day number is not in the earliest month number - say if you only had 01-Jan-2014 and 02-Feb-2014; comparing the day and month separately would look for 01-Feb-2014, which doesn't exist.
SQL Fiddle (with Thomas Tschernich's anwer thrown in too, giving the same result for this data).
To join the result against your invoice table, you don't need to join to the orders table again - especially not with a cross join, which is skewing your results. You can do the join (at least) two ways:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
) o, invoices i
WHERE i.invno = o.invno
AND rn = 1;
Or:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT orderno, orderdate, invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
)
WHERE rn = 1
) o, invoices i
WHERE i.invno = o.invno;
The first looks like it does more work but the execution plans are the same.
SQL Fiddle with your pastebin-supplied query that gets two rows back, and these two that get one.