Accomplishing what I need without a CROSS JOIN - postgresql

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.

Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Related

Postgres Hierarchy output

im struggling on how to get the correct output using hierarchy query.
I have one table which loads per day all product and its price. during time this can cancel and being activate again.
I believe with oracle we could use the Connect By.
WITH RECURSIVE cte AS (
select min(event_date) event_date, item_code,sum(price::numeric)/1024/1024 price, 1 AS level
from rdpidevdat.raid_r_cbs_offer_accttype_map where product_type='cars' and item_code in ('Renault')
group by item_code
UNION ALL
SELECT e.event_date, e.item_code, e.price, cte.level + 1
from (select event_date, item_code,sum(price::numeric)/1024/1024 price
from rdpidevdat.raid_r_cbs_offer_accttype_map where product_type='cars' and item_code in ('9859')
group by event_date,item_code) e join cte ON e.event_date = cte.event_date and e.item_code = cte.item_code
)
SELECT *
FROM cte where item_code in ('Renault') ;
how do i put an ouput where will have the range of each product during time?
if we have the data:
EVENT_DATE | ITEM_COD| PRICE
20210910 | Renaut | 2500
20210915 | Renaut | 2500
20210920 | Renaut | 2600
20211020 | Renaut | 2900
20220101 | Renaut | 2500
the expected output should be:
-------------------------------------------------
FROM_EVENT_DATE | TO_EVENT_DATE | ITEM_COD| PRICE
20210910 | 20210915 | Renaut | 2500
20210915 | 20210920 | Renaut | 2600
20210920 | 20211020 | Renaut | 2900
20211020 | 20220101 | Renaut | 2500
Thanks in Advance and Regards!
I already found the solution. Using the Lag and lastvalue function. no need to use the hierarchy one.

Distinct Count Dates by timeframe

I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.
You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date
I would be inclined to approach this by expanding out the days and visitors using a cross join and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;

How to fill Null with the previous value in PostgreSQL?

I have a table which contains Null values. I need to replace them with a previous non-Null value.
This is an example of data which I have:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | Null |
2018-01-03 | A | 0 | Null |
2018-01-04 | A | 0 | Null |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | Null |
2018-01-07 | B | 0 | Null |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | Null |
2018-01-10 | A | 0 | Null |
The result should look like this:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | 1 |
2018-01-03 | A | 0 | 1 |
2018-01-04 | A | 0 | 1 |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | 2 |
2018-01-07 | B | 0 | 2 |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | 3 |
2018-01-10 | A | 0 | 3 |
I tried the following query, but in this case, only the first Null value will be replaced.
select
date,
category,
start_period,
case
when period_number isnull then lag(period_number) over()
else period_number
end as period_number
from period_table;
Also, I tried to use first_value() window function, but I don't know how to set up the correct window.
Any help is highly appreciated.
You can join table with itself and get desired value. Assuming your date column is the primary key or unique.
update your_table upd set period_number = tbl.period_number
from
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t
inner join your_table tbl on tbl.date = t.d2
where t.date= upd.date
If you don't need to update the table but only a select statement then
select yt.date, yt.category, yt.start_period, tbl.period_number
from your_table yt
inner join
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t on yt.date = t.date
inner join your_table tbl on tbl.date = t.d2
If you replace your case statement with:
(
select
_.period_number
from
period_table as _
where
_.period_number is not null
and _.category = period_table.category
and _.date <= period_table.date
order by
_.date desc
limit 1
) as period_number
Then it should have the intended effect. It's nowhere near as elegant as a window function but I don't think window functions are quite flexible enough for your specific use case here (Or at least, if they are, I don't know how to flex them that much)
Examples of windows function and frame clause:
select
date,category,score
,FIRST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW
) as last_score
from testing.rec_test
order by date, category
select
date,category,score
,LAST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) as last_score
from testing.rec_test
order by date, category

Join tables and count instances of different values

user
---------------------------
| ID | Name |
---------------------------
| 1 | Jim Rice |
| 2 | Wade Boggs |
| 3 | Bill Buckner |
---------------------------
at_bats
----------------------
| ID | User | Bases |
----------------------
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 4 | 3 | 0 |
| 5 | 1 | 3 |
----------------------
What I want my query to do is get the count of the different base values in a join table like:
count_of_hits
---------------------
| ID | 1B | 2B | 3B |
---------------------
| 1 | 0 | 2 | 1 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 |
---------------------
I had a query where I was able to get the bases individually, but not them all unless I did some complicated Joins and I'd imagine there is a better way. This was the foundational query though:
SELECT id, COUNT(ab.*)
FROM user
LEFT OUTER JOIN (SELECT * FROM at_bats WHERE at_bats.bases=2) ab ON ab.user=user.id
PostgreSQL 9.4+ provides a much cleaner way to do this:
SELECT
users,
count(*) FILTER (WHERE bases=1) As B1,
count(*) FILTER (WHERE bases=2) As B2,
count(*) FILTER (WHERE bases=3) As B3,
FROM at_bats
GROUP BY users
ORDER BY users;
I think the following query would solve your problem. However, I am not sure if it is the best approach:
select distinct a.users, coalesce(b.B1, 0) As B1, coalesce(c.B2, 0) As B2 ,coalesce(d.B3, 0) As B3
FROM at_bats a
LEFT JOIN (SELECT users, count(bases) As B1 FROM at_bats WHERE bases = 1 GROUP BY users) as b ON a.users=b.users
LEFT JOIN (SELECT users, count(bases) As B2 FROM at_bats WHERE bases = 2 GROUP BY users) as c ON a.users=c.users
LEFT JOIN (SELECT users, count(bases) As B3 FROM at_bats WHERE bases = 3 GROUP BY users) as d ON a.users=d.users
Order by users
the coalesce() function is just to replace the nulls with zeros. I hope this query helps you :D
UPDATE 1
I found a better way to do it, look to the following:
SELECT users,
count(case bases when 1 then 1 else null end) As B1,
count(case bases when 2 then 1 else null end) As B2,
count(case bases when 3 then 1 else null end) As B3
FROM at_bats
GROUP BY users
ORDER BY users;
It it is more efficient compared to my first query. You can check the performance by using EXPLAIN ANALYSE before the query.
Thanks to Guffa from this post: https://stackoverflow.com/a/1400115/4453190

How to list the train operators that use the second oldest trains (PostgreSQL)

train_operators:
| train_operator_id | name |
------------------------------
| 1 | Virgin |
| 2 | First |
journeys:
| journey_id | train_operator | train_type |
--------------------------------------------
| 1 | 2 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 3 |
| 4 | 1 | 2 |
train_types:
| train_type_id | date_made |
------------------------------
| 1 | 1999-02-15 |
| 2 | 2001-03-11 |
| 3 | 2000-12-05 |
How would you write a query to find all the train operators that use the second oldest type of train?
With the given schema the query should result with just Virgin since it is the only train operator that uses the second oldest train type
Try this:
select distinct train_operator from journeys
inner join (Select * from train_types order by date_made LIMIT 1 OFFSET 1) sectrain
on sectrain.train_type_id = journeys.train_type
You're into the UK Rail Network are you? I used to work for Funkwerk IT, who in turn used to provide the timetable planning software for Network Rail...
It can be pretty easy using the power of window functions in pg
SELECT DISTINCT train_operator_id,
name
FROM (SELECT t.train_operator_id,
t.name,
Rank() OVER (ORDER BY tt.date_made) AS rank
FROM train_operators AS t
JOIN journeys AS j
ON j.train_operator = t.train_operator_id
JOIN train_types AS tt
ON tt.train_type_id = j.train_type) AS q
WHERE rank = 2;
http://sqlfiddle.com/#!12/98816/8
select to.name
from
train_operators to
inner join
journeys j on to.train_operator_id = j.train_operator
where
j.train_type = (
select train_type_id
from train_types
order by date_made
limit 1 offset 1
)