Distinct Count Dates by timeframe - date

I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.

You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date

I would be inclined to approach this by expanding out the days and visitors using a cross join and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;

Related

Grouping with different timespans

currently I am struggling to achieve some aggregation that is kinda overlapping.
The current structure of my table is:
|ymd |id|costs|
|--------|--|-----|
|20200101|a |10 |
|20200102|a |12 |
|20200101|b |13 |
|20200101|c |15 |
|20200102|c |1 |
However i'd like to group it in a way that I had different timespan per item. Considering that I am running this query on the 20200103, the result i am trying to achieve is:
| timespan | id | costs |
|------------|----|-------|
| last 2 days| a | 22 |
| last 1 day | a | 12 |
| last 2 days| b | 13 |
| last 1 day | b | 0 |
| last 2 days| c | 16 |
| last 1 day | c | 1 |
I have tried many things, but so far I wasn't able to achieve what I need. This is the query that I have tried, with no correct results:
SELECT
CASE
WHEN ymd BETWEEN date_add(current_date(),-2) AND to_date(current_date()) THEN '2 days'
WHEN ymd BETWEEN date_add(current_date(),-1) AND to_date(current_date()) THEN '1 day'
END AS timespan,
id,
sum(costs) AS costs
FROM `table`
GROUP BY
CASE
WHEN ymd BETWEEN date_add(current_date(),-2) AND to_date(current_date()) THEN '2 days'
WHEN ymd BETWEEN date_add(current_date(),-1) AND to_date(current_date()) THEN '1 day'
END,
id
You can build a derived table that stores the timestamps, cross join it with the list of distinct users to generate all possible combinations, then bring the table with a left join and aggregate:
select d.timespan, i.id, coalesce(sum(t.costs), 0) costs
from (select distinct id from mytable) i
cross join (
select 1 n, 'last 1 day' timespan
union all select 2, 'last 2 day'
) d
left join mytable t
on t.ymd between date_add(current_date(), - d.n) and current_date()
group by d.n, d.timespan, i.id

Using PostgreSQL, how can I count the amount of individuals that opened a message in the previous 30 days from the Monday of each week?

Scenario:
I have a table, events_table, that consists of records that are inserted by a webhook based on messages I send to my users:
"column_name" (type)
- "time_stamp" (timestamp with time zone)
- "username" (varchar)
- "delivered" (int)
- "action" (int)
Sample Data:
| time_stamp | username | delivered | action |
|:----------------|:---------|:----------|:-------|
|1349733421.460000| user1 | 1 | null |
|1549345346.460000| user3 | 1 | 1 |
|1524544421.460000| user1 | 1 | 1 |
|1345444421.570000| user7 | 1 | null |
|1756756761.980000| user9 | 1 | null |
|1234343421.460000| user171 | 1 | 1 |
|1843455621.460000| user5 | 1 | 1 |
| ... | ... | ... | ... |
The "delivered" column is null by default and 1 when delivered. The "action" column is null by default and is 1 when opened.
Problem:
Using PostgreSQL, how can I count the amount of individuals that opened an email in the previous 30 days from the Monday of each week?
Ideal query results:
| date | count |
|:----------------|:----------|
| 02/24/2020 | 1,234,123 |
| 02/17/2020 | 234,123 |
| 02/10/2020 | 1,234,123 |
| 02/03/2020 |12,341,213 |
| ... | ... |
My attempt:
This is the extent of what I've tried which gives me count of the previous week:
SELECT
date_trunc('week', to_timestamp("time_stamp")) as date,
count("username") as count,
lag(count(1), 1) over (order by "date") as "count_previous_week"
FROM events_table
WHERE "delivered" = 1
and "action" = 1
GROUP BY 1 order by 1 desc
This is my attempt at writing this query.
First I get the lowest and highest dates from the data set. I add 7 days on to the highest date to make sure I include data up to today.
I then run generate_series against these 2 values set with an interval of 7 days to give me every single monday between the 2 points (we can't rely on just mondays within your data set in case we have an empty week)
Then, I simply subquery and aggregate the data based on our generate_series output.
select
__weeks.week_begins,
(
select
count(distinct "username")
from
events_table
where
to_timestamp("time_stamp")::date between week_begins - '30 days'::interval and week_begins
and "delivered" = 1
and "action" = 1
)
from
(
select
generate_series(_.min_date, _.max_date, '7 days'::interval)::date as week_begins
from
(
select
min(date_trunc('week', to_timestamp("time_stamp"))::date) as min_date
max(date_trunc('week', to_timestamp("time_stamp"))::date) as max_date
from
events_table
where
"delivered" = 1
and "action" = 1
) as _
) as __weeks
order by
__weeks.week_begins
I'm not particularly keen on this query because the query planner visits the same table twice, but I can't think of another way to structure it.

How to fill Null with the previous value in PostgreSQL?

I have a table which contains Null values. I need to replace them with a previous non-Null value.
This is an example of data which I have:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | Null |
2018-01-03 | A | 0 | Null |
2018-01-04 | A | 0 | Null |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | Null |
2018-01-07 | B | 0 | Null |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | Null |
2018-01-10 | A | 0 | Null |
The result should look like this:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | 1 |
2018-01-03 | A | 0 | 1 |
2018-01-04 | A | 0 | 1 |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | 2 |
2018-01-07 | B | 0 | 2 |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | 3 |
2018-01-10 | A | 0 | 3 |
I tried the following query, but in this case, only the first Null value will be replaced.
select
date,
category,
start_period,
case
when period_number isnull then lag(period_number) over()
else period_number
end as period_number
from period_table;
Also, I tried to use first_value() window function, but I don't know how to set up the correct window.
Any help is highly appreciated.
You can join table with itself and get desired value. Assuming your date column is the primary key or unique.
update your_table upd set period_number = tbl.period_number
from
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t
inner join your_table tbl on tbl.date = t.d2
where t.date= upd.date
If you don't need to update the table but only a select statement then
select yt.date, yt.category, yt.start_period, tbl.period_number
from your_table yt
inner join
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t on yt.date = t.date
inner join your_table tbl on tbl.date = t.d2
If you replace your case statement with:
(
select
_.period_number
from
period_table as _
where
_.period_number is not null
and _.category = period_table.category
and _.date <= period_table.date
order by
_.date desc
limit 1
) as period_number
Then it should have the intended effect. It's nowhere near as elegant as a window function but I don't think window functions are quite flexible enough for your specific use case here (Or at least, if they are, I don't know how to flex them that much)
Examples of windows function and frame clause:
select
date,category,score
,FIRST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW
) as last_score
from testing.rec_test
order by date, category
select
date,category,score
,LAST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) as last_score
from testing.rec_test
order by date, category

Accomplishing what I need without a CROSS JOIN

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.
Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Join tables and count instances of different values

user
---------------------------
| ID | Name |
---------------------------
| 1 | Jim Rice |
| 2 | Wade Boggs |
| 3 | Bill Buckner |
---------------------------
at_bats
----------------------
| ID | User | Bases |
----------------------
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 4 | 3 | 0 |
| 5 | 1 | 3 |
----------------------
What I want my query to do is get the count of the different base values in a join table like:
count_of_hits
---------------------
| ID | 1B | 2B | 3B |
---------------------
| 1 | 0 | 2 | 1 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 |
---------------------
I had a query where I was able to get the bases individually, but not them all unless I did some complicated Joins and I'd imagine there is a better way. This was the foundational query though:
SELECT id, COUNT(ab.*)
FROM user
LEFT OUTER JOIN (SELECT * FROM at_bats WHERE at_bats.bases=2) ab ON ab.user=user.id
PostgreSQL 9.4+ provides a much cleaner way to do this:
SELECT
users,
count(*) FILTER (WHERE bases=1) As B1,
count(*) FILTER (WHERE bases=2) As B2,
count(*) FILTER (WHERE bases=3) As B3,
FROM at_bats
GROUP BY users
ORDER BY users;
I think the following query would solve your problem. However, I am not sure if it is the best approach:
select distinct a.users, coalesce(b.B1, 0) As B1, coalesce(c.B2, 0) As B2 ,coalesce(d.B3, 0) As B3
FROM at_bats a
LEFT JOIN (SELECT users, count(bases) As B1 FROM at_bats WHERE bases = 1 GROUP BY users) as b ON a.users=b.users
LEFT JOIN (SELECT users, count(bases) As B2 FROM at_bats WHERE bases = 2 GROUP BY users) as c ON a.users=c.users
LEFT JOIN (SELECT users, count(bases) As B3 FROM at_bats WHERE bases = 3 GROUP BY users) as d ON a.users=d.users
Order by users
the coalesce() function is just to replace the nulls with zeros. I hope this query helps you :D
UPDATE 1
I found a better way to do it, look to the following:
SELECT users,
count(case bases when 1 then 1 else null end) As B1,
count(case bases when 2 then 1 else null end) As B2,
count(case bases when 3 then 1 else null end) As B3
FROM at_bats
GROUP BY users
ORDER BY users;
It it is more efficient compared to my first query. You can check the performance by using EXPLAIN ANALYSE before the query.
Thanks to Guffa from this post: https://stackoverflow.com/a/1400115/4453190