Counting Differences within a Case-When - postgresql

I'm trying to do some analytics on user activity, specifically how many users are still active, or at least logging in, over a period of time. However I have some conflicting numbers with the first months count which should just be the count of users that signed up during a month. To figure that out, my simple query is this.
SELECT count(user_id)
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01
Hypothetically this returns '1,000' which I believe to be correct because of the simplicity. But if I do this...
SELECT
COUNT(CASE WHEN date_part('day', last_login - member_since) >= 0
THEN user_id END) days_0
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01
...It will return a number less than 1,000. Theoretically this should return the same number as above because even if last_login is the same day as member_since that would be zero and should count those users. Both member_since and last_login are 'timestamp' types. I have a hunch that the difference could be users where last_login is the exact same as member_since, meaning that they signed up and never came back, but I'm not sure how I would test this. Is this a NULL issue? If so, how could I include that to get back to the count of '1,000'?

I would bet dollars to donuts that you nulls are what are causing the problem since they will always hit the else statement.
To correct for that, try this:
SELECT
COUNT(CASE WHEN last_login is null or date_part('day', last_login - member_since) >= 0
THEN user_id END) days_0
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01;

Related

How to subtract a seperate count from one grouping

I have a postgres query like this
select application.status as status, count(*) as "current_month" from application
where to_char(application.created, 'mon') = to_char('now'::timestamp - '1 month'::interval, 'mon')
and date_part('year',application.created) = date_part('year', CURRENT_DATE)
and application.job_status != 'expired'
group by application.status
it returns the table below that has the number of applications grouped by status for the current month. However I want to subtract a total count of a seperate but related query from the internal review number only. I want to count the number of rows with type = abc within the same table and for the same date range and then subtract that amount from the internal review number (Type is a seperate field). Current_month_desired is how it should look.
status
current_month
current_month_desired
fail
22
22
internal_review
95
22
pass
146
146
UNTESTED: but maybe...
The intent here is to use an analytic and case expression to conditionally sum. This way, the subtraction is not needed in the first place as you are only "counting" the values needed.
SELECT application.status as status
, sum(case when type = 'abc'
and application.status ='internal_review' then 0
else 1 end) over (partition by application.status)) as
"current_month"
FROM application
WHERE to_char(application.created, 'mon') = to_char('now'::timestamp - '1 month'::interval, 'mon')
and date_part('year',application.created) = date_part('year', CURRENT_DATE)
and application.job_status != 'expired'
GROUP BY application.status

If today's results are blank, show the totals from yesterday

My code is an accumulated total of revenue over a period of time. If a single day is blank (no revenue for that day) I need it to show the totals from the day before. CASE WHEN (today is blank), Yesterday's data ELSE Today's Total
I am not sure what the syntax is on this one.
select distinct
date_trunc('day',admit_date) as admit_date,
revenue,
sum(revenue) over(order by admit_date) as running_rev
from dailyrev
order by admit_date
Expected Results:
Day 1: $100
Day 2: $200
Day 3: (no data so show Day 2 data) $200
Maybe this is what you need:
SELECT admit_date,
prev_revs[cardinality(prev_revs)] AS adj_revenue,
sum(prev_revs[cardinality(prev_revs)])
OVER (ORDER BY admit_date) AS running_sum
FROM (SELECT date_trunc('day', admit_date) AS admit_date,
array_remove(array_agg(revenue)
OVER (order by admit_date),
NULL) AS prev_revs
FROM dailyrev) AS q
ORDER BY admit_date;
Unfortunately PostgreSQL doesn't yet support the IGNORE NULLS clause, then it would have been simpler.
I am not sure if this is what you want, but try this:
SELECT
gs.date::date AS admit_date,
(SELECT revenue FROM dailyrev WHERE admit_date::date = gs.date) AS revenue,
(SELECT SUM(revenue) FROM dailyrev WHERE admit_date::date <= gs.date) AS accumulated_total
FROM
generated_series(
(SELECT MIN(admit_date::date) FROM dailyrev),
(SELECT MAX(admit_date::date) FROM dailyrev),
INTERVAL '1 day'
) gs
ORDER BY gs.date::date;
Yes, it does not look that nice, but..

Using Lag() function to retrieve values across dates

I am trying to use the LAG() and LEAD() functions in postgres to retrieve values from other rows/records in a table and I am running into some difficulty. The functionality works as intended as long as the LAG or LEAD function is only looking at dates within the same month (i.e. June 2nd can look back to June 1st, but when I try to look back to May 31st, I retrieve a NULL value).
Here's what the table looks like
_date count_daily_active_users count_new_users day1_users users_arriving_today_who_returned_tomrrow day_retained_users
5/27/2013 1742 335 266 207 0.617910448
5/28/2013 1768 241 207 146 0.605809129
5/29/2013 1860 272 146 161 0.591911765
5/30/2013 2596 841 161 499 0.59334126
5/31/2013 2837 703 499 NULL NULL
6/1/2013 12881 10372 0 5446 0.525067489
6/2/2013 14340 6584 5446 2781 0.422387606
6/3/2013 12222 3690 2781 1494 0.404878049
6/4/2013 25861 17254 1494 8912 0.516517909
From that table you can see that on May 31st when I try to 'look ahead' to June 1st to retrieve the number of users who arrived for the first time on May 31st and then returned again on June 1st I get a NULL value. This happens at every month boundary and it happens regardless of the number of days I try to 'look ahead'. So if I look ahead two days, then I'd have NULLs for May 30th and May 31st.
Here's the SQL I wrote
SELECT
timestamp_session::date AS _date
, COUNT(DISTINCT dim_player_key) AS count_daily_active_users
, COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END) AS count_new_users
, COUNT(DISTINCT CASE WHEN days_since_birth != 0 THEN dim_player_key ELSE NULL END) AS count_returning_users
, COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END) AS day1_users -- note: the function is a LAG function instead of a LEAD function because of the sort order
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END), 1) OVER (order by _date)::float, 0)) as AA
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END), 1) OVER (order by _date)::float, 0)) as AB
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END), 0) OVER (order by _date)::float, 0)) as BB
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END), 0) OVER (order by _date)::float, 0)) as BA
FROM ( SELECT sessions_table.account_id AS dim_player_key,
sessions_table.session_id AS dim_session_key,
sessions_table.title_id AS dim_title_id,
sessions_table.appid AS dim_app_id,
sessions_table.loginip AS login_ip,
essions_table.logindate AS timestamp_session,
birthdate_table.birthdate AS timestamp_birthdate,
EXTRACT(EPOCH FROM (sessions_table.logindate - birthdate_table.birthdate)) AS count_age_in_seconds,
(date_part('day', sessions_table.logindate)- date_part('day', birthdate_table.birthdate)) AS days_since_birth
FROM
dataset.tablename1 AS sessions_table
JOIN (
SELECT
account_id,
MIN(logindate) AS birthdate
FROM
dataset.tablename1
GROUP BY
account_id )
-- call this sub-table the birthdate_table
birthdate_table ON
sessions_table.account_id = birthdate_table.account_id
-- call this table the outer_sessions_table
) AS outer_sessions_table
GROUP BY
_date
ORDER BY
_date ASC
I think that what I probably need to do is add an additional field in the inner select that reports the date as an integer value- something like that the EPOCH time for that date at midnight. But when I have tried that (adding a per day epoch time) it changes all of the values in the output table to 1. And I don't understand why.
Can anyone help me out?
Thanks,
Brad
The problem was with the days_since_birth calculation. I was using
(date_part('day',
sessions_table.logindate)- date_part('day',
birthdate_table.birthdate)) AS days_since_birth
as though it was subtracting the absolute date to give me a difference between those dates in days, but it's just converting the date to a day of the month and subtracting that, so at the month roll over, it returns -27, -29, -30 (depending on the month). I can fix this by wrapping it with an ABS function.

Update Redshift table from query

I'm trying to update a table in Redshift from query:
update mr_usage_au au
inner join(select mr.UserId,
date(mr.ActionDate) as ActionDate,
count(case when mr.EventId in (32) then mr.UserId end) as Moods,
count(case when mr.EventId in (33) then mr.UserId end) as Activities,
sum(case when mr.EventId in (10) then mr.Duration end) as Duration
from mr_session_log mr
where mr.EventTime >= current_date - interval '1 days' and mr.EventTime < current_date
Group By mr.UserId,
date(mr.ActionDate)) slog on slog.UserId=au.UserId
and slog.ActionDate=au.Date
set au.Moods = slog.Moods,
au.Activities=slog.Activities,
au.Durarion=slog.Duration
But I receive the following error:
ERROR: syntax error at or near "au".
This is completely invalid syntax for Redshift (or Postgres). Reminds me of SQL Server ...
Should work like this (at least on current Postgres):
UPDATE mr_usage_au
SET Moods = slog.Moods
, Activities = slog.Activities
, Durarion = slog.Duration
FROM (
select UserId
, ActionDate::date
, count(CASE WHEN EventId = 32 THEN UserId END) AS Moods
, count(CASE WHEN EventId = 33 THEN UserId END) AS Activities
, sum(CASE WHEN EventId = 10 THEN Duration END) AS Duration
FROM mr_session_log
WHERE EventTime >= current_date - 1 -- just subtract integer from a date
AND EventTime < current_date
GROUP BY UserId, ActionDate::date
) slog
WHERE slog.UserId = mr_usage_au.UserId
AND slog.ActionDate = mr_usage_au.Date;
This is generally the case for Postgres and Redshift:
Use a FROM clause to join in additional tables.
You cannot table-qualify target columns in the SET clause.
Also, Redshift was forked from PostgreSQL 8.0.2, which is very long ago. Only some later updates to Postgres were applied.
For instance, Postgres 8.0 did not allow a table alias in an UPDATE statement, yet - which is the reason behind the error you see.
I simplified some other details.

count data in current month - not 30 days back Postgres statment

Ive this query which return data for 30 days from current date , need to modify it to return data for current month only not 30 days from current date
SELECT count(1) AS counter FROM users.logged WHERE createddate >=
date_trunc('month', CURRENT_DATE);
any tips how to tweak this query , at based on Postgres
regards
Something like this should work.
SELECT count(1) AS counter
FROM users.logged
WHERE date_trunc('month', createddate) = date_trunc('month', current_date);
It is already supposed to return the values in current month. Truncation does the conversion 10 Nov 2013 14:16 -> 01 Nov 2013 00:00 and it will return the data since the beginning of this month. The problem seems to be something else.
Ive this query which return data for 30 days from current date , need to modify it to return data for current month only not 30 days from current date
That's incorrect. Your query:
SELECT count(1) AS counter FROM users.logged WHERE createddate >= date_trunc('month', CURRENT_DATE);
returns all dates >= Nov 1st 00:00:00, in other words what you say that you want already. Or then, you've simplified your query and left out the more important bits — those that are broken. If not:
It might be that you've dates in the future and that you're getting incorrect counts as a result. If so, add an additional criteria in the where clause:
AND created_date < date_trunc('month', CURRENT_DATE) + interval '1 month'
It might also be that your sample data has a bizarre row with a time zone such that it looks like the timestamp is from this month but the date/time arithmetics land it last month.
This is will give you data for the current month only. I try to extract month and year. The last step is you can compare created date against current date-time.
SELECT count(1) AS counter
FROM users.logged
WHERE
EXTRACT(MONTH FROM createddate) = EXTRACT(MONTH FROM current_date)
AND EXTRACT(YEAR FROM createddate) = EXTRACT(YEAR FROM current_date);