User Sessions | Month's Since Last Active Using SQL - postgresql

UserID
CalMonth
ActiveFlag
Months_since_last_active
A
1/1/2021
1
0
A
2/1/2021
1
A
3/1/2021
2
A
4/1/2021
1
0
B
1/1/2021
1
0
B
2/1/2021
1
B
3/1/2021
1
0
Problem --> The first 3 colums are given. Generate the last one 'Months_since_last_active' by adding 1 until the use is active again
My Solution as below:
With active_sessions as (
Select
User_Id
, CalMonth
, active flag as current_flag
, LAG (ActiveFlag,1) over (partition by User_Id order by CalMonth) as previous_flag
)
Select User_Id, CalMonth, current_flag, sum(case when current_flag =1 then 0
when current_flag IS NULL then Months_since_last_active + 1
END
) as Months_since_last_active
from active_sessions
order by 1,2
I was asked the above question in an interview and told that my proposed solution would not work because:
When it comes to 3/1/2021 and beyond, the previous values of 'Months_since_last_active' are not in the table yet -- they are only in the code
If I wanted to use LAG function, then it'd take innumerable LAG functions to achieve what I was trying to achieve
I will appreciate if someone can comment on my solution.

Your solution has 3 major problems, 2 of them may be related to copy/past errors. The active_sessions CTE is missing the from clause, so there is no data source. Then the main portion uses the aggregate function SUM, however, the query has no group by which is required for the aggregate function. These are easily corrected. The other issue concerns the LAG function and your use of it.
First off in the CTE you alias the result as previous_flag, then in the main query you reference Months_since_last_active which does not exist yet. I think this is the source of the interviewer's first point.
The interviewer's second point also stems form the LAG function. As written it always looks back exactly 1 row, but from the current row yet it needs to look back 2 rows for (userid, calmonth) = ('A', 2021-03-01), and 3 rows for (A, 2021-04-01), etc. Basically you need to look back to to the last row with active_flag = 1. This leads directly to the it'd take innumerable LAG functions as you do not know how far beck you need to look. Suppose you had 30-40 or more inactive rows between active rows. You need a LAG(activeflag,n) ... for each possibility.
A solution. I dislike the problem statement it should not contain by adding 1 until the use is active again (is it yours or theirs). Either way this is an XY. If theirs they should be telling you what to solve, i.e. find number of months since last active. If yours you have created the problem for yourself. The problem statement should not say anything about how to solve the it. I will ignore that portion of the problem (And in a real interview I would/have ignored it, but be prepared to explain why).
What you have a a version of a Gaps And Islands (google it, you will find more that to think about). In this version lets consider each row with activeflag = 'Y' an as island, and anything else as a gap. Nor what you are looking for is the length of the gaps between islands. In the following the island_num CTE does 2 things. It assigns a sequence number to each row for a (userid, calmonth) and generates a boolean for each island. The `gap_points' then joins the results with itself, selecting the assigned for the max island whose calmonth is less than the current rows calmonth. In the main part the Months_since_last_active is assigned 0 if the current row is an island, and the difference between the generated row numbers if it is a gap. (see demo)
with island_num (userid, cal_month, active_flag, is_island, row_num) as
( select am.*
, case when am.activeflag = 1 then true else false end is_island
, row_number() over (partition by am.userid order by am.calmonth) rn
from active_month am
) -- select * from island_num
, gap_points(userid, cal_month, active_flag, is_island, row_num, island_row) as
( select *
from island_num i1
join lateral
(select max(row_num)
from island_num i2
where i1.userid = i2.userid
and i2.cal_month < i1.cal_month
and i2.is_island
) s0
on true
) --select * from gap_points;
select userid "User Id"
, cal_month "Cal Month"
, active_flag "Active Flag"
, case when is_island then 0
else row_num - island_row
end "Months_since_last_active"
from gap_points;

Related

Select until row matches in postgresql?

Is there a way to select rows until some condition is met? I.e. a type of limit, but not limited to N rows, but to all the rows until the first non-matching row?
For example, say I have the table:
CREATE TABLE t (id SERIAL PRIMARY KEY, rank INTEGER, value INTEGER);
INSERT INTO t (rank, value) VALUES ( 1, 1), (2, 1), (2,2),(3,1);
that is:
test=# SELECT * FROM t;
id | rank | value
----+------+-------
1 | 1 | 1
2 | 2 | 1
3 | 2 | 2
4 | 3 | 1
(4 rows)
I want to order by rank, and select up until the first row that is over 1.
I.e. SELECT * FROM t ORDER BY rank UNTIL value>1
and I want the first 2 rows back?
One solution is to use a subquery and bool_or:
SELECT * FROM
( SELECT id, rank, value, bool_and(value<2) OVER (order by rank, id) AS ok FROM t ORDER BY rank) t2
WHERE ok=true
BUT wont that end up going through all rows, even if I only want a handful?
(real world context: I have timestamped events in a table, I can use a window query lead/lag to select the time between two events, I want all event from now going back as long as they happened less than 10 minutes apart – the lead/lag window query complicates things, so simplified example here)
edit: made window-function order by rank, id
What you want is a sort of stop-condition. As far as I am aware there is no such thing in SQL, at least PostgreSQL's dialect.
What you can do is use a PL/PgSQL procedure to read rows from a cursor and return them until the stop condition is met. It won't be super fast, but it'll be alright. It's just a FOR loop over a query with an IF expression THEN exit; ELSE return next; END IF;. No explicit cursor is required because PL/PgSQL will use one internally if you FOR loop over a query.
Another option is to create a cursor and read chunks of rows from it in the application, then discard part of the last chunk once the stop condition is met.
Either way, a cursor is going to be what you want.
A stop expression wouldn't actually be too hard to implement in PostgreSQL by the way. You'd have to implement a new executor node type, but the new CustomScan support would make that practical to do in an extension. Then you'd just evaluate an expression to decide whether or not to carry on fetching rows.
You can try something such as:
select * from t, (
select rank from t where value = 1 order by "rank" limit 1) x
where t.rank <= x.rank order by rank;
It will make two passes through the first part of the table (which you might be able to cut by creating an index on (rank, value = 1)) but shouldn't evaluate the rest of the table if you have an index on rank.
[If you could have window expressions in where clauses you could use a window expression to make sure any previous rows didn't have value = 1.. but even if this were possible, then getting the query evaluator to use to limit search would be yet another challenge.]
This may be no better than your solution, since you begged the question, "won't that end up going through all rows?"
I can tell you this -- the explain plan is different than your solution. I don't know how the guts of PostgreSQL works, but if I were writing a "max" function, I would think it would always be O(n). By contrast, you had an order by which is average case O(n log n), worst case O(n^2).
That said, I cannot deny that this will go through all rows:
select * from sandbox.t
where id < (select min (id) from sandbox.t where value > 1)
One thing to clarify, though, is that unless you scan all rows, I'm not sure how you could determine the minimum value. Any time you invoke an aggregate concept across all records, doesn't that mean that you must read all rows?

Select unique values sorted by date

I am trying to solve an interesting problem. I have a table that has, among other data, these columns (dates in this sample are shown in European format - dd/mm/yyyy):
n_place_id dt_visit_date
(integer) (date)
========== =============
1 10/02/2012
3 11/03/2012
4 11/05/2012
13 14/06/2012
3 04/10/2012
3 03/11/2012
5 05/09/2012
13 18/08/2012
Basically, each place may be visited multiple times - and the dates may be in the past (completed visits) or in the future (planned visits). For the sake of simplicity, today's visits are part of planned future visits.
Now, I need to run a select on this table, which would pull unique place IDs from this table (without date) sorted in the following order:
Future visits go before past visits
Future visits take precedence in sorting over past visits for the same place
For future visits, the earliest date must take precedence in sorting for the same place
For past visits, the latest date must take precedence in sorting for the same place.
For example, for the sample data shown above, the result I need is:
5 (earliest future visit)
3 (next future visit into the future)
13 (latest past visit)
4 (previous past visit)
1 (earlier visit in the past)
Now, I can achieve the desired sorting using case when in the order by clause like so:
select
n_place_id
from
place_visit
order by
(case when dt_visit_date >= now()::date then 1 else 2 end),
(case when dt_visit_date >= now():: date then 1 else -1 end) * extract(epoch from dt_visit_date)
This sort of does what I need, but it does contain repeated IDs, whereas I need unique place IDs. If I try to add distinct to the select statement, postgres complains that I must have the order by in the select clause - but then the unique won't be sensible any more, as I have dates in there.
Somehow I feel that there should be a way to get the result I need in one select statement, but I can't get my head around how to do it.
If this can't be done, then, of course, I'll have to do the whole thing in the code, but I'd prefer to have this in one SQL statement.
P.S. I am not worried about the performance, because the dataset I will be sorting is not large. After the where clause will be applied, it will rarely contain more than about 10 records.
With DISTINCT ON you can easily show additional columns of the row with the resulting n_place_id:
SELECT n_place_id, dt_visit_date
FROM (
SELECT DISTINCT ON (n_place_id) *
,dt_visit_date < now()::date AS prio -- future first
,#(now()::date - dt_visit_date) AS diff -- closest first
FROM place_visit
ORDER BY n_place_id, prio, diff
) x
ORDER BY prio, diff;
Effectively I pick the row with the earliest future date (including "today") per n_place_id - or latest date in the past, failing that.
Then the resulting unique rows are sorted by the same criteria.
FALSE sorts before TRUE
The "absolute value" # helps to sort "closest first"
More on the Postgres specific DISTINCT ON in this related answer.
Result:
n_place_id | dt_visit_date
------------+--------------
5 | 2012-09-05
3 | 2012-10-04
13 | 2012-08-18
4 | 2012-05-11
1 | 2012-02-10
Try this
select n_place_id
from
(
select *,
extract(epoch from (dt_visit_date - now())) as seconds,
1 - SIGN(extract(epoch from (dt_visit_date - now())) ) as futurepast
from #t
) v
group by n_place_id
order by max(futurepast) desc, min(abs(seconds))

tsql random records extraction returns duplicates

I have to extract two random records from a table.
I have implemented something like this inside a stored procedure:
with tmpTable as(
SELECT top 500 [columns I need]
, row_number() over(order by [myColumn]) as rown
FROM SourceTable
JOIN [myJoin]
WHERE [myCondition]
)
-- here I extract with an intervall of 10 records: 10, 20, 30, ..., 400, 410, ...
select * from tmpTable where rown = (1 + FLOOR(50*RAND()))*10
It works great, it extracts a random record among the first 500 records from my source.
But, when the sp is called from the presentation layer (ASP.NET 4.0, SqlClient ADO.NET), eventually the same record is returned twice. Note that the two calls are indipendent from each other.
I guess it depends on the fact that the sp is called twice in few milliseconds and that the random generator creates the same number. In the debug process no duplication occurs: due to the F10 manual steps, I guess, that require more than few milliseconds.
How may I obtain two different records?
EDIT
The answer of Lamak requires some more details. The source table is made up of records of products. Groups of about 10 records differ from each other only for some carachteristics (e.g. color). The records are distributed in such a way:
1 to 10: product 1
11 to 20: product 2
... and so on
So if I get the first two random records it is highly expected that the records will regard the same product. This is why I'm using an intervall of 10 records in the random extraction.
If you are using SQL Server you can just do the following:
SELECT TOP 1 [columns I need]
FROM SourceTable
JOIN [myJoin]
ON [Something]
WHERE [MyCondition]
ORDER BY NEWID()
If you still want to first isolate the records to each product, you can try this:
SELECT TOP 1 *
FROM ( SELECT [columns I need], ROW_NUMBER() OVER(PARTITION BY Product ORDER BY NEWID()) Corr
FROM SourceTable
JOIN [myJoin]
ON [Something]
WHERE [MyCondition]) A
WHERE Corr = 1
ORDER BY NEWID()
Though you still can choose a record that has the same product of the first one.

Crystal reports zero values

Hey guys,
So I have this report that I am grouping into different age buckets. I want the count of an age bucket to be zero if there are no rows associated with this age bucket. So I did an outer join in my database select and that works fine. However, I need to add a group based on another column in my database.
When I add this group, the agebuckets that had no rows associated with them dissapear. I thought it might have been because the column that I was trying to group by was null for that row, so I added a row number to my select, and then grouped by that (I basically just need to group by each row and I can't just put it in the details... I can explain more about this if necessary). But after adding the row number the agebuckets that have no data are still null! When I remove this group that I added I get all age buckets.
Any ideas? Thanks!!
It's because the outer join to age group is not also an outer join to whatever your other
group is - you are only guaranteed to have one of each age group per data set, not one of each age group per [other group].
So if, for example, your other group is Region, you need a Cartesian / Cross join from your age range table to a Region table (so that you get every possible combination of age range and region), before outer joining to the rest of your dataset.
EDIT - based on the comments, a query like the following should work:
select date_helper.date_description, c.case_number, e.event_number
from
(select 0 range_start, 11 range_end, '0-10 days' date_description from dual union
select 11, 21, '11-20 days' from dual union
select 21, 31, '21-30 days' from dual union
select 31, 99999, '31+ days' from dual) date_helper
cross join case_table c
left outer join event_table e
on e.event_date <= date_helper.range_start*-1 + sysdate
and e.event_date > date_helper.range_end*-1 + sysdate
and c.case_number = e.case_number
(assuming that it's the event_date that needs to be grouped into buckets.)
I had trouble understanding your question.
I do know that Crystal Reports' NULL support is lacking in some pretty fundamental ways. So I usually try not to depend on it.
One way to approach this problem is to hard-code age ranges in the database query, e.g.:
SELECT p.person_type
, SUM(CASE WHEN
p.age <= 2
THEN 1 ELSE 0 END) AS "0-2"
, SUM(CASE WHEN
p.age BETWEEN 2 AND 18
THEN 1 ELSE 0 END) AS "3-17"
, SUM(CASE WHEN
p.age >= 18
THEN 1 ELSE 0 END) AS "18_and_over"
FROM person p
GROUP BY p.person_type
This way you are sure to get zeros where you want zeros.
I realize that this is not a direct answer to your question. Best of luck.

what is count(*) % 2 = 1

I see a query like
select *
from Table1
group by Step
having count(*) % 2 = 1
What is the trick about having count(*) % 2 = 1
Can anyone explain?
edit: What are the common usage areas?
Well % is the modulo operator, which gives the remainder of a division so it would give 0 when the number is exactly divisible by 2 (even) and 1 when not (e.g. it is odd). So the query basically selects elements for which count is odd (as said above).
Would that not be checking if you have an odd number of entries per step?
It will return all the steps which had odd number of rows.
just test it
declare #t1 table (step char(1))
insert into #t1(step)
select 'a'
union all select 'b'
union all select 'b'
union all select 'c'
union all select 'c'
union all select 'c'
union all select 'd'
union all select 'd'
union all select 'd'
union all select 'd'
select * from #t1
group by step
having count(*)%2 = 1
that will return values of column step that exist add number of times
in this example it will return
'a'
'c'
the select * is confusing here though and I would rather write it as
select step from #t1
group by step
having count(*)%2 = 1
or even for more visibility
select step, count(*) from #t1
group by step
having count(*)%2 = 1
A reason to do this:
Say you want to seperate the odd and even entries into two columns. You could use the even one for one of them and the odd for the other.
I also put this in a comment but wasn't getting a response.
The COUNT(*) will count all the rows in the database. The % is the modulus character, which will give you the remainder of a division problem. So this is dividing all rows by two and returning those which have a remainder of 1 (meaning an odd number of rows.)
As Erik pointed out, that would not be all the rows, but rather the ones grouped by step, meaning this is all the odd rows per step.
It's impossible for us to answer your question without knowing what the tables are used for.
For a given "Step" it might be that it is required to have an equal amount of "something" and that this will produce a list of elements to be displayed in some interface where this is not the case.
Example:
Lets forget "Steps" for a moment and assume this was a table of students and that "Step" was instead "Groups" the students are devided into. A requirement for a group is that there are an even number of students because the students will work in pairs. For an administrative tool you could write a query like this to see a list of groups where this is not true.
Group: Count
A, 10
B, 9
C, 17
D, 8
E, 4
F, 5
And the query will return groups B, C, F
Thanks to everybody. All of you said the query returns grouped rows that has odd count.
but this is not point! i will continue to inspect this case will and write the reason in the programmer's mind (if i find who write this)
Lessons learned: Programmers must write comments about stupid logic like that...