Find last occurring value within record in PostgreSQL - postgresql

I'm not new to SQL, but I am new to PostgreSQL and am really struggling to adapt my current knowledge in a different environment.
I am trying to create a variable that captures whether or not someone stays active, skips, or churns within a 0/1 time series variable. For example, in the data below, my dataset would include the variables id,time, and voted, and I would create the variable "skipped":
id time voted skipped
1 1 1 active
1 2 0 skipped
1 3 1 active
2 1 1 active
2 2 0 churned
2 3 0 churned
3 1 1 active
3 2 1 active
3 3 0 churned
The rule for coding "skipped" is pretty simple: If 1 is the last record, the person is "active" and any zeroes count as "skipped", but if 0 is the last record, the person is "churned".
The record with id = 1 is a skip because id is non-zero at time 3 after being 0 at time 2. The other two cases, 0 is the final value so they are "churned". Can anyone help? I've been noodling on it all day, and am hitting a wall.

This isn't particularly elegant, but it should meet your needs:
with votes as (
select
id, time, voted,
max(time) over (partition by id) as max_time
from voter_data
)
select
v1.id, v1.time, v1.voted,
case
when v1.voted = 1 then 'active'
when v2.voted = 1 then 'skipped'
else 'churned'
end as skipped
from
votes v1
join votes v2 on
v1.id = v2.id and
v1.max_time = v2.time
In a nutshell, we first figure out which is the last record for each voter id, and then we do a self-join on the resulting table to isolate only that last id.
There is a chance this could produce multiple results -- if it's possible to have the same ID vote twice at the same time. If that's the case, you want row_number() instead of max().
Results on your data:
1 1 1 'active'
1 2 0 'skipped'
1 3 1 'active'
2 1 1 'active'
2 2 0 'churned'
2 3 0 'churned'
3 1 1 'active'
3 2 1 'active'
3 3 0 'churned'

Window functions can help for readability when working with self-referential joins.
WITH
add_last_voted_status AS (
SELECT
*
, LAST_VALUE(voted) OVER (
PARTITION BY id
ORDER BY time
) AS last_voted_status
FROM table
)
SELECT
id
, time
, voted
, CASE
WHEN last_voted_status = 0
THEN 'churned'
WHEN last_voted_status = 1 AND voted = 1
THEN 'active'
WHEN last_voted_status = 1 AND voted = 0
THEN 'skipped'
ELSE '?'
END AS skipped
FROM add_last_voted_status

Related

PostgreSQL group by and count on specific condition

I have the following tables (example)
Analyze_Line
id
game_id
bet_result
game_type
1
1
WIN
0
2
2
LOSE
0
3
3
WIN
0
4
4
LOSE
0
5
5
LOSE
0
6
6
WIN
0
Game
id
league_id
home_team_id
away_team_id
1
1
1
2
2
2
2
3
3
3
3
4
4
1
1
2
5
2
2
3
6
3
3
4
Required Data:
league_id
WIN
LOSE
GameCnt
1
1
1
2
2
0
2
2
3
2
0
2
The Analyze_Line table is joined with the Game table and simple can get GameCnt grouping by league_id, but I am not sure how to calculate WIN count and LOSE count in bet_result
You can use conditionals in aggregate function to divide win and lose bet results per league.
select
g.league_id,
sum(case when a.bet_result = 'WIN' then 1 end) as win,
sum(case when a.bet_result = 'LOSE' then 1 end) as lose,
count(*) as gamecnt
from
game g
inner join analyze_line a on
g.id = a.game_id
group by
g.league_id
Since there is no mention of postgresql version, I can't recommend using FILTER clause (postgres specific), since it might not work for you.
Adding to Kamil's answer - PostgreSQL introduced the filter clause in PostgreSQL 9.4, released about eight years ago (December 2014). At this point, I think it's safe enough to use in answers. IMHO, it's a tad more elegant than summing over a case expression, but it does have the drawback of being PostgreSQL specific syntax, and thus not portable:
SELECT g.league_id,
COUNT(*) FILTER (WHERE a.bet_result = 'WIN') AS win,
COUNT(*) FILTER (WHERE a.bet_result = 'LOSE') AS lose,
COUNT(*) AS gamecnt
FROM game g
JOIN analyze_line a ON g.id = a.game_id
GROUP BY g.league_id

Postgresql query first and last in every range

I have table
id
machineid
reset
1
1
false
2
1
false
3
1
false
4
1
true
5
1
false
15
1
true
17
1
false
20
2
false
21
2
false
25
2
false
30
2
false
I cant figure out how to find first and last id for every machine. Reset create new range for next rows. Result should look like:
machineid
startid
endid
1
1
3
1
4
5
1
15
17
2
20
30
you can start from grouping your records into groups or ranges. As the order of your records matter, it indicates you can make use of window functions. You have to determine how you are going to uniquely name these groups. I suggest you use the number of resets above the record. This result to this statement:
SELECT *
, SUM(case when reset then 1 else 0 end) over (partition by machineid order by id) as reset_group
FROM
test;
After that finding the start and end ids is a simple GROUP BY statement:
SELECT
machineid, MIN(id) as startid, MAX(id) as endid
FROM (
SELECT machineid, id
, SUM(case when reset then 1 else 0 end) over (partition by machineid order by id) as reset_group
FROM
test
) as grouped
GROUP BY
machineid, reset_group
ORDER BY
machineid, startid;
Please try it out: db<>fiddle

Remove duplicates based on only 1 column

My data is in the following format:
rep_id user_id other non-duplicated data
1 1 ...
1 2 ...
2 3 ...
3 4 ...
3 5 ...
I am trying to achieve a column for deduped_rep with 0/1 such that only first rep id across the associated users has a 1 and rest have 0.
Expected result:
rep_id user_id deduped_rep
1 1 1
1 2 0
2 3 1
3 4 1
3 5 0
For reference, in Excel, I would use the following formula:
IF(SUMPRODUCT(($A$2:$A2=A2)*($A$2:$A2=A2))>1,0,1)
I know there is the FIXED() LoD calculation http://kb.tableau.com/articles/howto/removing-duplicate-data-with-lod-calculations, but I only see use cases of it deduplicating based on another column. However, mine are distinct.
Define a field first_reg_date_per_rep_id as
{ fixed rep_id : min(registration_date) }
The define a field is_first_reg_date? as
registration_date = first_reg_date_per_rep_id
You can use that last Boolean field to distinguish the first record for each rep_id from later ones
try this query
select
rep_id,
user_id,
row_number() over(partition by rep_id order by rep_id,user_id) deduped_rep
from
table

how to combine multiple query into one single query

I have three queries as below and I need to combine them into one. Does any body know how to do that?
select COUNT(*) from dbo.VWAnswer where questionId =2 and answer =1
select COUNT(*) from dbo.VWAnswer where questionId =3 and answer =4
select COUNT(*) from dbo.VWAnswer where questionId =5 and answer =2
I want to find out total count of those people whose gender = 1 and Education = 4 and marital status = 2
Following is the table columns(With one ex) that i refer:
questionId questionText anwser AnserSheetID
1 Gender 1 1
2 Qualification 4 1
3 Marital Status 2 1
1 Gender 2 2
2 Qualification 1 2
3 Marital Status 2 2
1 Gender 1 3
2 Qualification 3 3
3 Marital Status 1 3
Basically, these are questions answered by different people whose answers are stored in this table.
So if we consider above table entries I should get 1 as total count based upon above 3 conditions i.e. gender = 1 and Education = 4 and marital status = 2
Can someone tell me what I need to do to get this to work?
If you want to combine your three count queries, you can try the below SQL to get it done.
select
sum(case when questionId =2 and anwser=1 then 1 else 0 end) as FCount,
sum(case when questionId =3 and anwser=4 then 1 else 0 end) as SCount,
sum(case when questionId =5 and anwser=2 then 1 else 0 end) as TCount
from dbo.VWAnswer
Update 1:
select
Sum(case when questionText='Gender' and anwser='1' then 1 else 0 end) as GenderCount,
Sum(case when questionText='Qualification' and anwser='4' then 1 else 0 end) as EducationCount,
Sum(case when questionText='Marital Status' and anwser='2' then 1 else 0 end) as MaritalCount
from VWAnswer
We can only get the counts based on the rows and every condition should apply in each row.
You might use a joined view meeting you conditions and select the count of the rows fitting your conditions.
Select COUNT(*) as cnt from
(
Select a.AnserSheetID
from VWAnswer a
Join VWAnswer b on a.AnserSheetID=b.AnserSheetID and b.questionId = 2 and b.anwser=4
Join VWAnswer c on a.AnserSheetID=c.AnserSheetID and c.questionId = 3 and c.anwser=2
where a.questionId=1 and a.anwser=1
) hlp

CRM Reports: Grouping by a related Entity

There is an N<>N relationship between Contacts and Complaints.
My report currently looks like this:
Status 1 Status 2 Status 3 Status 4
3 4 32 34
With the following query:
SELECT
SUM(case WHEN status = 1 then 1 else 0 end) Status1,
SUM(case WHEN status = 2 then 1 else 0 end) Status2,
SUM(case WHEN status = 3 then 1 else 0 end) Status3,
SUM(case WHEN status = 4 then 1 else 0 end) Status4,
SUM(case WHEN status = 5 then 1 else 0 end) Status5
FROM [DB].[dbo].[Contact]
This is listing the number of contacts in each status. I'm now trying to GROUP BY a field in a related entity in CRM - complaints.
Status 1 Status 2 Status 3 Status 4
Contact.Complaints.CreatedBy[1] 3 4 32 34
Contact.Complaints.CreatedBy[2] 3 4 32 34
Contact.Complaints.CreatedBy[3] 3 4 32 34
Contact.Complaints.CreatedBy[4] 3 4 32 34
I'm not sure where to get started in my GROUP BY statement - any pointers would be awesome. I feel like I have to have another FROM statement pointing to the NN relationship, or at least Complaints.
It should be as easy as adding a JOIN to Complaints (thru the N:N) table. I completely agree with James, just make sure you execute the report as a CRM user, otherwise Filtered views return 0 rows.
SELECT
MyComplaintType,
...existing Sum(Case) stuff
FROM
FilteredContacts c
JOIN
Filterednew_Contacts_new_Complaint_new_complaints r1 (whatever your N:N is)
ON c.contactId = r1.contactId
JOIN
Filterednew_Complaint comp
ON r1.new_complaintId = comp.new_complaintId
GROUP BY
MyComplaintType