Pyspark - Merge consecutive duplicate rows but maintain start and end dates

Pyspark - Merge consecutive duplicate rows but maintain start and end dates - pyspark

I have a dataframe with the following format...
id , name, start_date, end_date , active
1 , albert , 2019-08-14, 3499-12-31, 1
1 , albert , 2019-08-13, 2019-08-14, 0
1 , albert , 2019-06-26, 2019-08-13, 0
1 , brian , 2018-01-17, 2019-06-26, 0
1 , brian , 2017-07-31, 2018-01-17, 0
1 , albert , 2017-03-31, 2018-07-31, 0
2 , diane , 2019-07-14, 3499-12-31, 1
2 , diane , 2019-06-13, 2019-07-14, 0
2 , ethel , 2019-03-20, 2019-06-13, 0
2 , ethel , 2018-01-17, 2019-03-20, 0
2 , frank , 2017-07-31, 2018-01-17, 0
2 , frank , 2015-03-21, 2018-07-31, 0
And I want to merge consecutive rows where name is the same as the previous row, but maintain the correct start and end dates in the final output dataframe. So the correct output would be...
id , name, start_date, end_date , active
1 , albert , 2019-06-26, 3499-12-31, 1
1 , brian , 2017-07-31, 2019-06-26, 0
1 , albert , 2017-03-31, 2018-07-31, 0
2 , diane , 2019-06-13, 3499-12-31, 1
2 , ethel , 2018-01-17, 2019-06-13, 0
2 , frank , 2017-03-31, 2018-01-17, 0
The number of entries per id varies as does the number of different names per id.
How could this be achieved in pyspark?
Thanks

Are you looking for df.groupby(["name", "start_date", "end_date"]).sum("active")?
If I understood your questions right, the above code will do the job.

So after a bit of thinking I figured out how to do this. There may be a better way, but this works.
First create a window, partitioned by id and ordered by start_date and capture the next row.
frame = Window.partitionBy('id').orderBy(col('start_date').desc())
df = df.select('*', lag(col('name'), default=0).over(frame).alias('next_name'))
Then if the current name row and next names match set 0 else set 1...
df = df.withColumn('countrr', when(col('name') == col('next_name'), 0).otherwise(1))
Next create an extension of the frame to take the rows between the start of the window and the current row, and sum the count col for the frame...
frame2 = Window.partitionBy('id').orderBy(col('start_date').desc()).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('sumrr', sum('countrr').over(frame2)
This effectively creates a column that increases by one when name changes. Finally you can use this new sumrr column and the other columns to group by and take the max and min dates as required...
gb_df = df.groupby(['id', 'name', 'sumrr'])
result = gb_df.agg({'start_date':'min', 'end_date':'max'})
Then you have to join back the active flag on id, name and end_date.
Gives the correct output...

Related

Postgresql : Average over a limit of Date with group by

I have a table like this
item_id date number
1 2000-01-01 100
1 2003-03-08 50
1 2004-04-21 10
1 2004-12-11 10
1 2010-03-03 10
2 2000-06-29 1
2 2002-05-22 2
2 2002-07-06 3
2 2008-10-20 4
I'm trying to get the average for each uniq Item_id over the last 3 dates.
It's difficult because there are missing date in between so a range of hardcoded dates doesn't always work.
I expect a result like :
item_id MyAverage
1 10
2 3
I don't really know how to do this. Currently i manage to do it for one item but i have trouble extending it to multiples items :
SELECT AVG(MyAverage.number) FROM (
SELECT date,number
FROM item_list
where item_id = 1
ORDER BY date DESC limit 3
) as MyAverage;
My main problem is with generalising the "DESC limit 3" over a group by id.
attempt :
SELECT item_id,AVG(MyAverage.number)
FROM (
SELECT item_id,date,number
FROM item_list
ORDER BY date DESC limit 3) as MyAverage
GROUP BY item_id;
The limit is messing things up there.
I have made it " work " using between date and date but it's not working as i want because i need a limit and not an hardcoded date..
Can anybody help

You can use row_number() to assign 1 to 3 for the records with the last date for an ID an then filter for that.
SELECT x.item_id,
avg(x.number)
FROM (SELECT il.item_id,
il.number,
row_number() OVER (PARTITION BY il.item_id
ORDER BY il.date DESC) rn
FROM item_list il) x
WHERE x.rn BETWEEN 1 AND 3
GROUP BY x.item_id;

Remove duplicates based on only 1 column

My data is in the following format:
rep_id user_id other non-duplicated data
1 1 ...
1 2 ...
2 3 ...
3 4 ...
3 5 ...
I am trying to achieve a column for deduped_rep with 0/1 such that only first rep id across the associated users has a 1 and rest have 0.
Expected result:
rep_id user_id deduped_rep
1 1 1
1 2 0
2 3 1
3 4 1
3 5 0
For reference, in Excel, I would use the following formula:
IF(SUMPRODUCT(($A$2:$A2=A2)*($A$2:$A2=A2))>1,0,1)
I know there is the FIXED() LoD calculation http://kb.tableau.com/articles/howto/removing-duplicate-data-with-lod-calculations, but I only see use cases of it deduplicating based on another column. However, mine are distinct.

Define a field first_reg_date_per_rep_id as
{ fixed rep_id : min(registration_date) }
The define a field is_first_reg_date? as
registration_date = first_reg_date_per_rep_id
You can use that last Boolean field to distinguish the first record for each rep_id from later ones

try this query
select
rep_id,
user_id,
row_number() over(partition by rep_id order by rep_id,user_id) deduped_rep
from
table

How to insert row data between consecutive dates in HIVE?

Sample Data:
customer txn_date tag
A 1-Jan-17 1
A 2-Jan-17 1
A 4-Jan-17 1
A 5-Jan-17 0
B 3-Jan-17 1
B 5-Jan-17 0
Need to fill every missing txn_date between date range (1-Jan-17 to 5-Jan-2017). Just like below:
Output should be:
customer txn_date tag
A 1-Jan-17 1
A 2-Jan-17 1
A 3-Jan-17 0 (inserted)
A 4-Jan-17 1
A 5-Jan-17 0
B 1-Jan-17 0 (inserted)
B 2-Jan-17 0 (inserted)
B 3-Jan-17 1
B 4-Jan-17 0 (inserted)
B 5-Jan-17 0

select c.customer
,d.txn_date
,coalesce(t.tag,0) as tag
from (select date_add (from_date,i) as txn_date
from (select date '2017-01-01' as from_date
,date '2017-01-05' as to_date
) p
lateral view
posexplode(split(space(datediff(p.to_date,p.from_date)),' ')) pe as i,x
) d
cross join (select distinct
customer
from t
) c
left join t
on t.customer = c.customer
and t.txn_date = d.txn_date
;
c.customer d.txn_date tag
A 2017-01-01 1
A 2017-01-02 1
A 2017-01-03 0
A 2017-01-04 1
A 2017-01-05 0
B 2017-01-01 0
B 2017-01-02 0
B 2017-01-03 1
B 2017-01-04 0
B 2017-01-05 0

Just have the delta content i.e the missing data in a file(input.txt) delimited with the same delimiter you have mentioned when you created the table.
Then use the load data command to insert this records into the table.
load data local inpath '/tmp/input.txt' into table tablename;
Your data wont be in the order you have mentioned , it would get appended to the last. You could retrieve the order by adding order by txn_date in the select query.

Find last occurring value within record in PostgreSQL

I'm not new to SQL, but I am new to PostgreSQL and am really struggling to adapt my current knowledge in a different environment.
I am trying to create a variable that captures whether or not someone stays active, skips, or churns within a 0/1 time series variable. For example, in the data below, my dataset would include the variables id,time, and voted, and I would create the variable "skipped":
id time voted skipped
1 1 1 active
1 2 0 skipped
1 3 1 active
2 1 1 active
2 2 0 churned
2 3 0 churned
3 1 1 active
3 2 1 active
3 3 0 churned
The rule for coding "skipped" is pretty simple: If 1 is the last record, the person is "active" and any zeroes count as "skipped", but if 0 is the last record, the person is "churned".
The record with id = 1 is a skip because id is non-zero at time 3 after being 0 at time 2. The other two cases, 0 is the final value so they are "churned". Can anyone help? I've been noodling on it all day, and am hitting a wall.

This isn't particularly elegant, but it should meet your needs:
with votes as (
select
id, time, voted,
max(time) over (partition by id) as max_time
from voter_data
)
select
v1.id, v1.time, v1.voted,
case
when v1.voted = 1 then 'active'
when v2.voted = 1 then 'skipped'
else 'churned'
end as skipped
from
votes v1
join votes v2 on
v1.id = v2.id and
v1.max_time = v2.time
In a nutshell, we first figure out which is the last record for each voter id, and then we do a self-join on the resulting table to isolate only that last id.
There is a chance this could produce multiple results -- if it's possible to have the same ID vote twice at the same time. If that's the case, you want row_number() instead of max().
Results on your data:
1 1 1 'active'
1 2 0 'skipped'
1 3 1 'active'
2 1 1 'active'
2 2 0 'churned'
2 3 0 'churned'
3 1 1 'active'
3 2 1 'active'
3 3 0 'churned'

Window functions can help for readability when working with self-referential joins.
WITH
add_last_voted_status AS (
SELECT
*
, LAST_VALUE(voted) OVER (
PARTITION BY id
ORDER BY time
) AS last_voted_status
FROM table
)
SELECT
id
, time
, voted
, CASE
WHEN last_voted_status = 0
THEN 'churned'
WHEN last_voted_status = 1 AND voted = 1
THEN 'active'
WHEN last_voted_status = 1 AND voted = 0
THEN 'skipped'
ELSE '?'
END AS skipped
FROM add_last_voted_status

Show the count based on some condition

I've asked a question some days back. Here is that link.
Count() corresponding to max()
Now with the same set of tables (SQL Fiddle) I would like to check a different condition
If the first question was about a count related to the max of a status, this question is about showing the count based on the next status of every project.
Explanation
As you can see in the table user_approval,appr_prjt_id=1 has 3 different statuses namely 10,20 ,30. And the next status will be 40 (With every approval the status is increased by 10) and so on. So is it possible to show that there is a project whose status is waiting to be 40? Its count must only be shown for status corresponding to 40 in the output (not in the statuses 10,20,30,...etc)
Desired Output:
10 | 20 | 30 | 40
location1 0 | 0 | 0 | 1

Not sure what the next status will be 40 means. But assuming that the status is increased by 10 with every approval, the following should work:
SELECT *
FROM user_projects pr
WHERE EXISTS (
SELECT * FROM user_approval ex
WHERE ex.appr_prjt_id = pr.proj_id
AND ex.appr_status = 30
)
AND NOT EXISTS (
SELECT * FROM user_approval nx
WHERE nx.appr_prjt_id = pr.proj_id
AND nx.appr_status >= 40
);

You can get the counts for each of the next status requirements with a query that looks more like:
select
sum(case when ua.appr_status = 10 then 1 else 0 end) as app_waiting_20,
sum(case when ua.appr_status = 20 then 1 else 0 end) as app_waiting_30,
sum(case when ua.appr_status = 30 then 1 else 0 end) as app_waiting_40
from
user_approval ua;
The nice thing about this solution is only one table scan, and you can add all kinds of other counts/sums in the query result as well.

select * from user_approval where appr_status
= (select max(appr_status) from user_approval where appr_status < 40);
SQL Fiddle : - http://www.sqlfiddle.com/#!11/f5243/10

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark - Merge consecutive duplicate rows but maintain start and end dates - pyspark

Are you looking for df.groupby(["name", "start_date", "end_date"]).sum("active")? If I understood your questions right, the above code will do the job.

Related

Postgresql : Average over a limit of Date with group by

Remove duplicates based on only 1 column

How to insert row data between consecutive dates in HIVE?

Find last occurring value within record in PostgreSQL

Show the count based on some condition

Categories

Resources