Spark: restart counting on specific value - scala

I have a dataFrame with Boolean records and i want restart the counting when goal=False/Null.
How can i get the Score tab ?
The score tab is a count of True values with a reset on False/null values
My df:
Goals
Null
True
False
True
True
True
True
False
False
True
True
Expected Result:
Goals Score
Null 0
True 1
False 0
True 1
True 2
True 3
True 4
False 0
False 0
True 1
True 2
EDIT: Adding more infos
Actually my full dataset is:
Player Goals Date Score
1 Null 2017-08-18 10:30:00 0
1 True 2017-08-18 11:30:00 1
1 False 2017-08-18 12:30:00 0
1 True 2017-08-18 13:30:00 1
1 True 2017-08-18 14:30:00 2
1 True 2017-08-18 15:30:00 3
1 True 2017-08-18 16:30:00 4
1 False 2017-08-18 17:30:00 0
1 False 2017-08-18 18:30:00 0
1 True 2017-08-18 19:30:00 1
1 True 2017-08-18 20:30:00 2
2 False 2017-08-18 10:30:00 0
2 False 2017-08-18 11:30:00 0
2 True 2017-08-18 12:30:00 1
2 True 2017-08-18 13:30:00 2
2 False 2017-08-18 15:30:00 0
I've created a window to calculate the score by player on a certain date
val w = Window.partitionBy("Player","Goals").orderBy("date")
I've tried with the lag function and comparing the values but i can't reset the count.
EDIT2: Add unique Date per player
Thank you.

I finally solved the problem with grouping the goals that occurs together.
I used a count over a partition containing the difference between the row index of the "table" and the row_number related to the partitioned window.
First declare the window with future columns to use
val w = Window.partitionBy("player","goals","countPartition").orderBy("date")
Then populate the columns "countPartition" and "goals" with 1 to keep the rowNumber neutral
val list1= dataList.withColumn("countPartition", lit(1)).withColumn("goals", lit(1)).withColumn("index", rowNumber over w )
the udf
def div = udf((countInit: Int, countP: Int) => countInit-countP)
And finally calculate the score
val score = list1.withColumn("goals", goals).withColumn("countPartition", div(col("index") , rowNumber over w )).withColumn("Score", when(col("goals") === true, count("goals") over w ).otherwise(when(col("goals") isNull, "null").otherwise(0))).orderBy("date")

Related

Making different group of sequential null data in pyspark

I have data that contain sequential null and I want to make those sequential null data to different group
I have data like below
group_num
days
time
useage
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
2
20200102
1
30
2
20200102
2
null
2
20200102
3
null
2
20200102
4
50
2
20200102
5
null
3
20200105
10
null
3
20200105
11
null
3
20200105
12
5
What I want to do in this data is that make null_group data in usage as the group.
I want to make the same null group if null data is sequential. And also I want to make different null group if null data is not sequential or have different group_num.
group_num
days
time
useage
null_group
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
group1
2
20200102
1
30
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
4
50
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
3
20200105
12
5
Or maybe make new data that only contain null data with different group.
group_num
days
time
useage
null_group
1
20200101
3
null
group1
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
null_group can be change to numeric like below
group_num
days
time
useage
null_group
1
20200101
3
null
1
2
20200102
2
null
2
2
20200102
3
null
2
2
20200102
5
null
3
3
20200105
10
null
4
3
20200105
11
null
4
Can anyone help with this problem? I thought I can do this with pyspark's window function, but it didn't work very well. I think I have to use pyspark because the original data is too large handling as python.
This looks a bit complicated, but the last 2 parts are just for displaying correctly. The main logic goes like this:
calculate "time" difference "dt" between rows (needs to be 1 for the same "null_group")
generate "key" from "usage" and "dt" columns
use the trick to label consecutive rows (originally for pandas https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/)
rename and manipulate labels to get desired result
Full solution:
w = Window.partitionBy('group_num').orderBy('time')
w_cumsum = w.rowsBetween(Window.unboundedPreceding, 0)
# main logic
df_tmp = (
df
.withColumn('dt', F.coalesce(F.col('time') - F.lag('time').over(w), F.lit(1)))
.withColumn('key', F.concat_ws('-', 'usage', 'dt'))
.withColumn('prev_key', F.lag('key').over(w))
.withColumn('diff', F.coalesce((F.col('key') != F.col('prev_key')).cast('int'), F.lit(1)))
.withColumn('cumsum', F.sum('diff').over(w_cumsum))
.withColumn('null_group_key',
F.when(F.isnull('usage'), F.concat_ws('-', 'group_num', 'cumsum')).otherwise(None))
)
# map to generate required group names
df_map = (
df_tmp
.select('null_group_key')
.distinct()
.dropna()
.sort('null_group_key')
.withColumn('null_group', F.concat(F.lit('group'), F.monotonically_increasing_id() + F.lit(1)))
)
# rename and display as needed
(
df_tmp
.join(df_map, 'null_group_key', 'left')
.fillna('', 'null_group')
.select('group_num', 'days', 'time', 'usage', 'null_group')
.sort('group_num', 'time')
.show()
)

my-sql query to fill out the time gaps between the records

I want to write an optimized query to fill out the time gaps between the records with the stock value that is most recent to date.
The requirement is to have the latest stock value for every group of id_warehouse, id_stock, and date. The table is quite large (2 million records) and hence I would like to optimize the query that I have added below and the table grows.
daily_stock_levels:
date
id_stock
id_warehouse
new_stock
is_stock_avaible
2022-01-01
1
1
24
1
2022-01-01
1
1
25
1
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-06
1
1
27
1
2022-01-09
1
1
26
1
Result:
date
id_stock
id_warehouse
closest_date_with_stock_value
most_recent_stock_value
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-03
1
1
30
1
2022-01-04
1
1
30
1
2022-01-05
1
1
30
1
2022-01-06
1
1
27
1
2022-01-07
1
1
27
1
2022-01-07
1
1
27
1
2022-01-09
1
1
26
1
2022-01-10
1
1
26
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
2022-08-08
1
1
26
1
SELECT
sl.date,
sl.id_warehouse,
sl.id_item,
(SELECT
s.date
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS closest_date_with_stock_value,
(SELECT
s.new_stock
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS most_recent_stock_value
FROM
daily_stock_levels sl
GROUP BY sl.id_warehouse,
sl.id_item,
sl.date

PySpark UDF does not return expected result

I have a Databricks dataframe with multiple columns and a UDF that generates the contents of a new column, based on values from other columns.
A sample of the original dataset is:
interval_group_id control_value pulse_value device_timestamp
2797895314 5 5 2020-09-12 09:08:44
0 5 5 2020-09-12 09:08:45
0 6 5 2020-09-12 09:08:46
0 0 5 2020-09-12 09:08:47
Now I am trying to add a new column, called group_id, based on some logic with the columns above. My UDF code is:
#udf('integer')
def udf_calculate_group_id_new (interval_group_id, prev_interval_group_id, control_val, pulse_val):
if interval_group_id != 0:
return interval_group_id
elif control_val >= pulse_val and prev_interval_group_id != 0:
return prev_interval_group_id
else:
return -1
And the new column being added to my dataframe is done with:
df = df.withColumn('group_id'
, udf_calculate_group_id_new(
df.interval_group_id
, lag(col('interval_group_id')).over(Window.orderBy('device_timestamp'))
, df.control_value
, df.pulse_value)
)
My expected results are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 2797895314
0 5 5 2020-09-12 09:08:45 2797895314
0 6 5 2020-09-12 09:08:46 2797895314
0 0 5 2020-09-12 09:08:47 -1
However, the results of adding the new group_id column are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 null
0 5 5 2020-09-12 09:08:45 null
0 6 5 2020-09-12 09:08:46 -1
0 0 5 2020-09-12 09:08:47 -1
My goal is to propagate the value 2797895314 down the group_id column, based on the conditions mentioned above, but somehow this doesn't happen and the results are populated with null and -1 incorrectly.
Is this a bug with UDF's or is my expectation of the way of working for the UDF incorrect? Or am I just bad at coding?

Pandas edit dataframe

I am querying a MongoDB collection, with two queries and appending them to get a single data frame.(keys are: status, date, uniqueid)
for record in results:
query1 = (record["sensordata"]["user"])
df1 = pd.DataFrame(query1.items())
query2 = (record["created_date"])
df2 = pd.DataFrame(query2.items())
index = "status"
result = df1.append(df2, index)
b = result.transpose()
print b
b.to_csv(q)
output is :
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 17:51:08.263000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:23.269000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:21.130000
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 18:26:28.217000
how to remove these extra 0 ,1 ,2 and 0,1 in rows and columns?
also, i don't want status uniqueid and date repeat everytime.
My desired output should be like this:
status uniqueid date
0 191b117fcf5c 2017-03-01 18:26:28.217000
1 191b117fcf5c 2017-03-01 19:26:28.192000
1 191b117fcf5c 2017-04-01 11:16:28.222000

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01