How can I group my data frame based on conditions on a column - pyspark

I have a data frame like this:
Date
Version
Value
Name
Jan 1
123.1
3
A
Jan 2
123.23
5
A
Jan 1
223.1
6
B
Jan 2
623.23
7
B
I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .. And for the Value, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.
Version Prefix
Value
Name
123
5
A
223
6
B
623
7
B
Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest
Version has 5 as Value.

(df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
.withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
.withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
.where(col('size')==col('max'))#Filter out max suffixes
.drop('Date','size','max','Version')#Drop unwanted columns
).show()
+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
| 5| A| 123|
| 6| B| 223|
| 7| B| 623|
+-----+----+--------------+

Related

Sum values of specific rows if fields are the same

Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.
try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html
You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+

Spark aggregation with window functions

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:
A
B
C
Snap
1
2
3
2019-12-29
1
2
4
2019-12-31
where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:
A
B
C
Snap
activity
1
2
3
2019-12-29
false
1
2
4
2019-12-31
true
I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.
Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:
A
B
C
Snap
activity
end
1
2
3
2019-12-29
false
2019-12-30
1
2
4
2019-12-31
true
You can check row_number ordered by Snap in descending order. The 1st row is the last active snap:
df.selectExpr(
'*',
'row_number() over (partition by A, B order by Snap desc) = 1 as activity'
).show()
+---+---+---+----------+--------+
| A| B| C| Snap|activity|
+---+---+---+----------+--------+
| 1| 2| 4|2019-12-31| true|
| 1| 2| 3|2019-12-29| false|
+---+---+---+----------+--------+
Edit: to get the end date for each group, use max window function on Snap:
import pyspark.sql.functions as f
df.withColumn(
'activity',
f.expr('row_number() over (partition by A, B order by Snap desc) = 1')
).withColumn(
"end",
f.expr('case when activity then null else max(date_add(to_date(Snap), -1)) over (partition by A, B) end')
).show()
+---+---+---+----------+--------+----------+
| A| B| C| Snap|activity| end|
+---+---+---+----------+--------+----------+
| 1| 2| 4|2019-12-31| true| null|
| 1| 2| 3|2019-12-29| false|2019-12-30|
+---+---+---+----------+--------+----------+

Restrict table columns, preserving keys

I've found in "Q Tips" a technique to preserve keys in a table. This is useful for restriction columns in the right table in lj for example, without re-applying a key. Using each:
q)show t:(`c1`c2!1 2;`c1`c2!3 4)!(`c3`c4`c5!30 40 50;`c3`c4`c5!31 41 51)
c1 c2| c3 c4 c5
-----| --------
1 2 | 30 40 50
3 4 | 31 41 51
q)`c3`c4#/:t
c1 c2| c3 c4
-----| -----
1 2 | 30 40
3 4 | 31 41
I’m trying to understand why it preserves a key part of the table t:
q){-3!x}/:t
'/:
[0] {-3!x}/:t
^
But in this case q doesn’t show how it treats each row of the keyed table.
So why is this syntax #/:t works in such a way for a keyed table? Is it mentioned anywhere in code.kx.com docs?
Upd1: I've found a case with # and keyed table on code.kx.com, but it is about selecting rows, not columns.
If you view the keyed table as a dictionary (which it is) then it's no different to:
q)2*/:`a`b!1 2
a| 2
b| 4
or
q){x+1} each `a`b!1 2
a| 2
b| 3
The keys are retained when applying a function to each element of a dictionary. In your example the function being applied is to use take on a dictionary, e.g:
q)`c3`c4#first t
c3| 30
c4| 40
doing that for each row returns a list of dictionaries which is itself a table.
Also your other attempt would work as:
{-3!x}#/:t
so it's not unique to take #
{-3!x}/:t
each right needs two arguments so this wont work.
Since the table is keyed, it is treated as a dictionary. The each right iterates over the dictionary values and therefore ignores the keys of the main dictionary (= the keyed columns). To see what is happening it might help to see what happens when using each:
q)){-3!x} each t
c1 c2|
-----| --------------------
1 2 | "`c3`c4`c5!30 40 50"
3 4 | "`c3`c4`c5!31 41 51"

how to take value for same answered more than once and need to create each value one column

I have data like below, want to take data for same id from one column and put each answer in different new columns respectively
actual
ID Brandid
1 234
1 122
1 134
2 122
3 234
3 122
Excpected
ID BRANDID_1 BRANDID_2 BRANDID_3
1 234 122 134
2 122 - -
3 234 122 -
You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# create the window on ID and as you need orderBy after,
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1))
# create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')))
# groupby the ID and pivot on the created column
.groupBy('ID').pivot('pv')
# in aggregation, you need a function so we use first
.agg(F.first('Brandid')))
and you get
pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
| 1| 234| 122| 134|
| 3| 234| 122| null|
| 2| 122| null| null|
+---+---------+---------+---------+
EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:
nb_pad = 3
and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by
F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))
and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by
nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))

How to get the name of the group with maximum value of parameter? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")
There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.