Select Data from multiple rows to one row - pyspark

I'm pretty new to functional programming and pyspark and I currently struggle to condense the data I want from my source data
Let's say I have two tables as DataFrames:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['Id', 'JoinId', 'Name']
vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')]
persons = spark.createDataFrame(vals,columns)
columns = ['Id', 'JoinId', 'Specification', 'Date', 'Destination']
vals = [(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')]
movements = spark.createDataFrame(vals,columns)
persons.show()
+---+------+----------+
| Id|JoinId| Name|
+---+------+----------+
| 1| 11| FirstName|
| 2| 12|SecondName|
| 3| 13| ThirdName|
+---+------+----------+
movements.show()
+---+------+-------------+--------+-------------+
| Id|JoinId|Specification| Date| Destination|
+---+------+-------------+--------+-------------+
| 1| 10| I|20051205|New York City|
| 2| 11| I|19991112| Berlin|
| 3| 11| O|20030101| Madrid|
| 4| 13| I|20200113| Paris|
| 5| 11| U|20070806| Lissabon|
+---+------+-------------+--------+-------------+
What I want to create is
+--------+----------+---------+---------+-----------+
|PersonId|PersonName| IDate| ODate|Destination|
| 1| FirstName| 19991112| 20030101| Berlin|
| 3| ThirdName| 20200113| | Paris|
+--------+----------+---------+---------+-----------+
The rules would be:
PersonId is the Id of the Person
IDate is the Date saved in the Movements DataFrame where Specification is I
ODate the Date saved in the Movements DataFrame where Specification is O
The Destination is the Destination of the joined entry where the Specification was I
I already joined the dataframes on JoinId
joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, col('P_JoinId') == movements.JoinId, how='inner')
joined.show()
+---+--------+---------+---+------+-------------+--------+-----------+
| Id|P_JoinId| Name| Id|JoinId|Specification| Date|Destination|
+---+--------+---------+---+------+-------------+--------+-----------+
| 1| 11|FirstName| 2| 11| I|19991112| Berlin|
| 1| 11|FirstName| 3| 11| O|20030101| Madrid|
| 1| 11|FirstName| 5| 11| U|20070806| Lissabon|
| 3| 13|ThirdName| 4| 13| I|20200113| Paris|
+---+--------+---------+---+------+-------------+--------+-----------+
But I'm struggling to select data from multiple rows and put them with the given rules into a single row...
Thank you for your help

Note : I have renamed the id in movements to Id_Movements,to avoid confusion in grouping later.
You can pivot your joined data based on the specification and do some aggregation on date and destination. Then you will get the date and destination specification wise.
import pyspark.sql.functions as F
persons =sqlContext.createDataFrame( [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')],schema=['Id', 'JoinId', 'Name'])
movements=sqlContext.createDataFrame([(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')],schema=['Id_movements', 'JoinId', 'Specification', 'Date', 'Destination'])
df_joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, F.col('P_JoinId') == movements.JoinId, how='inner')
#%%
df_pivot = df_joined.groupby(['Id','Name']).pivot('Specification').agg(F.min('Date').alias("date"),F.min('Destination').alias('destination'))
Here I have chosen the min aggregation, but you can choose the one as per your need and drop the irrelevant columns
results :
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| Id| Name| I_date|I_destination| O_date|O_destination| U_date|U_destination|
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| 1|FirstName|19991112| Berlin|20030101| Madrid|20070806| Lissabon|
| 3|ThirdName|20200113| Paris| null| null| null| null|
+---+---------+--------+-------------+--------+-------------+--------+-------------+

Related

PySpark: Create incrementing group column counter

How can I generate the expected value, ExpectedGroup such that the same value exists when True, but changes and increments by 1, when we run into a False statement in cond1.
Consider:
df = spark.createDataFrame(sc.parallelize([
['A', '2019-01-01', 'P', 'O', 2, None],
['A', '2019-01-02', 'O', 'O', 5, 1],
['A', '2019-01-03', 'O', 'O', 10, 1],
['A', '2019-01-04', 'O', 'P', 4, None],
['A', '2019-01-05', 'P', 'P', 300, None],
['A', '2019-01-06', 'P', 'O', 2, None],
['A', '2019-01-07', 'O', 'O', 5, 2],
['A', '2019-01-08', 'O', 'O', 10, 2],
['A', '2019-01-09', 'O', 'P', 4, None],
['A', '2019-01-10', 'P', 'P', 300, None],
['B', '2019-01-01', 'P', 'O', 2, None],
['B', '2019-01-02', 'O', 'O', 5, 3],
['B', '2019-01-03', 'O', 'O', 10, 3],
['B', '2019-01-04', 'O', 'P', 4, None],
['B', '2019-01-05', 'P', 'P', 300, None],
]),
['ID', 'Time', 'FromState', 'ToState', 'Hours', 'ExpectedGroup'])
# condition statement
cond1 = (df.FromState == 'O') & (df.ToState == 'O')
df = df.withColumn('condition', cond1.cast("int"))
df = df.withColumn('conditionLead', F.lead('condition').over(Window.orderBy('ID', 'Time')))
df = df.na.fill(value=0, subset=["conditionLead"])
df = df.withColumn('finalCondition', ( (F.col('condition') == 1) & (F.col('conditionLead') == 1)).cast('int'))
# working pandas option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.shift(-1) & cond1).cumsum().mask(~cond1)
# other working option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.diff()&cond1).cumsum().where(cond1)
# failing here
windowval = (Window.partitionBy('ID').orderBy('Time').rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('ExpectedGroup2', F.sum(F.when(cond1, F.col('finalCondition'))).over(windowval))
Just use the same logic shown in your Pandas code, use Window lag function to get the previous value of cond1, set the flag to 1 only when the current cond1 is true and the previous cond1 is false, and then do the cumsum based on cond1, see below code(BTW, you probably want to add ID to partitionBy clause of the WindSpec, in that case the last ExpectedGroup1 should be 1 instead of 3):
from pyspark.sql import functions as F, Window
w = Window.partitionBy().orderBy('ID', 'time')
df_new = (df.withColumn('cond1', (F.col('FromState')=='O') & (F.col('ToState')=='O'))
.withColumn('f', F.when(F.col('cond1') & (~F.lag(F.col('cond1')).over(w)),1).otherwise(0))
.withColumn('ExpectedGroup1', F.when(F.col('cond1'), F.sum('f').over(w)))
)
df_new.show()
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| ID| Time|FromState|ToState|Hours|ExpectedGroup|cond1| f|ExpectedGroup1|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| A|2019-01-01| P| O| 2| null|false| 0| null|
| A|2019-01-02| O| O| 5| 1| true| 1| 1|
| A|2019-01-03| O| O| 10| 1| true| 0| 1|
| A|2019-01-04| O| P| 4| null|false| 0| null|
| A|2019-01-05| P| P| 300| null|false| 0| null|
| A|2019-01-06| P| O| 2| null|false| 0| null|
| A|2019-01-07| O| O| 5| 2| true| 1| 2|
| A|2019-01-08| O| O| 10| 2| true| 0| 2|
| A|2019-01-09| O| P| 4| null|false| 0| null|
| A|2019-01-10| P| P| 300| null|false| 0| null|
| B|2019-01-01| P| O| 2| null|false| 0| null|
| B|2019-01-02| O| O| 5| 3| true| 1| 3|
| B|2019-01-03| O| O| 10| 3| true| 0| 3|
| B|2019-01-04| O| P| 4| null|false| 0| null|
| B|2019-01-05| P| P| 300| null|false| 0| null|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
How to create a group column counter in PySpark?
To create a group column counter in PySpark, we can use the Window function and the row_number function. The Window function allows us to define a partitioning and ordering criteria for the rows, and the row_number function returns the position of the row within the window.
For example, suppose we have a PySpark DataFrame called df with the following data:
id
name
category
1
A
X
2
B
X
3
C
Y
4
D
Y
5
E
Z
We want to create a new column called group_counter that assigns a number to each row within the same category, starting from 1. To do this, we can use the following code:
# Import the required modules
from pyspark.sql import Window
from pyspark.sql.functions import row_number
# Define the window specification
window = Window.partitionBy("category").orderBy("id")
# Create the group column counter
df = df.withColumn("group_counter", row_number().over(window))
# Show the result
df.show()
The output of the code is:
id
name
category
group_counter
1
A
X
1
2
B
X
2
3
C
Y
1
4
D
Y
2
5
E
Z
1
As we can see, the group_counter column increments by 1 for each row within the same category, and resets to 1 when the category changes.
Why is creating a group column counter useful?
Creating a group column counter can be useful for various purposes, such as:
Ranking the rows within a group based on some criteria, such as sales, ratings, or popularity.
Assigning labels or identifiers to the rows within a group, such as customer segments, product categories, or order numbers.
Performing calculations or aggregations based on the group column counter, such as cumulative sums, averages, or percentages.

How to move a specific column of a pyspark dataframe in the start of the dataframe

I have a pyspark dataframe as follows (this is just a simplified example, my actual dataframe has hundreds of columns):
col1,col2,......,col_with_fix_header
1,2,.......,3
4,5,.......,6
2,3,........,4
and I want to move col_with_fix_header in the start, so that the output comes as follows:
col_with_fix_header,col1,col2,............
3,1,2,..........
6,4,5,....
4,2,3,.......
I don't want to list all the columns in the solution.
In case you don't want to list all columns of your dataframe, you can use the dataframe property columns. This property gives you a python list of column names and you can simply slice it:
df = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
df.select([df.columns[-1]] + df.columns[:-1]).show()
Output:
+---+---+-------+
|age| id| name|
+---+---+-------+
| 34| a| Alice|
| 36| b| Bob|
| 30| c|Charlie|
| 29| d| David|
| 32| e| Esther|
| 36| f| Fanny|
| 60| g| Gabby|
+---+---+-------+

How to apply conditional counts (with reset) to grouped data in PySpark?

I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)
Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+
I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?
That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+

How to calculate connections of the node in Spark 2

I have the following DataFrame df:
val df = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
I want to count the number of ingoing and outgoing links for each node only when the type of from node is the same as the type of to node (i.e. the values of type_from and type_to).
The cases when to and from are equal should be excluded.
This is how I calculate the number of outgoing links based on this answer proposed by user8371915.
df
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
Of course, I can repeat the same calculation for the incoming links and then join the results. But is there any shorter solution?
df2
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"to" as "nodeId", $"type_to" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
val df_result = df.join(df2, Seq("nodeId", "type"), "rightouter")