Pivot dataframe in pyspark using column for suffix - pyspark

This question is similar to one I've asked before (Pandas pivot ussing column as suffix) but this time I need to do it using Pyspark instead of Pandas. The problem is as follows.
I have a dataframe like the following example:
Id
Type
Value_1
Value_2
1234
A
1
2
1234
B
1
2
789
A
1
2
789
B
1
2
567
A
1
2
And I want to transform to get the following:
Id
Value_1_A
Value_1_B
Value_2_A
Value_2_B
1234
1
1
2
2
789
1
1
2
2
567
1
1
In summary: replicating the value columns using the 'Type' column as a suffix and convert the dataframe to a wide format.
One solution I can think of is creating the columns with the suffix manually and then aggregating.
Other solutions I've tried are using pyspark GroupedData pivot function as follows:
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'Id': {0: 1234, 1: 1234, 2: 789, 3: 789, 4: 567},
'Type': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'Value_1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Value_2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}}))
df.groupBy("Id").pivot("Type").avg().show()
The issue of this solution is that the resulting dataframe would include the Id column repeated 3 times, and the inability to name the columns adding Type as the suffix, since they would be named liked this:
['Id',
'A_avg(Id)',
'A_avg(Value_1)',
'A_avg(Value_2)',
'B_avg(Id)',
'B_avg(Value_1)',
'B_avg(Value_2)']
I also tried specifying the value columns to the pivot functions as follows
df.groupBy("Id").pivot("Type", values=["Value_1", "Value_2"]).avg().show()
This removes the extra Id columns, but the rest of the columns only have null values.
Is there any elegant way to do the transformation I'm attempting on pyspark?

Option 1:
If you don't mind having your Type values as column prefixes rather than suffixes, you can use a combination of agg, avg, and alias:
import pyspark.sql.functions as F
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1").alias("Value_1"), F.avg("Value_2").alias("Value_2"))
+----+---------+---------+---------+---------+
|Id |A_Value_1|A_Value_2|B_Value_1|B_Value_2|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+
Separately, it's worth noting here that the values argument in the pivot method is used to limit which values you want to retain from your pivot (i.e., Type) column. For example, if you only wanted A and not B in your output, you would specify pivot("Type", values=["A"]).
Option 2:
If you do still want them as suffixes, you'll likely have to use some regex and withColumnRenamed, which could look something like this:
import pyspark.sql.functions as F
import re
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1"), F.avg("Value_2"))
for col in df_pivot.columns:
if "avg(" in col:
suffix = re.findall("^.*(?=_avg\()|$", col)[0]
base_name = re.findall("(?<=\().*(?=\)$)|$", col)[0]
df_pivot = df_pivot.withColumnRenamed(col, "_".join([base_name, suffix]))
+----+---------+---------+---------+---------+
|Id |Value_1_A|Value_2_A|Value_1_B|Value_2_B|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+

Related

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+

Scala how to match two dfs if mathes then update the key in first df

I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Checking datetime format of a column in dataframe

I have an input dateframe, which has below data:
id date_column
1 2011-07-09 11:29:31+0000
2 2011-07-09T11:29:31+0000
3 2011-07-09T11:29:31
4 2011-07-09T11:29:31+0000
I want to check whether format of date_column matches the format "%Y-%m-%dT%H:%M:%S+0000", if format matches, i want to add a column, which has value 1 otherwise 0.
Currently, i have defined a UDF to do this operation:
def date_pattern_matching(value, pattern):
try:
datetime.strptime(str(value),pattern)
return "1"
except:
return "0"
It generates below output dataframe:
id date_column output
1 2011-07-09 11:29:31+0000 0
2 2011-07-09T11:29:31+0000 1
3 2011-07-09T11:29:31 0
4 2011-07-09T11:29:31+0000 1
Execution through UDF takes a lot of time, is there an alternate way to achieve it?
Try the regex pyspark.sql.Column.rlike operator with a when otherwise block
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+