How to transform the below table to required format? - scala

I have following table loaded as a dataframe :
Id Name customCount Custom1 Custom1value custom2 custom2Value custom3 custom3Value
1 qwerty 2 Height 171 Age 76 Null Null
2 asdfg 2 Weight 78 Height 166 Null Null
3 zxcvb 3 Age 28 SkinColor white Height 67
4 tyuio 1 Height 177 Null Null Null Null
5 asdfgh 2 SkinColor brown Age 34 Null Null
I need to change this table into below format :
Id Name customCount Height Weight Age SkinColor
1 qwerty 2 171 Null 76 Null
2 asdfg 2 161 78 Null Null
3 zxcvb 3 67 Null 28 white
4 tyuio 1 177 Null Null Null
5 asdfgh 2 Null Null 34 brown
I tried for two custom fields columns :
val rawDf= spark.read.option("Header",false).options(Map("sep"->"|")).csv("/sample/data.csv")
rawDf.createOrReplaceTempView("Table")
val dataframe=spark.sql("select distinct * from (select `_c3` from Table union select `_c5` from Table)")
val dfWithDistinctColumns=dataframe.toDF("colNames")
val list=dfWithDistinctColumns.select("colNames").map(x=>x.getString(0)).collect().toList
val rawDfWithSchema=rawDf.toDF("Id","Name",customCount","h1","v1","h2","v2")
val expectedDf=list.foldLeft(rawDfWithSchema)((df1,c)=>(df1.withColumn(c, when(col("h1")===c,col("v1")).when(col("h2")===c,col("v2")).otherwise(null)))).drop("h1","h2","v1","v2")
But I am not able to do union on multiple columns when I try it on 3 custom fields .
Can you please give any idea/solution for this?

You can do a pivot, but you also need to clean up the format of the dataframe first:
val df2 = df.select(
$"Id", $"Name", $"customCount",
explode(array(
array($"Custom1", $"Custom1value"),
array($"custom2", $"custom2Value"),
array($"custom3", $"custom3Value")
)).alias("custom")
).select(
$"Id", $"Name", $"customCount",
$"custom"(0).alias("key"),
$"custom"(1).alias("value")
).groupBy(
"Id", "Name", "customCount"
).pivot("key").agg(first("value")).drop("null").orderBy("Id")
df2.show
+---+------+-----------+----+------+---------+------+
| Id| Name|customCount| Age|Height|SkinColor|Weight|
+---+------+-----------+----+------+---------+------+
| 1|qwerty| 2| 76| 171| null| null|
| 2| asdfg| 2|null| 166| null| 78|
| 3| zxcvb| 3| 28| 67| white| null|
| 4| tyuio| 1|null| 177| null| null|
| 5|asdfgh| 2| 34| null| brown| null|
+---+------+-----------+----+------+---------+------+

Related

Pyspark pivot table and create new columns based on values of another column

I am trying to create new versions of existing columns based on the values of another column. Eg. in the input I create new columns for 'var1', 'var2', 'var3' for each value 'var' split can take.
Input:
time
student
split
var1
var2
var3
t1
Student1
A
1
3
7
t1
Student1
B
2
5
6
t1
Student1
C
3
1
9
t2
Student1
A
5
3
7
t2
Student1
B
9
6
3
t2
Student1
C
3
5
3
t1
Student2
A
1
2
8
t1
Student2
C
7
4
0
Output:
time
student
splitA_var1
splitA_var2
splitA_var1
splitB_var1
splitB_var2
splitB_var3
splitC_var1
splitC_var2
splitC_var3
t1
Student1
1
3
7
2
5
6
3
1
9
t2
Student1
5
3
7
9
6
3
3
5
3
t1
Student2
1
2
8
7
4
0
Image of output here if table not formatted
this is an easy pivot with multiple aggregations (within agg()).
see example below
import pyspark.sql.functions as func
data_sdf. \
withColumn('pivot_col', func.concat(func.lit('split'), 'split')). \
groupBy('time', 'student'). \
pivot('pivot_col'). \
agg(func.first('var1').alias('var1'),
func.first('var2').alias('var2'),
func.first('var3').alias('var3')
). \
show()
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# |time| student|splitA_var1|splitA_var2|splitA_var3|splitB_var1|splitB_var2|splitB_var3|splitC_var1|splitC_var2|splitC_var3|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# | t1|Student2| 1| 2| 8| null| null| null| 7| 4| 0|
# | t2|Student1| 5| 3| 7| 9| 6| 3| 3| 5| 3|
# | t1|Student1| 1| 3| 7| 2| 5| 6| 3| 1| 9|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
spark will create new columns with the following nomenclature - <pivot column value>_<aggregation alias>

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

Modify DataFrame values against a particular value in Spark Scala

See my code:
val spark: SparkSession = SparkSession.builder().appName("ReadFiles").master("local[*]").getOrCreate()
import spark.implicits._
val data: DataFrame = spark.read.option("header", "true")
.option("inferschema", "true")
.csv("Resources/atom.csv")
data.show()
Data looks like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz null null
New list of values:
val nums: List[Int] = List(10,20)
I want to add these values where ID=4. So that DataFrame may look like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz 10 20
I wonder if it is possible. Any help will be appreciated. Thanks
It's possible, Use when otherwise statements for this case.
import org.apache.spark.sql.functions._
df.withColumn("newCol1", when(col("id") === 4, lit(nums(0))).otherwise(col("newCol1")))
.withColumn("newCol2", when(col("id") === 4, lit(nums(1))).otherwise(col("newCol2")))
.show()
//+---+----+----+-------+-------+
//| ID|Name|City|newCol1|newCol2|
//+---+----+----+-------+-------+
//| 1| Ali| lhr| null| null|
//| 2|Ahad| swl| 1| 10|
//| 3|Sana| khi| null| null|
//| 4| ABC| xyz| 10| 20|
//+---+----+----+-------+-------+

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

How to transpose/pivot the rows data to different column based on the value in Spark Scala?

I have a dataframe like this:
person_id ar_id new_value
101 5 Y
102 6 N
103 7 Full Time
104 8 Training
When I am executing:
val ar_id = Seq("5","6","7","8")
df.groupBy("person_id").pivot("ar_id",ar_id).agg(expr("coalesce(first(new_value), \"null\")"))
The output I am getting is:
person_id 5 6 7 8
101 Y null null null
102 null N null null
103 null null Time null
104 null null null Trainer
But my requirement is to have each value a different column name say 5 is status, 6 is manager, 7 is availability and 8 is role. like below:
person_id status manager availability role
101 Y null null null
102 null N null null
103 null null Time null
104 null null null Trainer
Please help. Thanks
If you want to rename columns 5,6,7,8 to status, manager, availability, role, you can do the following:
val renames = Map("5"->"status", "6"->"manager", "7"->"availability", "8"->"role")
val newDF = renames.foldLeft(df){case (d,(key,val)) => d.withColumnRenamed(key,val))
Spark 2.4.3
scala> var df= spark.createDataFrame(Seq((101,5,"Y"),(102,6,"N"),(103,7,"Full Time"),(104,8,"Training"))).toDF("person_id", "ar_id" ,"new_value")
scala> var df_v1 = df.groupBy("person_id").pivot($"ar_id").agg(expr("coalesce(first(new_value), \"null\")"))
scala> df_v1.show
+---------+----+----+---------+--------+
|person_id| 5| 6| 7| 8|
+---------+----+----+---------+--------+
| 101| Y|null| null| null|
| 103|null|null|Full Time| null|
| 102|null| N| null| null|
| 104|null|null| null|Training|
+---------+----+----+---------+--------+
1.create a Map for columns to be mapped
scala> val lookup = Map("5" -> "status", "6" -> "manager","7" -> "availability","8" -> "role")
2.then use map function to rename the columns
scala> df_v1.select(df_v1.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*).show()
+---------+------+-------+------------+--------+
|person_id|status|manager|availability| role|
+---------+------+-------+------------+--------+
| 101| Y| null| null| null|
| 103| null| null| Full Time| null|
| 102| null| N| null| null|
| 104| null| null| null|Training|
+---------+------+-------+------------+--------+
Hope this helps you retrieve your desired output let me know if you have any question-related o same and if it solves your purpose don't forget to accept the answer.thanks, HAppy HAdoop