How to map spark DataFrame row values to columns? - scala

I am trying to map values in rows to columns in another dataframe.
I have the following DataFrame, the values in "id" are known to be unique:
sqlContext.createDataFrame(Seq(("a", 1),("b",2))).toDF("id","number")
And:
sqlContext.createDataFrame(Seq(("jane",10),("John",12))).toDF("mcid", "age")
And I wish to produce a DataFrame with the schema:
| mcid | age | a | b |

I have no idea what you are try to do, but assuming you have this:
val df1 = sqlContext.createDataFrame(Seq(("a", 1),("b",2))).toDF("id","number")
val df2 = sqlContext.createDataFrame(Seq(("jane",10),("John",12))).toDF("mcid", "age")
This will get you a DataFrame with the schema you are looking for:
df2.join(df1).groupBy($"mcid", $"age").pivot("id").sum("number")

Related

Explode or pivot spark scala dataframe horizontally to create a big flat dataframe

I have a dataframe with following schema:
UserID | StartDate | endDate | orderId | OrderCost| OrderItems| OrderLocation| Rank
Where Rank is 1 to 10.
I need to transpose this dataframe on rank and create dataframe in the below format:
UserID| StartDate_1 | endDate_1 | orderId_1 | OrderCost_1| OrderItems_1| OrderLocation_1|start_2 |endDate_2| orderId_2 | OrderCost_2| OrderItems_2| OrderLocation_2 |............| startDate_N|endDate_N | orderId_N | OrderCost_N| OrderItems_N| OrderLocation_N
If a user has only two records with rank 3 and 10 then the requirement is populate columns with suffix _3 and _10 the rest of the cell values for the user will be null.
I have tried 2 brute force approaches
Filter the DF for a rank, and rename the columns with suffix and do self join back to DF.
Grouped by UserID, collect as list and pass it to map function where I populate a array based on rank and then return the seq of string. Create the DF by passing the required schema
Both seemed to be working (Unsure if its the right approach
)but they are not generic that i can re use for different usecase i have
In this example I used https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/_data/auto-mpg.csv
Spark by default puts the rank in front, so the column names are "reversed" from what you specified, but this is done in only a few steps. The key is that exprs should be dynamically created, and that agg requires this to be split into a head and tail (which is why there is .agg(exprs(0), exprs.slice(1, exprs.length) below)
scala> df2.columns
res39: Array[String] = Array(mpg, cyl, displ, hp, weight, accel, yr, origin, name, Rank)
// note here, you would use columns.slice with the indices for
// the columns you need, i.e. (1, 7)
val exprs = for (col <- df2.columns.slice(0, 8)) yield expr(s"first(${col}) as ${col}")
exprs: Array[org.apache.spark.sql.Column] = Array(first(mpg, false) AS `mpg`, first(cyl, false) AS `cyl`, first(displ, false) AS `displ`, first(hp, false) AS `hp`, first(weight, false) AS `weight`, first(accel, false) AS `accel`, first(yr, false) AS `yr`, first(origin, false) AS `origin`)
scala> val resultDF = df2.groupBy("name").pivot("Rank").agg(exprs(0), exprs.slice(1, exprs.length):_*)
scala> resultDF.columns
res40: Array[String] = Array(name, 1_mpg, 1_cyl, 1_displ, 1_hp, 1_weight, 1_accel, 1_yr, 1_origin, 2_mpg, 2_cyl, 2_displ, 2_hp, 2_weight, 2_accel, 2_yr, 2_origin, 3_mpg, 3_cyl, 3_displ, 3_hp, 3_weight, 3_accel, 3_yr, 3_origin, 4_mpg, 4_cyl, 4_displ, 4_hp, 4_weight, 4_accel, 4_yr, 4_origin, 5_mpg, 5_cyl, 5_displ, 5_hp, 5_weight, 5_accel, 5_yr, 5_origin, 6_mpg, 6_cyl, 6_displ, 6_hp, 6_weight, 6_accel, 6_yr, 6_origin, 7_mpg, 7_cyl, 7_displ, 7_hp, 7_weight, 7_accel, 7_yr, 7_origin, 8_mpg, 8_cyl, 8_displ, 8_hp, 8_weight, 8_accel, 8_yr, 8_origin, 9_mpg, 9_cyl, 9_displ, 9_hp, 9_weight, 9_accel, 9_yr, 9_origin, 10_mpg, 10_cyl, 10_displ, 10_hp, 10_weight, 10_accel, 10_yr, 10_origin)

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

casting to string of column for pyspark dataframe throws error

I have pyspark dataframe with two columns with datatypes as
[('area', 'int'), ('customer_play_id', 'int')]
+----+----------------+
|area|customer_play_id|
+----+----------------+
| 100| 8606738 |
| 110| 8601843 |
| 130| 8602984 |
+----+----------------+
I want to cast column area to str using pyspark commands but I am getting error as below
I tried below
str(df['area']) : but it didnt change datatype to str
df.area.astype(str) : gave "TypeError: unexpected type: "
df['area'].cast(str) same as error above
Any help will be appreciated
I want datatype of area as string using pyspark dataframe operation
Simply you can do any of these -
Option1:
df1 = df.select('*',df.area.cast("string"))
select - All the columns you want in df1 should be mentioned in select
Option2:
df1 = df.selectExpr("*","cast(area as string) AS new_area")
selectExpr - All the columns you want in df1 should be mentioned in selectExpr
Option3:
df1 = df.withColumn("new_area", df.area.cast("string"))
withColumn will add new column (additional to existing columns of df)
"*" in select and selectExpr represent all the columns.
use withColumn function to change the data type or values in the field in spark e.g. is show below:
import pyspark.sql.functions as F
df = df.withColumn("area",F.col("area").cast("string"))
You Can use this UDF Function
from pyspark.sql.types import FloatType
tofloatfunc = udf(lambda x: x,FloatType())
changedTypedf = df.withColumn("Column_name", df["Column_name"].cast(FloatType()))

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?
df.show()
+--------------------+-------------------- +
| ids | vals |
+--------------------+-------------------- +
|[id1,id2,id3] | null |
|[id2,id5,id6] |[WrappedArray(0,2,4)] |
|[id2,id4,id7] |[WrappedArray(6,8,10)]|
Expected output:
+----+----+
|id1 | id2| ...
+----+----+
|null| 0 | ...
|null| 6 | ...
A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.
import org.apache.spark.sql.functions._
val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))
val df = sparkContext.parallelize(data).toDF("ids","values")
val values = df.flatMap{
case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))
// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))
// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))
// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)
This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.
Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.