add new column in a dataframe depending on another dataframe's row values - scala

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing

Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct

Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

Related

Explode or pivot spark scala dataframe horizontally to create a big flat dataframe

I have a dataframe with following schema:
UserID | StartDate | endDate | orderId | OrderCost| OrderItems| OrderLocation| Rank
Where Rank is 1 to 10.
I need to transpose this dataframe on rank and create dataframe in the below format:
UserID| StartDate_1 | endDate_1 | orderId_1 | OrderCost_1| OrderItems_1| OrderLocation_1|start_2 |endDate_2| orderId_2 | OrderCost_2| OrderItems_2| OrderLocation_2 |............| startDate_N|endDate_N | orderId_N | OrderCost_N| OrderItems_N| OrderLocation_N
If a user has only two records with rank 3 and 10 then the requirement is populate columns with suffix _3 and _10 the rest of the cell values for the user will be null.
I have tried 2 brute force approaches
Filter the DF for a rank, and rename the columns with suffix and do self join back to DF.
Grouped by UserID, collect as list and pass it to map function where I populate a array based on rank and then return the seq of string. Create the DF by passing the required schema
Both seemed to be working (Unsure if its the right approach
)but they are not generic that i can re use for different usecase i have
In this example I used https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/_data/auto-mpg.csv
Spark by default puts the rank in front, so the column names are "reversed" from what you specified, but this is done in only a few steps. The key is that exprs should be dynamically created, and that agg requires this to be split into a head and tail (which is why there is .agg(exprs(0), exprs.slice(1, exprs.length) below)
scala> df2.columns
res39: Array[String] = Array(mpg, cyl, displ, hp, weight, accel, yr, origin, name, Rank)
// note here, you would use columns.slice with the indices for
// the columns you need, i.e. (1, 7)
val exprs = for (col <- df2.columns.slice(0, 8)) yield expr(s"first(${col}) as ${col}")
exprs: Array[org.apache.spark.sql.Column] = Array(first(mpg, false) AS `mpg`, first(cyl, false) AS `cyl`, first(displ, false) AS `displ`, first(hp, false) AS `hp`, first(weight, false) AS `weight`, first(accel, false) AS `accel`, first(yr, false) AS `yr`, first(origin, false) AS `origin`)
scala> val resultDF = df2.groupBy("name").pivot("Rank").agg(exprs(0), exprs.slice(1, exprs.length):_*)
scala> resultDF.columns
res40: Array[String] = Array(name, 1_mpg, 1_cyl, 1_displ, 1_hp, 1_weight, 1_accel, 1_yr, 1_origin, 2_mpg, 2_cyl, 2_displ, 2_hp, 2_weight, 2_accel, 2_yr, 2_origin, 3_mpg, 3_cyl, 3_displ, 3_hp, 3_weight, 3_accel, 3_yr, 3_origin, 4_mpg, 4_cyl, 4_displ, 4_hp, 4_weight, 4_accel, 4_yr, 4_origin, 5_mpg, 5_cyl, 5_displ, 5_hp, 5_weight, 5_accel, 5_yr, 5_origin, 6_mpg, 6_cyl, 6_displ, 6_hp, 6_weight, 6_accel, 6_yr, 6_origin, 7_mpg, 7_cyl, 7_displ, 7_hp, 7_weight, 7_accel, 7_yr, 7_origin, 8_mpg, 8_cyl, 8_displ, 8_hp, 8_weight, 8_accel, 8_yr, 8_origin, 9_mpg, 9_cyl, 9_displ, 9_hp, 9_weight, 9_accel, 9_yr, 9_origin, 10_mpg, 10_cyl, 10_displ, 10_hp, 10_weight, 10_accel, 10_yr, 10_origin)

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Dynamically select column content based on other column from the same row

I am using Spark 1.6.1. Lets say my data frame looks like:
+------------+-----+----+
|categoryName|catA |catB|
+------------+-----+----+
| catA |0.25 |0.75|
| catB |0.5 |0.5 |
+------------+-----+----+
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
+------------+-----+----+-------+
|categoryName|catA |catB| score |
+------------+-----+----+-------+
| catA |0.25 |0.75| 0.25 | ('score' has value from column name 'catA')
| catB |0.5 |0.7 | 0.7 | ('score' value from column name 'catB')
+------------+-----+----+-------+
I need such extraction to some later calculations. Any ideas?
Important: I don't know names of category columns. Solution needs to be dynamic.
Spark 2.0:
You can do this (for any number of category columns) by creating a temporary column which holds a map of categroyName -> categoryValue, and then selecting from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// create a map of category -> value, and then select from that map using categoryName:
input
.withColumn("asMap", map(catCols.flatMap(c => Seq(lit(c), col(c))): _*))
.withColumn("score", $"asMap".apply($"categoryName"))
.drop("asMap")
Spark 1.6: Similar idea, but using an array and a UDF to select from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// UDF to select from array by index of colName in catCols
val getByColName = udf[Double, String, mutable.WrappedArray[Double]] {
case (colName, colValues) =>
val index = catCols.zipWithIndex.find(_._1 == colName).map(_._2)
index.map(colValues.apply).getOrElse(0.0)
}
// create an array of category values and select from it using UDF:
input
.withColumn("asArray", array(catCols.map(col): _*))
.withColumn("score", getByColName($"categoryName", $"asArray"))
.drop("asArray")
You have several options:
If you are using scala you can use the Dataset API in which case you would simply create a map which does the calculation.
You can move to RDD from dataframe and use a map
You can create a UDF which receives all relevant columns as input and do the calculation inside
you can use a bunch of when/otherwise clauses to do the search (e.g. when(col1 == CatA, col(CatA)).otherwise(col(CatB)))

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?
df.show()
+--------------------+-------------------- +
| ids | vals |
+--------------------+-------------------- +
|[id1,id2,id3] | null |
|[id2,id5,id6] |[WrappedArray(0,2,4)] |
|[id2,id4,id7] |[WrappedArray(6,8,10)]|
Expected output:
+----+----+
|id1 | id2| ...
+----+----+
|null| 0 | ...
|null| 6 | ...
A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.
import org.apache.spark.sql.functions._
val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))
val df = sparkContext.parallelize(data).toDF("ids","values")
val values = df.flatMap{
case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))
// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))
// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))
// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)
This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.
Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.