How to dynamically add columns to a DataFrame? - scala

I am trying to dynamically add columns to a DataFrame from a Seq of String.
Here's an example : the source dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I also have a Seq of String which contains name of columns I want to add. If a column already exists in the source DataFrame, it must do some kind of difference like below :
The Seq looks like :
val columns = Seq("A", "B", "F", "G", "H")
The expectation is:
+-----+---+----+---+---+---+---+---+
|id | A | B | C | D | F | G | H |
+-----+---+----+---+---+---+---+---+
|1 |toto|tata|titi|tutu|null|null|null
|2 |bla |blo | | |null|null|null|
|3 |b | c | a | d |null|null|null|
+-----+---+----+---+---+---+---+---+
What I've done so far is something like this :
val difference = columns diff sourceDF.columns
val finalDF = difference.foldLeft(sourceDF)((df, field) => if (!sourceDF.columns.contains(field)) df.withColumn(field, lit(null))) else df)
.select(columns.head, columns.tail:_*)
But I can't figure how to do this using Spark efficiently in a more simpler and easier way to read ...
Thanks in advance

Here is another way using Seq.diff, single select and map to generate your final column list:
import org.apache.spark.sql.functions.{lit, col}
val newCols = Seq("A", "B", "F", "G", "H")
val updatedCols = newCols.diff(df.columns).map{ c => lit(null).as(c)}
val selectExpr = df.columns.map(col) ++ updatedCols
df.select(selectExpr:_*).show
// +---+----+----+----+----+----+----+----+
// | id| A| B| C| D| F| G| H|
// +---+----+----+----+----+----+----+----+
// | 1|toto|tata|titi|null|null|null|null|
// | 2| bla| blo|null|null|null|null|null|
// | 3| b| c| a| d|null|null|null|
// +---+----+----+----+----+----+----+----+
First we find the diff between newCols and df.columns this gives us: F, G, H. Next we transform each element of the list to lit(null).as(c) via map function. Finally, we concatenate the existing and the new list together to produce selectExpr which is used for the select.

Below will be optimised way with your logic.
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> val df1 = newCol.foldLeft(df)((df,name) => df.withColumn(name, lit(null)))
scala> df1.show()
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| F| G| H|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+
If you do not want to use foldLeft then you can use RunTimeMirror which will be faster. Check Below Code.
scala> import scala.reflect.runtime.universe.runtimeMirror
scala> import scala.tools.reflect.ToolBox
scala> import org.apache.spark.sql.DataFrame
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> def compile[A](code: String): DataFrame => A = {
| val tb = runtimeMirror(getClass.getClassLoader).mkToolBox()
| val tree = tb.parse(
| s"""
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |def wrapper(context:DataFrame): Any = {
| | $code
| |}
| |wrapper _
| """.stripMargin)
|
| val fun = tb.compile(tree)
| val wrapper = fun()
| wrapper.asInstanceOf[DataFrame => A]
| }
scala> def AddColumns(df:DataFrame,withColumnsString:String):DataFrame = {
| val code =
| s"""
| |import org.apache.spark.sql.functions._
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |var data = context.asInstanceOf[DataFrame]
| |data = data
| """ + withColumnsString +
| """
| |
| |data
| """.stripMargin
|
| val fun = compile[DataFrame](code)
| val res = fun(df)
| res
| }
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> var cols = ""
scala> newCol.foreach{ name =>
| cols = ".withColumn(\""+ name + "\" , lit(null))" + cols
| }
scala> val df1 = AddColumns(df,cols)
scala> df1.show
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| H| G| F|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+

Related

Building up a dataframe

I am trying to build a dataframe of 10k records to then save to a parquet file on Spark 2.4.3 standalone
The following works in a small scale up to 1000 records but takes forever when ramping up to 10k
scala> import spark.implicits._
import spark.implicits._
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> for ( i <- 1 to 1000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
scala> someDF.show
+---+------+
| x| y|
+---+------+
| 0| item0|
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
+---+------+
only showing top 20 rows
[Stage 2:=========================================================(20 + 0) / 20]
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> someDF.show
+---+-----+
| x| y|
+---+-----+
| 0|item0|
+---+-----+
scala> for ( i <- 1 to 10000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
Just want to save someDF to a parquet file to then load into Impala
//declare Range that you want
scala> val r = 1 to 10000
//create DataFrame with range
scala> val df = sc.parallelize(r).toDF("x")
//Add new column "y"
scala> val final_df = df.select(col("x"),concat(lit("item"),col("x")).alias("y"))
scala> final_df.show
+---+------+
| x| y|
+---+------+
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
| 20|item20|
+---+------+
scala> final_df.count
res17: Long = 10000
//Write final_df to path in parquet format
scala> final_df.write.format("parquet").save(<path to write>)

Replicating rows in Spark dataframe according values in a column

I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+

spark - conditinal statements inside select

I am selecting two Columns from Dataframe col1 and col2.
df.select((col("a")+col("b")).as("sum_col")
now user wants this sum_col to be fixed spaces to 4.
so length of a and b is 2 hence max value can come less than 100 (two) or more than 100(three) so need to do it conditionally to add 1 or 2 spaces with it.
can anyone tell me how to handle within select block with cinditional logic to cast the Column to concat and decide one or two spaces to be added
Just use format_string function
import org.apache.spark.sql.functions.format_string
val df = Seq(1, 10, 100).toDF("sum_col")
val result = df.withColumn("sum_col_fmt", format_string("%4d", $"sum_col"))
And proof it works:
result.withColumn("proof", concat(lit("'"), $"sum_col_fmt", lit("'"))).show
// +-------+-----------+------+
// |sum_col|sum_col_fmt| proof|
// +-------+-----------+------+
// | 1| 1|' 1'|
//| 10| 10|' 10'|
// | 100| 100|' 100'|
// +-------+-----------+------+
UDF with String.format:
val df = List((1, 2)).toDF("a", "b")
val leadingZeroes = (value: Integer) => String.format("%04d", value)
val leadingZeroesUDF = udf(leadingZeroes)
val result = df.withColumn("sum_col", leadingZeroesUDF($"a" + $"b"))
result.show(false)
Output:
+---+---+-------+
|a |b |sum_col|
+---+---+-------+
|1 |2 |0003 |
+---+---+-------+
Define a UDF and then register it. I added a dot in front of the format so that it can be shown in the output. Check this out
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = spark.range(1,20).toDF("col1")
df: org.apache.spark.sql.DataFrame = [col1: bigint]
scala> val df2 = df.withColumn("newcol", 'col1 + 'col1)
df2: org.apache.spark.sql.DataFrame = [col1: bigint, newcol: bigint]
scala> def myPadding(a:String):String =
| return ".%4s".format(a)
myPadding: (a: String)String
scala> val myUDFPad = udf( myPadding(_:String):String)
myUDFPad: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df2.select(myUDFPad(df2("newcol"))).show
+-----------+
|UDF(newcol)|
+-----------+
| . 2|
| . 4|
| . 6|
| . 8|
| . 10|
| . 12|
| . 14|
| . 16|
| . 18|
| . 20|
| . 22|
| . 24|
| . 26|
| . 28|
| . 30|
| . 32|
| . 34|
| . 36|
| . 38|
+-----------+
scala>

How to use NOT IN clause in filter condition in spark

I want to filter a column of an RDD source :
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
I want to use IN clause in filter condition to filter out only the values present in src from source, something like below(EDITED):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
equivalent SQL code is
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
Thank you
Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :
// create a DataFrame for a range 'id' from 1 to 9.
scala> val df = spark.range(1,10).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
// values to exclude
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
// select * from df where id is not in the values to exclude
scala> df.filter(!col("id").isin(f : _*)).show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 8|
| 9|
+---+
// select * from df where id is in the values to exclude
scala> df.filter(col("id").isin(f : _*)).show
Here is the RDD version of the not isin :
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
scala> val rdd2 = rdd.filter(x => !f.contains(x))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at <console>:28
Nevertheless, I still believe this is an overkill since you are already using spark-sql.
It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.
You can use the left anti join approach :
scala> val source = spark.read.format("csv").load("source.file")
source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> val destination = spark.read.format("csv").load("destination.file")
destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> source.show
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Ravi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2|Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
scala> destination.show
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi1 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi2 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2| Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam1| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair1|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
You'll just need to do the following :
scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
It's the same logic I mentioned in my answer here.
You can try like--
df.filter(~df.Dept.isin("30","20")).show()
//This will list all the columns of df where Dept NOT IN 30 or 20
You can try something similar in Java,
ds = ds.filter(functions.not(functions.col(COLUMN_NAME).isin(exclusionSet)));
where exclusionSet is a set of objects that needs to be removed from your dataset.

Depth Search in Spark Scala

I've a following data which has user and supervisor relationship.
user |supervisor |id
-----|-----------|----
a | b | 1
b | c | 2
c | d | 3
e | b | 4
I want to explode the relationship hierarchy between the user and supervisor as below.
user |supervisor |id
-----|-----------|----
a | b | 1
a | c | 1
a | d | 1
b | c | 2
b | d | 2
c | d | 3
e | b | 4
e | c | 4
e | d | 4
As you see, for the user 'a', the immediate supervisor is 'b' but again 'b' has 'c' as its supervisor. So indirectly 'c' is supervisor for 'a' as well and so on. Such as, my aim is to explode the hierarchy at any level for a given user. What is the best way to implement this in spark-scala ?
Here is an approach using dataframes. I show doing one level of hierarchy, but it can be done multiple times by repeating the step below:
val df = sc.parallelize(Array(("a", "b", 1), ("b", "c", 2), ("c", "d", 3), ("e", "b", 4))).toDF("user", "supervisor", "id")
scala> df.show
+----+----------+---+
|user|supervisor| id|
+----+----------+---+
| a| b| 1|
| b| c| 2|
| c| d| 3|
| e| b| 4|
+----+----------+---+
Let's enable cross joins:
spark.conf.set("spark.sql.crossJoin.enabled", true)
Then join the same table:
val dfjoin = df.as("df1").join(df.as("df2"), $"df1.supervisor" === $"df2.user", "left").select($"df1.user", $"df1.supervisor".as("s1"), $"df1.id", $"df2.supervisor".as("s2"))
I use an udf to combine two columns into an array:
import org.apache.spark.sql.functions.udf
val combineUdf = udf((x: String, y: String) => Seq(x, y))
val dfcombined = dfjoin.withColumn("combined", combineUdf($"s1", $"s2")).select($"user", $"combined", $"id")
Then the last step is to flatten the array to separate rows and filter rows that did not join:
val dfexplode = dfcombined.withColumn("supervisor", explode($"combined")).select($"user", $"id", $"supervisor").filter($"supervisor".isNotNull)
The first level hierarchy looks like this:
scala> dfexplode.show
+----+---+----------+
|user| id|supervisor|
+----+---+----------+
| c| 3| d|
| b| 2| c|
| b| 2| d|
| a| 1| b|
| a| 1| c|
| e| 4| b|
| e| 4| c|
+----+---+----------+