Passing data frame as optional function parameter in Scala - scala

Is there a way that I can pass a data frame as an optional input function parameter in Scala?
Ex:
def test(sampleDF: DataFrame = df.sqlContext.emptyDataFrame): DataFrame = {
}
df.test(sampleDF)
Though I am passing a valid data frame here , it is always assigned to an empty data frame, how can I avoid this?

Yes you can pass dataframe as a parameter to a function
lets say you have a dataframe as
import sqlContext.implicits._
val df = Seq(
(1, 2, 3),
(1, 2, 3)
).toDF("col1", "col2", "col3")
which is
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |2 |3 |
|1 |2 |3 |
+----+----+----+
you can pass it to a function as below
import org.apache.spark.sql.DataFrame
def test(sampleDF: DataFrame): DataFrame = {
sampleDF.select("col1", "col2") //doing some operation in dataframe
}
val testdf = test(df)
testdf would be
+----+----+
|col1|col2|
+----+----+
|1 |2 |
|1 |2 |
+----+----+
Edited
As eliasah pointed out that #Garipaso wanted optional argument. This can be done by defining the function as
def test(sampleDF: DataFrame = sqlContext.emptyDataFrame): DataFrame = {
if(sampleDF.count() > 0) sampleDF.select("col1", "col2") //doing some operation in dataframe
else sqlContext.emptyDataFrame
}
If we pass a valid dataframe as
test(df).show(false)
It will give output as
+----+----+
|col1|col2|
+----+----+
|1 |2 |
|1 |2 |
+----+----+
But if we don't pass argument as
test().show(false)
we would get empty dataframe as
++
||
++
++
I hope the answer is helpful

Related

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

How to create a Spark DF as below

I need to create a Scala Spark DF as below. This question may be silly but need to know what is the best approach to create small structures for testing purpose
For creating a minimal DF.
For creating a minimal RDD.
I've tried the following code so far without success :
val rdd2 = sc.parallelize(Seq("7","8","9"))
and then creating to DF by
val dfSchema = Seq("col1", "col2", "col3")
and
rdd2.toDF(dfSchema: _*)
Here's a sample Dataframe I'd like to obtain :
c1 c2 c3
1 2 3
4 5 6
abc_spark, here's a sample you can use to easily create Dataframes and RDDs for testing :
import spark.implicits._
val df = Seq(
(1, 2, 3),
(4, 5, 6)
).toDF("c1", "c2", "c3")
df.show(false)
+---+---+---+
|c1 |c2 |c3 |
+---+---+---+
|1 |2 |3 |
|4 |5 |6 |
+---+---+---+
val rdd: RDD[Row] = df.rdd
rdd.map{_.getAs[Int]("c2")}.foreach{println}
Gives
5
2
You are missing one "()" in Seq. Use it as below:
scala> val df = sc.parallelize(Seq(("7","8","9"))).toDF("col1", "col2", "col3")
scala> df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 7| 8| 9|
+----+----+----+

spark 2.3.1 insertinto remove value of fields and change it to null

I just upgrade my spark cluster from 2.2.1 to 2.3.1 in order to enjoy the feature of overwrite specific partitions. see link.
But ....
From some reason when I am testing it I get a very strange behavior see code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
case class MyRow(partitionField: Int, someId: Int, someText: String)
object ExampleForStack2 extends App{
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val list1 = List(
MyRow(1, 1, "someText")
,MyRow(2, 2, "someText2")
)
val list2 = List(
MyRow(1, 1, "someText modified")
,MyRow(3, 3, "someText3")
)
val df = spark.createDataFrame(list1)
val df2 = spark.createDataFrame(list2)
df2.show(false)
df.write.partitionBy("partitionField").option("path","/tmp/tables/").saveAsTable("my_table")
df2.write.mode(SaveMode.Overwrite).insertInto("my_table")
spark.sql("select * from my_table").show(false)
}
And output:
+--------------+------+-----------------+
|partitionField|someId|someText |
+--------------+------+-----------------+
|1 |1 |someText modified|
|3 |3 |someText3 |
+--------------+------+-----------------+
+------+---------+--------------+
|someId|someText |partitionField|
+------+---------+--------------+
|2 |someText2|2 |
|1 |someText |1 |
|3 |3 |null |
|1 |1 |null |
+------+---------+--------------+
Why I get those nulls ?
It seems that fields were moved ? but why?
Thanks
Ok found it, insert into is based on fields position. see documentation
Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 5| 6|
| 3| 4|
| 1| 2|
+---+---+
Because it inserts data to an existing table, format or options will
be ignored.
Moreover I am using dynamic partition which should appear as the last field. So the solution is to move the dynamic partitions to the end of the dataframe, which means in my case:
df2.select("someId", "someText","partitionField").write.mode(SaveMode.Overwrite).insertInto("my_table")

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

Spark Dataframe Random UUID changes after every transformation/action

I have a Spark dataframe with a column that includes a generated UUID.
However, each time I do an action or transformation on the dataframe, it changes the UUID at each stage.
How do I generate the UUID only once and have the UUID remain static thereafter.
Some sample code to re-produce my issue is below:
def process(spark: SparkSession): Unit = {
import spark.implicits._
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
sc.setLogLevel("OFF")
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// register an UDF that creates a random UUID
val generateUUID = udf(() => UUID.randomUUID().toString)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", generateUUID())
dfWithUuid.show(false)
dfWithUuid.show(false) // uuid is different
// new transformations also change the uuid
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
}
The output is:
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |a414e73b-24b8-4f64-8d21-f0bc56d3d290|
|b |2 |f37935e5-0bfc-4863-b6dc-897662307e0a|
|c |3 |e3aaf655-5a48-45fb-8ab5-22f78cdeaf26|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |1c6597bf-f257-4e5f-be81-34a0efa0f6be|
|b |2 |6efe4453-29a8-4b7f-9fa1-7982d2670bd6|
|c |3 |2f7ddc1c-3e8c-4118-8e2c-8a6f526bee7e|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |00b85af8-711e-4b59-82e1-8d8e59d4c512|2.0 |
|b |2 |94c3f2c6-9234-4fb3-b1c4-273a37171131|3.0 |
|c |3 |1059fff2-b8f9-4cec-907d-ea181d5003a2|4.0 |
+----+----+------------------------------------+----+
Note that the UUID is different at each step.
It is an expected behavior. User defined functions have to be deterministic:
The user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be
invoked more times than it is present in the query.
If you want to include non-deterministic function and preserve the output you should write intermediate data to a persistent storage and read it back. Checkpointing or caching may work in some simple cases but it won't be reliable in general.
If upstream process is deterministic (for starters there is shuffle) you could try to use rand function with seed, convert to byte array and pass to UUID.nameUUIDFromBytes.
See also: About how to add a new column to an existing DataFrame with random values in Scala
Note: SPARK-20586 introduced deterministic flag, which can disable certain optimization, but it is not clear how it behaves when data is persisted and a loss of executor occurs.
it is very old question but letting the people know what worked for me. It might help someone.
You could use the expr function as below to generate unique GUIDs which does not change on transformations.
import org.apache.spark.sql.functions._
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", expr("uuid()"))
dfWithUuid.show(false)
dfWithUuid.show(false)
// new transformations
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
Output is as below :
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|2.0 |
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|3.0 |
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|4.0 |
+----+----+------------------------------------+----+
I have a pyspark version:
from pyspark.sql import functions as f
pdataDF=dataDF.withColumn("uuid_column",f.expr("uuid()"))
display(pdataDF)
pdataDF.write.mode("overwrite").saveAsTable("tempUuidCheck")
Try this one:
df.withColumn("XXXID", lit(java.util.UUID.randomUUID().toString))
it works different vs:
val generateUUID = udf(() => java.util.UUID.randomUUID().toString)
df.withColumn("XXXCID", generateUUID() )
I hope this helps.
Pawel