Spark Dataframe Random UUID changes after every transformation/action - scala

I have a Spark dataframe with a column that includes a generated UUID.
However, each time I do an action or transformation on the dataframe, it changes the UUID at each stage.
How do I generate the UUID only once and have the UUID remain static thereafter.
Some sample code to re-produce my issue is below:
def process(spark: SparkSession): Unit = {
import spark.implicits._
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
sc.setLogLevel("OFF")
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// register an UDF that creates a random UUID
val generateUUID = udf(() => UUID.randomUUID().toString)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", generateUUID())
dfWithUuid.show(false)
dfWithUuid.show(false) // uuid is different
// new transformations also change the uuid
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
}
The output is:
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |a414e73b-24b8-4f64-8d21-f0bc56d3d290|
|b |2 |f37935e5-0bfc-4863-b6dc-897662307e0a|
|c |3 |e3aaf655-5a48-45fb-8ab5-22f78cdeaf26|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |1c6597bf-f257-4e5f-be81-34a0efa0f6be|
|b |2 |6efe4453-29a8-4b7f-9fa1-7982d2670bd6|
|c |3 |2f7ddc1c-3e8c-4118-8e2c-8a6f526bee7e|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |00b85af8-711e-4b59-82e1-8d8e59d4c512|2.0 |
|b |2 |94c3f2c6-9234-4fb3-b1c4-273a37171131|3.0 |
|c |3 |1059fff2-b8f9-4cec-907d-ea181d5003a2|4.0 |
+----+----+------------------------------------+----+
Note that the UUID is different at each step.

It is an expected behavior. User defined functions have to be deterministic:
The user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be
invoked more times than it is present in the query.
If you want to include non-deterministic function and preserve the output you should write intermediate data to a persistent storage and read it back. Checkpointing or caching may work in some simple cases but it won't be reliable in general.
If upstream process is deterministic (for starters there is shuffle) you could try to use rand function with seed, convert to byte array and pass to UUID.nameUUIDFromBytes.
See also: About how to add a new column to an existing DataFrame with random values in Scala
Note: SPARK-20586 introduced deterministic flag, which can disable certain optimization, but it is not clear how it behaves when data is persisted and a loss of executor occurs.

it is very old question but letting the people know what worked for me. It might help someone.
You could use the expr function as below to generate unique GUIDs which does not change on transformations.
import org.apache.spark.sql.functions._
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", expr("uuid()"))
dfWithUuid.show(false)
dfWithUuid.show(false)
// new transformations
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
Output is as below :
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|2.0 |
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|3.0 |
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|4.0 |
+----+----+------------------------------------+----+

I have a pyspark version:
from pyspark.sql import functions as f
pdataDF=dataDF.withColumn("uuid_column",f.expr("uuid()"))
display(pdataDF)
pdataDF.write.mode("overwrite").saveAsTable("tempUuidCheck")

Try this one:
df.withColumn("XXXID", lit(java.util.UUID.randomUUID().toString))
it works different vs:
val generateUUID = udf(() => java.util.UUID.randomUUID().toString)
df.withColumn("XXXCID", generateUUID() )
I hope this helps.
Pawel

Related

Scala -- apply a custom if-then on a dataframe

I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

Consider items of the same value when deciding rank

In spark, I would like to count how values are less or equal to other values. I tried to accomplish this via ranking but rank produces
[1,2,2,2,3,4] -> [1,2,2,2,5,6]
while what I would like is
[1,2,2,2,3,4] -> [1,4,4,4,5,6]
I can accomplish this by ranking, grouping by the rank and then modifying the rank value based on how many items are in the group. But this is kind of clunky and it's inefficient. Is there a better way to do this?
Edit: Added minimal example of what I'm trying to accomplish
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".asc)
Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
.groupByKey(_._2)
.mapGroups((rank, nums) => (rank, nums.toList.map(_._1)))
.map(x => (x._1 + x._2.length - 1, x._2))
.flatMap(x => x._2.map(num => (num, x._1)))
.toDF("nums", "rank")
.show(false)
}
Output:
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+
Use window functions
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> df.createOrReplaceTempView("tbl")
scala> spark.sql(" with tab1(select nums, rank() over(order by nums) rk, count(*) over(partition by nums) cn from tbl) select nums, rk+cn-1 as rk2 from tab1 ").show(false)
18/11/28 02:20:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>
Note that the df doesn't partition on any column, so spark complains of moving all data to single partition.
EDIT1:
scala> spark.sql(" select nums, rank() over(order by nums) + count(*) over(partition by nums) -1 as rk2 from tbl ").show
18/11/28 23:20:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
| 1| 1|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 5|
| 4| 6|
+----+---+
scala>
EDIT2:
The equivalent df version
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> df.withColumn("rk2", rank().over(Window orderBy 'nums)+ count(lit(1)).over(Window.partitionBy('nums)) - 1 ).show(false)
2018-12-01 11:10:26 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>
So, a friend pointed out that if I just calculate the rank in descending order and then for each rank do (max_rank + 1) - current_rank. This is a much more efficient implementation.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".desc)
val rankings = Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
val maxElement = rankings.select("rank").as[Int].reduce((a, b) => if (a > b) a else b)
rankings
.map(x => x.copy(_2 = maxElement - x._2 + 1))
.toDF("nums", "rank")
.orderBy("rank")
.show(false)
}
Output
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+

append two dataframes and update data

Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))