append two dataframes and update data - scala

Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot

Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.

instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))

Related

Align multiple dataframes in pyspark

I have these 4 spark dataframes:
order,device,count_1
101,201,2
102,202,4
order,device,count_2
101,201,10
103,203,100
order,device,count_3
104,204,111
103,203,10
order,device,count_4
101,201,4
104,204,11
I want to create a resultant dataframe as:
order,device,count_1,count_2,count_3,count_4
101,201,2,10,,4,
102,202,4,,,,
103,203,,100,10,,
104,204,,,111,11
Is this a case of UNION or JOIN or APPEND? How to get the final resultant df?
You can think of UNION as combining tables by rows, so the number of rows will likely increase. JOIN combines tables by columns. I'm not sure what you mean by APPEND, but in this case, you would want JOIN.
Try:
val df1 = Seq((101,201,2), (102,202,4)).toDF("order" ,"device", "count_1")
val df2 = Seq((101,201,10), (103,203,100)).toDF("order" ,"device", "count_2")
val df3 = Seq((104,204,111), (103,203,10)).toDF("order" ,"device", "count_3")
val df4 = Seq((101,201,4), (104,204,11)).toDF("order" ,"device", "count_4")
val df12 = df1.join(df2, Seq("order", "device"),"fullouter")
df12.show(false)
val df123 = df12.join(df3, Seq("order", "device"),"fullouter")
df123.show(false)
val df1234 = df123.join(df4, Seq("order", "device"),"fullouter")
df1234.show(false)
returns:
+-----+------+-------+-------+-------+-------+
|order|device|count_1|count_2|count_3|count_4|
+-----+------+-------+-------+-------+-------+
|101 |201 |2 |10 |null |4 |
|102 |202 |4 |null |null |null |
|103 |203 |null |100 |10 |null |
|104 |204 |null |null |111 |11 |
+-----+------+-------+-------+-------+-------+
As you can see the comments are flawed and the 1st answer incorrect.
Did in Scala, should be easy to do in pyspark.

how to get oldest month data using datetime column in spark dataframe(Scala)?

I have a spark dataframe which has two column one is id and secound id col_datetime. as you can see the dataframe given below. how can i filter the dataframe based on col_datetime to get the oldest month data. I want to achieve the result dynamically because I have 20 odd dataframes.
INPUT DF:-
import spark.implicits._
val data = Seq((1 , "2020-07-02 00:00:00.0"),(2 , "2020-08-02 00:00:00.0"),(3 , "2020-09-02 00:00:00.0"),(4 , "2020-10-02 00:00:00.0"),(5 , "2020-11-02 00:00:00.0"),(6 , "2020-12-02 00:00:00.0"),(7 , "2021-01-02 00:00:00.0"),(8 , "2021-02-02 00:00:00.0"),(9 , "2021-03-02 00:00:00.0"),(10, "2021-04-02 00:00:00.0"),(11, "2021-05-02 00:00:00.0"),(12, "2021-06-02 00:00:00.0"),(13, "2021-07-22 00:00:00.0"))
val dfFromData1 = data.toDF("ID","COL_DATETIME").withColumn("COL_DATETIME",to_timestamp(col("COL_DATETIME")))
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
OUTPUT:-
DF1 : - Oldest month data
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
+------+---------------------+
DF2:- lastest months data after removing oldest month data from orginal DF.
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
logic/approach:-
step1 :- calculate the minimum datetime for col_datetime column for given dataframe and assign to mindate variable.
Lets assume I will get
var mindate = "2020-07-02 00:00:00.0"
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
print(mindate)
result:-
mindate : org.apache.spark.sql.Row = [2020-07-02 00:00:00.0]
[2020-07-02 00:00:00.0]
Step2:- to get the end date of month using mindate.I haven’t code for this part to get enddatemonth using mindate.
Val enddatemonth = "2020-07-31 00:00:00.0"
Step3 : - Now I can use enddatemonth variable to filter the spark dataframe in DF1 and DF2 based on conditions.
Even if I tried to filter the dataframe based on mindate I am getting error
val DF1 = dfFromData1.where(col("COL_DATETIME") <= enddatemonth)
val DF2 = dfFromData1.where(col("COL_DATETIME") > enddatemonth)
Error : <console>:166: error: type mismatch;
found : org.apache.spark.sql.Row
required: org.apache.spark.sql.Column val DF1 = dfFromData1.where(col("COL_DATETIME" )<= mindate)
Thanks...!!
A Similar approach , but I find it cleaner just to deal with MONTHS.
Idea : like we have epoch for seconds, compute it for months
val dfWithEpochMonth = dfFromData1.
withColumn("year",year($"COL_DATETIME")).
withColumn("month",month($"COL_DATETIME")).
withColumn("epochMonth", (($"year" - 1970 - 1) * 12) + $"month")
Now your df will look like :
+---+-------------------+----+-----+----------+
| ID| COL_DATETIME|year|month|epochMonth|
+---+-------------------+----+-----+----------+
| 1|2020-07-02 00:00:00|2020| 7| 595|
| 2|2020-08-02 00:00:00|2020| 8| 596|
| 3|2020-09-02 00:00:00|2020| 9| 597|
| 4|2020-10-02 00:00:00|2020| 10| 598|
Now, you can calculate min epochMonth and filter directly.
val minEpochMonth = dfWithEpochMonth.select(min("epochMonth")).first().apply(0).toString().toInt
val df1 = dfWithEpochMonth.where($"epochMonth" <= minEpochMonth)
val df2 = dfWithEpochMonth.where($"epochMonth" > minEpochMonth)
You can drop unnecessary columns.
To address your error message :
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
val mindateString = mindate.apply(0).toString()
Now you can use mindateString to filter.

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+

Spark Dataframe Random UUID changes after every transformation/action

I have a Spark dataframe with a column that includes a generated UUID.
However, each time I do an action or transformation on the dataframe, it changes the UUID at each stage.
How do I generate the UUID only once and have the UUID remain static thereafter.
Some sample code to re-produce my issue is below:
def process(spark: SparkSession): Unit = {
import spark.implicits._
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
sc.setLogLevel("OFF")
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// register an UDF that creates a random UUID
val generateUUID = udf(() => UUID.randomUUID().toString)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", generateUUID())
dfWithUuid.show(false)
dfWithUuid.show(false) // uuid is different
// new transformations also change the uuid
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
}
The output is:
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |a414e73b-24b8-4f64-8d21-f0bc56d3d290|
|b |2 |f37935e5-0bfc-4863-b6dc-897662307e0a|
|c |3 |e3aaf655-5a48-45fb-8ab5-22f78cdeaf26|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |1c6597bf-f257-4e5f-be81-34a0efa0f6be|
|b |2 |6efe4453-29a8-4b7f-9fa1-7982d2670bd6|
|c |3 |2f7ddc1c-3e8c-4118-8e2c-8a6f526bee7e|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |00b85af8-711e-4b59-82e1-8d8e59d4c512|2.0 |
|b |2 |94c3f2c6-9234-4fb3-b1c4-273a37171131|3.0 |
|c |3 |1059fff2-b8f9-4cec-907d-ea181d5003a2|4.0 |
+----+----+------------------------------------+----+
Note that the UUID is different at each step.
It is an expected behavior. User defined functions have to be deterministic:
The user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be
invoked more times than it is present in the query.
If you want to include non-deterministic function and preserve the output you should write intermediate data to a persistent storage and read it back. Checkpointing or caching may work in some simple cases but it won't be reliable in general.
If upstream process is deterministic (for starters there is shuffle) you could try to use rand function with seed, convert to byte array and pass to UUID.nameUUIDFromBytes.
See also: About how to add a new column to an existing DataFrame with random values in Scala
Note: SPARK-20586 introduced deterministic flag, which can disable certain optimization, but it is not clear how it behaves when data is persisted and a loss of executor occurs.
it is very old question but letting the people know what worked for me. It might help someone.
You could use the expr function as below to generate unique GUIDs which does not change on transformations.
import org.apache.spark.sql.functions._
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", expr("uuid()"))
dfWithUuid.show(false)
dfWithUuid.show(false)
// new transformations
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
Output is as below :
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|2.0 |
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|3.0 |
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|4.0 |
+----+----+------------------------------------+----+
I have a pyspark version:
from pyspark.sql import functions as f
pdataDF=dataDF.withColumn("uuid_column",f.expr("uuid()"))
display(pdataDF)
pdataDF.write.mode("overwrite").saveAsTable("tempUuidCheck")
Try this one:
df.withColumn("XXXID", lit(java.util.UUID.randomUUID().toString))
it works different vs:
val generateUUID = udf(() => java.util.UUID.randomUUID().toString)
df.withColumn("XXXCID", generateUUID() )
I hope this helps.
Pawel