Split the timestamp interval based on hours in spark - scala

Split the timestamp based on hours in spark
1,2019-04-01 04:00:21,12
1,2019-04-01 06:01:22,34
1,2019-04-01 09:21:23,10
1,2019-04-01 11:23:09,15
1,2019-04-01 12:02:10,15
1,2019-04-01 15:00:21,10
1,2019-04-01 18:00:22,10
1,2019-04-01 19:30:22,30
1,2019-04-01 20:22:30,30
1,2019-04-01 22:20:30,30
1,2019-04-01 23:59:00,10
Spilt the timestamp based on hours by every 6 hours into 4 parts in a day and sum it.
Here I'm splitting like 0-6AM,6AM-12PM etc.
1,2019-04-01,12
1,2019-04-01,59
1,2019-04-01,25
1,2019-04-01,110

Try this-
Load the test data
spark.conf.set("spark.sql.session.timeZone", "UTC")
val data =
"""
|c1,c2,c3
|1,2019-04-01 04:00:21,12
|1,2019-04-01 06:01:22,34
|1,2019-04-01 09:21:23,10
|1,2019-04-01 11:23:09,15
|1,2019-04-01 12:02:10,15
|1,2019-04-01 15:00:21,10
|1,2019-04-01 18:00:22,10
|1,2019-04-01 19:30:22,30
|1,2019-04-01 20:22:30,30
|1,2019-04-01 22:20:30,30
|1,2019-04-01 23:59:00,10
""".stripMargin
val stringDS2 = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df2 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS2)
df2.show(false)
df2.printSchema()
/**
* +---+-------------------+---+
* |c1 |c2 |c3 |
* +---+-------------------+---+
* |1 |2019-03-31 22:30:21|12 |
* |1 |2019-04-01 00:31:22|34 |
* |1 |2019-04-01 03:51:23|10 |
* |1 |2019-04-01 05:53:09|15 |
* |1 |2019-04-01 06:32:10|15 |
* |1 |2019-04-01 09:30:21|10 |
* |1 |2019-04-01 12:30:22|10 |
* |1 |2019-04-01 14:00:22|30 |
* |1 |2019-04-01 14:52:30|30 |
* |1 |2019-04-01 16:50:30|30 |
* |1 |2019-04-01 18:29:00|10 |
* +---+-------------------+---+
*
* root
* |-- c1: integer (nullable = true)
* |-- c2: timestamp (nullable = true)
* |-- c3: integer (nullable = true)
*/
truncate the date for the range of 6 hrs and then groupBy().sum
val seconds = 21600 // 6 hrs
df2.withColumn("c2_long", expr(s"floor(cast(c2 as long) / $seconds) * $seconds"))
.groupBy("c1", "c2_long")
.agg(sum($"c3").as("c3"))
.withColumn("c2", to_date(to_timestamp($"c2_long")))
.withColumn("c2_time", to_timestamp($"c2_long"))
.orderBy("c2")
.show(false)
/**
* +---+----------+---+----------+-------------------+
* |c1 |c2_long |c3 |c2 |c2_time |
* +---+----------+---+----------+-------------------+
* |1 |1554055200|12 |2019-03-31|2019-03-31 18:00:00|
* |1 |1554120000|100|2019-04-01|2019-04-01 12:00:00|
* |1 |1554076800|59 |2019-04-01|2019-04-01 00:00:00|
* |1 |1554141600|10 |2019-04-01|2019-04-01 18:00:00|
* |1 |1554098400|25 |2019-04-01|2019-04-01 06:00:00|
* +---+----------+---+----------+-------------------+
*/

SCALA: The answer in the post that I comment on is working very well.
df.groupBy($"id", window($"time", "6 hours").as("time"))
.agg(sum("count").as("count"))
.orderBy("time.start")
.select($"id", to_date($"time.start").as("time"), $"count")
.show(false)
+---+----------+-----+
|id |time |count|
+---+----------+-----+
|1 |2019-04-01|12 |
|1 |2019-04-01|59 |
|1 |2019-04-01|25 |
|1 |2019-04-01|110 |
+---+----------+-----+

Related

spark-scala: Transform the dataframe to generate new column gender and vice versa [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Table1:
class male female
1 2 1
2 0 2
3 2 0
table2:
class gender
1 m
1 f
1 m
2 f
2 f
3 m
3 m
Using spark-scala take the data from table1 and dump into another table in the format of table2 as given.Also please do vice-versa
Please help me in this guys.
Thanks in Advance
You can use udf and explode function like below.
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq((1,2,1),(2,0,2),(3,2,0)).toDF("class","male","female")
//Input Df
+-----+----+------+
|class|male|female|
+-----+----+------+
| 1| 2| 1|
| 2| 0| 2|
| 3| 2| 0|
+-----+----+------+
val getGenderUdf=udf((x:Int,y:Int)=>List.fill(x)("m")++List.fill(y)("f"))
val df1=df.withColumn("gender",getGenderUdf(df.col("male"),df.col("female"))).drop("male","female").withColumn("gender",explode($"gender"))
df1.show()
+-----+------+
|class|gender|
+-----+------+
| 1| m|
| 1| m|
| 1| f|
| 2| f|
| 2| f|
| 3| m|
| 3| m|
+-----+------+
Reverse of df1
val df2=df1.groupBy("class").pivot("gender").agg(count("gender")).na.fill(0).withColumnRenamed("m","male").withColumnRenamed("f","female")
df2.show()
//Sample Output:
+-----+------+----+
|class|female|male|
+-----+------+----+
| 1| 1| 2|
| 3| 0| 2|
| 2| 2| 0|
+-----+------+----+
val inDF = Seq((1,2,1),
(2, 0, 2),
(3, 2, 0)).toDF("class", "male", "female")
val testUdf = udf((m: Int, f: Int) => {
val ml = 1.to(m).map(_ => "m")
val fml = 1.to(f).map(_ => "f")
ml ++ fml
})
val df1 = inDF.withColumn("mf", testUdf('male, 'female))
.drop("male", "female")
.select('class, explode('mf).alias("gender"))
Perhaps this is helpful - without UDF
spark>=2.4
Load the test data provided
val data =
"""
|class | male | female
|1 | 2 | 1
|2 | 0 | 2
|3 | 2 | 0
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-----+----+------+
* |class|male|female|
* +-----+----+------+
* |1 |2 |1 |
* |2 |0 |2 |
* |3 |2 |0 |
* +-----+----+------+
*
* root
* |-- class: integer (nullable = true)
* |-- male: integer (nullable = true)
* |-- female: integer (nullable = true)
*/
compute the gender array and explode
df1.select($"class",
when($"male" >= 1, sequence(lit(1), col("male"))).otherwise(array()).as("male"),
when($"female" >= 1, sequence(lit(1), col("female"))).otherwise(array()).as("female")
).withColumn("male", expr("TRANSFORM(male, x -> 'm')"))
.withColumn("female", expr("TRANSFORM(female, x -> 'f')"))
.withColumn("gender", explode(concat($"male", $"female")))
.select("class", "gender")
.show(false)
/**
* +-----+------+
* |class|gender|
* +-----+------+
* |1 |m |
* |1 |m |
* |1 |f |
* |2 |f |
* |2 |f |
* |3 |m |
* |3 |m |
* +-----+------+
*/
vice versa
df2.groupBy("class").agg(collect_list("gender").as("gender"))
.withColumn("male", expr("size(FILTER(gender, x -> x='m'))"))
.withColumn("female", expr("size(FILTER(gender, x -> x='f'))"))
.select("class", "male", "female")
.orderBy("class")
.show(false)
/**
* +-----+----+------+
* |class|male|female|
* +-----+----+------+
* |1 |2 |1 |
* |2 |0 |2 |
* |3 |2 |0 |
* +-----+----+------+
*/

How to transpose data in pyspark for multiple different columns

I am trying to transpose data in pyspark. I was able to transpose using a single column. However, with multiple columns I am not sure how to pass parameters to explode function.
Input format:
Output Format :
Can someone please hint me with any example or reference? Thanks in advance.
use stack to transpose as below (spark>=2.4)-
Load the test data
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
Un-pivot the table using stack
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
A cleaned PySpark version of this
from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[
"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| 1| xyz| MS| abc| Phd| pqr| BS|
| 2| POR| MS| ABC| Phd| | |
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
Code -
df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")
+---+---+---+
| id| B| C|
+---+---+---+
| 1|xyz| MS|
| 1|abc|Phd|
| 1|pqr| BS|
| 2|POR| MS|
| 2|ABC|Phd|
| 2| | |
+---+---+---+

Finding most common non-null prefix per group in spark

I need to write a structured query that finds the most common not-null PREFIX (occurences) per UNIQUE_GUEST_ID.
There is input data:
val inputDf = Seq(
(1, "Mr"),
(1, "Mme"),
(1, "Mr"),
(1, null),
(1, null),
(1, null),
(2, "Mr"),
(3, null)).toDF("UNIQUE_GUEST_ID", "PREFIX")
println("Input:")
inputDf.show(false)
My solution was:
inputDf
.groupBy($"UNIQUE_GUEST_ID")
.agg(collect_list($"PREFIX").alias("PREFIX"))
But that is not what i need:
Expected:
+---------------+------+
|UNIQUE_GUEST_ID|PREFIX|
+---------------+------+
|1 |Mr |
|2 |Mr |
|3 |null |
+---------------+------+
Actual:
+---------------+-------------+
|UNIQUE_GUEST_ID|PREFIX |
+---------------+-------------+
|1 |[Mr, Mme, Mr]|
|3 |[] |
|2 |[Mr] |
+---------------+-------------+
Try this-
val inputDf = Seq(
(1, "Mr"),
(1, "Mme"),
(1, "Mr"),
(1, null),
(1, null),
(1, null),
(2, "Mr"),
(3, null)).toDF("UNIQUE_GUEST_ID", "PREFIX")
println("Input:")
inputDf.show(false)
/**
* Input:
* +---------------+------+
* |UNIQUE_GUEST_ID|PREFIX|
* +---------------+------+
* |1 |Mr |
* |1 |Mme |
* |1 |Mr |
* |1 |null |
* |1 |null |
* |1 |null |
* |2 |Mr |
* |3 |null |
* +---------------+------+
*/
inputDf
.groupBy($"UNIQUE_GUEST_ID", $"PREFIX").agg(count($"PREFIX").as("count"))
.groupBy($"UNIQUE_GUEST_ID")
.agg(max( struct( $"count", $"PREFIX")).as("max"))
.selectExpr("UNIQUE_GUEST_ID", "max.PREFIX")
.show(false)
/**
* +---------------+------+
* |UNIQUE_GUEST_ID|PREFIX|
* +---------------+------+
* |2 |Mr |
* |1 |Mr |
* |3 |null |
* +---------------+------+
*/
val df2 = inputDf.groupBy('UNIQUE_GUEST_ID,'PREFIX).agg(count('PREFIX).as("ct"))
val df3 = df2.groupBy('UNIQUE_GUEST_ID).agg(max('ct).as("ct"))
df2.join(df3,Seq("ct","UNIQUE_GUEST_ID")).show()
output:
+---+---------------+------+
| ct|UNIQUE_GUEST_ID|PREFIX|
+---+---------------+------+
| 1| 2| Mr|
| 0| 3| null|
| 2| 1| Mr|
+---+---------------+------+

Get the values from nested structure dataframe in spark using scala

I have a dataframe with nested structure (Arrays of array),
StructField("Games", ArrayType(StructType(Array(
StructField("Team", StringType, true),
StructField("Amount", StringType, true),
StructField("Game", StringType, true)))), true),
For this I will get the values like below (Team, Amount, Game follows sequence here)
[[A,160,Chess], [B,100,Hockey], [C,1200,Football], [D,900,Cricket]]
[[E,700,Cricket], [F,1000,Chess]]
[[G,1900,Basketball], [I,1000,Cricket], [H,9000,Football]]
Now I have to get the values from this dataframe if
Game === 'Football' then TeamFootball = C and Amount = 1200
Game === 'Cricket' then TeamCricket = D and Amount = 900 for first row.
I tried like this
.withColumn("TeamFootball", when($"Games.Game".getItem(2)==="Football",$"Games.Team".getItem(0).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamCricket", when($"Games.Game".getItem(2)==="Cricket", $"Games.Team".getItem(0).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamFootballAmount", when($"Games.Game".getItem(2)==="Football",$"Games.Amount".getItem(1).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamCricketAmount", when($"Games.Game".getItem(2)==="Cricket",$"Games.Amount".getItem(1).cast(StringType)).otherwise(lit("NA")))
I need all this columns in same row, that why I am not using explode.
Here I am unable to handle array index, Could you please help.
"Explode" and then "pivot" can help, please check "result" in output:
val data = List(
(1, "A", 160, "Chess"), (1, "B", 100, "Hockey"), (1, "C", 1200, "Football"), (1, "D", 900, "Cricket"),
(2, "E", 700, "Cricket"), (2, "F", 1000, "Chess"),
(3, "G", 1900, "Basketball"), (3, "I", 1000, "Cricket"), (3, "H", 9000, "Football")
)
val unstructured = data.toDF("id", "Team", "Amount", "Game")
unstructured.show(false)
val original = unstructured.groupBy("id").agg(collect_list(struct($"Team", $"Amount", $"Game")).alias("Games"))
println("--- Original ----")
original.printSchema()
original.show(false)
val exploded = original.withColumn("Games", explode($"Games")).select("id", "Games.*")
println("--- Exploded ----")
exploded.show(false)
println("--- Result ----")
exploded.groupBy("id").pivot("Game").agg(max($"Amount").alias("Amount"), max("Team").alias("Team")).orderBy("id").show(false)
Output is:
+---+----+------+----------+
|id |Team|Amount|Game |
+---+----+------+----------+
|1 |A |160 |Chess |
|1 |B |100 |Hockey |
|1 |C |1200 |Football |
|1 |D |900 |Cricket |
|2 |E |700 |Cricket |
|2 |F |1000 |Chess |
|3 |G |1900 |Basketball|
|3 |I |1000 |Cricket |
|3 |H |9000 |Football |
+---+----+------+----------+
--- Original ----
root
|-- id: integer (nullable = false)
|-- Games: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Team: string (nullable = true)
| | |-- Amount: integer (nullable = false)
| | |-- Game: string (nullable = true)
+---+-------------------------------------------------------------------+
|id |Games |
+---+-------------------------------------------------------------------+
|3 |[[G,1900,Basketball], [I,1000,Cricket], [H,9000,Football]] |
|1 |[[A,160,Chess], [B,100,Hockey], [C,1200,Football], [D,900,Cricket]]|
|2 |[[E,700,Cricket], [F,1000,Chess]] |
+---+-------------------------------------------------------------------+
--- Exploded ----
+---+----+------+----------+
|id |Team|Amount|Game |
+---+----+------+----------+
|3 |G |1900 |Basketball|
|3 |I |1000 |Cricket |
|3 |H |9000 |Football |
|1 |A |160 |Chess |
|1 |B |100 |Hockey |
|1 |C |1200 |Football |
|1 |D |900 |Cricket |
|2 |E |700 |Cricket |
|2 |F |1000 |Chess |
+---+----+------+----------+
--- Result ----
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+
|id |Basketball_Amount|Basketball_Team|Chess_Amount|Chess_Team|Cricket_Amount|Cricket_Team|Football_Amount|Football_Team|Hockey_Amount|Hockey_Team|
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+
|1 |null |null |160 |A |900 |D |1200 |C |100 |B |
|2 |null |null |1000 |F |700 |E |null |null |null |null |
|3 |1900 |G |null |null |1000 |I |9000 |H |null |null |
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+

join dataframes and perform operation

Hello guys i have a dataframe that is being up to date each date , each day i need to add the new qte and the new ca to the old one and update the date .
So i need to update the ones that are already existing and add the new ones.Here an example what i would like to have at the end
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here what i did
val histoCombinaison2=hist2.join(hist,Seq("article_id","pos_id"),"left")
.groupBy("article_id","pos_id").agg((hist2("qte")+hist("qte")) as ("qte"),(hist2("ca")+hist("ca")) as ("ca"),hist2("date"))
histoCombinaison2.show()
and i got the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`qte`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:218)
// import functions
import org.apache.spark.sql.functions.{coalesce, lit}
// we might not need groupBy,
// since after join, all the information will be in the same row
// so instead of using aggregate function, we simply combine the related fields as a new column.
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
// df.show()
|pos_id|article_id| date| qte| ca|
+------+----------+----------+----+----+
| 1| 1|2000-01-08| 5.0| 7.0|
| 2| 2|2000-01-08|29.4|24.0|
| 3| 3|2000-01-08| 7.0| 2.4|
| 4| 4|2000-01-08| 3.5| 1.2|
| 5| 5|2000-01-08|14.5| 1.2|
| 6| 6|2000-01-08| 2.0|1.25|
+------+----------+----------+----+----+
Thanks.
As I have mentioned your comment that you should define your schema and use it in reading csv to dataframe as
import sqlContext.implicits._
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("pos_id", LongType, true),
StructField("article_id", LongType, true),
StructField("date", DateType, true),
StructField("qte", LongType, true),
StructField("ca", DoubleType, true)
))
val hist1 = sqlContext.read
.format("csv")
.option("header", "true")
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
hist1.show
val hist2 = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
hist2.show
Then you should use when function to define the logic you need to implement as
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
when(hist2("date").isNotNull, hist2("date")).otherwise(when(hist1("date").isNotNull, hist1("date")).otherwise(lit(null))).alias("date"),
(when(hist2("qte").isNotNull, hist2("qte")).otherwise(lit(0)) + when(hist1("qte").isNotNull, hist1("qte")).otherwise(lit(0))).alias("qte"),
(when(hist2("ca").isNotNull, hist2("ca")).otherwise(lit(0)) + when(hist1("ca").isNotNull, hist1("ca")).otherwise(lit(0))).alias("ca"))
I hope the answer is helpful