Spark joining DataFrames and aggregation - scala

This is sort of a contrived example, but captures what I am trying to do using Spark/Scala
Pet Types
val pets = Array(Row(1,"Cat"),Row(2,"Dog"))
val petsRDD = sc.parallelize(pets)
val petSchema = StructType(Array(StructField("id",IntegerType),StructField("type",StringType)))
val petsDF = sqlContext.createDataFrame(petsRDD,petSchema)
Pet Names
val petnames = Array(Row(1,1,"Tigger","M"),Row(2,1,"Winston","M"),Row(3,1,"Snowball","F"),Row(4,2,"Spot","M"),Row(5,2,"Barf","M"),Row(6,2,"Snoppy","M"))
val petnamesRDD = sc.parallelize(petnames)
val petnameSchema = StructType(Array(StructField("id",IntegerType),StructField("pet_id",IntegerType),StructField("name",StringType),StructField("gender",StringType)))
val petNamesDF = sqlContext.createDataFrame(petNamesRDD,petNameSchema)
From here i can join the dataframes ...
val join = petsDF.join(petNamesDF, petsDF("id") === petNamesDF("pet_id") ), "leftouter")
Results
+---+-----+---+--------+---------+------+
| id| type| id| pet_id | name |gender
+---+-----+---+--------+---------+------+
| 1| Cat| 1 | 1 |Tigger | M
| 1| Cat| 2 | 1 |Winston | M
| 1| Cat| 3 | 1 |Snowball | F
| 2| Dog| 4 | 2 |Spot | M
| 3| Dog| 5 | 2 |Barf | M
| 3| Dog| 6 | 2 |Snoopy | F
+---+-----+---+--------+---------+------+
I would like to flatten the results so it looks something like this so I can map the results into a something for more processing.
((1,"Cat"),(1,"Tigger","M"),(2,"Winston","M"),(3,"Snowball","F"))
((2,"Dog"),(1,"Spot","M"),(2,"barf","M"),(3,"Snoopy","F"))
I started looking at UserDefinedAggregateFunctions but I could not really get it to work. I did not try that hard, but it seems like this was not a good fit.
I also looked as using a map to transform each petDF row into a petDF (list of petNames), but nested DF are not allowed.
I am hoping that I am missing something built into Spark or for an idea to get this to work. I am new to Spark/Scala.
Thanks

Related

how to explode a spark dataframe

I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|
Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

How to aggregate sum on multi columns on Dataframe using reduce function and not groupby?

How to aggregate sum on multi columns on Dataframe using reduce function and not groupby? Since, groupby sum is taking alot of time now i am thinking of using reduce function. Any lead will be helpful.
Input:
| A | B | C | D |
| x | 1 | 2 | 3 |
| x | 2 | 3 | 4 |
CODE:
dataFrame.groupBy("A").sum()
Output:
| A | B | C | D |
| x | 3 | 5 | 7 |
You will have to convert the DataFrame to RDD to perform reduceByKey operation.
val rows: RDD[Row] = df.rdd
Once you create your RDD you can use the reduceByKey to add values of multiple columns
val input = sc.parallelize(List(("X",1,2,3),("X",2,3,4)))
input.map{ case (a, b, c, d) => ((a), (b,c,d)) }.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3))
spark.createDataFrame(final_rdd).toDF("M","N").select($"M", $"N._1".as("X"), $"N._2".as("Y"),$"N._3".as("Z")).show(10)
+---+---+---+---+
| M| X| Y| Z|
+---+---+---+---+
| X| 3| 5| 7|
+---+---+---+---+

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.
Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show
Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.
It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.
I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

How to perform merge operation on spark Dataframe?

I have spark dataframe mainDF and deltaDF both with a matching schema.
Content of the mainDF is as follows:
id | name | age
1 | abc | 23
2 | xyz | 34
3 | pqr | 45
Content of deltaDF is as follows:
id | name | age
1 | lmn | 56
4 | efg | 37
I want to merge deltaDF with mainDF based on value of id. So if my id already exists in mainDF then the record should be updated and if id doesn't exist then the new record should be added. So the resulting data frame should be like this:
id | name | age
1 | lmn | 56
2 | xyz | 34
3 | pqr | 45
4 | efg | 37
This is my current code and it is working:
val updatedDF = mainDF.as("main").join(deltaDF.as("delta"),$"main.id" === $"delta.id","inner").select($"main.id",$"main.name",$"main.age")
mainDF= mainDF.except(updateDF).unionAll(deltaDF)
However here I need to explicitly provide list columns again in the select function which feels like overhead to me. Is there any other better/cleaner approach to achieve the same?
If you don't want to provide the list of columns explicitly, you can map over the original DF's columns, something like:
.select(mainDF.columns.map(c => $"main.$c" as c): _*)
BTW you can do this without a union after the join: you can use outer join to get records that don't exist in both DFs, and then use coalesce to "choose" the non-null value prefering deltaDF's values. So the complete solution would be something like:
val updatedDF = mainDF.as("main")
.join(deltaDF.as("delta"), $"main.id" === $"delta.id", "outer")
.select(mainDF.columns.map(c => coalesce($"delta.$c", $"main.$c") as c): _*)
updatedDF.show
// +---+----+---+
// | id|name|age|
// +---+----+---+
// | 1| lmn| 56|
// | 3| pqr| 45|
// | 4| efg| 37|
// | 2| xyz| 34|
// +---+----+---+
You can achieve this by using dropDuplicates and specifying on wich column you don't want any duplicates.
Here's a working code :
val a = (1,"lmn",56)::(2,"abc",23)::(3,"pqr",45)::Nil
val b = (1,"opq",12)::(5,"dfg",78)::Nil
val df1 = sc.parallelize(a).toDF
val df2 = sc.parallelize(b).toDF
df1.unionAll(df2).dropDuplicates("_1"::Nil).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1|lmn| 56|
| 2|abc| 23|
| 3|pqr| 45|
| 5|dfg| 78|
+---+---+---+
Another way of doing so: pyspark implementation
updatedDF = mainDF.alias(“main”).join(deltaDF.alias(“delta”), main.id == delta.id,"left")
upsertDF = updatedDF.where(“main.id IS not null").select("main.*")
unchangedDF = updatedDF.where(“main.id IS NULL”).select("delta.*")
finalDF = upsertDF.union(unchangedDF)