Constructing distinction matrix in Spark - scala

I am trying construct distinction matrix using spark and am confused how to do it optimally. I am new to spark. I have given a small example of what I'm trying to do below.
Example of distinction matrix construction:
Given Dataset D:
| id | a1 | a2 | a3 |
| 1 | yes | high | on |
| 2 | no | high | off |
| 3 | yes | low | off |
and my distinction table is
| id,id | a1 | a2 | a3 |
| 1,2 | 1 | 0 | 1 |
| 1,3 | 0 | 1 | 1 |
| 2,3 | 1 | 1 | 0 |
i.e whenever an attribute ai is helpful in distinguishing a pair of tuples, distinction table has a 1, otherwise a 0.
My Datasets are huge and I trying to do it in spark.Following are approaches that came to my mind:
using nested for loop to iterate over all members of RDD (of dataset)
using cartesian() transformation over original RDD and iterate over all members of resultant RDD to get distinction table.
My questions are:
In 1st approach, does spark automatically optimize nested for loop setup internally for parallel processing?
In 2nd approach, using cartesian() causes extra storage overhead to store intermediate RDD. Is there any way to avoid this storage overhead and get final distinction table?
Which of these approaches is better and is there any other approach which can be useful to construct distinction matrix efficiently (both space and time)?

For this dataframe:
scala> val df = List((1, "yes", "high", "on" ), (2, "no", "high", "off"), (3, "yes", "low", "off") ).toDF("id", "a1", "a2", "a3")
df: org.apache.spark.sql.DataFrame = [id: int, a1: string ... 2 more fields]
| id| a1| a2| a3|
| 1|yes|high| on|
| 2| no|high|off|
| 3|yes| low|off|
We can build a cartesian product by using crossJoin with itself. However, the column names will be ambiguous (I don't really know how to easily deal with that). To prepare for that, let's create a second dataframe:
scala> val df2 = df.toDF("id_2", "a1_2", "a2_2", "a3_2")
df2: org.apache.spark.sql.DataFrame = [id_2: int, a1_2: string ... 2 more fields]
| 1| yes|high| on|
| 2| no|high| off|
| 3| yes| low| off|
In this example we can get combinations by filtering using id < id_2.
scala> val xp = df.crossJoin(df2)
xp: org.apache.spark.sql.DataFrame = [id: int, a1: string ... 6 more fields]
| id| a1| a2| a3|id_2|a1_2|a2_2|a3_2|
| 1|yes|high| on| 1| yes|high| on|
| 1|yes|high| on| 2| no|high| off|
| 1|yes|high| on| 3| yes| low| off|
| 2| no|high|off| 1| yes|high| on|
| 2| no|high|off| 2| no|high| off|
| 2| no|high|off| 3| yes| low| off|
| 3|yes| low|off| 1| yes|high| on|
| 3|yes| low|off| 2| no|high| off|
| 3|yes| low|off| 3| yes| low| off|
scala> val filtered = xp.filter($"id" < $"id_2")
filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a1: string ... 6 more fields]
| id| a1| a2| a3|id_2|a1_2|a2_2|a3_2|
| 1|yes|high| on| 2| no|high| off|
| 1|yes|high| on| 3| yes| low| off|
| 2| no|high|off| 3| yes| low| off|
At this point the problem is basically solved. To get the final table we can use a when().otherwise() statement on each column pair, or a UDF as I have done here:
scala> val dist = udf((a:String, b: String) => if (a != b) 1 else 0)
dist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(StringType, StringType)))
scala> val distinction =$"id", $"id_2", dist($"a1", $"a1_2").as("a1"), dist($"a2", $"a2_2").as("a2"), dist($"a3", $"a3_2").as("a3"))
distinction: org.apache.spark.sql.DataFrame = [id: int, id_2: int ... 3 more fields]
| id|id_2| a1| a2| a3|
| 1| 2| 1| 0| 1|
| 1| 3| 0| 1| 1|
| 2| 3| 1| 1| 0|


Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
| ID| token|value|
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
| token1|val2|
| life |0.71|
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
| ID| val|val2|
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val$"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val$"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
| ID| val|val2|
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
| ID| token|value|
| 1|learning| 0.69|
| 3|learning| 0.69|
I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
The result (without the null handling):
| ID|learning|life|
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
This will give you the result such as
| ID|learning|life|
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here -

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
12 BB
This command does not provide an expected result:"Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
| | AA|
| 12| BB|
| 12| null|
+------+------+"Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
|Field1 |Field2 |
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
| f1| f2|
|null| AA|
| 12| BB|
scala>"Anonymous", Seq("f1")).show
| f1| f2|
|null| AA|
| 12| BB|
Working Example - Using Null With All Numbers:
| f1| f2|
|null| AA|
| 12| BB|
scala>, Seq("f1")).show
| f1| f2|
| 1| AA|
| 12| BB|
Failing Example (Empty String instead of Null):
| f1| f2|
| | AA|
| 12| BB|
scala>, Seq("f1")).show
| f1| f2|
| | AA|
| 12| BB|
Case Statement Fix Example:
| f1| f2|
| | AA|
| 12| BB|
scala>"f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
| f1| f2|
|Anonymous| AA|
| 12| BB|
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
| ID|Name|NulType|
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data =,Map(""->"null"))
| ID|Name|NulType|
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
And a dataframe B :
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
val prio = coalesce([(String, String, Int)] {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|

How to use NOT IN clause in filter condition in spark

I want to filter a column of an RDD source :
val source = sql("SELECT * from sample.source")","))
val destination = sql("select * from sample.destination")","))
val source_primary_key = => (rec.split(",")(0)))
val destination_primary_key = => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
I want to use IN clause in filter condition to filter out only the values present in src from source, something like below(EDITED):
val source = + "/source")","))
val destination = + "/destination")","))
val source_primary_key = => (rec.split(",")(0)))
val destination_primary_key = => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
equivalent SQL code is
Thank you
Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :
// create a DataFrame for a range 'id' from 1 to 9.
scala> val df = spark.range(1,10).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
// values to exclude
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
// select * from df where id is not in the values to exclude
scala> df.filter(!col("id").isin(f : _*)).show
| id|
| 1|
| 2|
| 3|
| 4|
| 8|
| 9|
// select * from df where id is in the values to exclude
scala> df.filter(col("id").isin(f : _*)).show
Here is the RDD version of the not isin :
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
scala> val rdd2 = rdd.filter(x => !f.contains(x))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at <console>:28
Nevertheless, I still believe this is an overkill since you are already using spark-sql.
It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.
You can use the left anti join approach :
scala> val source ="csv").load("source.file")
source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> val destination ="csv").load("destination.file")
destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
| 1| Ravi kumar| Ravi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2|Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
| 1| Ravi kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi1 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi2 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2| Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam1| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair1|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
You'll just need to do the following :
scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
It's the same logic I mentioned in my answer here.
You can try like--
//This will list all the columns of df where Dept NOT IN 30 or 20
You can try something similar in Java,
ds = ds.filter(functions.not(functions.col(COLUMN_NAME).isin(exclusionSet)));
where exclusionSet is a set of objects that needs to be removed from your dataset.

Spark SQL DataFrame transformation involving partitioning and lagging

I want to transform a Spark SQL DataFrame like this:
animal value
cat 8
cat 5
cat 6
dog 2
dog 4
dog 3
rat 7
rat 4
rat 9
into a DataFrame like this:
animal value previous-value
cat 8 0
cat 5 8
cat 6 5
dog 2 0
dog 4 2
dog 3 4
rat 7 0
rat 4 7
rat 9 4
I sort of want to partition by animal, and then, for each animal, previous-value lags one row behind value (with a default value of 0), and then put the partitions back together again.
This can be accomplished using a window function.
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = sc.parallelize(Seq(("cat", 8, "01:00"),("cat", 5, "02:00"),("cat", 6, "03:00"),("dog", 2, "02:00"),("dog", 4, "04:00"),("dog", 3, "06:00"),("rat", 7, "01:00"),("rat", 4, "03:00"),("rat", 9, "05:00"))).toDF("animal", "value", "time")
|animal|value| time|
| cat| 8|01:00|
| cat| 5|02:00|
| cat| 6|03:00|
| dog| 2|02:00|
| dog| 4|04:00|
| dog| 3|06:00|
| rat| 7|01:00|
| rat| 4|03:00|
| rat| 9|05:00|
I've added a "time" field to illustrate orderBy.
val w1 = Window.partitionBy($"animal").orderBy($"time")
val previous_value = lag($"value", 1).over(w1)
val df1 = df.withColumn("previous", previous_value)
|animal|value| time|previous|
| dog| 2|02:00| null|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| null|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| null|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
If you want to replace nulls with 0:
val df2 =
|animal|value| time|previous|
| dog| 2|02:00| 0|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| 0|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| 0|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
This peice of code would work:
val df ="CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
val df2 = df.groupBy("animal").agg(collect_list("value") as "listValue")
val desiredDF = df2.rdd.flatMap{row=>
val animal=row.getAs[String]("animal")
val valueList=row.getAs[Seq[String]]("listValue").toList
val newlist=valueList zip "0"::valueList>(animal,a._1,a._2))
On the Spark shell:
scala> val"CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
df: org.apache.spark.sql.DataFrame = [animal: string, value: string]
| cat| 8|
| cat| 5|
| cat| 6|
| dog| 2|
| dog| 4|
| dog| 3|
| rat| 7|
| rat| 4 |
| rat| 9|
scala> val df2=df.groupBy("animal").agg(collect_list("value") as "listValue")
df2: org.apache.spark.sql.DataFrame = [animal: string, listValue: array<string>]
|animal| listValue|
| rat|[7, 4 , 9]|
| dog| [2, 4, 3]|
| cat| [8, 5, 6]|
scala> val desiredDF=df2.rdd.flatMap{row=>
| val animal=row.getAs[String]("animal")
| val valueList=row.getAs[Seq[String]]("listValue").toList
| val newlist=valueList zip "0"::valueList
| }.toDF("animal","value","previousValue")
desiredDF: org.apache.spark.sql.DataFrame = [animal: string, value: string ... 1 more field]
| rat| 7| 0|
| rat| 4 | 7|
| rat| 9| 4 |
| dog| 2| 0|
| dog| 4| 2|
| dog| 3| 4|
| cat| 8| 0|
| cat| 5| 8|
| cat| 6| 5|