How to do equality check of two DataFrames? - scala

I have below scenario:
I have 2 dataframes containing only 1 column
Lets say
DF1=(1,2,3,4,5)
DF2=(3,6,7,8,9,10)
Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is:
val df1count= DF1.count
val df2count=DF2.count
val diffDF=DF2.except(DF1)
val diffCount=diffDF.count
if(diffCount==(df2count-df1count)) true
else false
The problem with this approach is I am calling action elements 4 times which is for sure not the best way. Can someone suggest me the best effective way of achieving this?

You can use intersect to get the values common to both DataFrames, and then check if it's empty:
DF1.intersect(DF2).take(1).isEmpty
That will use only one action (take(1)) and a fairly quick one.

Here is the check if Dataset first ist equal to Dataset second:
if(first.except(second).union(second.except(first)).count() == 0)
first == second
else
first != second

Try an intersection combined with a count this would assure the the contents are the same and the number of values in both are the same and asserts to a true
val intersectcount= DF1.intersect(DF2).count()
val check =(intersectcount == DF1.count()) && (intersectcount==DF2.count())

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Iterating through Seq[row] till a particular condition is met using Scala

I need to iterate a scala Seq of Row type until a particular condition is met. i dont need to process further post the condition.
I have a seq[Row] r->WrappedArray([1/1/2020,abc,1],[1/2/2020,pqr,1],[1/3/2020,stu,0],[1/4/2020,opq,1],[1/6/2020,lmn,0])
I want to iterate through this collection for r.getInt(2) until i encounter 0. As soon as i encounter 0, i need to break the iteration and collect r.getString(1) till then. I dont need to look into any other data post that.
My output should be: Array(abc,pqr,stu)
I am new to scala programming. This seq was actually a Dataframe. I know how to handle this using Spark dataframes, but due to some restriction put forth by my organization, windows function, createDataFrame function are not available/working in our environment. Hence i have resort to Scala programming to achieve the same.
All I could come up was something like below, but not really working!
breakable{
for(i <- r)
var temp = i.getInt(3)===0
if(temp ==true)
{
val = i.getInt(2)
break()
}
}
Can someone please help me here!
You can use the takeWhile method to grab the elements while it's value is 1
s.takeWhile(_.getInt(2) == 1).map(_.getString(1))
Than will give you
List(abc, pqr)
So you still need to get the first element where the int values 0 which you can do as follows:
s.find(_.getInt(2)== 0).map(_.getString(1)).get
Putting all together (and handle possible nil values):
s.takeWhile(_.getInt(2) == 1).map(_.getString(1)) ++ s.find(_.getInt(2)== 0).map(r => List(r.getString(1))).getOrElse(Nil)
Result:
Seq[String] = List(abc, pqr, stu)

Using getOrElse to return nothing if option type is None (Scala)

I am using RDD to create a left outer join as so far I have the following results:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,Some((35 Jedburgh Road,PL23 6BA)))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,Some((68 High Street,LE16 3PH)))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,Some((63 Crown Street,SW99 2HY)))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,Some((99 Newgate Street,PA5 9UY)))
(100000003,Liam,Brown,Some((9 Earls Avenue,ML12 2DY)))
To remove the None and Some I have so far used the below getOrElse code:
scala> val LeftJoinedDataset = LeftJoin.map(x=>(x._1,x._2._1._1,x._2._1._2,x._2._2.getOrElse(None)))
This prints out:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,(35 Jedburgh Road,PL23 6BA))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,(68 High Street,LE16 3PH))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,(63 Crown Street,SW99 2HY))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,(99 Newgate Street,PA5 9UY))
(100000003,Liam,Brown,(9 Earls Avenue,ML12 2DY))
Although the some has gone, I still want to remove the None and return no data. E.g.
(300000006,Ava,Coleman) instead of (300000006,Ava,Coleman,None)
How can i do this?
Many Thanks
You can't have different amount of columns in different rows of the same dataset, so you'll have to either drop that column altogether, all deal with Option values, or fill them with something else (e.g. empty strings).
But just having an Option in that column seems like the best way - it will show the consumer, that this data may be absent.

Difference between these two count methods in Spark

I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.
However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?
I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.

Spark Iterating RDD over another RDD with filter conditions Scala

I wants to iterate one BIG RDD with small RDD with some additional filter conditions . the below code is working fine but the process is running only with Driver and Not spread-ed across the nodes . So please suggest any other approach ?
val cross = titlesRDD.cartesian(brRDD).cache()
val matching = cross.filter{ case( x, br) =>
((br._1 == "0") &&
(((br._2 ==((x._4))) &&
((br._3 exists (x._5)) || ((br._3).head==""))
}
Thanks,
madhu
You probably don't want to cache cross. Not caching it will, I believe, let the cartesian product happen "on the fly" as needed for the filter, instead of instantiating the potentially large number of combinations resulting from the cartesian product in memory.
Also, you can do brRDD.filter(_._1 == "0") before doing the cartesian product with titlesRDD, e.g.
val cross = titlesRDD.cartesian(brRRD.filter(_._1 == "0"))
and then modify the filter used to create matching appropriately.