To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty
I need to iterate a scala Seq of Row type until a particular condition is met. i dont need to process further post the condition.
I have a seq[Row] r->WrappedArray([1/1/2020,abc,1],[1/2/2020,pqr,1],[1/3/2020,stu,0],[1/4/2020,opq,1],[1/6/2020,lmn,0])
I want to iterate through this collection for r.getInt(2) until i encounter 0. As soon as i encounter 0, i need to break the iteration and collect r.getString(1) till then. I dont need to look into any other data post that.
My output should be: Array(abc,pqr,stu)
I am new to scala programming. This seq was actually a Dataframe. I know how to handle this using Spark dataframes, but due to some restriction put forth by my organization, windows function, createDataFrame function are not available/working in our environment. Hence i have resort to Scala programming to achieve the same.
All I could come up was something like below, but not really working!
breakable{
for(i <- r)
var temp = i.getInt(3)===0
if(temp ==true)
{
val = i.getInt(2)
break()
}
}
Can someone please help me here!
You can use the takeWhile method to grab the elements while it's value is 1
s.takeWhile(_.getInt(2) == 1).map(_.getString(1))
Than will give you
List(abc, pqr)
So you still need to get the first element where the int values 0 which you can do as follows:
s.find(_.getInt(2)== 0).map(_.getString(1)).get
Putting all together (and handle possible nil values):
s.takeWhile(_.getInt(2) == 1).map(_.getString(1)) ++ s.find(_.getInt(2)== 0).map(r => List(r.getString(1))).getOrElse(Nil)
Result:
Seq[String] = List(abc, pqr, stu)
I am using RDD to create a left outer join as so far I have the following results:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,Some((35 Jedburgh Road,PL23 6BA)))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,Some((68 High Street,LE16 3PH)))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,Some((63 Crown Street,SW99 2HY)))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,Some((99 Newgate Street,PA5 9UY)))
(100000003,Liam,Brown,Some((9 Earls Avenue,ML12 2DY)))
To remove the None and Some I have so far used the below getOrElse code:
scala> val LeftJoinedDataset = LeftJoin.map(x=>(x._1,x._2._1._1,x._2._1._2,x._2._2.getOrElse(None)))
This prints out:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,(35 Jedburgh Road,PL23 6BA))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,(68 High Street,LE16 3PH))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,(63 Crown Street,SW99 2HY))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,(99 Newgate Street,PA5 9UY))
(100000003,Liam,Brown,(9 Earls Avenue,ML12 2DY))
Although the some has gone, I still want to remove the None and return no data. E.g.
(300000006,Ava,Coleman) instead of (300000006,Ava,Coleman,None)
How can i do this?
Many Thanks
You can't have different amount of columns in different rows of the same dataset, so you'll have to either drop that column altogether, all deal with Option values, or fill them with something else (e.g. empty strings).
But just having an Option in that column seems like the best way - it will show the consumer, that this data may be absent.
I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.
However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?
I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.
I wants to iterate one BIG RDD with small RDD with some additional filter conditions . the below code is working fine but the process is running only with Driver and Not spread-ed across the nodes . So please suggest any other approach ?
val cross = titlesRDD.cartesian(brRDD).cache()
val matching = cross.filter{ case( x, br) =>
((br._1 == "0") &&
(((br._2 ==((x._4))) &&
((br._3 exists (x._5)) || ((br._3).head==""))
}
Thanks,
madhu
You probably don't want to cache cross. Not caching it will, I believe, let the cartesian product happen "on the fly" as needed for the filter, instead of instantiating the potentially large number of combinations resulting from the cartesian product in memory.
Also, you can do brRDD.filter(_._1 == "0") before doing the cartesian product with titlesRDD, e.g.
val cross = titlesRDD.cartesian(brRRD.filter(_._1 == "0"))
and then modify the filter used to create matching appropriately.