How to get number of lines from RDD which contain any digits - scala

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0

You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()

In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Filtering data from RDD using value sequence spark

I need the help on the below use case:
Question 1: My RDD is of below format.Now from this RDD ,I want to exclude the rows where airport.code in("PUN","HAR","KAS")
case class airport(code:String,city:String,airportname:String)
val airportRdd=sparkSession.sparkContext.textFile("src/main/resources/airport_data.csv").
map(x=>x.split(","))
val airPortRddTransformed=airportRdd.map(x=>airport(x(0),x(1),x(2)))
val trasnformedRdd=airPortRddTransformed.filter(air=>!(air.code.contains(seqValues:_*)))
But ! is not working .It is telling can not resolve symbol !.Can some one please help me.How to do negate in RDD.I have to use RDD approach only.
Also another question:
Question 2 : The data file is having 70 columns.I have a columns sequence:
val seqColumns=List("lat","longi","height","country")
I want to exclude these columns while loading the RDD.How can I do it.My production RDD is having 70 columns, I just really know the columns names to exclude.Not the index of every column.Again looking for it in RDD approach.I am aware on how to do it in Dataframe approach.
Question 1
Use broadcast to pass list of values to filter function. It seems _* in filter is not working. I changed condition to !seqValues.value.contains(air.code)
Data sample: airport_data.csv
C001,Pune,Pune Airport
C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport
C003,New York,New York Airport
C004,Delhi,Delhi Airport
Code snippet
case class airport(code:String,city:String,airportname:String)
val seqValues=spark.sparkContext.broadcast(List("C001","C003"))
val airportRdd = spark.sparkContext.textFile("D:\\DataAnalysis\\airport_data.csv").map(x=>x.split(","))
val airPortRddTransformed = airportRdd.map(x=>airport(x(0),x(1),x(2)))
//airPortRddTransformed.foreach(println)
val trasnformedRdd = airPortRddTransformed.filter(air => !seqValues.value.contains(air.code))
trasnformedRdd.foreach(println)
Output ->
airport(C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport)
airport(C004,Delhi,Delhi Airport)
Things I would change:
1- You are reading a .csv as a TextFile and then spliting the lines based on ,. You can save this step by reading the file like:
val df = spark.read.csv("src/main/resources/airport_data.csv")
2- Change the order of contains
val trasnformedRdd = airPortRddTransformed.filter(air => !(seqValues.contains(air.code)))

Difference between these two count methods in Spark

I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.
However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?
I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you

Key/Value pair RDD

I have a question on key/value pair RDD.
I have five files in the C:/download/input folder which has the dialogs in the films as the content of the files as follows:
movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt
I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
I am trying to do an operation where i have to group the input files of each genre together using groupByKey(). The values of all the horror movies together , comedy movies together and so on.
Is there any way i can generate the key/value pair this way (horror, values) instead of (C:/download/input/movie_horror_Conjuring.txt,values)
val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))
The above code is giving me the output as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)
where as i need the output as follows :
(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))
I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.
Also i would like to know how can i get the lines count in values1, values2, values3 etc.
My final output should be like
(horror, 100)
where 100 is the sum of the count of lines in values1 = 40 lines, values2 = 30 lines and values3 = 30 lines and so on..
Try this:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()
output.collect
Let me know if this works for you!
Update:
To get output in the format of (horror, 100):
val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reduceByKey(_ + _)
output.collect