Key/Value pair RDD - scala

I have a question on key/value pair RDD.
I have five files in the C:/download/input folder which has the dialogs in the films as the content of the files as follows:
movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt
I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
I am trying to do an operation where i have to group the input files of each genre together using groupByKey(). The values of all the horror movies together , comedy movies together and so on.
Is there any way i can generate the key/value pair this way (horror, values) instead of (C:/download/input/movie_horror_Conjuring.txt,values)
val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))
The above code is giving me the output as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)
where as i need the output as follows :
(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))
I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.
Also i would like to know how can i get the lines count in values1, values2, values3 etc.
My final output should be like
(horror, 100)
where 100 is the sum of the count of lines in values1 = 40 lines, values2 = 30 lines and values3 = 30 lines and so on..

Try this:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()
output.collect
Let me know if this works for you!
Update:
To get output in the format of (horror, 100):
val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reduceByKey(_ + _)
output.collect

Related

Filtering data from RDD using value sequence spark

I need the help on the below use case:
Question 1: My RDD is of below format.Now from this RDD ,I want to exclude the rows where airport.code in("PUN","HAR","KAS")
case class airport(code:String,city:String,airportname:String)
val airportRdd=sparkSession.sparkContext.textFile("src/main/resources/airport_data.csv").
map(x=>x.split(","))
val airPortRddTransformed=airportRdd.map(x=>airport(x(0),x(1),x(2)))
val trasnformedRdd=airPortRddTransformed.filter(air=>!(air.code.contains(seqValues:_*)))
But ! is not working .It is telling can not resolve symbol !.Can some one please help me.How to do negate in RDD.I have to use RDD approach only.
Also another question:
Question 2 : The data file is having 70 columns.I have a columns sequence:
val seqColumns=List("lat","longi","height","country")
I want to exclude these columns while loading the RDD.How can I do it.My production RDD is having 70 columns, I just really know the columns names to exclude.Not the index of every column.Again looking for it in RDD approach.I am aware on how to do it in Dataframe approach.
Question 1
Use broadcast to pass list of values to filter function. It seems _* in filter is not working. I changed condition to !seqValues.value.contains(air.code)
Data sample: airport_data.csv
C001,Pune,Pune Airport
C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport
C003,New York,New York Airport
C004,Delhi,Delhi Airport
Code snippet
case class airport(code:String,city:String,airportname:String)
val seqValues=spark.sparkContext.broadcast(List("C001","C003"))
val airportRdd = spark.sparkContext.textFile("D:\\DataAnalysis\\airport_data.csv").map(x=>x.split(","))
val airPortRddTransformed = airportRdd.map(x=>airport(x(0),x(1),x(2)))
//airPortRddTransformed.foreach(println)
val trasnformedRdd = airPortRddTransformed.filter(air => !seqValues.value.contains(air.code))
trasnformedRdd.foreach(println)
Output ->
airport(C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport)
airport(C004,Delhi,Delhi Airport)
Things I would change:
1- You are reading a .csv as a TextFile and then spliting the lines based on ,. You can save this step by reading the file like:
val df = spark.read.csv("src/main/resources/airport_data.csv")
2- Change the order of contains
val trasnformedRdd = airPortRddTransformed.filter(air => !(seqValues.contains(air.code)))

Split string twice and reduceByKey in Scala

I have a .csv file that I am trying to analyse using spark. The .csv file contains amongst others, a list of topics and their counts. The topics and their counts are separated by a ',' and all these topics+counts are in the same string separated by ';' like so
"topic_1,10;topic_2,12;topic_1,3"
As you can see, some topics are in the string multiple times.
I have an rdd containing key value pairs of some date and the topic strings [date, topicstring]
What I want to do is split the string at the ';' to get all the separate topics, then split those topics at the ',' and create a key value pair of the topic name and counts, which can be reduced by key. For the example above this would be
[date, ((topic_1, 13), (topic_2, 12))]
So I have been playing around in spark a lot as I am new to scala. What I tried to do is
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
The problem is that this returns an Array of tuples, which I cannot reduceByKey. When I split the string at ';' this returns an array. I tried mapping this to a tuple (as you can see from the map operation) but this does not work.
The complete code I used is
val rdd = sc.textFile("./data/segment/*.csv")
val topicsByDate = rdd
.filter(line => line.split("\t").length > 23)
.map({case(str) => (str.split("\t")(1), str.split("\t")(23))})
.reduceByKey(_ + _)
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
separateTopicsByDate.take(2)
This returns
res42: Array[(String, Array[(String, Int)])] = Array((20150219001500,Array((Cecilia Pedraza,91), (Mexico City,110), (Soviet Union,1019), (Dutch Warmbloods,1236), (Jose Luis Vaquero,1413), (National Equestrian Club,1636), (Lenin Park,1776), (Royal Dutch Sport Horse,2075), (North American,2104), (Western Hemisphere,2246), (Maydet Vega,2800), (Mirach Capital Group,110), (Subrata Roy,403), (New York,820), (Saransh Sharma,945), (Federal Bureau,1440), (San Francisco,1482), (Gregory Wuthrich,1635), (San Francisco,1652), (Dan Levine,2309), (Emily Flitter,2327), (K...
As you can see this is an array of tuples which I cannot use .reduceByKey(_ + _) on.
Is there a way to split the string in such a way that it can be reduced by key?
In case if your RDD has rows like:
[date, "topic1,10;topic2,12;topic1,3"]
you can split the values and explode the row using flatMap into:
[date, ["topic1,10", "topic2,12", "topic1,3"]] ->
[date, "topic1,10"]
[date, "topic2,12"]
[date, "topic1,3"]
Then convert each row into [String,Integer] Tuple (rdd1 in the code):
["date_topic1",10]
["date_topic2",12]
["date_topic1",3]
and reduce by Key using addition (rdd2 in the code):
["date_topic1",13]
["date_topic2",12]
Then you separate dates from topics and combine topics with values, getting [String,String] Tuples like:
["date", "topic1,13"]
["date", "topic2,12"]
Finally you split the values into [topic,count] Tuples, prepare ["date", [(topic,count)]] pairs (rdd3 in the code) and reduce by Key (rdd4 in the code), getting:
["date", [(topic1, 13), (topic2, 12)]]
===
below is Java implementation as a sequence of four intermediate RDDs, you may refer to it for Scala development:
JavaPairRDD<String, String> rdd; //original data. contains [date, "topic1,10;topic2,12;topic1,3"]
JavaPairRDD<String, Integer> rdd1 = //contains
//["date_topic1",10]
//["date_topic2",12]
//["date_topic1",3]
rdd.flatMapToPair(
pair -> //pair=[date, "topic1,10;topic2,12;topic1,3"]
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
String k = pair._1; //date
String v = pair._2; //"topic,count;topic,count;topic,count"
String[] v_splits = v.split(";");
for(int i=0; i<v_splits.length; i++)
{
String[] v_split_topic_count = v_splits[i].split(","); //"topic,count"
list.add(new Tuple2<String,Integer>(k + "_" + v_split_topic_count[0], Integer.parseInt(v_split_topic_count[1]))); //"date_topic,count"
}
return list.iterator();
}//end call
);
JavaPairRDD<String,Integer> rdd2 = //contains
//["date_topic1",13]
//["date_topic2",12]
rdd1.reduceByKey((Integer i1, Integer i2) -> i1+i2);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd3 = //contains
//["date", [(topic1,13)]]
//["date", [(topic2,12)]]
rdd2.mapToPair(
pair -> //["date_topic1",13]
{
String k = pair._1; //date_topic1
Integer v = pair._2; //13
String[] dateTopicSplits = k.split("_");
String new_k = dateTopicSplits[0]; //date
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
list.add(new Tuple2<String,Integer>(dateTopicSplits[1], v)); //[(topic1,13)]
return new Tuple2<String,Iterator<Tuple2<String,Integer>>>(new_k, list.iterator());
}
);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd4 = //contains
//["date", [(topic1, 13), (topic2, 12)]]
rdd3.reduceByKey(
(Iterator<Tuple2<String,Integer>> itr1, Iterator<Tuple2<String,Integer>> itr2) ->
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
while(itr1.hasNext())
list.add(itr1.next());
while(itr2.hasNext())
list.add(itr2.next());
return list.iterator();
}
);
UPD. This problem can actually be solved by using a single map only - you split the row value (i.e. topicstring) by ; so it gives you [key,value] pairs as [topic,count] and you populate the hashmap by those pairs adding up the counts. Finally you output the date key with all distinct keys accumulated in the hashmap together with their values.
This way seems to be more efficient as well because the size of hashmap is not going to be larger than the size of the original row so the memory space consumed by mapper will be limited by the size of largest row, whereas in the flatmap solution, memory should be large enough to fit all those expanded rows

How to get number of lines from RDD which contain any digits

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

Create multiple RDDs from single file based on row value ( header record in sample file) using Spark scala

I am trying to create multiple RDDs to process independently from below file based on the similar format of data .
Please find the file with different data formats
custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
custid,relstatus
fg3-03,R
dfsdf4-01,V
56fbfg,R
devid,reg,hold,devbrn,lname,lcon
CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370
In the above file, we have three different type of data formats exist and I want to split the file into three different RDDs as per the format.
Could you please suggest how to implement using Spark (Scala)?
Your file looks like it has 3 different csv files in it.
You can read it as a single file and extract 3 RDDs from it based on the number of fields you have in each row.
// Caching because you'll be filtering it thrice
val topRdd = sc.textFile("file").cache
topRdd.count
//res0: Long = 10
val rdd1 = topRdd.filter(_.split(",", -1).length == 10 )
val rdd2 = topRdd.filter(_.split(",", -1).length == 2 )
val rdd3 = topRdd.filter(_.split(",", -1).length == 6 )
rdd1.collect.foreach(println)
// custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
// 4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
// X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
// XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
rdd2.collect.foreach(println)
// custid,relstatus
// fg3-03,R
// dfsdf4-01,V
// 56fbfg,R
rdd3.collect.foreach(println)
// devid,reg,hold,devbrn,lname,lcon
// CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370

Split rdd and Select elements

I am trying to capture a stream, transform the data, and then save it locally.
So far, streaming, and writing works fine. However, the transformation only works halfway.
The stream I receive consists out of 9 columns separated by "|". So I want to split it, and let's say select column 1,3, and 5. What I have tried looks like this, but nothing really let to a result
val indices = List(1,3,5)
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
if (rdd.count() > 0) {
rdd
.map(_.split("\\|").slice(1,2))
//.map(arr => (arr(0), arr(2))))
//filter(x=> indices.contains(_(x)))) //selec(indices)
//.zipWithIndex
.coalesce(1,true)
//the replacement is is used so that I get a csv file at the end
//.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
//.map{_.mkString(DELIMITER_OUTPUT) }
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}
Has anyone a tip how do I split a rdd and then only grab specific elements out of it?
Edit Input:
val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
Input stream is that:
536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22
Thank you very much everyone.
As you recommended, I modified my code like that:
private val DELIMITER_STREAM = "\\|"
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
.map(x =>{
val y = x.split(DELIMITER_STREAM)
(y(0),y(1),y(3),y(4),y(5),y(6),y(7))})
and then in the RDD
if (rdd.count() > 0) {
rdd
.map(_.productIterator.mkString(DELIMITER_OUTPUT))
.coalesce(1,true)
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}