Split string twice and reduceByKey in Scala - scala

I have a .csv file that I am trying to analyse using spark. The .csv file contains amongst others, a list of topics and their counts. The topics and their counts are separated by a ',' and all these topics+counts are in the same string separated by ';' like so
"topic_1,10;topic_2,12;topic_1,3"
As you can see, some topics are in the string multiple times.
I have an rdd containing key value pairs of some date and the topic strings [date, topicstring]
What I want to do is split the string at the ';' to get all the separate topics, then split those topics at the ',' and create a key value pair of the topic name and counts, which can be reduced by key. For the example above this would be
[date, ((topic_1, 13), (topic_2, 12))]
So I have been playing around in spark a lot as I am new to scala. What I tried to do is
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
The problem is that this returns an Array of tuples, which I cannot reduceByKey. When I split the string at ';' this returns an array. I tried mapping this to a tuple (as you can see from the map operation) but this does not work.
The complete code I used is
val rdd = sc.textFile("./data/segment/*.csv")
val topicsByDate = rdd
.filter(line => line.split("\t").length > 23)
.map({case(str) => (str.split("\t")(1), str.split("\t")(23))})
.reduceByKey(_ + _)
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
separateTopicsByDate.take(2)
This returns
res42: Array[(String, Array[(String, Int)])] = Array((20150219001500,Array((Cecilia Pedraza,91), (Mexico City,110), (Soviet Union,1019), (Dutch Warmbloods,1236), (Jose Luis Vaquero,1413), (National Equestrian Club,1636), (Lenin Park,1776), (Royal Dutch Sport Horse,2075), (North American,2104), (Western Hemisphere,2246), (Maydet Vega,2800), (Mirach Capital Group,110), (Subrata Roy,403), (New York,820), (Saransh Sharma,945), (Federal Bureau,1440), (San Francisco,1482), (Gregory Wuthrich,1635), (San Francisco,1652), (Dan Levine,2309), (Emily Flitter,2327), (K...
As you can see this is an array of tuples which I cannot use .reduceByKey(_ + _) on.
Is there a way to split the string in such a way that it can be reduced by key?

In case if your RDD has rows like:
[date, "topic1,10;topic2,12;topic1,3"]
you can split the values and explode the row using flatMap into:
[date, ["topic1,10", "topic2,12", "topic1,3"]] ->
[date, "topic1,10"]
[date, "topic2,12"]
[date, "topic1,3"]
Then convert each row into [String,Integer] Tuple (rdd1 in the code):
["date_topic1",10]
["date_topic2",12]
["date_topic1",3]
and reduce by Key using addition (rdd2 in the code):
["date_topic1",13]
["date_topic2",12]
Then you separate dates from topics and combine topics with values, getting [String,String] Tuples like:
["date", "topic1,13"]
["date", "topic2,12"]
Finally you split the values into [topic,count] Tuples, prepare ["date", [(topic,count)]] pairs (rdd3 in the code) and reduce by Key (rdd4 in the code), getting:
["date", [(topic1, 13), (topic2, 12)]]
===
below is Java implementation as a sequence of four intermediate RDDs, you may refer to it for Scala development:
JavaPairRDD<String, String> rdd; //original data. contains [date, "topic1,10;topic2,12;topic1,3"]
JavaPairRDD<String, Integer> rdd1 = //contains
//["date_topic1",10]
//["date_topic2",12]
//["date_topic1",3]
rdd.flatMapToPair(
pair -> //pair=[date, "topic1,10;topic2,12;topic1,3"]
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
String k = pair._1; //date
String v = pair._2; //"topic,count;topic,count;topic,count"
String[] v_splits = v.split(";");
for(int i=0; i<v_splits.length; i++)
{
String[] v_split_topic_count = v_splits[i].split(","); //"topic,count"
list.add(new Tuple2<String,Integer>(k + "_" + v_split_topic_count[0], Integer.parseInt(v_split_topic_count[1]))); //"date_topic,count"
}
return list.iterator();
}//end call
);
JavaPairRDD<String,Integer> rdd2 = //contains
//["date_topic1",13]
//["date_topic2",12]
rdd1.reduceByKey((Integer i1, Integer i2) -> i1+i2);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd3 = //contains
//["date", [(topic1,13)]]
//["date", [(topic2,12)]]
rdd2.mapToPair(
pair -> //["date_topic1",13]
{
String k = pair._1; //date_topic1
Integer v = pair._2; //13
String[] dateTopicSplits = k.split("_");
String new_k = dateTopicSplits[0]; //date
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
list.add(new Tuple2<String,Integer>(dateTopicSplits[1], v)); //[(topic1,13)]
return new Tuple2<String,Iterator<Tuple2<String,Integer>>>(new_k, list.iterator());
}
);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd4 = //contains
//["date", [(topic1, 13), (topic2, 12)]]
rdd3.reduceByKey(
(Iterator<Tuple2<String,Integer>> itr1, Iterator<Tuple2<String,Integer>> itr2) ->
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
while(itr1.hasNext())
list.add(itr1.next());
while(itr2.hasNext())
list.add(itr2.next());
return list.iterator();
}
);
UPD. This problem can actually be solved by using a single map only - you split the row value (i.e. topicstring) by ; so it gives you [key,value] pairs as [topic,count] and you populate the hashmap by those pairs adding up the counts. Finally you output the date key with all distinct keys accumulated in the hashmap together with their values.
This way seems to be more efficient as well because the size of hashmap is not going to be larger than the size of the original row so the memory space consumed by mapper will be limited by the size of largest row, whereas in the flatmap solution, memory should be large enough to fit all those expanded rows

Related

Using reducedByKey instead of GroupBy

How do I use reducedByKey instead of GroupBy for data stored as RDD?
Purpose is to group by key and then sum the values.
I have a working Scala process to find the Odds Ratio.
Problem:
Data we are ingesting to the script has grown drastically and started failing because of memory/disk issue. Main problem here is lot of shuffling because of "GROUP BY".
Sample Data:
(543040000711860,543040000839322,0,0,0,0)
(543040000711860,543040000938728,0,0,1,1)
(543040000711860,543040000984046,0,0,1,1)
(543040000711860,543040001071137,0,0,1,1)
(543040000711860,543040001121115,0,0,1,1)
(543040000711860,543040001281239,0,0,0,0)
(543040000711860,543040001332995,0,0,1,1)
(543040000711860,543040001333073,0,0,1,1)
(543040000839322,543040000938728,0,1,0,0)
(543040000839322,543040000984046,0,1,0,0)
(543040000839322,543040001071137,0,1,0,0)
(543040000839322,543040001121115,0,1,0,0)
(543040000839322,543040001281239,1,0,0,0)
(543040000839322,543040001332995,0,1,0,0)
(543040000839322,543040001333073,0,1,0,0)
(543040000938728,543040000984046,0,0,1,1)
(543040000938728,543040001071137,0,0,1,1)
(543040000938728,543040001121115,0,0,1,1)
(543040000938728,543040001281239,0,0,0,0)
(543040000938728,543040001332995,0,0,1,1)
Here is the code to transform my data:
var groupby = flags.groupBy(item =>(item._1, item._2) )
var counted_group = groupby.map(item => (item._1, item._2.map(_._3).sum, item._2.map(_._4).sum, item._2.map(_._5).sum, item._2.map(_._6).sum))
Result:
((3900001339662,3900002247644),6,12,38,38)
((543040001332995,543040001352893),112,29,57,57)
((3900001572602,543040001071137),1,0,1,1)
((3900001640810,543040001281239),2,1,0,0)
((3900001295323,3900002247644),8,21,8,8)
I need to convert this to "REDUCE BY KEY" so that data will be reduced in each partition before sending it back. I am using RDD so there is no direct method to do REDUCE BY.
I think I solved the problem by using aggregateByKey.
Remapped the RDD to generate a key value pair
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
Then applied the aggregateByKey function on the result, Now each partition return the aggregated result rather than group result.
rddPair.aggregateByKey((0, 0, 0, 0))(
(iTotal, oisubtotal) => (iTotal._1 + oisubtotal._1, iTotal._2 + oisubtotal._2, iTotal._3 + oisubtotal._3, iTotal._4 + oisubtotal._4 ),
(fTotal, iTotal) => (fTotal._1 + iTotal._1, fTotal._2 + iTotal._2, fTotal._3 + iTotal._3, fTotal._4 + iTotal._4)
)
reducyByKey would require a RDD[(K, V)] i.e. key value pair, so you should create a rdd pairs first
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
Then you can use reduceByKey on above rddPair as
rddPair.reduceByKey((x, y)=> (x._1+y._1, x._2+y._2, x._3+y._3, x._4+y._4))
I hope the answer is helpful

Split rdd and Select elements

I am trying to capture a stream, transform the data, and then save it locally.
So far, streaming, and writing works fine. However, the transformation only works halfway.
The stream I receive consists out of 9 columns separated by "|". So I want to split it, and let's say select column 1,3, and 5. What I have tried looks like this, but nothing really let to a result
val indices = List(1,3,5)
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
if (rdd.count() > 0) {
rdd
.map(_.split("\\|").slice(1,2))
//.map(arr => (arr(0), arr(2))))
//filter(x=> indices.contains(_(x)))) //selec(indices)
//.zipWithIndex
.coalesce(1,true)
//the replacement is is used so that I get a csv file at the end
//.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
//.map{_.mkString(DELIMITER_OUTPUT) }
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}
Has anyone a tip how do I split a rdd and then only grab specific elements out of it?
Edit Input:
val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
Input stream is that:
536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22
Thank you very much everyone.
As you recommended, I modified my code like that:
private val DELIMITER_STREAM = "\\|"
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
.map(x =>{
val y = x.split(DELIMITER_STREAM)
(y(0),y(1),y(3),y(4),y(5),y(6),y(7))})
and then in the RDD
if (rdd.count() > 0) {
rdd
.map(_.productIterator.mkString(DELIMITER_OUTPUT))
.coalesce(1,true)
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}

Scala: Creating a HBase table with pre splitting region based on Row Key

I have three RegionServers. I want to evenly distribute a HBase table onto three regionservres based on rowkeys which I have already identified (say, rowkey_100 and rowkey_200). It can be done from hbase shell using:
create 'tableName', 'columnFamily', {SPLITS => ['rowkey_100','rowkey_200']}
If I am not mistaken, this 2 split points will create 3 regions, and the first 100 rows will go to the 1st regionserver, next 100 rows will be in 2nd regionserver and the remaining rows in last regionserver. I want to do the same thing using scala code. How can I specify this in scala code to split table into regions?
Below is a Scala snippet for creating a HBase table with splits:
val admin = new HBaseAdmin(conf)
if (!admin.tableExists(myTable)) {
val htd = new HTableDescriptor(myTable)
val hcd = new HColumnDescriptor(myCF)
val splits = Array[Array[Byte]](splitPoint1.getBytes, splitPoint2.getBytes)
htd.addFamily(hcd)
admin.createTable(htd, splits)
}
There are some predefined region split policies, but in case you want to create your own way of setting split points that span your rowkey range, you can create a simple function like the following:
def autoSplits(n: Int, range: Int = 256) = {
val splitPoints = new Array[Array[Byte]](n)
for (i <- 0 to n-1) {
splitPoints(i) = Array[Byte](((range / (n + 1)) * (i + 1)).asInstanceOf[Byte])
}
splitPoints
}
Just comment out the val splits = ... line and replace createTable's splits parameter with autoSplits(2) or autoSplits(4, 128), etc.
This java code can help
HTableDescriptor td = new HTableDescriptor(TableName.valueOf("tableName"));
HColumnDescriptor cf = new HColumnDescriptor("cf".getBytes());
td.addFamily(cf);
byte[][] splitKeys = new byte[] {key1.getBytes(), key2.getBytes()};
HBaseAdmin dbAdmin = new HBaseAdmin(conf);
dbAdmin.createTable(td, splitKeys);

Key/Value pair RDD

I have a question on key/value pair RDD.
I have five files in the C:/download/input folder which has the dialogs in the films as the content of the files as follows:
movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt
I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
I am trying to do an operation where i have to group the input files of each genre together using groupByKey(). The values of all the horror movies together , comedy movies together and so on.
Is there any way i can generate the key/value pair this way (horror, values) instead of (C:/download/input/movie_horror_Conjuring.txt,values)
val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))
The above code is giving me the output as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)
where as i need the output as follows :
(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))
I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.
Also i would like to know how can i get the lines count in values1, values2, values3 etc.
My final output should be like
(horror, 100)
where 100 is the sum of the count of lines in values1 = 40 lines, values2 = 30 lines and values3 = 30 lines and so on..
Try this:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()
output.collect
Let me know if this works for you!
Update:
To get output in the format of (horror, 100):
val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reduceByKey(_ + _)
output.collect

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values