Split rdd and Select elements - scala

I am trying to capture a stream, transform the data, and then save it locally.
So far, streaming, and writing works fine. However, the transformation only works halfway.
The stream I receive consists out of 9 columns separated by "|". So I want to split it, and let's say select column 1,3, and 5. What I have tried looks like this, but nothing really let to a result
val indices = List(1,3,5)
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
if (rdd.count() > 0) {
rdd
.map(_.split("\\|").slice(1,2))
//.map(arr => (arr(0), arr(2))))
//filter(x=> indices.contains(_(x)))) //selec(indices)
//.zipWithIndex
.coalesce(1,true)
//the replacement is is used so that I get a csv file at the end
//.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
//.map{_.mkString(DELIMITER_OUTPUT) }
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}
Has anyone a tip how do I split a rdd and then only grab specific elements out of it?
Edit Input:
val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
Input stream is that:
536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22

Thank you very much everyone.
As you recommended, I modified my code like that:
private val DELIMITER_STREAM = "\\|"
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
.map(x =>{
val y = x.split(DELIMITER_STREAM)
(y(0),y(1),y(3),y(4),y(5),y(6),y(7))})
and then in the RDD
if (rdd.count() > 0) {
rdd
.map(_.productIterator.mkString(DELIMITER_OUTPUT))
.coalesce(1,true)
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}

Related

Split string twice and reduceByKey in Scala

I have a .csv file that I am trying to analyse using spark. The .csv file contains amongst others, a list of topics and their counts. The topics and their counts are separated by a ',' and all these topics+counts are in the same string separated by ';' like so
"topic_1,10;topic_2,12;topic_1,3"
As you can see, some topics are in the string multiple times.
I have an rdd containing key value pairs of some date and the topic strings [date, topicstring]
What I want to do is split the string at the ';' to get all the separate topics, then split those topics at the ',' and create a key value pair of the topic name and counts, which can be reduced by key. For the example above this would be
[date, ((topic_1, 13), (topic_2, 12))]
So I have been playing around in spark a lot as I am new to scala. What I tried to do is
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
The problem is that this returns an Array of tuples, which I cannot reduceByKey. When I split the string at ';' this returns an array. I tried mapping this to a tuple (as you can see from the map operation) but this does not work.
The complete code I used is
val rdd = sc.textFile("./data/segment/*.csv")
val topicsByDate = rdd
.filter(line => line.split("\t").length > 23)
.map({case(str) => (str.split("\t")(1), str.split("\t")(23))})
.reduceByKey(_ + _)
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
separateTopicsByDate.take(2)
This returns
res42: Array[(String, Array[(String, Int)])] = Array((20150219001500,Array((Cecilia Pedraza,91), (Mexico City,110), (Soviet Union,1019), (Dutch Warmbloods,1236), (Jose Luis Vaquero,1413), (National Equestrian Club,1636), (Lenin Park,1776), (Royal Dutch Sport Horse,2075), (North American,2104), (Western Hemisphere,2246), (Maydet Vega,2800), (Mirach Capital Group,110), (Subrata Roy,403), (New York,820), (Saransh Sharma,945), (Federal Bureau,1440), (San Francisco,1482), (Gregory Wuthrich,1635), (San Francisco,1652), (Dan Levine,2309), (Emily Flitter,2327), (K...
As you can see this is an array of tuples which I cannot use .reduceByKey(_ + _) on.
Is there a way to split the string in such a way that it can be reduced by key?
In case if your RDD has rows like:
[date, "topic1,10;topic2,12;topic1,3"]
you can split the values and explode the row using flatMap into:
[date, ["topic1,10", "topic2,12", "topic1,3"]] ->
[date, "topic1,10"]
[date, "topic2,12"]
[date, "topic1,3"]
Then convert each row into [String,Integer] Tuple (rdd1 in the code):
["date_topic1",10]
["date_topic2",12]
["date_topic1",3]
and reduce by Key using addition (rdd2 in the code):
["date_topic1",13]
["date_topic2",12]
Then you separate dates from topics and combine topics with values, getting [String,String] Tuples like:
["date", "topic1,13"]
["date", "topic2,12"]
Finally you split the values into [topic,count] Tuples, prepare ["date", [(topic,count)]] pairs (rdd3 in the code) and reduce by Key (rdd4 in the code), getting:
["date", [(topic1, 13), (topic2, 12)]]
===
below is Java implementation as a sequence of four intermediate RDDs, you may refer to it for Scala development:
JavaPairRDD<String, String> rdd; //original data. contains [date, "topic1,10;topic2,12;topic1,3"]
JavaPairRDD<String, Integer> rdd1 = //contains
//["date_topic1",10]
//["date_topic2",12]
//["date_topic1",3]
rdd.flatMapToPair(
pair -> //pair=[date, "topic1,10;topic2,12;topic1,3"]
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
String k = pair._1; //date
String v = pair._2; //"topic,count;topic,count;topic,count"
String[] v_splits = v.split(";");
for(int i=0; i<v_splits.length; i++)
{
String[] v_split_topic_count = v_splits[i].split(","); //"topic,count"
list.add(new Tuple2<String,Integer>(k + "_" + v_split_topic_count[0], Integer.parseInt(v_split_topic_count[1]))); //"date_topic,count"
}
return list.iterator();
}//end call
);
JavaPairRDD<String,Integer> rdd2 = //contains
//["date_topic1",13]
//["date_topic2",12]
rdd1.reduceByKey((Integer i1, Integer i2) -> i1+i2);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd3 = //contains
//["date", [(topic1,13)]]
//["date", [(topic2,12)]]
rdd2.mapToPair(
pair -> //["date_topic1",13]
{
String k = pair._1; //date_topic1
Integer v = pair._2; //13
String[] dateTopicSplits = k.split("_");
String new_k = dateTopicSplits[0]; //date
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
list.add(new Tuple2<String,Integer>(dateTopicSplits[1], v)); //[(topic1,13)]
return new Tuple2<String,Iterator<Tuple2<String,Integer>>>(new_k, list.iterator());
}
);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd4 = //contains
//["date", [(topic1, 13), (topic2, 12)]]
rdd3.reduceByKey(
(Iterator<Tuple2<String,Integer>> itr1, Iterator<Tuple2<String,Integer>> itr2) ->
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
while(itr1.hasNext())
list.add(itr1.next());
while(itr2.hasNext())
list.add(itr2.next());
return list.iterator();
}
);
UPD. This problem can actually be solved by using a single map only - you split the row value (i.e. topicstring) by ; so it gives you [key,value] pairs as [topic,count] and you populate the hashmap by those pairs adding up the counts. Finally you output the date key with all distinct keys accumulated in the hashmap together with their values.
This way seems to be more efficient as well because the size of hashmap is not going to be larger than the size of the original row so the memory space consumed by mapper will be limited by the size of largest row, whereas in the flatmap solution, memory should be large enough to fit all those expanded rows

Scala - How to use filter on Map

If I have map with filename and it's size and I want to filter it based on it's size, I can do something like this.
val fileCount = Map("a.txt"->10, "b.txt"->0)
First Way
val zeroSizeFiles = fileCount.filter(t=>t._2 != 0)
or
val zeroSizeFiles = fileCount.filter(_._2 != 0)
I realized that, I can do something like this, which is more verbose
Second Way
val zeroSizeFiles = fileCount.filter{case(fileName, count) => count == 0}
So, is there any disadvantage in using second way as compared to first or advantage apart from being more readabel?

Split one row into multiple rows of dataframe

I want to convert one row from dataframe into multiple rows. If hours is same then rows will not get split but if hour is different then rows will split into multiple rows wrt difference between hours.I am good with solution using dataframe function or hive query.
Input Table or Dataframe
Expected Output Table or Dataframe
Please help me to get workaround for expected output.
The easiest solution for such a simple schema is to use Dataset.flatMap after defining case classes for the input and output schema.
A simple UDF solution would return a sequence and then you can use functions.explode. Far less clean & efficient that using flatMap.
Last but not least, you could create your own table-generating UDF but that would be extreme overkill for this problem.
You can implement your own logic inside the map operation and use flatMap to achieve this.
The following is the crude way, that I have implemented the solution, you can improvise it as per the need.
import java.time.format.DateTimeFormatter
import java.time.temporal.ChronoUnit
import java.time.{Duration, LocalDateTime}
import org.apache.spark.sql.Row
import scala.collection.mutable.ArrayBuffer
import sparkSession.sqlContext.implicits._
val df = Seq(("john", "2/9/2018", "2/9/2018 5:02", "2/9/2018 5:12"),
("smit", "3/9/2018", "3/9/2018 6:12", "3/9/2018 8:52"),
("rick", "4/9/2018", "4/9/2018 23:02", "5/9/2018 2:12")
).toDF("UserName", "Date", "start_time", "end_time")
val rdd = df.rdd.map(row => {
val result = new ArrayBuffer[Row]()
val formatter1 = DateTimeFormatter.ofPattern("d/M/yyyy H:m")
val formatter2 = DateTimeFormatter.ofPattern("d/M/yyyy H:mm")
val d1 = LocalDateTime.parse(row.getAs[String]("start_time"), formatter1)
val d2 = LocalDateTime.parse(row.getAs[String]("end_time"), formatter1)
if (d1.getHour == d2.getHour) result += row
else {
val hoursDiff = Duration.between(d1, d2).toHours.toInt
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
row.getAs[String]("start_time"),
d1.plus(1, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
for (index <- 1 until hoursDiff) {
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d1.plus(index, ChronoUnit.HOURS).withMinute(0).format(formatter1),
d1.plus(1 + index, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
}
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d2.withMinute(0).format(formatter2),
row.getAs[String]("end_time")))
}
result
}).flatMap(_.toIterator)
rdd.collect.foreach(println)
and finally, your result is as follows:
[john,2/9/2018,2/9/2018 5:02,2/9/2018 5:12]
[smit,3/9/2018,3/9/2018 6:12,3/9/2018 7:00]
[smit,3/9/2018,3/9/2018 7:0,3/9/2018 8:00]
[smit,3/9/2018,3/9/2018 8:00,3/9/2018 8:52]
[rick,4/9/2018,4/9/2018 23:02,5/9/2018 0:00]
[rick,4/9/2018,5/9/2018 0:0,5/9/2018 1:00]
[rick,4/9/2018,5/9/2018 1:0,5/9/2018 2:00]
[rick,4/9/2018,5/9/2018 2:00,5/9/2018 2:12]

Spark: convert large input to rdd

I read a lot of lines from an iterator and I need to convert them to an RDD.
I have see some answers like do an sc.parallelize(YourIterable.toList) but the toList will raise a memory exception
I have also read post saying that it is again Spark model but I think there should be another solution.
I have two ideas, please tell me which is the best or if you have any other ideas to solve this.
Solution 1: I store those lines 100 000 by 100 000 to an ArrayBuffer then when the iterator is empty I convert the array to an RDD with parallelize
val result = ArrayBuffer[String]()
var counter = 0
var resultRDD:RDD[Array[Option[Any]]] = sc.sparkContext.emptyRDD[Array[Option[Any]]]
while (resultSet.next()) {
//Do some stuff on resultSet
result.append(stuff)
counter = counter + 1
if(counter % 100000 == 0){
val tmp = sc.sparkContext.parallelize(result)
tmp.count // Need to run an action because result will be cleared
resultRDD = resultRDD union sc.sparkContext.parallelize(result)
result.clear
}
}
// Same for last lines
//use resultRDD
With this method the use of count to force an action on the lazy union before the arrayBuffer.clear is a bit annoying.
Solution 2: Same bunch reads but write in some files in HDFS and next do a sc.textFiles

Scala: Creating a HBase table with pre splitting region based on Row Key

I have three RegionServers. I want to evenly distribute a HBase table onto three regionservres based on rowkeys which I have already identified (say, rowkey_100 and rowkey_200). It can be done from hbase shell using:
create 'tableName', 'columnFamily', {SPLITS => ['rowkey_100','rowkey_200']}
If I am not mistaken, this 2 split points will create 3 regions, and the first 100 rows will go to the 1st regionserver, next 100 rows will be in 2nd regionserver and the remaining rows in last regionserver. I want to do the same thing using scala code. How can I specify this in scala code to split table into regions?
Below is a Scala snippet for creating a HBase table with splits:
val admin = new HBaseAdmin(conf)
if (!admin.tableExists(myTable)) {
val htd = new HTableDescriptor(myTable)
val hcd = new HColumnDescriptor(myCF)
val splits = Array[Array[Byte]](splitPoint1.getBytes, splitPoint2.getBytes)
htd.addFamily(hcd)
admin.createTable(htd, splits)
}
There are some predefined region split policies, but in case you want to create your own way of setting split points that span your rowkey range, you can create a simple function like the following:
def autoSplits(n: Int, range: Int = 256) = {
val splitPoints = new Array[Array[Byte]](n)
for (i <- 0 to n-1) {
splitPoints(i) = Array[Byte](((range / (n + 1)) * (i + 1)).asInstanceOf[Byte])
}
splitPoints
}
Just comment out the val splits = ... line and replace createTable's splits parameter with autoSplits(2) or autoSplits(4, 128), etc.
This java code can help
HTableDescriptor td = new HTableDescriptor(TableName.valueOf("tableName"));
HColumnDescriptor cf = new HColumnDescriptor("cf".getBytes());
td.addFamily(cf);
byte[][] splitKeys = new byte[] {key1.getBytes(), key2.getBytes()};
HBaseAdmin dbAdmin = new HBaseAdmin(conf);
dbAdmin.createTable(td, splitKeys);