Scala flattening embedded list of lists - scala

I have created a Twitter datastream that is displaying hashtag, author, and mentioned users in the below format.
(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))
I can't do analysis on this format because of the embedded lists. How can I create another datastream that displays the data in this format?
timetofly, Shera_Eyra, blxcknicotine
timetofly, Shera_Eyra, kimtheskimm
hellocake, Shera_Eyra, blxcknicotine
hellocake, Shera_Eyra, kimtheskimm
Here is my code to produce the data:
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval))
val stream = TwitterUtils.createStream(ssc, None)
val data = stream.map {line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}

In your code snippet, data is a DStream[(Array[String], String, List[String])]. To get a DStream[String] in your desired format, you can use flatMap and map:
val data = stream.map { line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
.map { case (hash, user, mention) => s"$hash, $user, $mention" }
The flatMap results in a DStream[(String, String, String)] in which each tuple consists of a hash tag entity, user, and mention entity. The subsequent call to map with the pattern matching creates a DStream[String] in which each String consists of the elements in each tuple, separated by a comma and space.

I would use for comprehension for this:
val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))
val result = for {
hashtag <- data._1
user = data._2
mentionedUser <- data._3
} yield (hashtag, user, mentionedUser)
result.foreach(println)
Output:
(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)
If you would prefer a seq of lists of strings, rather than a seq of tuples of strings, then change the yield to give you a list instead: yield List(hashtag, user, mentionedUser)

Related

How to add the elements into the Map where key is String and Value is List[String] in Scala

I have a text file which contains information about the sender and messages. The format is sender, messages.
I've loaded the file into a RDD and split them by "," and created a key value pair, where key is sender and value is messages RDD[(String,String)].
Then, I did a groupByKey() to group the messages based on sender and I got a RDD[(String,Iterable[String])].
Array[(String, Iterable[String])] = Array((Key,CompactBuffer(value1,value2,value3,....))
Now, I want to iterate the value part, and stores the values one by one to the List, so I've created a empty Map where key is String and value is List[String]
First I should check whether the Map is empty, if it is empty then I should add the first value to the List which is present inside the Map.
The below is whatever I've tried but I could not do it, when I've checked the Map it's shows None.
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(",");
(arr(0),arr(1))})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String,List[String]]
for(value <- grpData.values){
val list = ListBuffer[String]()
if(myMap.isEmpty){
list += value
myMap.put("G1", list.toList)
}
}
}
In the for loop, I gave grpData.values because I need only the value part. I don't want the keys whatever I have in my file as a sender. I Just used them to group the messages based on the sender, but in the Map[String,List[String]] my key should be Group1, Group2 and so on. The value is messages whatever I will get one by one from the CompactBuffer.
First, I should check whether the Map is empty, If it is empty I should add the first message to the List which is present inside the Map. Key should be "Group1" and value should be the message that should be stored in List[String].
For the second Iteration, Map will not be empty then the condition will go to the else part, In the else part I should use lavenshtein distance algorithm to compare the messages. Here first message was already added to the List, now I should get the first message from Map and compare it with second message using lavenshtein distance algorithm with threshold of 70%. If the 2 messages meets the threshold then I should add the second message to the List, If not I should add second message to the separate list and keep the key name as "G2" and so on.
You can use aggregateByKey to get a combined list of strings for each key:
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(","); (arr(0),arr(1))})
val result = data2.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
// or to prepend rather than append,
// val result = data2.aggregateByKey(List[String]())((x, y) => y :: x, _ ++ _)
If you want the result as a Map, you can do
val resultMap = result.toMap
I'm assuming that you are trying to cluster base off some distance function, in which this might be what you are looking for:
def isWithinThreshold(s1: String, s2: String): Boolean = ???
//2 sets are grouped when there exist elements in a both sets that are closed to each other
def combine(acc: Vector[Vector[String]], s: Vector[String]) = {
val (near, far) = acc.partition(_.exists(str => s.exists(isWithinThreshold(str, _))))
near.fold(s)(_ ++ _) +: far
}
val preClusteringGroups = grpData.values.map(_.toVector) //this is already pre-grouped with with the key from data2 (`arr(0)`)
val res = preClusteringGroups.aggregate(Vector.empty[Vector[String]])(combine, { case (v1, v2) =>
(v1 ++ v2).foldLeft(Vector.empty[Vector[String]])(combine)
}).zipWithIndex.map { case (v, i) => s"G$i" -> v }.toMap //.mapValues(_.toList) if you actually need a list
preClusteringGroups is based of grpData which is already pre-grouped by the orginal key and might not fulfill your distance requirements. If that' the case redefine preClusteringGroups with:
val preClusteringGroups = data2.values.map(Vector(_))

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions

Modifying List of String in scala

I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?
You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334
Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

How to group by a select number of fields in an RDD looking for duplicates based on those fields

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".
CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")
val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")
If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")
How about group the data, change it and put it back:
val csv = sc.parallelize(List(
"00111,00111651,4444,PY,MA",
"00111,00111651,4444,XX,MA",
"00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
dataList.zipWithIndex.map{case(data, idx) => data match {
case Array(fst,scd,thd,rest#_*) => Array(fst,scd,thd+(idx+1)) ++ rest
}
})
numbered.flatMap{case (key, values)=>values}
Also grouping the data, changing it, putting it back.
val lists= List(("00111","00111651","4444","PY","MA"),
("00111","00111651","4444","XX","MA"),
("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{case(a,b,c,d,e) => (a,b,c)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case ((a,b,c,d,e), idx) => (a,b,c + (idx+1).toString,d,e)}
val unwrapped = indexed.flatMap(_._2)
//List((00112,00112P11,55551,TA,MA),
// (00111,00111651,44442,XX ,MA),
// (00111,00111651,44441,PY,MA))
Version working on Arrays (of arbitary length >= 3)
val lists= List(Array("00111","00111651","4444","PY","MA"),
Array("00111","00111651","4444","XX","MA"),
Array("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{_.take(3)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case (Array(a,b,c, rest#_*), idx) => Array(a,b,c+ (idx+1).toString) ++ rest})
val unwrapped = indexed.flatMap(_._2)
// List(Array(00112, 00112P11, 55551, TA, MA),
// Array(00111, 00111651, 44441, XX, MA),
// Array(00111, 00111651, 44441, PY, MA))