Modifying List of String in scala - scala

I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?

You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334

Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)

Related

How to add the elements into the Map where key is String and Value is List[String] in Scala

I have a text file which contains information about the sender and messages. The format is sender, messages.
I've loaded the file into a RDD and split them by "," and created a key value pair, where key is sender and value is messages RDD[(String,String)].
Then, I did a groupByKey() to group the messages based on sender and I got a RDD[(String,Iterable[String])].
Array[(String, Iterable[String])] = Array((Key,CompactBuffer(value1,value2,value3,....))
Now, I want to iterate the value part, and stores the values one by one to the List, so I've created a empty Map where key is String and value is List[String]
First I should check whether the Map is empty, if it is empty then I should add the first value to the List which is present inside the Map.
The below is whatever I've tried but I could not do it, when I've checked the Map it's shows None.
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(",");
(arr(0),arr(1))})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String,List[String]]
for(value <- grpData.values){
val list = ListBuffer[String]()
if(myMap.isEmpty){
list += value
myMap.put("G1", list.toList)
}
}
}
In the for loop, I gave grpData.values because I need only the value part. I don't want the keys whatever I have in my file as a sender. I Just used them to group the messages based on the sender, but in the Map[String,List[String]] my key should be Group1, Group2 and so on. The value is messages whatever I will get one by one from the CompactBuffer.
First, I should check whether the Map is empty, If it is empty I should add the first message to the List which is present inside the Map. Key should be "Group1" and value should be the message that should be stored in List[String].
For the second Iteration, Map will not be empty then the condition will go to the else part, In the else part I should use lavenshtein distance algorithm to compare the messages. Here first message was already added to the List, now I should get the first message from Map and compare it with second message using lavenshtein distance algorithm with threshold of 70%. If the 2 messages meets the threshold then I should add the second message to the List, If not I should add second message to the separate list and keep the key name as "G2" and so on.
You can use aggregateByKey to get a combined list of strings for each key:
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(","); (arr(0),arr(1))})
val result = data2.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
// or to prepend rather than append,
// val result = data2.aggregateByKey(List[String]())((x, y) => y :: x, _ ++ _)
If you want the result as a Map, you can do
val resultMap = result.toMap
I'm assuming that you are trying to cluster base off some distance function, in which this might be what you are looking for:
def isWithinThreshold(s1: String, s2: String): Boolean = ???
//2 sets are grouped when there exist elements in a both sets that are closed to each other
def combine(acc: Vector[Vector[String]], s: Vector[String]) = {
val (near, far) = acc.partition(_.exists(str => s.exists(isWithinThreshold(str, _))))
near.fold(s)(_ ++ _) +: far
}
val preClusteringGroups = grpData.values.map(_.toVector) //this is already pre-grouped with with the key from data2 (`arr(0)`)
val res = preClusteringGroups.aggregate(Vector.empty[Vector[String]])(combine, { case (v1, v2) =>
(v1 ++ v2).foldLeft(Vector.empty[Vector[String]])(combine)
}).zipWithIndex.map { case (v, i) => s"G$i" -> v }.toMap //.mapValues(_.toList) if you actually need a list
preClusteringGroups is based of grpData which is already pre-grouped by the orginal key and might not fulfill your distance requirements. If that' the case redefine preClusteringGroups with:
val preClusteringGroups = data2.values.map(Vector(_))

How to filter elements from a tuple using a list as a filter

Had a filter that parsed through a list but then realised I needed to zip the original list to number each line before i filtered and now I'm not sure how to use the same filter on each of the _._2 tuple elements
val list = List("def", "var", "val")
val source = Source.fromFile("..\scala.file").getLines.toList
val filtered = source filter(line => list.exists(word => list.contains(word))))
//before
val filtered = (1 to source.length) zip source
filter(line => list.exists(word => list.contains(word))))
//after
Cannot get function working with tuple.
Supposed to filter out each tuple that doesn't contain any instances of the elements from the list
val list = List("def", "var", "val")
val matcher = list.mkString(".*(", "|", ").*")
io.Source
.fromFile("..\scala.file")
.getLines
.zipWithIndex
.filter(_._1 matches matcher)
.map{case (txt,idx) => (idx+1,txt)} //optional
.toList

Scala flattening embedded list of lists

I have created a Twitter datastream that is displaying hashtag, author, and mentioned users in the below format.
(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))
I can't do analysis on this format because of the embedded lists. How can I create another datastream that displays the data in this format?
timetofly, Shera_Eyra, blxcknicotine
timetofly, Shera_Eyra, kimtheskimm
hellocake, Shera_Eyra, blxcknicotine
hellocake, Shera_Eyra, kimtheskimm
Here is my code to produce the data:
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval))
val stream = TwitterUtils.createStream(ssc, None)
val data = stream.map {line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
In your code snippet, data is a DStream[(Array[String], String, List[String])]. To get a DStream[String] in your desired format, you can use flatMap and map:
val data = stream.map { line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
.map { case (hash, user, mention) => s"$hash, $user, $mention" }
The flatMap results in a DStream[(String, String, String)] in which each tuple consists of a hash tag entity, user, and mention entity. The subsequent call to map with the pattern matching creates a DStream[String] in which each String consists of the elements in each tuple, separated by a comma and space.
I would use for comprehension for this:
val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))
val result = for {
hashtag <- data._1
user = data._2
mentionedUser <- data._3
} yield (hashtag, user, mentionedUser)
result.foreach(println)
Output:
(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)
If you would prefer a seq of lists of strings, rather than a seq of tuples of strings, then change the yield to give you a list instead: yield List(hashtag, user, mentionedUser)

How to use reduceByKey on key of value in tuple set up as (String, (String, Int))?

I am attempting to loop through an RDD of a text file, and make a tally of each unique word in the file, and then accumulate all of the words that follow each unique word, along with their counts. So far, this is what I have:
// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object
//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")
//Splits the file into individual words
val words = lines.flatMap(line => {
val wordList = line.split(" ")
for {i <- 0 until wordList.length - 1}
yield (wordList(i), wordList(i + 1), 1)
})
If I haven't been clear thus far, what I am trying to do is to accumulate the set of words that follow each word in the file, along with the number of times said words follow.
If I understand you correctly, we have something like this:
val lines: Seq[String] = ...
val words: Seq[(String, String, Int)] = ...
And we want something like this:
val frequencies: Map[String, Seq[(String, Int)]] = {
words
.groupBy(_._1) // word -> [(w, next, cc), ...]
.mapValues { values =>
values
.map { case (w, n, cc) => (n, cc) }
.groupBy(_._1) // next -> [(next, cc), ...]
.mapValues(_.reduce(_._2 + _._2)) // next -> sum
.toSeq
}
}

How to group by a select number of fields in an RDD looking for duplicates based on those fields

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".
CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")
val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")
If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")
How about group the data, change it and put it back:
val csv = sc.parallelize(List(
"00111,00111651,4444,PY,MA",
"00111,00111651,4444,XX,MA",
"00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
dataList.zipWithIndex.map{case(data, idx) => data match {
case Array(fst,scd,thd,rest#_*) => Array(fst,scd,thd+(idx+1)) ++ rest
}
})
numbered.flatMap{case (key, values)=>values}
Also grouping the data, changing it, putting it back.
val lists= List(("00111","00111651","4444","PY","MA"),
("00111","00111651","4444","XX","MA"),
("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{case(a,b,c,d,e) => (a,b,c)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case ((a,b,c,d,e), idx) => (a,b,c + (idx+1).toString,d,e)}
val unwrapped = indexed.flatMap(_._2)
//List((00112,00112P11,55551,TA,MA),
// (00111,00111651,44442,XX ,MA),
// (00111,00111651,44441,PY,MA))
Version working on Arrays (of arbitary length >= 3)
val lists= List(Array("00111","00111651","4444","PY","MA"),
Array("00111","00111651","4444","XX","MA"),
Array("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{_.take(3)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case (Array(a,b,c, rest#_*), idx) => Array(a,b,c+ (idx+1).toString) ++ rest})
val unwrapped = indexed.flatMap(_._2)
// List(Array(00112, 00112P11, 55551, TA, MA),
// Array(00111, 00111651, 44441, XX, MA),
// Array(00111, 00111651, 44441, PY, MA))