I would like to getLine from a Source and convert it to a tuple (Int, Int). I've did it using foreach.
val values = collection.mutable.ListBuffer[(Int, Int)]()
Source.fromFile(invitationFile.ref.file).getLines().filter(line => !line.isEmpty).foreach(line => {
val value = line.split("\\s")
values += ((value(0).toInt, (value(1).toInt)))
})
What's the best way to write the same code without use foreach?
Use map, it builds a new list for you:
Source.fromFile(invitationFile.ref.file)
.getLines()
.filter(line => !line.isEmpty)
.map(line => {
val value = line.split("\\s")
(value(0).toInt, value(1).toInt)
})
.toList()
foreach should be a final operation, not a transformation.
In your case, you want to use the function map
val values = Source.fromFile(invitationFile.ref.file).getLines()
.filter(line => !line.isEmpty)
.map(line => line.split("\\s"))
.map(line => (line(0).toInt, line(1).toInt))
Using a for comprehension:
val values = for(line <- Source.fromFile(invitationFile.ref.file).getLines(); if !line.isEmpty) {
val splits = line.split("\\s")
yield (split(0).toInt, split(1).toInt)
}
Related
How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))
I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?
You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334
Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)
I have below raw string and I want to convert it to List or List of tuples or List of maps, basically I need to iterate through foreach
val rawStr = "(foo,bar), (foo1,bar1), (foo3,bar3)"
How would I go for it?
Split the string using any of ( , ) and then group
rawStr.split(s"""[(|,|)]""").filterNot(s => s.isEmpty || s.trim.isEmpty)
.grouped(2)
.toList
.map(pair => (pair(0), pair(1))).toList
Scala REPL
scala> val rawStr = "(foo,bar), (foo1,bar1), (foo3,bar3)"
rawStr: String = "(foo,bar), (foo1,bar1), (foo3,bar3)"
scala> rawStr.split(s"""[(|,|)]""").filterNot(s => s.isEmpty || s.trim.isEmpty).grouped(2).toList.map(pair => (pair(0), pair(1))).toList
res13: List[(String, String)] = List(("foo", "bar"), ("foo1", "bar1"), ("foo3", "bar3"))
This one can also deal with invalid input:
"\\(([^,]+{1})\\s*,\\s*([^,]+{1})\\)".r
.findAllMatchIn(rawStr)
.map(m => m.group(1) -> m.group(2)).toMap
You can give it
val rawStr = "(foo,bar,baz), (foo1,bar1), (foo3,bar3)"
or
val rawStr = "(foo), (foo1,bar1), (foo3,bar3)"
and it won't crash
I am doing some basic programs in scala
import scala.io.Source
/* records.txt
USA,Surender
USA,Raja
CHINA,Yen
CHINA,Chen
INDIA,Adam
INDIA,Edward
*/
object ReadingFile
{
def main (args :Array[String]){
val fileLoc = "D:\\inputfiles\\records.txt"
val lines = Source.fromFile(fileLoc).getLines().toList
val linesSplit = lines.map(x => x.split(","))
val linesMap = linesSplit.map(x => (x(0),x(1)))
}
}
I don't know how to use AGG function to linesMap. What do I need to add in my code to get the below output
USA,2
CHINA,2
INDIA,2
Source.fromFile(fileLoc)
.getLines()
.map(_.split(",")).
.groupBy(_(0))
.map(i => (i._1, i._2.size))
also can use mapValues:
Source.fromFile(fileLoc)
.getLines()
.map(_.split(","))
.groupBy(_(0))
.mapValues(_.size)
How can i traverse following RDD using Spark scala. I wants to print every value present in Seq with associated key
res1: org.apache.spark.rdd.RDD[(java.lang.String, Seq[java.lang.String])] = MapPartitionsRDD[6] at groupByKey at <console>:14
I tried following code for it.
val ss=mapfile.map(x=>{
val key=x._1
val value=x._2.sorted
var i=0
while (i < value.length) {
(key,value(i))
i += 1
}
}
)
ss.top(20).foreach(println)
I try to convert your codes as follows:
val ss = mapfile.flatMap {
case (key, value) => value.sorted.map((key, _))
}
ss.top(20).foreach(println)
Is it what you want?
I tried this and it works for the return type as mentioned.
val ss=mapfile.map(x=>{case (key, value) => value.sorted.map((key, _))}.groupByKey().map(x=>(x._1,x._2.toSeq))
ss.top(20).foreach(println)
Note: ss is of type::: org.apache.spark.rdd.RDD[(java.lang.String, Seq[java.lang.String])]