Read file contents and create Map of filename and contents - scala

I want to read files from given directory then read contents from file and create Map of filename as key and its contetns as value.
I did not got any success but I have tried like this,
def getFileLists(): List[File] = {
val directory = "./input"
// print(new File(directory).listFiles().toList)
return new File(directory).listFiles().toList
}
val contents = getFileLists().map(file => Source.fromFile(file).getLines())
print(contents)

you can change your following line
val contents = getFileLists().map(file => Source.fromFile(file).getLines())
to
val contents = getFileLists().map(file => (file.getName, Source.fromFile(file).getLines()))
which would give you
contents: List[(String, Iterator[String])]
Furthermore you can add .toMap method call as
val contents = getFileLists().map(file => (file.getName, Source.fromFile(file).getLines())).toMap
which would give you
contents: scala.collection.immutable.Map[String,Iterator[String]]

What you are doing is transforming the list of file-names to a list of their contents. You want a Map[File, List[String]] instead. To do that the easiest ist to map to tuples of files and content, and then call toMap on the result:
getFileLists().map(file => file -> Source.fromFile(file).getLines().toList).toMap
toMap works when the input sequence has Tuple2 as element type. file -> contents is such a tuple (File, List[String]).
Or in two steps:
val xs: Seq[(File, List[String])] = getFileLists().map(file =>
file -> Source.fromFile(file).getLines().toList)
val m: Map[File, List[String]] = xs.toMap

You can try this:
getFileLists().map(file => (file.getName, Source.fromFile(file).getLines().toList)).toMap

Related

I want to save Map[String, String] to disk and later read it back as same type. Somehow I cannot find collectAsMap method with my sparkContext

I am working on Spark Scala and there is a requirement to save Map[String, String] to the disk so that a different Spark application can read it.
(x,1),(y,2)...
To Save:
sc.parallelize(itemMap.toSeq).coalesce(1).saveAsTextFile(fileName)
I am doing a coalesce as the data is only 450 rows.
But to read it back, I am not able to convert it back to Map[String, String]
val myMap = sc.textFile(fileName).zipWithUniqueId().collect.toMap
the data comes as
((x,1),0),((y,2),1)...
What is the possible solution?
Thanks.
Loading a text file results in RDD[String], so you will have to deserialize your string representations of the tuples.
You can change your Save operation to add a delimiter between tuple value 1 and tuple value 2, or parse the string (:v1, :v2).
val d = spark.sparkContext.textFile(fileName)
val myMap = d.map(s => {
val parsedVals = s.substring(1, s.length-1).split(",")
(parsedVals(0), parsedVals(1))
}).collect.toMap
Alternatively, you can change your save operation to create a delimiter (like a comma) and parse the structure that way:
itemMap.toSeq.map(kv => kv._1 + "," + kv._2).saveAsTextFile(fileName)
val myMap = spark.sparkContext.textFile("trash3.txt")
.map(_.split(","))
.map(d => (d(0), d(1)))
.collect.toMap
Method "collectAsMap" exists in "PairRDDFunctions" class, means, applicable only for RDD with two values RDD[(K, V)].
If this function call is required, can be organized with code below. Dataframe is used for store in csv format ant avoid hand-made parsing
val originalMap = Map("x" -> 1, "y" -> 2)
// write
sparkContext.parallelize(originalMap.toSeq).coalesce(1).toDF("k", "v").write.csv(path)
// read
val restoredDF = spark.read.csv(path)
val restoredMap = restoredDF.rdd.map(r => (r.getString(0), r.getString(1))).collectAsMap()
println("restored map: " + restoredMap)
Output:
restored map: Map(y -> 2, x -> 1)

scala/spark:Read in RDD[(String,Int)]

I have the following text file (previously output from an RDD[(String,Int)] )
(ARCHITECTURE,50)
(BUSINESS,17)
(CHEMICAL ENGINEERING,6)
(CHILD DEVELOPMENT,43)
(CIVIL ENGINEERING,26)
etc
I can read in as RDD[String] like this:
spark.sparkContext.textFile(path + s"$path\\${fileName}_labelNames")
But how can I read in as RDD[String,Int]? Is it possible?
EDITED:
Fixed error in RDD type above
There is no RDD[String, Int], it's illegal.
Maybe what you mean is RDD[(String, Int)].
Here is how you can transform it from the original data.
val data = original.map { record =>
val a = record.stripPrefix("(").stripSuffix(")").split(",")
val k = a(0)
val v = a(1).toInt
(k, v)
}
Where original variable is of type RDD[String], as you read from the source.

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Scala flattening embedded list of lists

I have created a Twitter datastream that is displaying hashtag, author, and mentioned users in the below format.
(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))
I can't do analysis on this format because of the embedded lists. How can I create another datastream that displays the data in this format?
timetofly, Shera_Eyra, blxcknicotine
timetofly, Shera_Eyra, kimtheskimm
hellocake, Shera_Eyra, blxcknicotine
hellocake, Shera_Eyra, kimtheskimm
Here is my code to produce the data:
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval))
val stream = TwitterUtils.createStream(ssc, None)
val data = stream.map {line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
In your code snippet, data is a DStream[(Array[String], String, List[String])]. To get a DStream[String] in your desired format, you can use flatMap and map:
val data = stream.map { line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
.map { case (hash, user, mention) => s"$hash, $user, $mention" }
The flatMap results in a DStream[(String, String, String)] in which each tuple consists of a hash tag entity, user, and mention entity. The subsequent call to map with the pattern matching creates a DStream[String] in which each String consists of the elements in each tuple, separated by a comma and space.
I would use for comprehension for this:
val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))
val result = for {
hashtag <- data._1
user = data._2
mentionedUser <- data._3
} yield (hashtag, user, mentionedUser)
result.foreach(println)
Output:
(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)
If you would prefer a seq of lists of strings, rather than a seq of tuples of strings, then change the yield to give you a list instead: yield List(hashtag, user, mentionedUser)

Modifying List of String in scala

I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?
You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334
Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)