map() method works to multiple textfiles in Scala spark intellij - scala

I want to make some operations at text which I read it from multiple textfile, but the map() method takes every file Separately. As example I do:
val text = sc.wholeTextFiles("src/folder").map(a => a._2)
.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
and the result is:
(hi , 1) //from the first file
(hi , 1) // from the second file
I want the result to be: (hi,2)
I'm thinking in for loop, but it's does not seem flexibility because I don't know the number of text files

I tried your code in spark-shell and these are my findings:
I have 2 files:
csv1 -> hi
csv2 -> hi hi hi
The result was ok after I removed the end lines:
val text = sc.wholeTextFiles("testSO/").map(a => a._2).flatMap(line => line.split(" ")).map(line => line.replace("\n","")).map(word => (word,1)).reduceByKey(_+_).foreach(println)
Output:
(hi,4)
This was the result without removing endlines:
scala> val text = sc.wholeTextFiles("testSO/").map(a => a._2).flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_).foreach(println)
Output:
(hi,2)
(hi
,2)

Related

Filename usage with Spark RDD

I am getting a filename in my RDD when i use :
val file=sc.wholeTextFiles("file:///c://samples//finalout.txt",0)
However, right after I flatten it, I loose the first tuple, how do I make sure that filename is carried forward to my map function?
My code:
val res= file.flatMap{e=>e._2.split("\n")}.map{line => line.split(",")}.map(elem => {
...I want to use filename here
}
A slightly different approach. It should be all possible with just RDD, but I use a DF to RDD approach. It's a bit late in the evening.
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = "/FileStore/tables/sample_text.txt" //does work, also for directory of course
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.map(line => (line._1, line._2.split("\n")))
//rdd2.collect
val rdd3 = rdd2.map(line => (line._1, line._2.flatMap(x => x.split(" ")).map(x => (x,1))))
rdd3.collect
returns in this case:
res61: Array[(String, Array[(String, Int)])] = Array((dbfs:/FileStore/tables/sample_text.txt,Array((Hi,1), (how,1), (are,1), (you,1), (today,1), (ZZZ,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((I,1), (am,1), (fine,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((I,1), (am,1), (also,1), (tired,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((You,1), (look,1), (good,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Can,1), (I,1), (stay,1), (with,1), (you?,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Bob,1), (will,1), (pop,1), (in,1), (later,1), (ZZZ,1))), (dbfs:/FileStore/tables/sample_text.txt,Array((Oh,1), (really?,1), (Nice,,1), (cool,1))))
You can modify accordingly, stripping out this and that. line._1 is your file_name. For RDDs use case statement approach.
UPD
Using your approach - and correcting it - as you cannot do what you want, the following random code as not sure what u r wishing to do:
val file = sc.wholeTextFiles("/FileStore/tables/sample_text.txt",0)
val res = file.map(line => (line._1, line._2.split("\n").flatMap(x => x.split(" ")) )).map(elem => {(elem._1,elem._2, "This is also possible") })
res.collect
returns:
res: org.apache.spark.rdd.RDD[(String, Array[String], String)] =
MapPartitionsRDD[69] at map at command-1038477736158514:15
res9: Array[(String, Array[String], String)] =
Array((dbfs:/FileStore/tables/sample_text.txt,Array(Hi, how, are, you, today, "ZZZ
", I, am, "fine
", I, am, also, "tired
", You, look, "good
", Can, I, stay, with, "you?
", Bob, will, pop, in, later, "ZZZ
", Oh, really?, Nice,, "cool
"),This is also possible))
Your approach is not possible with flatMap, need to follow as per above.

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

How to perform a join on two files within the same RDD loaded using wholeTextFiles()

I am fairly new to spark-scala so please don't mind if this is a beginner question.
I have a directory test which contains two files, input1.txt and input2.txt.
Now, lets say i create a RDD called inputRDD using
val inputRDD = sc.wholeTextFiles("/home/hduser/test")
which includes both the files into the pair RDD (inputRDD).
based on my understanding, inputRDD contains file name as the key and contents as the value
something like this
(input1.txt,contents of input1.txt)
(input2.txt,contents of input2.txt)
Now, lets say I have to perform a join on both the files this way(which are in the same RDD) based on the first column.
contents of input1.txt
----------------------
1 a
1 b
2 c
2 d
contents of input2.txt
----------------------
1 e
2 f
3 g
How can i do that?
You need to first split your content, then do a reduceByKey to format your join. Something like below:
val outputRDD = inputRDD.mapPartitions(iter => {
iter.map(path_content => {
// split string content
val splittedStr = path_content._2.split(" ")
// outputs (1, a) (1, b) (2, c)
(splittedStr(0), splittedStr(1))
})
}).reduceByKey(_ + _) // this outputs (1, abe)
If you have only two files in your test directory and if the filenames are known then you can separate the texts of two files into two rdds and use join as below
val rdd1 = inputRDD.filter(tuple => tuple._1.contains("input1.txt"))
.flatMap(tuple => tuple._2.split("\n"))
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
val rdd2 = inputRDD.filter(tuple => tuple._1.contains("input2.txt"))
.flatMap(tuple => tuple._2.split("\n"))
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
rdd1.join(rdd2).foreach(println)
You should have output as
(2,(c,f))
(2,(d,f))
(1,(a,e))
(1,(b,e))
I hope this is what you desire
Updated
If there are two files in test directory whose names are unknown then you can avoid wholeTextFile api and use textFile api to read them as separate rdds and join them as above. But for that you will have to write a function to list the files.
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val fileList = getListOfFiles("/home/hduser/test")
val rdd1 = sc.textFile(fileList(0).getPath)
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
val rdd2 = sc.textFile(fileList(1).getPath)
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
rdd1.join(rdd2).foreach(println)

Modifying List of String in scala

I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?
You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334
Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)

mkString and sortByKey are not working with Arrays in Spark

I have a log file (accounts) with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
1- I got the log file using:
val accountsdata = sc.textFile("C:/Users/Sam/Downloads/account1.txt")
2- I wanted to key accounts by postal/zip code, so I did the following:
val accountsByPCode = accountsdata.keyBy(line => line.split(',') (8)).mapValues(line => line.split(",")) ---> This works fine.
3- Then I wanted to map accountsByPCode to lastname,firstname as the values, and I did it using the following:
val namesByPCode = accountsByPCode.mapValues(fields => (fields(3), fields(4))).collect() --> This works fine too, but when I tried to print it using:
println(s"======= namesByPCode, style1 =======")
for (pair <- namesByPCode.take(5)) {
printf("%s, [%s] \n",pair._1,pair._2.mkString(","))
}
I got this error:
error: value mkString is not a member of (String, String)
printf("%s, [%s] \n",pair._1,pair._2.mkString(","))
^
Also when I tried to sortByKey using:
println(s"======= namesByPCode, style2 =======")
for (pair <- namesByPCode.sortByKey().take(5)) {
println("---" + pair._1)
pair._2.take(3)foreach(println)
}
I got the following error:
error: value sortByKey is not a member of Array[(String, (String,String))]
for (pair <- namesByPCode.sortByKey().take(5)) {
^
Can someone advise what is wrong with my code?
You get an error in
error: value sortByKey is not a member of Array[(String, (String,String))]
for (pair <- namesByPCode.sortByKey().take(5)) {
just because you did collect on the previous step
val namesByPCode = accountsByPCode.mapValues(fields => (fields(3), fields(4))).collect()
Since you called collect() you don't have an RDD anymore. You are working with an array. You just need to sort data by key before collecting.
val namesByPCode = accountsByPCode.mapValues(fields => (fields(3), fields(4))).sortByKey().collect()
Now you have a sorted array. And if you don't need the whole array, you should replace collect() with take(5).
It's because you are creating a Tuple2[String,String] not an Array[String]. Try:
val namesByPCode = accountsByPCode.mapValues(fields => Array(fields(3), fields(4))).collect()
Or else change the code where you pick it apart to this:
printf("%s, [%s] \n",pair._1,Array(pair._2._1, pair._2._2).mkString(","))
Do one of those (don't do both!).