scala read file,and each line save to a variable? - scala

scalaresult.txt
0~250::250~500::500~750::750~1000::1000~1250
481::827::750::256::1000
scala code
val filename = "/home/user/scalaresult.txt"
for ( (line,index) <- Source.fromFile(filename).getLines().zipWithIndex){
println(line)
println(index)
}
//val step_x = "0~250::250~500::500~750::750~1000::1000~1250"
//val step_y = "481::827::750::256::1000"
Seq("java", "-jar", "/home/user/birt2.jar" , step_x , step_y , "BarChart").lines
I have a file: scalaresult.txt
I need to save first line (index(0)) to step_x
and the second line (index(1)) to step_y
How to do this ? Please guide me Thank you.

This is not the optimal solution, but you can try the following: (I'm not a scala expert yet! :P)
scala> val it = Source.fromFile(filename).getLines().toList
it: List[String] = List(0~250::250~500::500~750::750~1000::1000~1250, "481::827::750::256::1000 ")
scala> it(1)
res7: String = "481::827::750::256::1000 "
scala> it(0)
res8: String = 0~250::250~500::500~750::750~1000::1000~1250

If all you are trying to do it take the two lines from the file and inserting them into the sequence, the indexer on the list will do the trick. Mind you, it's an O(n) operation on list, so if there were a lot of lines, it wouldn't be the best approach.
val filename = "/home/user/scalaresult.txt"
val lines = Source.fromFile(filename).getLines()
val seq = Seq("java", "-jar", "/home/user/birt2.jar" , lines(0) , lines(1), "BarChart")

Related

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.
Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.
val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

Error in code Regex

I am trying to find only the word contains 3 letters(e is below example) in the word
need to find using regex.
val inputString = """edepak,suman,employdee,eeeee,eme,ev"""
and i have written the below code.
val numberPatteren = "([a-z]*e){3,}".r
but i am getting the below output which is not as expected.
employdee,eeeee
but the output should be only -- employdee
can you please help me on this.
You can achieve that simply by doing the following
scala> inputString.split(",").filter(word => word.count(_ == 'e') == 3).mkString(",")
//res16: String = employdee
If you want to use regex, you can do as below
scala> val numberPatteren = "[a-df-zA-DF-Z0-9]".r
//numberPatteren: scala.util.matching.Regex = [a-df-zA-DF-Z0-9]
scala> inputString.split(",").filter(numberPatteren.replaceAllIn(_, "").length == 3).mkString(",")
//res0: String = employdee

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

Converting a String to a Map

Given a String : {'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}
I want to parse it into a Map[String,String], I already tried this answer but it doesn't work when the character : is inside the parsed value. Same thing with the ' character, it seems to break every JSON Mappers...
Thanks for any help.
Let
val s0 = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val s = s0.stripPrefix("{").stripSuffix("}")
Then
(for (e <- s.split(",") ; xs = e.split(":",2)) yield xs(0) -> xs(1)).toMap
Here we split each key-value by the first occurrence of ":". Further this is a strong assumption, in that the key does not contain any ":".
You can use the familiar jackson-module-scala that can do this in much better scale.
For example:
val src = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](src)

Spark Data Loadling Issue

I am getting IndexOutOfBoundException while doing following operation in spark-shell
val input = sc.textFile("demo.txt")
b.collect
Both of above functions are working fine .
val out = input.map(_.split(",")).map(r => r(1))
Getting OutOfBoundException for above line
demo.txt is looks like this:(Header :- Name,Gender,age)
Danial,,14
,Male,18
Hema,,
With pig same file is working without any issue!!
You can try this out yourself, just start the Scala console and enter your sample lines.
scala> "Danial,,14".split(",")
res0: Array[String] = Array(Danial, "", 14)
scala> ",Male,18".split(",")
res1: Array[String] = Array("", Male, 18)
scala> "Hema,,".split(",")
res2: Array[String] = Array(Hema)
So ooops, the last line doesn't work. Add the number of expected columns to split:
scala> "Hema,,".split(",", 3)
res3: Array[String] = Array(Hema, "", "")
or even better, write a real parser. String.split isn't suitable for production code.