Scala How to avoid reading the file twice - scala

Here is the sample program I am working on to read a file with list of values one per line. I have to add all these values converting to double and also need to sort the values. Here is what I came up so far and it is working fine.
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines()
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s.replaceAll(",","").toDouble)
println("Total => " + sum)
println((Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",", "").toDouble)).toList.sorted)
}
}
The question here is, as you can see I am reading the file twice and I want to avoid it. As the Source.fromFile("c://exp.txt").getLines() gives you an iterator, I can loop through it only once and next operation it will be null, so I can't reuse the lines again for sorting and I need to read from file again. Also I don't want to store them into a temporary list. Is there any elegant way of doing this in a functional way?

Convert it to a List, so you can reuse it:
val lines = Source.fromFile("c://exp.txt").getLines().toList

As per my comment, convert it to a list, and rewrite as follows
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",","").toDouble)
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s)
println("Total => " + sum)
println(lines.toList.sorted)
}
}

Related

Scala: Most efficient way to process files in folder based on a file list

I am trying to find the most efficient way to process files in multiple folders based on a list of allowed files.
I have a list of allowed files that I should process.
The proces is as follows
val allowedFiles = List("File1.json","File2.json","File3.json")
Get list of folders in directory. For this I could use:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
Loop through each folder from step 2. and get all files. For this I would use :
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
If file from step 3. are in list of allowed files call another method that process the file
So I need to loop through a first directory, get files, check if file need to be procssed and then call another functionn. I was thinking about double loop which would work but is the most efficient way. I know in scala I should be using resursive funstions but failed with this double recursive function with call to extra method.
Any ideas welcome.
Files.find() will do both the depth search and filter.
import java.nio.file.{Files,Paths,Path}
import scala.jdk.StreamConverters._
def getListOfFiles(dir: String, targets:Set[String]): List[Path] =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).toScala(List)
usage:
val lof = getListOfFiles("/DataDir", allowedFiles.toSet)
But, depending on what kind of processing is required, instead of returning a List you might just process each file as it is encountered.
import java.nio.file.{Files,Paths,Path}
def processFile(path: Path): Unit = ???
def processSelected(dir: String, targets:Set[String]): Unit =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).forEach(processFile)
You can use Files.walk
The code would look like this (I didn't compile it, so it may have some typos)
import java.nio.file.{Files, Path}
import scala.jdk.StreamConverters._
def getFilesRecursive(initialFolder: Path, allowedFiles: Set[String]): List[Path] =
Files
.walk(initialFolder)
.filter(path => allowedFiles.contains(path.getFileName.toString.toLowerCase))
.toScala(List)
I'm no expert on Scala (last time I dabbled in it was probably 18 years ago) but I figured there had to be a way to take this code:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
And eliminate at least one extra list creation. I found this SO question which was instructive, and then did a Google search for withFilter.
Looks like you can take that bit above and translate it to the following. By replacing filter with withFilter, a new list is not created and then iterated over.
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.withFilter(_.isDirectory)
.map(_.getName)
.toList

How to use getOrElseUpdate in scala.collection.mutable.HashMap?

The example code counts each word's occurrences in given input file:
object Main {
def main(args: Array[String]) {
val counts = new scala.collection.mutable.HashMap[String, Int]
val in = new Scanner(new File("input.txt"))
while (in.hasNext()) {
val s: String = in.next()
counts(s) = counts.getOrElse(s, 0) + 1 // Here!
}
print(counts)
}
}
Can the highlighted with comment line be rewritten using the getOrElseUpdate method?
P.S. I am only at the 4th part of the "Scala for the impatient", so please don't teach me now about functional Scala which, I am sure, can be more beautiful here.
Thanks.
If you look at the doc you'll see the next:
If given key is already in this map, returns associated value.
Otherwise, computes value from given expression op, stores with key in
map and returns that value.
, but you need to modify map anyway, so getOrElseUpdate is useless here.
You can define default value, which will return if key doesn't exist. And use it the next way:
import scala.collection.mutable.HashMap
object Main {
def main(args: Array[String]) {
val counts = HashMap[String, Int]().withDefaultValue(0)
val in = new Scanner(new File("input.txt"))
while (in.hasNext()) {
val s: String = in.next()
counts(s) += 1
}
print(counts)
}
}

dynamically parse a string and return a function in scala using reflection and interpretors

I am trying to dinamically interpret code given as a String.
Eg:
val myString = "def f(x:Int):Int=x+1".
Im looking for a method that will return the real function out of it:
Eg:
val myIncrementFunction = myDarkMagicFunctionThatWillBuildMyFunction(myString)
println(myIncrementFunction(3))
will print 4
Use case: I want to use some simple functions from that interpreted code later in my code. For example they can provide something like def fun(x: Int): Int = x + 1 as a string, then I use the interpreter to compile/execute that code and then I'd like to be able to use this fun(x) in a map for example.
The problem is that that function type is unknown for me, and this is one of the big problems because I need to cast back from IMain.
I've read about reflection, type system and such, and after some googling I reached this point. Also I checked twitter's util-eval but I cant see too much from the docs and the examples in their tests, it's pretty the same thing.
If I know the type I can do something like
val settings = new Settings
val imain = new IMain(settings)
val res = imain.interpret("def f(x:Int):Int=x+1; val ret=f _ ")
val myF = imain.valueOfTerm("ret").get.asInstanceOf[Function[Int,Int]]
println(myF(2))
which works correctly and prints 3 but I am blocked by the problem I said above, that I dont know the type of the function, and this example works just because I casted to the type I used when I defined the string function for testing how IMain works.
Do you know any method how I could achieve this functionality ?
I'm a newbie so please excuse me if I wrote any mistakes.
Thanks
Ok, I managed to achieve the functionality I wanted, I am still looking for improving this code, but this snippet does what I want.
I used scala toolbox and quasiquotes
import scala.reflect.runtime.universe.{Quasiquote, runtimeMirror}
import scala.tools.reflect.ToolBox
object App {
def main(args: Array[String]): Unit = {
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val data = Array(1, 2, 3)
println("Data before function applied on it")
println(data.mkString(","))
println("Please enter the map function you want:")
val function = scala.io.StdIn.readLine()
val functionWrapper = "object FunctionWrapper { " + function + "}"
val functionSymbol = tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
// Map each element using user specified function
val dataAfterFunctionApplied = data.map(x => tb.eval(q"$functionSymbol.function($x)"))
println("Data after function applied on it")
println(dataAfterFunctionApplied.mkString(","))
}
}
And here is the result in the terminal:
Data before function applied on it
1,2,3
Please enter the map function you want:
def function(x: Int): Int = x + 2
Data after function applied on it
3,4,5
Process finished with exit code 0
I wanted to elaborate the previous answer with the comment and perform an evaluation of the solutions:
import scala.reflect.runtime.universe.{Quasiquote, runtimeMirror}
import scala.tools.reflect.ToolBox
object Runtime {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + " ns")
result
}
def main(args: Array[String]): Unit = {
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val data = Array(1, 2, 3)
println(s"Data before function applied on it: '${data.toList}")
val function = "def apply(x: Int): Int = x + 2"
println(s"Function: '$function'")
println("#######################")
// Function with tb.eval
println(".... with tb.eval")
val functionWrapper = "object FunctionWrapper { " + function + "}"
// This takes around 1sec!
val functionSymbol = time { tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])}
// This takes around 0.5 sec!
val result = time {data.map(x => tb.eval(q"$functionSymbol.apply($x)"))}
println(s"Data after function applied on it: '${result.toList}'")
println(".... without tb.eval")
val func = time {tb.eval(q"$functionSymbol.apply _").asInstanceOf[Int => Int]}
// This takes around 0.5 sec!
val result2 = time {data.map(func)}
println(s"Data after function applied on it: '${result2.toList}'")
}
}
If we execute the code above we see the following output:
Data before function applied on it: 'List(1, 2, 3)
Function: 'def apply(x: Int): Int = x + 2'
#######################
.... with tb.eval
Elapsed time: 716542980 ns
Elapsed time: 661386581 ns
Data after function applied on it: 'List(3, 4, 5)'
.... without tb.eval
Elapsed time: 394119232 ns
Elapsed time: 85713 ns
Data after function applied on it: 'List(3, 4, 5)'
Just to emphasize the importance of do the evaluation to extract a Function, and then apply to the data, without the end to evaluate again, as the comment in the answer indicates.
You can use twitter-util library to do this, check the test file:
https://github.com/twitter/util/blob/b0696d0/util-eval/src/test/scala/com/twitter/util/EvalTest.scala
If you need to use IMain, maybe because you want to use the intepreter with your own custom settings, you can do something like this:
a. First create a class meant to hold your result:
class ResHolder(var value: Any)
b. Create a container object to hold the result and interpret the code into that object:
val settings = new Settings()
val writer = new java.io.StringWriter()
val interpreter = new IMain(settings, writer)
val code = "def f(x:Int):Int=x+1"
// Create a container object to hold the result and bind in the interpreter
val holder = new ResHolder(null)
interpreter.bind("$result", holder.getClass.getName, holder) match {
case Success =>
case Error => throw new ScriptException("error in: binding '$result' value\n" + writer)
case Incomplete => throw new ScriptException("incomplete in: binding '$result' value\n" + writer)
}
val ir = interpreter.interpret("$result.value = " + code)
// Return cast value or throw an exception based on result
ir match {
case Success =>
val any = holder.value
any.asInstanceOf[(Int) => Int]
case Error => throw new ScriptException("error in: '" + code + "'\n" + writer)
case Incomplete => throw new ScriptException("incomplete in :'" + code + "'\n" + writer)
}

How to add line number into each line?

suppose these are my data:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
and i want to add a number to every line like below output:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
save them to file.
i've tried:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
but its output is like blew :
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
How can i edit my code to generate an output like my favorite output?
zipWithIndex is what you need here. It maps from RDD[T] to RDD[(T, Long)] by adding an index on the second position of the pair.
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
or using string interpolation (see a comment by #DanielC.Sobral)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
By calling val sample1 = sc.textFile("data.txt") you are creating a new RDD.
If you need just an output, you can try to use next code:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
Basically, by using this code, you will do this:
Using .zipWithIndex() will return new RDD[(T, Long)], where (T, Long) is a Tuple, T is a previous RDD elements datatype (java.lang.String, I believe) and Long is an index of element in RDD.
You performed transformation, now you need to make an action. foreach, in this case, suits very well. What is basically does: it applies your statement to every element in current RDD, so we just call quickly formatted println.

Scala equivalent of Haskell's insertWith for Maps

I'm looking to do the simple task of counting words in a String. The easiest way I've found is to use a Map to keep track of word frequencies. Previously with Haskell, I used its Map's function insertWith, which takes a function that resolves key collisions, along with the key and value pair. I can't find anything similar in Scala's library though; only an add function (+), which presumably overwrites the previous value when re-inserting a key. For my purposes though, instead of overwriting the previous value, I want to add 1 to it to increase its count.
Obviously I could write a function to check if a key already exists, fetch its value, add 1 to it, and re-insert it, but it seems odd that a function like this isn't included. Am I missing something? What would be the Scala way of doing this?
Use a map with default value and then update with +=
import scala.collection.mutable
val count = mutable.Map[String, Int]().withDefaultValue(0)
count("abc") += 1
println(count("abc"))
If it's a string then why not use the split module
import Data.List.Split
let mywords = "he is a good good boy"
length $ nub $ splitOn " " mywords
5
If you want to stick with Scala's immutable style, you could create your own class with immutable semantics:
class CountMap protected(val counts: Map[String, Int]){
def +(str: String) = new CountMap(counts + (str -> (counts(str) + 1)))
def apply(str: String) = counts(str)
}
object CountMap {
def apply(counts: Map[String, Int] = Map[String, Int]()) = new CountMap(counts.withDefaultValue(0))
}
And then you can use it:
val added = CountMap() + "hello" + "hello" + "world" + "foo" + "bar"
added("hello")
>>2
added("qux")
>>0
You might also add apply overloads on the companion object so that you can directly input a sequence of words, or even a sentence:
object CountMap {
def apply(counts: Map[String, Int] = Map[String, Int]()): CountMap = new CountMap(counts.withDefaultValue(0))
def apply(words: Seq[String]): CountMap = CountMap(words.groupBy(w => w).map { case(word, group) => word -> group.length })
def apply(sentence: String): CountMap = CountMap(sentence.split(" "))
}
And then the you can even more easily:
CountMap(Seq("hello", "hello", "world", "world", "foo", "bar"))
Or:
CountMap("hello hello world world foo bar")