Scala: Most efficient way to process files in folder based on a file list - scala

I am trying to find the most efficient way to process files in multiple folders based on a list of allowed files.
I have a list of allowed files that I should process.
The proces is as follows
val allowedFiles = List("File1.json","File2.json","File3.json")
Get list of folders in directory. For this I could use:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
Loop through each folder from step 2. and get all files. For this I would use :
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
If file from step 3. are in list of allowed files call another method that process the file
So I need to loop through a first directory, get files, check if file need to be procssed and then call another functionn. I was thinking about double loop which would work but is the most efficient way. I know in scala I should be using resursive funstions but failed with this double recursive function with call to extra method.
Any ideas welcome.

Files.find() will do both the depth search and filter.
import java.nio.file.{Files,Paths,Path}
import scala.jdk.StreamConverters._
def getListOfFiles(dir: String, targets:Set[String]): List[Path] =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).toScala(List)
usage:
val lof = getListOfFiles("/DataDir", allowedFiles.toSet)
But, depending on what kind of processing is required, instead of returning a List you might just process each file as it is encountered.
import java.nio.file.{Files,Paths,Path}
def processFile(path: Path): Unit = ???
def processSelected(dir: String, targets:Set[String]): Unit =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).forEach(processFile)

You can use Files.walk
The code would look like this (I didn't compile it, so it may have some typos)
import java.nio.file.{Files, Path}
import scala.jdk.StreamConverters._
def getFilesRecursive(initialFolder: Path, allowedFiles: Set[String]): List[Path] =
Files
.walk(initialFolder)
.filter(path => allowedFiles.contains(path.getFileName.toString.toLowerCase))
.toScala(List)

I'm no expert on Scala (last time I dabbled in it was probably 18 years ago) but I figured there had to be a way to take this code:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
And eliminate at least one extra list creation. I found this SO question which was instructive, and then did a Google search for withFilter.
Looks like you can take that bit above and translate it to the following. By replacing filter with withFilter, a new list is not created and then iterated over.
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.withFilter(_.isDirectory)
.map(_.getName)
.toList

Related

apply a function to list in Scala

I'm trying to learn Scala.
I ask for help to understand loop foreach
I have a function, it reads only last csv from path. But it works when I point only one way:
val path = "src/main/resources/historical_novel/"
def getLastFile(path: String, spark: SparkSession): String = {
val hdfs = ...}
but how can I apply this function to the list such as
val paths: List[String] = List(
"src/main/resources/historical_novel/",
"src/main/resources/detective/",
"src/main/resources/adventure/",
"src/main/resources/horror/")
I want to get such result:
src/main/resources/historical_novel/20221027.csv
src/main/resources/detective/20221026.csv
src/main/resources/adventure/20221026.csv
src/main/resources/horror/20221027.csv
I create df with column (path), then apply function through WithColumn and it is work,
but I want to do it with foreach, understand it.
let's say your function is like this
def f(s: String): Unit = {}
you can simply do this
paths.foreach(p => f(p))
After your edit, I think you may want use map, a function that can transform a collection to another collection. like this
val result = paths.map(p => getLastFile(p, yourSparkSession))
foreach applies a function you define or provide on each element in a Collection.
The simplest example is to print each path to the console
paths.foreach(path => println(path))
To apply a series of functions as you describe you can use {} in the foreach body and call multiple functions.
paths.foreach(path => {
val file = loadFile(path)
writeToDataBase(file)
})

How to create a map for (key - image name // value - image-file) in Scala

def getListOfImageNames(dir: String): List[String] = {
val names = new File(dir)
names.listFiles.filter(_.isFile)
.map(_.getName).toList
}
def getListOfImages(dir: String): List[String] = {
val files = new File(dir)
files.listFiles.filter(_.isFile)
.filter(_.getName.endsWith(".png"))
.filter(_.getName.endsWith(".jpg"))
.map(_.getPath).toList
}
I have a directory where I have different photos, small size, large size and I have already managed to write methods which: one of them only pulls out the names of the photos and the other the photos. How can I now combine them into a map, for example, then calculate their resolution using the method, if the photo is larger than, for example, 500x500, add a prefix to the name and save it in the X folder. Do you have any ideas? I'm not experienced in Scala, but I like the language very much.
As I got you need to get map of image name to image path. You can achieve it like so:
def getImagesMap(dirPath: String): Map[String, String] = {
val directory = new File(dirPath)
directory.listFiles.collect{
case file if file.isFile &&
(file.getName.endsWith(".png") ||
file.getName.endsWith(".jpg")) =>
file.getName -> file.getPath
}.toMap
}
here I use collect function. It's like combination of map and filter functions. Inside collect a pattern matching expression. If file matches pattern matching it will evaluate pair creation: file name to file path. Otherwise file just will be filtered. After I use toMap for conversion Array[(String, String)] to Map[String, String]. You can read more about collect here.

Converting disrete chunks of Stdin to usable form

Simply, if the user pastes a chunk of text (multiple lines) into the console all at once, I want to be able to grab that chunk and use it.
Currently my code is
val stringLines: List[String] = io.Source.stdin.getLines().toList
doStuff(stringLines)
However doStuff is never called. I realize the stdin iterator doesn't have an 'end', but how do I get the input as it is currently? I've checked many SO answers, but none of them work for multiple lines that need to be collated. I need to have all the lines of the user input at once, and it will always come as a single paste of data.
This is a rough outline, but it seems to work. I had to delve into Java Executors, which I hadn't encountered before, and I might not be using correctly.
You might want to play around with the timeout value, but 10 milliseconds passes my tests.
import java.util.concurrent.{Callable, Executors, TimeUnit}
import scala.util.{Failure, Success, Try}
def getChunk: List[String] = { // blocks until StdIn has data
val executor = Executors.newSingleThreadExecutor()
val callable = new Callable[String]() {
def call(): String = io.StdIn.readLine()
}
def nextLine(acc: List[String]): List[String] =
Try {executor.submit(callable).get(10L, TimeUnit.MILLISECONDS)} match {
case Success(str) => nextLine(str :: acc)
case Failure(_) => executor.shutdownNow() // should test for Failure type
acc.reverse
}
nextLine(List(io.StdIn.readLine())) // this is the blocking part
}
Usage is quite simple.
val input: List[String] = getChunk

Scala Future or other way to execute Java void methods in parallel

I have the following piece of code:
stringList map copyURLToFile // List[String]
def copyURLToFile(symbol: String) = {
val url = new URL(base.replace("XXX", symbol))
val file = new File(prefix + symbol + ".csv")
org.apache.commons.io.FileUtils.copyURLToFile(url, file)
}
How do I make copyURLToFile execute in parallel?
Scala Futures need a return type, but if I make copyURLToFile of type Future[Unit], what and how do I return from the function?
If you want a very easy way to parallelize this, then just map over the list using the parallel collections library.
stringList.par.map(copyURLToFile).toList
You need to return something, how are you going to know which files where downloaded?
Change your code to be:
def copyURLToFile(symbol: String): (String,String) = {
val url = new URL(base.replace("XXX", symbol))
val file = new File(prefix + symbol + ".csv")
org.apache.commons.io.FileUtils.copyURLToFile(url, file)
return (symbol, file)
}
And then the resulting collection will contain the symbols an the paths to the file where they were downloaded.

Scala How to avoid reading the file twice

Here is the sample program I am working on to read a file with list of values one per line. I have to add all these values converting to double and also need to sort the values. Here is what I came up so far and it is working fine.
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines()
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s.replaceAll(",","").toDouble)
println("Total => " + sum)
println((Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",", "").toDouble)).toList.sorted)
}
}
The question here is, as you can see I am reading the file twice and I want to avoid it. As the Source.fromFile("c://exp.txt").getLines() gives you an iterator, I can loop through it only once and next operation it will be null, so I can't reuse the lines again for sorting and I need to read from file again. Also I don't want to store them into a temporary list. Is there any elegant way of doing this in a functional way?
Convert it to a List, so you can reuse it:
val lines = Source.fromFile("c://exp.txt").getLines().toList
As per my comment, convert it to a list, and rewrite as follows
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",","").toDouble)
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s)
println("Total => " + sum)
println(lines.toList.sorted)
}
}