How to create a map for (key - image name // value - image-file) in Scala - scala

def getListOfImageNames(dir: String): List[String] = {
val names = new File(dir)
names.listFiles.filter(_.isFile)
.map(_.getName).toList
}
def getListOfImages(dir: String): List[String] = {
val files = new File(dir)
files.listFiles.filter(_.isFile)
.filter(_.getName.endsWith(".png"))
.filter(_.getName.endsWith(".jpg"))
.map(_.getPath).toList
}
I have a directory where I have different photos, small size, large size and I have already managed to write methods which: one of them only pulls out the names of the photos and the other the photos. How can I now combine them into a map, for example, then calculate their resolution using the method, if the photo is larger than, for example, 500x500, add a prefix to the name and save it in the X folder. Do you have any ideas? I'm not experienced in Scala, but I like the language very much.

As I got you need to get map of image name to image path. You can achieve it like so:
def getImagesMap(dirPath: String): Map[String, String] = {
val directory = new File(dirPath)
directory.listFiles.collect{
case file if file.isFile &&
(file.getName.endsWith(".png") ||
file.getName.endsWith(".jpg")) =>
file.getName -> file.getPath
}.toMap
}
here I use collect function. It's like combination of map and filter functions. Inside collect a pattern matching expression. If file matches pattern matching it will evaluate pair creation: file name to file path. Otherwise file just will be filtered. After I use toMap for conversion Array[(String, String)] to Map[String, String]. You can read more about collect here.

Related

apply a function to list in Scala

I'm trying to learn Scala.
I ask for help to understand loop foreach
I have a function, it reads only last csv from path. But it works when I point only one way:
val path = "src/main/resources/historical_novel/"
def getLastFile(path: String, spark: SparkSession): String = {
val hdfs = ...}
but how can I apply this function to the list such as
val paths: List[String] = List(
"src/main/resources/historical_novel/",
"src/main/resources/detective/",
"src/main/resources/adventure/",
"src/main/resources/horror/")
I want to get such result:
src/main/resources/historical_novel/20221027.csv
src/main/resources/detective/20221026.csv
src/main/resources/adventure/20221026.csv
src/main/resources/horror/20221027.csv
I create df with column (path), then apply function through WithColumn and it is work,
but I want to do it with foreach, understand it.
let's say your function is like this
def f(s: String): Unit = {}
you can simply do this
paths.foreach(p => f(p))
After your edit, I think you may want use map, a function that can transform a collection to another collection. like this
val result = paths.map(p => getLastFile(p, yourSparkSession))
foreach applies a function you define or provide on each element in a Collection.
The simplest example is to print each path to the console
paths.foreach(path => println(path))
To apply a series of functions as you describe you can use {} in the foreach body and call multiple functions.
paths.foreach(path => {
val file = loadFile(path)
writeToDataBase(file)
})

Scala: Most efficient way to process files in folder based on a file list

I am trying to find the most efficient way to process files in multiple folders based on a list of allowed files.
I have a list of allowed files that I should process.
The proces is as follows
val allowedFiles = List("File1.json","File2.json","File3.json")
Get list of folders in directory. For this I could use:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
Loop through each folder from step 2. and get all files. For this I would use :
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
If file from step 3. are in list of allowed files call another method that process the file
So I need to loop through a first directory, get files, check if file need to be procssed and then call another functionn. I was thinking about double loop which would work but is the most efficient way. I know in scala I should be using resursive funstions but failed with this double recursive function with call to extra method.
Any ideas welcome.
Files.find() will do both the depth search and filter.
import java.nio.file.{Files,Paths,Path}
import scala.jdk.StreamConverters._
def getListOfFiles(dir: String, targets:Set[String]): List[Path] =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).toScala(List)
usage:
val lof = getListOfFiles("/DataDir", allowedFiles.toSet)
But, depending on what kind of processing is required, instead of returning a List you might just process each file as it is encountered.
import java.nio.file.{Files,Paths,Path}
def processFile(path: Path): Unit = ???
def processSelected(dir: String, targets:Set[String]): Unit =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).forEach(processFile)
You can use Files.walk
The code would look like this (I didn't compile it, so it may have some typos)
import java.nio.file.{Files, Path}
import scala.jdk.StreamConverters._
def getFilesRecursive(initialFolder: Path, allowedFiles: Set[String]): List[Path] =
Files
.walk(initialFolder)
.filter(path => allowedFiles.contains(path.getFileName.toString.toLowerCase))
.toScala(List)
I'm no expert on Scala (last time I dabbled in it was probably 18 years ago) but I figured there had to be a way to take this code:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
And eliminate at least one extra list creation. I found this SO question which was instructive, and then did a Google search for withFilter.
Looks like you can take that bit above and translate it to the following. By replacing filter with withFilter, a new list is not created and then iterated over.
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.withFilter(_.isDirectory)
.map(_.getName)
.toList

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

How to add line number into each line?

suppose these are my data:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
and i want to add a number to every line like below output:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
save them to file.
i've tried:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
but its output is like blew :
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
How can i edit my code to generate an output like my favorite output?
zipWithIndex is what you need here. It maps from RDD[T] to RDD[(T, Long)] by adding an index on the second position of the pair.
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
or using string interpolation (see a comment by #DanielC.Sobral)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
By calling val sample1 = sc.textFile("data.txt") you are creating a new RDD.
If you need just an output, you can try to use next code:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
Basically, by using this code, you will do this:
Using .zipWithIndex() will return new RDD[(T, Long)], where (T, Long) is a Tuple, T is a previous RDD elements datatype (java.lang.String, I believe) and Long is an index of element in RDD.
You performed transformation, now you need to make an action. foreach, in this case, suits very well. What is basically does: it applies your statement to every element in current RDD, so we just call quickly formatted println.

How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?

My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. Here I give a dataset as the input. My code is,
def id3(attrs: Attributes,
examples: List[Example],
label: Symbol
) : Node = {
level = level+1
// if all the examples have the same label, return a new node with that label
if(examples.forall( x => x(label) == examples(0)(label))){
new Leaf(examples(0)(label))
} else {
for(a <- attrs.keySet-label){ //except label, take all attrs
("Information gain for %s is %f".format(a,
informationGain(a,attrs,examples,label)))
}
// find the best splitting attribute - this is an argmax on a function over the list
var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
informationGain(x,attrs,examples,label))
// now we produce a new branch, which splits on that node, and recurse down the nodes.
var branch = new Branch(bestAttr)
for(v <- attrs(bestAttr)){
val subset = examples.filter(x=> x(bestAttr)==v)
if(subset.size == 0){
// println(levstr+"Tiny subset!")
// zero subset, we replace with a leaf labelled with the most common label in
// the examples
val m = examples.map(_(label))
val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
branch.add(v,new Leaf(mostCommonLabel))
}
else {
// println(levstr+"Branch on %s=%s!".format(bestAttr,v))
branch.add(v,id3(attrs,subset,label))
}
}
level = level-1
branch
}
}
}
object samplet {
def main(args: Array[String]){
var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))
val examples: List[sample.Example] = List(
Map(
'0 -> 'abc,
'1 -> 'def,
'2 -> 'ghi,
'3 'jkl,
'4 -> 'mno
),
........................
)
// obviously we can't use the label as an attribute, that would be silly!
val label = 'play
println(sample.try(attrs,examples,label).getStr(0))
}
}
But How I change this code to - accepting input from a .csv file?
I suggest you use Java's io / nio standard library to read your CSV file. I think there is no relevant drawback in doing so.
But the first question we need to answer is where to read the file in the code? The parsed input seems to replace the value of examples. This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]]. So let us declare a new class
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}
Note that the Charset is only needed if we must distinguish between differently encoded CSV-files.
Okay, so how do we implement the method? It should do the following:
Create an appropriate input reader
Read all lines
Split each line at the comma-separator
Transform each substring into the symbol it represents
Build a map from from the list of symbols, using the attributes as key
Create and return the list of maps
Or expressed in code:
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
val Separator = ","
/** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
def getInput(file: Path): List[Map[Symbol, Symbol]] = {
val reader = Files.newBufferedReader(file, charset)
/* Read the whole file and discard the first line */
inputWithHeader(reader).tail
}
/** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
(JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
(accumulator, nextLine) =>
parseLine(nextLine) :: accumulator
}.reverse
}
/** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap
/** Create a symbol from a String... we could also check whether the string represents a valid symbol */
private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}
Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. If this cannot be assumed, then the code as is would fail to split certain valid input strings.
To use this new code, we could change the main-method as follows:
def main(args: Array[String]){
val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
// ... your code
Here, examples uses the value exampleInput, which is the current, hardcoded value of examples if no input argument is specified.
Important: In the code, all error handling has been omitted for convenience. In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional.
Side-notes:
Try not to use null in your code. Returning Option[T] is a better option than returning null, because it makes "nullness" explicit and provides static safety thanks to the type-system.
The return-keyword is not required in Scala, as the last value of a method is always returned. You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea).
Prefer val over var, because immutable values are much easier to understand than mutable values.
The code will fail with the provided CSV string, because it contains the symbols TRUE and FALSE which are not legal according to your programs logic (they should be true and false instead).
Add all information to your error-messages. Your error message only tells me what that a value for the attribute 'wind is bad, but it does not tell me what the actual value is.
Read a csv file ,
val datalines = Source.fromFile(filepath).getLines()
So this datalines contains all the lines from the csv file.
Next, convert each line into a Map[Int,String]
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}
Here, we split each line with ",". Then construct a map with key as column number and value as each word after the split.
Next, If we want List[Map[Int,String]],
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}.toList