Scala how to use regex on endsWith? - scala

I'm trying to figure out how to isolate all file extensions from a list of file names using regex and endsWith.
So as an example
input:
file.txt, notepad.exe
output:
txt, exe
What my idea is, is to use filter to get file names that endsWith("."_). But endsWith("."_) doesn't work.
Any suggestions?

You really do not want to filter, you want to map each filename into its extension.
(and maybe then collect only the ones that had an extension and probably you only want each unique extension)
You can use a regex for that.
object ExtExtractor {
val ExtRegex = """.*\.(\w+)?""".r
def apply(data: List[String]): Set[String] =
data.iterator.collect {
case ExtRegex(ext) => ext.toLowerCase
}.toSet
}
You can see it running here.

how about using split('.') which will return a
String[] parts = fileName.split("\\.");
String extension = parts[parts.length-1];

Related

%3d instead of = in file path, then i try to open file from resources

I write some tests and to get absolute path from relative path i use this function
private def getAbsolutePath(filePath: String): String = {
getClass.getResource(filePath).getFile
}
and then i do:
println(getAbsolutePath("/parquetIncrementalProcessor/withPartitioning/"))
println(getAbsolutePath("/parquetIncrementalProcessor/withPartitioning/own_loading_id=1/partition_column=test/"))
i get:
/Users/19658296/csp-fp-snaphot/library/target/scala-2.11/test-classes/parquetIncrementalProcessor/withPartitioning/
/Users/19658296/csp-fp-snaphot/library/target/scala-2.11/test-classes/parquetIncrementalProcessor/withPartitioning/own_loading_id%3d1/partition_column%3dtest/
As you can see, instead of =, I get some strange symbol. At the same time, when I try to read these files with a park, he can read the path without %3d, and with %3d he gets the error "Path does not exist".
How can I fix this?
Seems like its URL encoded, maybe because using stuff from files and resources are designed to work with Universal Resource Locators. You can URLDecode it like so:
import java.net.URLDecoder
def getAbsolutePath(filePath: String): String = {
val path = getClass.getResource(filePath).getFile
URLDecoder.decode(path, "UTF-8")
}

Scala changing parquet path in config (typesafe)

Currently I have a configuration file like this:
project {
inputs {
baseFile {
paths = ["project/src/test/resources/inputs/parquet1/date=2020-11-01/"]
type = parquet
applyConversions = false
}
}
}
And I want to change the date "2020-11-01" to another one during run time. I read I need a new config object since it's immutable, I'm trying this but I'm not quite sure how to edit paths since it's a list and not a String and it definitely needs to be a list or else it's going to say I haven't configured a path for the parquet.
val newConfig = config.withValue("project.inputs.baseFile.paths"(0),
ConfigValueFactory.fromAnyRef("project/src/test/resources/inputs/parquet1/date=2020-10-01/"))
But I'm getting a:
Error com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path 'project.inputs.baseFile.': path has a leading, trailing, or two adjacent period '.' (use quoted "" empty string if you want an empty element)
What's the correct way to set the new path?
One option you have, is to override the entire array:
import scala.collection.JavaConverters._
val mergedConfig = config.withValue("project.inputs.baseFile.paths",
ConfigValueFactory.fromAnyRef(Seq("project/src/test/resources/inputs/parquet1/date=2020-10-01/").asJava))
But a more elegant way to do this (IMHO), is to create a new config, and to use the existing as a fallback.
For example, we can create a new config:
val newJsonString = """project {
|inputs {
|baseFile {
| paths = ["project/src/test/resources/inputs/parquet1/date=2020-10-01/"]
|}}}""".stripMargin
val newConfig = ConfigFactory.parseString(newJsonString)
And now to merge them:
val mergedConfig = newConfig.withFallback(config)
The output of:
println(mergedConfig.getList("project.inputs.baseFile.paths"))
println(mergedConfig.getString("project.inputs.baseFile.type"))
is:
SimpleConfigList(["project/src/test/resources/inputs/parquet1/date=2020-10-01/"])
parquet
As expected.
You can read more about Merging config trees. Code run at Scastie.
I didn't find any way to replace one element of the array with withValue.

Way to Extract list of elements from Scala list

I have standard list of objects which is used for the some analysis. The analysis generates a list of Strings and i need to look through the standard list of objects and retrieve objects with same name.
case class TestObj(name:String,positions:List[Int],present:Boolean)
val stdLis:List[TestObj]
//analysis generates a list of strings
var generatedLis:List[String]
//list to save objects found in standard list
val lisBuf = new ListBuffer[TestObj]()
//my current way
generatedLis.foreach{i=>
val temp = stdLis.filter(p=>p.name.equalsIgnoreCase(i))
if(temp.size==1){
lisBuf.append(temp(0))
}
}
Is there any other way to achieve this. Like having an custom indexof method that over rides and looks for the name instead of the whole object or something. I have not tried that approach as i am not sure about it.
stdLis.filter(testObj => generatedLis.exists(_.equalsIgnoreCase(testObj.name)))
use filter to filter elements from 'stdLis' per predicate
use exists to check if 'generatedLis' has a value of ....
Don't use mutable containers to filter sequences.
Naive solution:
val lisBuf =
for {
str <- generatedLis
temp = stdLis.filter(_.name.equalsIgnoreCase(str))
if temp.size == 1
} yield temp(0)
if we discard condition temp.size == 1 (i'm not sure it is legal or not):
val lisBuf = stdLis.filter(s => generatedLis.exists(_.equalsIgnoreCase(s.name)))

Reading and filtering a file into a string

I am new to Scala and I need to read the contents of a text file into a string while removing certain lines at the same time. The lines to be removed can be identified with a substring match. I could come up with the following solution, which almost works, the only problem is that the newlines are removed:
val fileAsFilteredString = io.Source.fromFile("file.txt").getLines.filter(s => !(s contains "filter these")).mkString;
How can I keep the newlines?
Add some parameters to mkString:
val fileAsFilteredString = io.Source.fromFile("file.txt").getLines
.filter(s => !(s contains "filter these")).mkString("\n")

Preferred way of processing this data with parallel arrays

Imagine a sequence of java.io.File objects. The sequence is not in any particular order, it gets populated after a directory traversal. The names of the files can be like this:
/some/file.bin
/some/other_file_x1.bin
/some/other_file_x2.bin
/some/other_file_x3.bin
/some/other_file_x4.bin
/some/other_file_x5.bin
...
/some/x_file_part1.bin
/some/x_file_part2.bin
/some/x_file_part3.bin
/some/x_file_part4.bin
/some/x_file_part5.bin
...
/some/x_file_part10.bin
Basically, I can have 3 types of files. First type is the simple ones, which only have a .bin extension. The second type of file is the one formed from _x1.bin till _x5.bin. And the third type of file can be formed of 10 smaller parts, from _part1 till _part10.
I know the naming may be strange, but this is what I have to work with :)
I want to group the files together ( all the pieces of a file should be processed together ), and I was thinking of using parallel arrays to do this. The thing I'm not sure about is how can I perform the reduce/acumulation part, since all the threads will be working on the same array.
val allBinFiles = allBins.toArray // array of java.io.File
I was thinking of handling something like that:
val mapAcumulator = java.util.Collections.synchronizedMap[String,ListBuffer[File]](new java.util.HashMap[String,ListBuffer[File]]())
allBinFiles.par.foreach { file =>
file match {
// for something like /some/x_file_x4.bin nameTillPart will be /some/x_file
case ComposedOf5Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
case ComposedOf10Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
// simple file, without any pieces
case _ => {
mapAcumulator.getOrElseUpdate(file.toString,new ListBuffer[File]()) += file
}
}
}
I was thinking of doing it like I've shown in the above code. Having extractors for the files, and using part of the path as key in the map. Like for example, /some/x_file can hold as values /some/x_file_x1.bin to /some/x_file_x5.bin. I also think there could be a better way of handling this. I would be interested in your opinions.
The alternative is to use groupBy:
val mp = allBinFiles.par.groupBy {
case ComposedOf5Name(x) => x
case ComposedOf10Name(x) => x
case f => f.toString
}
This will return a parallel map of parallel arrays of files (ParMap[String, ParArray[File]]). If you want a sequential map of sequential sequences of files from this point:
val sqmp = mp.map(_.seq).seq
To ensure that the parallelism kicks in, make sure you have enough elements in you parallel array (10k+).