Scala changing parquet path in config (typesafe) - scala

Currently I have a configuration file like this:
project {
inputs {
baseFile {
paths = ["project/src/test/resources/inputs/parquet1/date=2020-11-01/"]
type = parquet
applyConversions = false
}
}
}
And I want to change the date "2020-11-01" to another one during run time. I read I need a new config object since it's immutable, I'm trying this but I'm not quite sure how to edit paths since it's a list and not a String and it definitely needs to be a list or else it's going to say I haven't configured a path for the parquet.
val newConfig = config.withValue("project.inputs.baseFile.paths"(0),
ConfigValueFactory.fromAnyRef("project/src/test/resources/inputs/parquet1/date=2020-10-01/"))
But I'm getting a:
Error com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path 'project.inputs.baseFile.': path has a leading, trailing, or two adjacent period '.' (use quoted "" empty string if you want an empty element)
What's the correct way to set the new path?

One option you have, is to override the entire array:
import scala.collection.JavaConverters._
val mergedConfig = config.withValue("project.inputs.baseFile.paths",
ConfigValueFactory.fromAnyRef(Seq("project/src/test/resources/inputs/parquet1/date=2020-10-01/").asJava))
But a more elegant way to do this (IMHO), is to create a new config, and to use the existing as a fallback.
For example, we can create a new config:
val newJsonString = """project {
|inputs {
|baseFile {
| paths = ["project/src/test/resources/inputs/parquet1/date=2020-10-01/"]
|}}}""".stripMargin
val newConfig = ConfigFactory.parseString(newJsonString)
And now to merge them:
val mergedConfig = newConfig.withFallback(config)
The output of:
println(mergedConfig.getList("project.inputs.baseFile.paths"))
println(mergedConfig.getString("project.inputs.baseFile.type"))
is:
SimpleConfigList(["project/src/test/resources/inputs/parquet1/date=2020-10-01/"])
parquet
As expected.
You can read more about Merging config trees. Code run at Scastie.
I didn't find any way to replace one element of the array with withValue.

Related

Scala how to use regex on endsWith?

I'm trying to figure out how to isolate all file extensions from a list of file names using regex and endsWith.
So as an example
input:
file.txt, notepad.exe
output:
txt, exe
What my idea is, is to use filter to get file names that endsWith("."_). But endsWith("."_) doesn't work.
Any suggestions?
You really do not want to filter, you want to map each filename into its extension.
(and maybe then collect only the ones that had an extension and probably you only want each unique extension)
You can use a regex for that.
object ExtExtractor {
val ExtRegex = """.*\.(\w+)?""".r
def apply(data: List[String]): Set[String] =
data.iterator.collect {
case ExtRegex(ext) => ext.toLowerCase
}.toSet
}
You can see it running here.
how about using split('.') which will return a
String[] parts = fileName.split("\\.");
String extension = parts[parts.length-1];

Given the file path find the file extension using Scala?

I am trying to find the file type in order to read the file based on its type.
Input come in different file formats such as CVS, excel and orc etc..,
for example input =>"D:\\resources\\core_dataset.csv"
I am expecting output => csv
You could achieve this as follows:
import java.nio.file.Paths
val path = "/home/gmc/exists.csv"
val fileName = Paths.get(path).getFileName // Convert the path string to a Path object and get the "base name" from that path.
val extension = fileName.toString.split("\\.").last // Split the "base name" on a . and take the last element - which is the extension.
// The above produces:
extension: String = csv

How to dump nested list using SnakeYAML

I'm parsing this yaml file
View:
from : 01.01.2007
to : 04.01.2007
driver : sun.jdbc.odbc.JdbcOdbcDriver
using SnakeYAML in Scala like this:
val stream = getClass.getResourceAsStream("/config_view.yml")
var configMap: Map[String, Any] = new Yaml().load(stream).asInstanceOf[java.util.Map[String, Any]].asScala
var view = configMap("View").asInstanceOf[java.util.LinkedHashMap[String, String]].asScala
view = view + ("from" -> "neu") // some test modifying
and I dump it like this:
val fileWriter = new FileWriter(System.getProperty("user.home") + "\\Desktop\\test.yml")
new Yaml().dump(Map[String, Any]("View" -> view.asJava).asJava, fileWriter)
which saves the new yaml file like this:
View: {driver: sun.jdbc.odbc.JdbcOdbcDriver, from: neu, to: 04.01.2007}
But I want it to save it like this:
View:
driver: sun.jdbc.odbc.JdbcOdbcDriver
from: neu
to: 04.01.2007
How can I tell SnakeYAML to save it in the desired format you see above?
By default SnakeYAML uses the DumperOptions.FlowStyle.FLOW but it can be changed to the DumperOptions.FlowStyle.BLOCK that will dump the data with the desired format.
An example in Kotlin:
val options = DumperOptions()
options.indent = 2
options.defaultFlowStyle = DumperOptions.FlowStyle.BLOCK
Yaml(options).dump(yourObject)
How about manually handling the indentation and key: value formatting:
view.map{ case (k,v) => s"\t$k: $v\n" }
In the case of nested maps you will want a method that
accepts the current "level" of nesting. Place the level tabs in front of the output to give the proper output nesting
checks each of the entries. If it were another collection type then it needs to recursively invoke itself - which will increase the indention level

Compute file content hash with Scala

In our app, we are in need to compute file hash, so we can compare if the file was updated later.
The way I am doing it right now is with this little method:
protected[services] def computeMigrationHash(toVersion: Int): String = {
val migrationClassName = MigrationClassNameFormat.format(toVersion, toVersion)
val migrationClass = Class.forName(migrationClassName)
val fileName = migrationClass.getName.replace('.', '/') + ".class"
val resource = getClass.getClassLoader.getResource(fileName)
logger.debug("Migration file - " + resource.getFile)
val file = new File(resource.getFile)
val hc = Files.hash(file, Hashing.md5())
logger.debug("Calculated migration file hash - " + hc.toString)
hc.toString
}
It all works perfectly, until the code get's deployed into different environment and file file is located in a different absolute path. I guess, the hashing take the path into account as well.
What is the best way to calculate some sort of reliable hash of a file content that well produce the same result for as log as the content of a file stays the same?
Thanks,
Having perused the source code https://github.com/google/guava/blob/master/guava/src/com/google/common/io/Files.java - only the file contents are hashed - the path does not come into play.
public static HashCode hash(File file, HashFunction hashFunction) throws IOException {
return asByteSource(file).hash(hashFunction);
}
Therefore you need not worry about locality of the file. Now why you end up with a different hash on a different fs .. maybe you should compare the size/contents to ensure eg no compound eol's were introduced.

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).