Hadoop FileSystem: repeated call of rename-method when target dir exists - scala

There are 2 dirs: /data/src/partition/file.parquet and /data/target.
The task is to move folder "partition" with all its content to target dir. And I expected the operations of move to be executed atomically with increment of files in target-dir:
1. First move:
ls
/data/src/partition/file1.parquet data/target
mv
data/src -> /data/target/partition/file1.parquet
2. Second move:
ls
/data/src/partition/file2.parquet /data/target/partition/file1.parquet
mv
data/src -> /data/target/partition/file1.parquet
/data/target/partition/file2.parquet
3. etc...
Note: src/partition may content more than one file.
I tried done this with Hadoop java library FileSystem.rename() -method (as I understand it use -mv "under the hood"):
First move:
// Given /data/src/partition/file1.parquet
val srcPath=new Path("/data/src/partition")
val targetPath= new Path("data/target")
fs.rename(srcPath, targetPath)
This done well:
data/src -> /data/target/partition/file1.parquet
2.Second move:
// Given /data/src/partition/file2.parquet
val srcPath=new Path("/data/src/partition")
val targetPath= new Path("data/target")
fs.rename(srcPath, targetPath)
Ends in failure - as a result:
data/src/partition/file2.parquet /data/target/partition/file1.parquet
So nothing is moved.
What am I doing wrong?

Related

Scala changing parquet path in config (typesafe)

Currently I have a configuration file like this:
project {
inputs {
baseFile {
paths = ["project/src/test/resources/inputs/parquet1/date=2020-11-01/"]
type = parquet
applyConversions = false
}
}
}
And I want to change the date "2020-11-01" to another one during run time. I read I need a new config object since it's immutable, I'm trying this but I'm not quite sure how to edit paths since it's a list and not a String and it definitely needs to be a list or else it's going to say I haven't configured a path for the parquet.
val newConfig = config.withValue("project.inputs.baseFile.paths"(0),
ConfigValueFactory.fromAnyRef("project/src/test/resources/inputs/parquet1/date=2020-10-01/"))
But I'm getting a:
Error com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path 'project.inputs.baseFile.': path has a leading, trailing, or two adjacent period '.' (use quoted "" empty string if you want an empty element)
What's the correct way to set the new path?
One option you have, is to override the entire array:
import scala.collection.JavaConverters._
val mergedConfig = config.withValue("project.inputs.baseFile.paths",
ConfigValueFactory.fromAnyRef(Seq("project/src/test/resources/inputs/parquet1/date=2020-10-01/").asJava))
But a more elegant way to do this (IMHO), is to create a new config, and to use the existing as a fallback.
For example, we can create a new config:
val newJsonString = """project {
|inputs {
|baseFile {
| paths = ["project/src/test/resources/inputs/parquet1/date=2020-10-01/"]
|}}}""".stripMargin
val newConfig = ConfigFactory.parseString(newJsonString)
And now to merge them:
val mergedConfig = newConfig.withFallback(config)
The output of:
println(mergedConfig.getList("project.inputs.baseFile.paths"))
println(mergedConfig.getString("project.inputs.baseFile.type"))
is:
SimpleConfigList(["project/src/test/resources/inputs/parquet1/date=2020-10-01/"])
parquet
As expected.
You can read more about Merging config trees. Code run at Scastie.
I didn't find any way to replace one element of the array with withValue.

Hadoop copyMerge not working properly: scala

I'm trying to combine 3 files present in HDFS through scala. All the 3 files are present in the HDFS location srcPath as mentioned in the code below.
Created a function as below:
def mergeFiles(conf: Configuration, fs: FileSystem, srcPath: Path, dstPath: String, finalFileName: String): Unit {
val localfs = FileSystem.getLocal(conf)
val status = fs.listStatus(srcPath)
status.foreach(x =>
FileUtil.copyMerge(fs, x.getPath, localfs, new Path(dstPath.toString), false, conf, null)
}
I tried executing this, no result, no error, also no file gets created even.
I verified that I'm passing all the arguments properly.
Any clues?
The second argument of copyMerge is a directory, not an individual file.
This should work:
FileUtil.copyMerge(fs, srcPath, localfs, new Path(dstPath.toString), false, conf, null)
Usually reading the source code is the best way to debug such issues.
FileUtil#copyMerge method has been removed. See details for the major change:
https://issues.apache.org/jira/browse/HADOOP-12967
https://issues.apache.org/jira/browse/HADOOP-11392
You can use getmerge
Usage: hadoop fs -getmerge [-nl]

Generating a single output file for each processed input file in Apach Flink

I am using Scala and Apache Flink to build an ETL that reads all the files under a directory in my local file system periodically and write the result of processing each file in a single output file under another directory.
So an example of this is would be:
/dir/to/input/files/file1
/dir/to/intput/files/fil2
/dir/to/input/files/file3
and the output of the ETL would be exactly:
/dir/to/output/files/file1
/dir/to/output/files/file2
/dir/to/output/files/file3
I have tried various approaches including reducing the parallel processing to one when writing to the dataSink but I still can't achieve the required result.
This is my current code:
val path = "/path/to/input/files/"
val format = new TextInputFormat(new Path(path))
val socketStream = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10)
val wordsStream = socketStream.flatMap(value => value.split(",")).map(value => WordWithCount(value,1))
val keyValuePair = wordsStream.keyBy(_.word)
val countPair = keyValuePair.sum("count")
countPair.print()
countPair.writeAsText("/path/to/output/directory/"+
DateTime.now().getHourOfDay.toString
+
DateTime.now().getMinuteOfHour.toString
+
DateTime.now().getSecondOfMinute.toString
, FileSystem.WriteMode.NO_OVERWRITE)
// The first write method I trid:
val sink = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink.setBucketer(new DateTimeBucketer[WordWithCount]("yyyy-MM-dd--HHmm"))
// The second write method I trid:
val sink3 = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink3.setUseTruncate(false)
sink3.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink3.setWriter(new StringWriter[WordWithCount])
sink3.setBatchSize(3)
sink3.setPendingPrefix("file-")
sink3.setPendingSuffix(".txt")
Both writing methods fail in producing the wanted result.
Can some with experience with Apache Flink guide me to the write approach please.
I solved this issue importing the next dependencies to run on local machine:
hadoop-aws-2.7.3.jar
aws-java-sdk-s3-1.11.183.jar
aws-java-sdk-core-1.11.183.jar
aws-java-sdk-kms-1.11.183.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
joda-time-2.8.1.jar
httpcore-4.4.4.jar
httpclient-4.5.3.jar
You can review it on :
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html
Section "Provide S3 FileSystem Dependency"

How can I get the project path in Scala?

I'm trying to read some files from my Scala project, and if I use: java.io.File(".").getCanonicalPath() I find that my current directory is far away from them (exactly where I have installed Scala Eclipse). So how can I change the current directory to the root of my project, or get the path to my project? I really don't want to have an absolute path to my input files.
val PATH = raw"E:\lang\scala\progfun\src\examples\"
def printFileContents(filename: String) {
try {
println("\n" + PATH + filename)
io.Source.fromFile(PATH + filename).getLines.foreach(println)
} catch {
case _:Throwable => println("filename " + filename + " not found")
}
}
val filenames = List("random.txt", "a.txt", "b.txt", "c.txt")
filenames foreach printFileContents
Add your files to src/main/resources/<packageName> where <packageName> is your class package.
Change the line val PATH = getClass.getResource("").getPath
new File(".").getCanonicalPath
will give you the base-path you need
Another workaround is to put the path you need in an user environmental variable, and call it with sys.env (returns exception if failure) or System.getenv (returns null if failure), for example val PATH = sys.env("ScalaProjectPath") but the problem is that if you move the project you have to update the variable, which I didn't want.

Potential flaw with SBT's IO.zip method?

I'm working on an SBT plugin where I'd like to zip up a directory. This is possible due to the following method in IO:
def zip(sources: Traversable[(File,String)], outputZip: File): Unit
After tinkering with this method, it seems that simply passing it a directory and expecting the resulting zip file to have the same file & folder structure is wrong.. Passing a directory (empty or otherwise) results in the following:
[error]...:zipper: java.util.zip.ZipException: ZIP file must have at least one entry
Therefore, it appears that the way to get use the zip method is by stepping through the directory and adding each file individually to the Traversable object.
Assuming my understanding is correct, this strikes me as very odd - vey rarely do users need to cherry-pick what is to be added to an archive.
Any thoughts on this?
It seems like you can use this to compose a zip with files from multiple places. I can see the use of that in a build system.
A bit late to the party, but this should do what you need:
val parentFolder: File = ???
val folderName: String = ???
val src: File = parentFolder / folderName
val tgt: File = parentFolder / s"$folderName.zip"
IO.zip(allSubpaths(src), tgt)
Here is some code for zipping directories using sbt's IO class:
IO.withTemporaryDirectory(base => {
val dirToZip = new File(base, "lib")
IO.createDirectory(dirToZip)
IO.write(dirToZip / "test1", "test")
IO.write(dirToZip / "test2", "test")
val zip: File = base / ("test.zip")
IO.zip(allSubpaths(dirToZip), zip)
val out: File = base / "out"
IO.createDirectory(out)
IO.unzip(zip,out) mustEqual(Set(out /"test1", out / "test2"))
IO.delete((out ** "*").get)
//Create a zip containing this lib directory but under a different directory in the zip
val finder: PathFinder = dirToZip ** "*" --- dirToZip //Remove dirToZip as you can't rebase a directory to itself
IO.zip(finder x rebase(dirToZip, "newlib"), base / "rebased.zip")
IO.createDirectory(out)
IO.unzip(base / "rebased.zip",out) mustEqual(Set(out /"newlib"/"test1", out / "newlib"/ "test2"))
})
See the docs
http://www.scala-sbt.org/0.12.2/docs/Detailed-Topics/Mapping-Files.html
http://www.scala-sbt.org/0.12.3/docs/Detailed-Topics/Paths.html
for tips on creating the Traversable object to pass to IO.zip