I have a groovy script that saves multiple files from a remote directory to my temp directory and parses them into xml. It has an interesting bug. Each time it runs, it can't find one file in my temp directory. The next time it runs, it finds that file, but can't find a new file. If I have 20 files, it won't find all 20 files until the 20th run. The temp directory is cleared after each run. I'm wondering if there are other artifacts the program is leaving behind?
If I clean the project after 16 runs, it still finds the first 16 files. So it seems it's not an artifact in eclipse.
This is running in Eclipse 3, Java 1.5, Windows 7, Groovy 1.0.
remoteftpFile.findAll {
println "in find"
ftp.getReply();
it.isFile()
}.each {
println "in each"
ftp.getReply();
println it
ftp.getReply();
def tempDestination=PropertiesUtil.getTempDir()
def procDestination=PropertiesUtil.getProcessedDir()
def tempFile = new File(tempDestination+ it.name )
def procFile = new File(procDestination+ it.name )
//set it to delete
ftp.getReply();
println "got tempfile"
def localftpFile = ftp.SaveToDisk(it,tempFile) //Save each file to disk
//************** Handles decryption via gpgexe
println "Decrypting file";
println localftpFile.toString();
def localftpFileStr=localftpFile.toString();
def processedftpFileStr=procFile.toString();
def gpgstring=PropertiesUtil.getGpgString();
def decryptedOutputName = localftpFileStr.substring(0, (localftpFileStr.length()-4));
def decryptedProcOutputName= processedftpFileStr.substring(0, (processedftpFileStr.length()-4));
def decryptedOutputXMLName = decryptedOutputName.substring(0, (decryptedOutputName.length()-4))+".xml";
def decryptedProcOutputXMLName = decryptedProcOutputName.substring(0, (decryptedProcOutputName.length()-4))+".xml";
println decryptedOutputName;
def xmlfile = new File(decryptedOutputName)
def cdmpXmlFile = new File(decryptedOutputXMLName)
def procCdmpXmlFile = decryptedProcOutputXMLName
println gpgstring + " --output ${decryptedOutputName} --decrypt ${localftpFile} "
(new ExternalExecute()).run(gpgstring +" --output ${decryptedOutputName} --decrypt ${localftpFile} ");
Thread.sleep(1000);
//************* Now Parse CSV file(s) into xml using stack overflow solution
println "parsing file"
def reader = new FileReader(xmlfile)
def writer = new FileWriter(cdmpXmlFile)
def csvdata = []
xmlfile.eachLine { line ->
if (line){
csvdata << line.split(',')
}
}
def headers = csvdata[0]
def dataRows = csvdata[1..-1]
def xml = new groovy.xml.MarkupBuilder(writer)
// write 'root' element
xml.root {
dataRows.eachWithIndex { dataRow, index ->
// write 'entry' element with 'id' attribute
entry(id:index+1) {
headers.eachWithIndex { heading, i ->
// write each heading with associated content
"${heading}"(dataRow[i])
}
}
}
}
println "Performing XSL Translation on ${cdmpXmlFile}"
def cdmpXML = new XMLTransformer(xmlTranslate).run(cdmpXmlFile) //Run it on each of the xml files and set the output
new File("C:\\temp\\temp.xml").write(cdmpXML)
new File(procCdmpXmlFile).write(cdmpXML)
println "Done Performing XSL Translation"
println "Uploading Data to CDMP"
def cdmpUp = new UpdateCDMP(updateDB)
cdmpUp.run(cdmpXML)
println "Finished Upload and Import"
//do clean-up backing it up AND removing the files
println "Finished"
println "Closing Buffers"
reader.close();
writer.close();
println "Deleting Local Files"
new File(decryptedOutputName).deleteOnExit();
new File(localftpFile).deleteOnExit();
xmlfile.deleteOnExit();
cdmpXmlFile.deleteOnExit();
println "Deleting " + cdmpXmlFile.getName()
new File("C:\\temp\\temp.xml").deleteOnExit();
}
ftp.close()
}
This is because you are using deleteOnExit, which is not guaranteed to delete a file. It only deletes if:
the files are closed correctly,
the JVM exits correctly (with no exceptions), and
a System.exit() was called with 0 argument (or the VM terminated naturally).
It's especially problematic on Windows OSes. I can't point to a specific Stack Overflow question about this, but most questions involving deleteOnExit discuss this issue.
If you actually want to delete a file, then you should always use aFile.delete() directly. There is really no good reason to delay the deletion until later in your example.
Related
How do we download directory content from SFTP server recursively in Scala? Can someone please help me with an example?
def recursiveDirectoryDownload(sourcePath: String, destinationPath: String) : Unit = {
val fileAndFolderList = channelSftp.ls(sourcePath)
for(item <- fileAndFolderList)
{
if(! item.getAttrs.isDir)
{
ChannelSftp.get(sourcePath + “\” + item.getFilename,destinationPath +”\” + item.getFilename)
}
}
}
You have to call back your recursiveDirectoryDownload function, when you encounter a subfolder.
See this question for implementation in Java:
Transfer folder and subfolders using channelsftp in JSch?
It should be trivial to translate to Scala.
For scala experts this might be a silly question but me as a beginner facing hard time to identify the solution. Any pointers would help.
I've set of 3 files in HDFS location by the names:
fileFirst.dat
fileSecond.dat
fileThird.dat
Not necessarily they'll be stored in any order. fileFirst.dat could be created at very last so a ls every time would show different ordering of the files.
My task is to combine all files in a single file in the order:
fileFirst contents, then fileSecond contents & finally fileThird contents; with newline as the separator, no spaces.
I tried some ideas but couldn't come up with something working. Every time the order of combination messes up.
Below is my function to merge whatever is coming in:
def writeFile(): Unit = {
val in: InputStream = fs.open(files(i).getPath)
try {
IOUtils.copyBytes(in, out, conf, false)
if (addString != null) out.write(addString.getBytes("UTF-8"))
} finally in.close()
}
Files is defined like this:
val files: Array[FileStatus] = fs.listStatus(srcPath)
This is part of a bigger function where I'm passing all the arguments used in this method. After everything is done, I'll do the out.close() to close the output stream.
Any ideas welcome, even if it goes against the file write logic I'm trying to do; just understand that I'm not that good in scala; for now :)
If you can enumerate your Paths directly, you don't really need to use listStatus. You could try something like this (untested):
val relativePaths = Array("fileFirst.dat", "fileSecond.dat", "fileThird.dat")
val paths = relativePaths.map(new Path(srcDirectory, _))
try {
val output = fs.create(destinationFile)
for (path <- paths) {
try {
val input = fs.open(path)
IOUtils.copyBytes(input, output, conf, false)
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
input.close()
}
}
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
output.close()
}
I want to add to my app a simple button that on click will call an Action that will create a csv file from two lists I have and download it to the user computer.
This is my Action:
def createAndDownloadFile = Action {
val file = new File("newFile.csv")
val writer = CSVWriter.open(file)
writer.writeAll(List(listOfHeaders, listOfValues))
writer.close()
Ok.sendFile(file, inline = false, _ => file.getName)
}
but this is now working for me, the file is not getting downloaded from the browser...
im expecting to see the file get downloaded by the browser, i thought Ok.sendFile should do the trick..
thanks!
You can use Enumerators and streams for that. It should work like this:
val enum = Enumerator.fromFile(...)
val source = akka.stream.scaladsl.Source.fromPublisher(play.api.libs.streams.Streams.enumeratorToPublisher(enum))
Result(
header = ResponseHeader(OK, Map(CONTENT_DISPOSITION → "attachment; filename=whatever.csv.gz")),
body = HttpEntity.Streamed(source.via(Compression.gzip), None, None)
)
This will actually pipe the download through gzip. Just remove the .via(Compression.gzip) part if that is not needed.
Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types
I'm trying to read some files from my Scala project, and if I use: java.io.File(".").getCanonicalPath() I find that my current directory is far away from them (exactly where I have installed Scala Eclipse). So how can I change the current directory to the root of my project, or get the path to my project? I really don't want to have an absolute path to my input files.
val PATH = raw"E:\lang\scala\progfun\src\examples\"
def printFileContents(filename: String) {
try {
println("\n" + PATH + filename)
io.Source.fromFile(PATH + filename).getLines.foreach(println)
} catch {
case _:Throwable => println("filename " + filename + " not found")
}
}
val filenames = List("random.txt", "a.txt", "b.txt", "c.txt")
filenames foreach printFileContents
Add your files to src/main/resources/<packageName> where <packageName> is your class package.
Change the line val PATH = getClass.getResource("").getPath
new File(".").getCanonicalPath
will give you the base-path you need
Another workaround is to put the path you need in an user environmental variable, and call it with sys.env (returns exception if failure) or System.getenv (returns null if failure), for example val PATH = sys.env("ScalaProjectPath") but the problem is that if you move the project you have to update the variable, which I didn't want.