file merge logic: scala - scala

For scala experts this might be a silly question but me as a beginner facing hard time to identify the solution. Any pointers would help.
I've set of 3 files in HDFS location by the names:
fileFirst.dat
fileSecond.dat
fileThird.dat
Not necessarily they'll be stored in any order. fileFirst.dat could be created at very last so a ls every time would show different ordering of the files.
My task is to combine all files in a single file in the order:
fileFirst contents, then fileSecond contents & finally fileThird contents; with newline as the separator, no spaces.
I tried some ideas but couldn't come up with something working. Every time the order of combination messes up.
Below is my function to merge whatever is coming in:
def writeFile(): Unit = {
val in: InputStream = fs.open(files(i).getPath)
try {
IOUtils.copyBytes(in, out, conf, false)
if (addString != null) out.write(addString.getBytes("UTF-8"))
} finally in.close()
}
Files is defined like this:
val files: Array[FileStatus] = fs.listStatus(srcPath)
This is part of a bigger function where I'm passing all the arguments used in this method. After everything is done, I'll do the out.close() to close the output stream.
Any ideas welcome, even if it goes against the file write logic I'm trying to do; just understand that I'm not that good in scala; for now :)

If you can enumerate your Paths directly, you don't really need to use listStatus. You could try something like this (untested):
val relativePaths = Array("fileFirst.dat", "fileSecond.dat", "fileThird.dat")
val paths = relativePaths.map(new Path(srcDirectory, _))
try {
val output = fs.create(destinationFile)
for (path <- paths) {
try {
val input = fs.open(path)
IOUtils.copyBytes(input, output, conf, false)
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
input.close()
}
}
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
output.close()
}

Related

Should 'require' go inside or outside of the Future?

How do I replace my first conditional with the require function in the context of a Future? Should I wrap the entire inRange method in a Future, and if I do that, how do I handle the last Future so that it doesn't return a Future[Future[List[UserId]], or is there a better way?
I have a block of code that looks something like this:
class RetrieveHomeownersDefault(depA: DependencyA, depB: DependencyB) extends RetrieveHomeowners {
def inRange(range: GpsRange): Future[List[UserId]] = {
// I would like to replace this conditional with `require(count >= 0, "The offset…`
if (count < 0) {
Future.failed(new IllegalArgumentException("The offset must be a positive integer.")
} else {
val retrieveUsers: Future[List[UserId]] = depA.inRange(range)
for (
userIds <- retrieveUsers
homes <- depB.homesForUsers(userIds)
) yield FilterUsers.withoutHomes(userIds, homes)
}
}
}
I started using the require function in other areas of my code, but when I tried to use it in the context of Futures I ran into some hiccups.
class RetrieveHomeownersDefault(depA: DependencyA, depB: DependencyB) extends RetrieveHomeowners {
// Wrapped the entire method with Future, but is that the correct approach?
def inRange(range: GpsRange): Future[List[UserId]] = Future {
require(count >= 0, "The offset must be a positive integer.")
val retrieveUsers: Future[List[UserId]] = depA.inRange(range)
// Now I get Future[Future[List[UserId]]] error in the compiler.
for (
userIds <- retrieveUsers
homes <- depB.homesForUsers(userIds)
) yield FilterUsers.withoutHomes(userIds, homes)
}
}
Any tips, feedback, or suggestions would be greatly appreciated. I'm just getting started with Futures and still having a tough time wrapping my head around many concepts.
Thanks a bunch!
Just remove the outer Future {...} wrapper. It's not necessary. There's no good reason for the require call to go inside the Future. It's actually better outside since then it will report immediately (in the same thread) to the caller that the argument is invalid.
By the way, the original code is wrong too. The Future.failed(...) is created but not returned. So essentially it didn't do anything.

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

scala read large files

Hello I am looking for fastest bat rather hi-level way to work with large data collection.
My task consist of two task read alot of large files in memory and then make some statistical calculations (the easiest way to work with data in this task is random access array ).
My first approach was to use java.io.ByteArrayOutputStream, becuase it can resize it's internal storage .
def packTo(buf:java.io.ByteArrayOutputStream,f:File) = {
try {
val fs = new java.io.FileInputStream(f)
IOUtils.copy(fs,buf)
} catch {
case e:java.io.FileNotFoundException =>
}
}
val buf = new java.io.ByteArrayOutputStream()
files foreach { f:File => packTo(buf,f) }
println(buf.size())
for(i <- 0 to buf.size()) {
for(j <- 0 to buf.size()) {
for(k <- 0 to buf.size()) {
// println("i " + i + " " + buf[i] );
// Calculate something amathing using buf[i] buf[j] buf[k]
}
}
}
println("amazing = " + ???)
but ByteArrayOutputStream can't get me as byte[] only copy of it. But I can not allow to have 2 copies of data .
Have you tried scala-io? Should be as simple as Resource.fromFile(f).byteArray with it.
Scala's built in library already provides a nice API to do this
io.Source.fromFile("/file/path").mkString.getBytes
However, it's not often a good idea to load whole file as byte array into memory. Do make sure the largest possible file can still fit into your JVM memory properly.

How does the following Java "continue" code translate to Scala?

for (String stock : allStocks) {
Quote quote = getQuote(...);
if (null == quoteLast) {
continue;
}
Price price = quote.getPrice();
if (null == price) {
continue;
}
}
I don't necessarily need a line by line translation, but I'm looking for the "Scala way" to handle this type of problem.
You don't need continue or breakable or anything like that in cases like this: Options and for comprehensions do the trick very nicely,
val stocksWithPrices =
for {
stock <- allStocks
quote <- Option(getQuote(...))
price <- Option(quote.getPrice())
} yield (stock, quote, price);
Generally you try to avoid those situations to begin with by filtering before you even start:
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
(this example showing how you'd carry along partial results if you need them). I've used a view so that results will be computed as-needed, instead of creating a bunch of new collections at each step.
Actually, you'd probably have the quotes and such return options--look around on StackOverflow for examples of how to use those instead of null return values.
But, anyway, if that sort of thing doesn't work so well (e.g. because you are generating too many intermediate results that you need to keep, or you are relying on updating mutable variables and you want to keep the evaluation pattern simple so you know what's happening when) and you can't conceive of the problem in a different, possibly more robust way, then you can
import scala.util.control.Breaks._
for (stock <- allStocks) {
breakable {
val quote = getQuote(...)
if (quoteLast eq null) break;
...
}
}
The breakable construct specifies where breaks should take you to. If you put breakable outside a for loop, it works like a standard Java-style break. If you put it inside, it acts like continue.
Of course, if you have a very small number of conditions, you don't need the continue at all; just use the else of the if-statement.
Your control structure here can be mapped very idiomatically into the following for loop, and your code demonstrates the kind of filtering that Scala's for loop was designed for.
for {stock <- allStocks.view
quote = getQuote(...)
if quoteLast != null
price = quote.getPrice
if null != price
}{
// whatever comes after all of the null tests
}
By the way, Scala will automatically desugar this into the code from Rex Kerr's solution
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
This solution probably doesn't work in general for all different kinds of more complex flows that might use continue, but it does address a lot of common ones.
If the focus is really on the continue and not on the null handling, just define an inner method (the null handling part is a different idiom in scala):
def handleStock(stock: String): Unit {
val quote = getQuote(...)
if (null == quoteLast) {
return
}
val price = quote.getPrice();
if (null == price) {
return
}
}
for (stock <- allStocks) {
handleStock(stock)
}
The simplest way is to embed the skipped-over code in an if with reversed-sense to what you have.
See http://www.scala-lang.org/node/257

Preferred way of processing this data with parallel arrays

Imagine a sequence of java.io.File objects. The sequence is not in any particular order, it gets populated after a directory traversal. The names of the files can be like this:
/some/file.bin
/some/other_file_x1.bin
/some/other_file_x2.bin
/some/other_file_x3.bin
/some/other_file_x4.bin
/some/other_file_x5.bin
...
/some/x_file_part1.bin
/some/x_file_part2.bin
/some/x_file_part3.bin
/some/x_file_part4.bin
/some/x_file_part5.bin
...
/some/x_file_part10.bin
Basically, I can have 3 types of files. First type is the simple ones, which only have a .bin extension. The second type of file is the one formed from _x1.bin till _x5.bin. And the third type of file can be formed of 10 smaller parts, from _part1 till _part10.
I know the naming may be strange, but this is what I have to work with :)
I want to group the files together ( all the pieces of a file should be processed together ), and I was thinking of using parallel arrays to do this. The thing I'm not sure about is how can I perform the reduce/acumulation part, since all the threads will be working on the same array.
val allBinFiles = allBins.toArray // array of java.io.File
I was thinking of handling something like that:
val mapAcumulator = java.util.Collections.synchronizedMap[String,ListBuffer[File]](new java.util.HashMap[String,ListBuffer[File]]())
allBinFiles.par.foreach { file =>
file match {
// for something like /some/x_file_x4.bin nameTillPart will be /some/x_file
case ComposedOf5Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
case ComposedOf10Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
// simple file, without any pieces
case _ => {
mapAcumulator.getOrElseUpdate(file.toString,new ListBuffer[File]()) += file
}
}
}
I was thinking of doing it like I've shown in the above code. Having extractors for the files, and using part of the path as key in the map. Like for example, /some/x_file can hold as values /some/x_file_x1.bin to /some/x_file_x5.bin. I also think there could be a better way of handling this. I would be interested in your opinions.
The alternative is to use groupBy:
val mp = allBinFiles.par.groupBy {
case ComposedOf5Name(x) => x
case ComposedOf10Name(x) => x
case f => f.toString
}
This will return a parallel map of parallel arrays of files (ParMap[String, ParArray[File]]). If you want a sequential map of sequential sequences of files from this point:
val sqmp = mp.map(_.seq).seq
To ensure that the parallelism kicks in, make sure you have enough elements in you parallel array (10k+).