Akka Streams: How to group a list of files in a source by size? - scala

So I currently have an akka stream to read a list of files, and a sink to concatenate them, and that works just fine:
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val sink = Sink.fold[ByteString, ByteString](ByteString(""))(_ ++ ByteString("\n" ++ _) // Concatenate
source.toMat(sink)(Keep.right).run().flatMap(concatByteStr => writeByteStrToFile(concatByteStr, "an-output-file.txt"))
While this is fine for a simple case, the files are rather large (on the order of GBs, and can't fit in the memory of the machine I'm running this application on. So I'd like to chunk it after the byte string has reached a certain size. An option is doing it with Source.grouped(N), but files vary greatly in size (from 1 KB to 2 GB), so there's no guarantee on normalizing the size of the file.
My question is if there's a way to chunk writing files by the size of the bytestring. The documentation of akka streams are quite overwhelming and I'm having trouble figuring out the library. Any help would be greatly appreciated. Thanks!

The FileIO module from Akka Streams provides you with a streaming IO Sink to write to files, and utility methods to chunk a stream of ByteString. Your example would become something along the lines of
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))
source.via(chunking).runWith(sink)
Using FileIO.toPath sink avoids storing the whole folded ByteString into memory (hence allowing proper streaming).
More details on this Akka module can be found in the docs.

I think #Stefano Bonetti already offered a great solution. Just wanted to add that, one could also consider building custom GraphStage to address specific chunking need. In essence, create a chunk emitting method like below for the In/Out handlers as described in this Akka Stream link:
private def emitChunk(): Unit = {
if (buffer.isEmpty) {
if (isClosed(in)) completeStage()
else pull(in)
} else {
val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
buffer = nextBuffer
push(out, chunk)
}
}

After a week of tinkering in the Akka Streams libraries, the solution I ended with was a combination of Stefano's answer along with a solution provided here. I read the source of files line by line via the Framing.delimiter function, and then just simply use the LogRotatorSink provided by Alpakka. The meat of the determining log rotation is here:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
val sizeRotatorSink: Sink[ByteString, Future[Done]] =
LogRotatorSink(fileSizeRotationFunction)
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
source.via(chunking).runWith(sizeRotatorSink)
And that's it. Hope this was helpful to others.

Related

What is the most efficient way to copy a S3 "folder"?

I want to find an efficient way to copy an S3 folder/prefix with lots of objects to another folder/prefix on the same bucket. This is what I have tried.
Test data: around 200 objects, around 100 MB each.
1) aws s3 cp --recursive. It took around 150 secs.
2) s3-dist-cp. It took around 59 secs.
3) spark & aws jdk, 2 threads. It took around 440 secs.
4) spark & aws jdk, 64 threads. It took around 50 secs.
The threads definitely worked, but when it goes to a single thread, the aws java sdk approach seems not as efficient the aws s3 cp approach. Is there a single-threaded programming API that can have performance comparable to that of aws s3 cp? Or if there is a better to copy data?
Ideally I would prefer to use programming API to have more flexibility.
Below are the codes I used.
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
def listAllFiles(rootPath: String): Seq[String] = {
val fileSystem = FileSystem.get(URI.create(rootPath), new Configuration())
val it = fileSystem.listFiles(new Path(rootPath), true)
var files = List[String]()
while (it.hasNext) {
files = it.next().getPath.toString::files
}
files
}
def s3CopyFiles(spark: SparkSession, fromPath: String, toPath: String): Unit = {
val fromFiles = listAllFiles(fromPath)
val toFiles = fromFiles.map(_.replaceFirst(fromPath, toPath))
val fileMap = fromFiles.zip(toFiles)
s3CopyFiles(spark, fileMap)
}
def s3CopyFiles(spark: SparkSession, fileMap: Seq[(String, String)]): Unit = {
val sc = spark.sparkContext
val filePairRdd = sc.parallelize(fileMap.toList, sc.defaultParallelism)
filePairRdd.foreachPartition(it => {
val p = "s3://([^/]*)/(.*)".r
val s3 = AmazonS3ClientBuilder.defaultClient()
while (it.hasNext) {
val (p(fromBucket, fromKey), p(toBucket, toKey)) = it.next()
s3.copyObject(fromBucket, fromKey, toBucket, toKey)
}
})
}
The AWS SDK transfer manager is multithreaded; you tell it the block size you want to split the copy up by and it will do it across threads and coalesce the output at the end. Your code shouldn't have to care about how the thread/http pool is working.
Remember that the COPY call isn't doing IO; each thread issues the HTTP request and then blocks awaiting the answer...you can have many, many of them blocked at the same time
I would recommend async approach, for example reactive-aws-clients. You still will be limited to the S3 throttling bandwidth but you won't need brute force of huge number of threads on client side. For example, you could create a Monix app with structure like:
val future = listS3filesTask.flatMap(key => Task.now(getS3Object(key))).runAsync
Await.result(future, 100.seconds)
Another possible optimization could be using torrent protocol s3 feature if you have multiple consumers, so you can distribute data files across consumers with just one S3 GetObject operation per file.

Spark: Writing RDD Results to File System is Slow

I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue. I tried re-partitioning (and coalescing) before writing, but the application took a long time performing these tasks. I know that re-partitioning (and coalescing) is costly, but is what I'm doing the right way? If it's not, could you please give me hints on what's the right approach.
Notes:
My file system is Amazon S3.
My input data size is around 130GB.
My cluster contains a driver node and five slave nodes each has 16 cores and 64 GB of RAM.
I'm assigning 15 executors for my job, each has 5 cores and 19GB of RAM.
P.S. I tried using Dataframes, same issue.
Here is a sample of my code just in case:
val sc = spark.sparkContext
// loading the samples
val samplesRDD = sc
.textFile(s3InputPath)
.filter(_.split(",").length > 7)
.map(parseLine)
.filter(_._1.nonEmpty) // skips any un-parsable lines
// pick random samples
val samples1Ids = samplesRDD
.map(_._2._1) // map to id
.distinct
.takeSample(withReplacement = false, 100, 0)
// broadcast it to the cluster's nodes
val samples1IdsBC = sc broadcast samples1Ids
val samples1RDD = samplesRDD
.filter(samples1IdsBC.value contains _._2._1)
val samples2RDD = samplesRDD
.filter(sample => !samples1IdsBC.value.contains(sample._2._1))
// compute
samples1RDD
.cogroup(samples2RDD)
.flatMapValues { case (left, right) =>
left.map(sample1 => (sample1._1, right.filter(sample2 => isInRange(sample1._2, sample2._2)).map(_._1)))
}
.map {
case (timestamp, (sample1Id, sample2Ids)) =>
s"$timestamp,$sample1Id,${sample2Ids.mkString(";")}"
}
.repartition(10)
.saveAsTextFile(s3OutputPath)
UPDATE
Here is the same code using Dataframes:
// loading the samples
val samplesDF = spark
.read
.csv(inputPath)
.drop("_c1", "_c5", "_c6", "_c7", "_c8")
.toDF("id", "timestamp", "x", "y")
.withColumn("x", ($"x" / 100.0f).cast(sql.types.FloatType))
.withColumn("y", ($"y" / 100.0f).cast(sql.types.FloatType))
// pick random ids as samples 1
val samples1Ids = samplesDF
.select($"id") // map to the id
.distinct
.rdd
.takeSample(withReplacement = false, 1000)
.map(r => r.getAs[String]("id"))
// broadcast it to the executor
val samples1IdsBC = sc broadcast samples1Ids
// get samples 1 and 2
val samples1DF = samplesDF
.where($"id" isin (samples1IdsBC.value: _*))
val samples2DF = samplesDF
.where(!($"id" isin (samples1IdsBC.value: _*)))
samples2DF
.withColumn("combined", struct("id", "lng", "lat"))
.groupBy("timestamp")
.agg(collect_list("combined").as("combined_list"))
.join(samples1DF, Seq("timestamp"), "rightouter")
.map {
case Row(timestamp: String, samples: mutable.WrappedArray[GenericRowWithSchema], sample1Id: String, sample1X: Float, sample1Y: Float) =>
val sample2Info = samples.filter {
case Row(_, sample2X: Float, sample2Y: Float) =>
Misc.isInRange((sample2X, sample2Y), (sample1X, sample1Y), 20)
case _ => false
}.map {
case Row(sample2Id: String, sample2X: Float, sample2Y: Float) =>
s"$sample2Id:$sample2X:$sample2Y"
case _ => ""
}.mkString(";")
(timestamp, sample1Id, sample1X, sample1Y, sample2Info)
case Row(timestamp: String, _, sample1Id: String, sample1X: Float, sample1Y: Float) => // no overlapping samples
(timestamp, sample1Id, sample1X, sample1Y, "")
case _ =>
("error", "", 0.0f, 0.0f, "")
}
.where($"_1" notEqual "error")
// .show(1000, truncate = false)
.write
.csv(outputPath)
Issue here is that normally spark commit tasks, jobs by renaming files, and on S3 renames are really, really slow. The more data you write, the longer it takes at the end of the job. That what you are seeing.
Fix: switch to the S3A committers, which don't do any renames.
Some tuning options to massively increase the number of threads in IO, commits and connection pool size
fs.s3a.threads.max from 10 to something bigger
fs.s3a.committer.threads -number files committed by a POST in parallel; default is 8
fs.s3a.connection.maximum + try (fs.s3a.committer.threads + fs.s3a.threads.max + 10)
These are all fairly small as many jobs work with multiple buckets and if there were big numbers for each it'd be really expensive to create an s3a client...but if you have many thousands of files, probably worthwhile.

This Akka stream sometimes doesn't finish

I have a graph that reads lines from multiple gzipped files and writes those lines to another set of gzipped files, mapped according to some value in each line.
It works correctly against small data sets, but fails to terminate on larger data. (It may not be the size of the data that's to blame, as I have not run it enough times to be sure - it takes a while).
def files: Source[File, NotUsed] =
Source.fromIterator(
() =>
Files
.fileTraverser()
.breadthFirst(inDir)
.asScala
.filter(_.getName.endsWith(".gz"))
.toIterator)
def extract =
Flow[File]
.mapConcat[String](unzip)
.mapConcat(s =>
(JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
.groupBy(1 << 16, _._1)
.groupedWithin(1000, 1.second)
.map { lines =>
val w = writer(lines.head._1)
w.println(lines.map(_._2).mkString("\n"))
w.close()
Done
}
.mergeSubstreams
def unzip(f: File) = {
scala.io.Source
.fromInputStream(new GZIPInputStream(new FileInputStream(f)))
.getLines
.toIterable
.to[collection.immutable.Iterable]
}
def writer(tk: String): PrintWriter =
new PrintWriter(
new OutputStreamWriter(
new GZIPOutputStream(
new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
))
)
val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()
Await.result(process, Duration.Inf)
The thread dump shows that the process is WAITING at Await.result(process, Duration.Inf) and nothing else is happening.
OpenJDK v11 with Akka v2.5.15
Most likely it's stuck in groupBy because it ran out of available threads in dispatcher to gather items into 2^16 groups for all sources.
So if I were you I'd probably implement grouping in extract semi-manually using statefulMapConcat with mutable Map[KeyType, List[String]]. Or buffer lines with groupedWithin first and split them into groups that you would write to different files in Sink.foreach.

Directive to complete with RandomAccessFile read

I have a large data file and respond to GET requests with very small portions of that file as Array[Byte]
The directive is:
get {
dataRepo.load(param).map(data =>
complete(
HttpResponse(
entity = HttpEntity(myContentType, data),
headers = List(gzipContentEncoding)
)
)
).getOrElse(complete(HttpResponse(status = StatusCodes.NoContent)))
}
Where dataRepo.load is a function along the lines of:
val pointers: Option[Long, Int] = calculateFilePointers(param)
pointers.map { case (index, length) =>
val dataReader = new RandomAccessFile(dataFile, "r")
dataReader.seek(index)
val data = Array.ofDim[Byte](length)
dataReader.readFully(data)
data
}
Is there a more efficient way to pipe the RandomAccessFile read directly back in the response, rather than having to read it fully first?
Instead of reading the data into a Array[Byte] you could create an Iterator[Array[Byte]] which reads chunks of the file at a time:
val dataReader = new RandomAccessFile(dataFile, 'r')
val chunkSize = 1024
Iterator
.range(index, index + length, chunkSize)
.map { currentIndex =>
val currentBytes =
Array.ofDim[Byte](Math.min(chunkSize, length - currentIndex))
dataReader seek currentIndex
dataReader readFully currentBytes
currentBytes
}
This iterator can now feed an akka Source:
val source : Source[Array[Byte], _] =
Source fromIterator (() => dataRepo.load(param))
Which can then feed an HttpEntity:
val byteStrSource : Source[ByteString, _] = source.map(ByteString.apply)
val httpEntity = HttpEntity(myContentType, byteStrSource)
Now each client will only use 1024 Bytes of memory at-a-time instead of the full length of your file read. This will make your server much more efficient at handling multiple concurrent requests as well as allowing your dataRepo.load to return immediately with a lazy Source value instead of utilizing a Future.

Large file download with Play framework

I have a sample download code that works fine if the file is not zipped because I know the length and when I provide, it I think while streaming play does not have to bring the whole file in memory and it works. The below code works
def downloadLocalBackup() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
logger.info("from local backup set the length in header as "+file.length())
Ok.sendEntity(HttpEntity.Streamed(source, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
I don't know how the streaming in above case takes care of the difference in speed between disk reads(Which are faster than network). This never runs out of memory even for large files. But when I use the below code, which has zipOutput stream I am not sure of the reason to run out of memory. Somehow the same 3GB file when I try to use with zip stream, is not working.
def downloadLocalBackup2() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val enumerator = Enumerator.outputStream { os =>
val zipStream = new ZipOutputStream(os)
zipStream.putNextEntry(new ZipEntry("backup2"))
val is = new BufferedInputStream(new FileInputStream(pathOfFile))
val buf = new Array[Byte](1024)
var len = is.read(buf)
var totalLength = 0L;
var logged = false;
while (len >= 0) {
zipStream.write(buf, 0, len)
len = is.read(buf)
if (!logged) {
logged = true;
logger.info("logging the while loop just one time")
}
}
is.close
zipStream.close()
}
logger.info("log right before sendEntity")
val kk = Ok.sendEntity(HttpEntity.Streamed(Source.fromPublisher(Streams.enumeratorToPublisher(enumerator)).map(x => {
val kk = Writeable.wByteArray.transform(x); kk
}),
None, Some("application/zip"))
).withHeaders("Content-Disposition" -> s"attachment; filename=backupfile.zip")
kk
}
In the first example, Akka Streams handles all details for you. It knows how to read the input stream without loading the complete file in memory. That is the advantage of using Akka Streams as explained in the docs:
The way we consume services from the Internet today includes many instances of streaming data, both downloading from a service as well as uploading to it or peer-to-peer data transfers. Regarding data as a stream of elements instead of in its entirety is very useful because it matches the way computers send and receive them (for example via TCP), but it is often also a necessity because data sets frequently become too large to be handled as a whole. We spread computations or analyses over large clusters and call it “big data”, where the whole principle of processing them is by feeding those data sequentially—as a stream—through some CPUs.
...
The purpose [of Akka Streams] is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member.
At the second example, you are handling the input/output streams by yourself, using the standard blocking API. I'm not 100% sure about how writing to a ZipOutputStream works here, but it is possible that it is not flushing the writes and accumulating everything before close.
Good thing is that you don't need to handle this manually since Akka Streams provides a way to gzip a Source of ByteStrings:
import javax.inject.Inject
import akka.util.ByteString
import akka.stream.scaladsl.{Compression, FileIO, Source}
import play.api.http.HttpEntity
import play.api.mvc.{BaseController, ControllerComponents}
class FooController #Inject()(val controllerComponents: ControllerComponents) extends BaseController {
def download = Action {
val pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
val gzipped = source.via(Compression.gzip)
Ok.sendEntity(HttpEntity.Streamed(gzipped, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
}