Bucketed Sink in scalaz-stream - scala

I am trying to make a sink that would write a stream to bucketed files: when a particular condition is reached (time, size of file, etc.) is reached, the current output stream is closed and a new one is opened to a new bucket file.
I checked how the different sinks were created in the io object, but there aren't many examples. So I trIed to follow how resource and chunkW were written. I ended up with the following bit of code, where for simplicity, buckets are just represented by an Int for now, but would eventually be some type output streams.
val buckets: Channel[Task, String, Int] = {
//recursion to step through the stream
def go(step: Task[String => Task[Int]]): Process[Task, String => Task[Int]] = {
// Emit the value and repeat
def next(msg: String => Task[Int]) =
Process.emit(msg) ++
go(step)
Process.await[Task, String => Task[Int], String => Task[Int]](step)(
next
, Process.halt // TODO ???
, Process.halt) // TODO ???
}
//starting bucket
val acquire: Task[Int] = Task.delay {
val startBuck = nextBucket(0)
println(s"opening bucket $startBuck")
startBuck
}
//the write step
def step(os: Int): Task[String => Task[Int]] =
Task.now((msg: String) => Task.delay {
write(os, msg)
val newBuck = nextBucket(os)
if (newBuck != os) {
println(s"closing bucket $os")
println(s"opening bucket $newBuck")
}
newBuck
})
//start the Channel
Process.await(acquire)(
buck => go(step(buck))
, Process.halt, Process.halt)
}
def write(bucket: Int, msg: String) { println(s"$bucket\t$msg") }
def nextBucket(b: Int) = b+1
There are a number of issues in this:
step is passed the bucket once at the start and this never changes during the recursion. I am not sure how in the recursive go to create a new step task that will use the bucket (Int) from the previous task, as I have to provide a String to get to that task.
the fallback and cleanup of the await calls do not receive the result of rcv (if there is one). In the io.resource function, it works fine as the resource is fixed, however, in my case, the resource might change at any step. How would I go to pass the reference to the current open bucket to these callbacks?

Well one of the options (i.e. time) may be to use simple go on the sink. This one uses time based, essentially reopening file every single hour:
val metronome = Process.awakeEvery(1.hour).map(true)
def writeFileSink(file:String):Sink[Task,ByteVector] = ???
def timeBasedSink(prefix:String) = {
def go(index:Int) : Sink[Task,ByteVector] = {
metronome.wye(write(prefix + "_" + index))(wye.interrupt) ++ go(index + 1)
}
go(0)
}
for the other options (i.e. bytes written) you can use similar technique, just keep signal of bytes written and combine it with Sink.

Related

Scala writing and reading from a file inside a while loop

I have an application where I am to write values into a file and read them back into the program in a while loop. This fails because the file is only written when I exit the loop and not in every iteration. Therefore, in the next iterations, I cannot access values that were supposed to be written into the file in the previous iterations. How can I make every iteration write into the file as opposed to writing all values at the end of the while loop?
My application uses Scalafix. It reads in a test suite Scala file and duplicates its test cases at each iteration. The important details are explained by my series of 8 comments. Is there something about the working of the FileWriter that makes it wait until the last round of the loop to write back to file as it does not write back to file at every iteration of the loop?
object Printer{
//1 . This is my filePrinter which I call at every iteration to print the new test file with its test cases duplicated.
def saveFile(filename:String, data: String): Unit ={
val fileWritter: FileWriter = new FileWriter(filename)
val bufferWritter: BufferedWriter = new BufferedWriter(fileWritter)
bufferWritter.write(data)
bufferWritter.flush()
bufferWritter.close()
}
}
object Main extends App {
//2. my loop starts here.
var n = 2
do {
// read in a semanticDocument (function provided below)
val ( sdoc1,base,filename)=SemanticDocumentBuilder.buildSemanticDocument()
implicit val sdoc = sdoc1 //4. P3 is a scalafix "patch" that collects all the test cases of
// test suite and duplicates them. It works just fine, see the next comment.
val p3 =sdoc.tree.collect {
case test#Term.ApplyInfix(Term.ApplyInfix(_,Term.Name(smc), _,
List(Lit.String(_))), Term.Name("in"), _, params) =>
Patch.addRight(test,"\n" +test.toString())
}.asPatch
//5. I collect the test cases in the next line and print
//out how many they are. At this moment, I have not
// applied the duplicate function, so they are still as
//originally read from the test file.
val staticAnalyzer = new StaticAnalyzer()
val testCases: List[Term.ApplyInfix] =
staticAnalyzer.collectTestCases()
println("Tests cases count: "+ testCases.length)
val r3 = RuleName(List(RuleIdentifier("r3")))
val map:Map[RuleName, Patch] = Map(r3->p3)
val r = PatchInternals(map, v0.RuleCtx(sdoc.tree), None)
//6. After applying the p3 patch in the previous three lines,
//I indeed print out the newly created test suite file
//and it contains each test case duplicated as shown
// by the below println(r._1.getClass).
println(r._1.getClass)
//7. I then call the my save file (see this function above - first lines of this code)
Printer.saveFile(base+"src/test/scala/"+filename,r._1)
n-=1
//8. Since I have saved my file with the duplicates,
//I would expect that it will save the file back to the
//file (overwrite the original file as I have not used "append = true".
//I would then expect that the next length of test cases will
//have doubled but this is never the case.
//The save function with FileWriter only works in the last loop.
//Therefore, no matter the number of loops, it only doubles once!
println("Loop: "+ n)
} while(n>0)
}
**Edit factored out the reading in of the semanticDocument ** This function simply returns a SemanticDocument and two strings representing my file path and filename.
object SemanticDocumentBuilder{
def buildSemanticDocument(): (SemanticDocument,String,String) ={
val base = "/Users/soft/Downloads/simpleAkkaProject/"
val local = new File(base)
//val dependenceisSBTCommand = s"sbt -ivy ./.ivy2 -Dsbt.ivy.home=./.ivy2 -Divy.home=./.ivy2
//val sbtCmd = s"sbt -ivy ./ivy2 -Dsbt.ivy.home=./ivy2 -Divy.home=./ivy2 -Dsbt.boot.directo
val result = sys.process.Process(Seq("sbt","semanticdb"), local).!
val jars = FileUtils.listFiles(local, Array("jar"), true).toArray(new Array[File](0))
.toList
.map(f => Classpath(f.getAbsolutePath))
.reduceOption(_ ++ _)
val classes = FileUtils.listFilesAndDirs(local, TrueFileFilter.INSTANCE, DirectoryFileFilte
.toList
.filter(p => p.isDirectory && !p.getAbsolutePath.contains(".sbt") && p.getAbsolutePath.co
.map(f => Classpath(f.getAbsolutePath))
.reduceOption(_ ++ _)
val classPath = ClassLoader.getSystemClassLoader.asInstanceOf[URLClassLoader].getURLs
.map(url => Classpath(url.getFile))
.reduceOption(_ ++ _)
val all = (jars ++ classes ++ classPath).reduceOption(_ ++ _).getOrElse(Classpath(""))
val symbolTable = GlobalSymbolTable(all)
val filename = "AkkaQuickstartSpec.scala"
val root = AbsolutePath(base).resolve("src/test/scala/")
println(root)
val abspath = root.resolve(filename)
println(root)
val relpath = abspath.toRelative(AbsolutePath(base))
println(relpath)
val sourceFile = new File(base+"src/test/scala/"+filename)
val input = Input.File(sourceFile)
println(input)
if (n == firstRound){
doc = SyntacticDocument.fromInput(input)
}
//println(doc.tree.structure(30))
var documents: Map[String, TextDocument] = Map.empty
Locator.apply(local.toPath)((path, db) => db.documents.foreach({
case document#TextDocument(_, uri, text, md5, _, _, _, _, _) if !md5.isEmpty => { // skip
if (n == firstRound){
ast= sourceFile.parse[Source].getOrElse(Source(List()))
}
documents = documents + (uri -> document)
println(uri)
}
println(local.canWrite)
if (editedSuite != null){
Printer.saveFile(sourceFile,editedSuite)
}
}))
//println(documents)
val impl = new InternalSemanticDoc(doc, documents(relpath.toString()), symbolTable)
implicit val sdoc = new SemanticDocument(impl)
val symbols = sdoc.tree.collect {
case t# Term.Name("<") => {
println(s"symbol for $t")
println(t.symbol.value)
println(symbolTable.info(t.symbol.value))
}
}
(sdoc,base,filename)
}
}
In saveFile you need to close the fileWriter after closing the bufferedWriter. You don't need to flush because close will do this for you.
You should also close all the other File objects that you create in the loop, because they may be holding on to stale file handles. (e.g. local, ast)
More generally, clean up the code by putting code in functions with meaningful names. There is also a lot of code that can be outside the loop. Doing this will make it easier to see what is going on and allow you create a Minimal, Complete, and Verifiable example. As it stands it is really difficult to work out what is going on.

Transform AWS S3 Object into ByteString issue

I'm working on a process of upload files from S3 to Facebook using Akka. According to Facebook API docs, files should be uploaded via small parts - chunks. Based on a file size, Facebook gives you an information about bytes offset which it expects to receive in a next request.
Firstly I make a GetObjectRequest to S3 via Java AWS SDK, in order to receive a chunk with a required bytes size:
val objChunkReq = new GetObjectRequest(get.s3ObjId.bucketName, get.s3ObjId.key)
objChunkReq.setRange(get.fbUploadSession.from, get.fbUploadSession.to)
Try(s3Client.getObject(objChunkReq)) match {
case Success(s3ObjChunk) => Right(S3ObjChunk(s3ObjChunk, get.fbUploadSession))
case Failure(ex) => Left(S3Exception(ex.getMessage))
}
Then in case if the S3 response is successful, I can work with the received chunk as with an InputStream in order to pass it then into Facebook HTTP request:
private def inputStreamToArrayByte(is: InputStream) = {
Try {
val reads: Int = is.read()
val byteStringBuilder = ByteString.newBuilder
while (is.read() != -1) {
byteStringBuilder.asOutputStream.write(reads)
is.read()
}
is.close()
byteStringBuilder.result()
}
}
The issue I faced is that size of s3ObjChunk from the first code snippet has twice bigger size in bytes than the resulting ByteString from the second one code snippet.
s3ObjChunk.getObjectMetadata.getContentLength == n
byteStringBuilder.result().length == n / 2
I have two assumptions:
a) I transform the InputStream into ByteString incorrectly
b) The ByteString compresses the InputStream
How to transform an S3 object InputStream into a ByteString correctly?
The issue with n vs n / 2 in the resulting output may be explained by a bug in the implementation.
is.read() is called twice in the loop, and none of its returns is indeed written into the output stream, but only the first one, stored in val reads.
The implementation should change to something like:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
try {
var reads: Int = is.read() // note "var" instead of "val"
while (reads != -1) {
output.write(reads)
reads = is.read()
}
} finally {
is.close() // should it be here or closed by the caller?
// also close "output"
}
byteStringBuilder.result()
Or, another approach would be to use slightly more idiomatic stream reading with scala.io.Source, for example:
val byteStringBuilder = ByteString.newBuilder
val output = byteStringBuilder.asOutputStream
scala.io.Source.fromInputStream(is).foreach(output.write(_))
byteStringBuilder.result()

How to extract timed-out sessions using mapWithState

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.
This was my old code:
val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
val parsed = Utils.parseJSON(eventRecord)
val member_id = parsed.getOrElse("member_id", "")
val timestamp = parsed.getOrElse("timestamp", "").toLong
//The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
(member_id, (timestamp, timestamp, List(eventRecord)))
})
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
reduceByKey((a, b) => {
//transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
}).updateStateByKey(Utils.updateState)
The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.
This is my first attempt to transform the old code to the new version:
val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()
I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.
Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.
What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:
def updateState(key: String,
value: Option[(Long, Long, Long, List[String])],
state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
(Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
}
value match {
case Some(currentValue) =>
val result = state
.getOption()
.map(currentState => reduce(currentState, currentValue))
.getOrElse(currentValue)
state.update(result)
None
case _ if state.isTimingOut() => state.getOption()
}
}
This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.
This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:
latestSessionInfo
.mapWithState(spec)
.filter(_.isDefined)
After filter, you'll only have states which have timed out.

Sink for line-by-line file IO with backpressure

I have a file processing job that currently uses akka actors with manually managed backpressure to handle the processing pipeline, but I've never been able to successfully manage the backpressure at the input file reading stage.
This job takes an input file and groups lines by an ID number present at the start of each line, and then once it hits a line with a new ID number, it pushes the grouped lines to a processing actor via message, and then continues with the new ID number, all the way until it reaches the end of the file.
This seems like it would be a good use case for Akka Streams, using the File as a sink, but I'm still not sure of three things:
1) How can I read the file line by line?
2) How can I group by the ID present on every line? I currently use very imperative processing for this, and I don't think I'll have the same ability in a stream pipeline.
3) How can I apply backpressure, such that I don't keep reading lines into memory faster than I can process the data downstream?
Akka streams' groupBy is one approach. But groupBy has a maxSubstreams param which would require that you to know that max # of ID ranges up front. So: the solution below uses scan to identify same-ID blocks, and splitWhen to split into substreams:
object Main extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
def extractId(s: String) = {
val a = s.split(",")
a(0) -> a(1)
}
val file = new File("/tmp/example.csv")
private val lineByLineSource = FileIO.fromFile(file)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024))
.map(_.utf8String)
val future: Future[Done] = lineByLineSource
.map(extractId)
.scan( (false,"","") )( (l,r) => (l._2 != r._1, r._1, r._2) )
.drop(1)
.splitWhen(_._1)
.fold( ("",Seq[String]()) )( (l,r) => (r._2, l._2 ++ Seq(r._3) ))
.concatSubstreams
.runForeach(println)
private val reply = Await.result(future, 10 seconds)
println(s"Received $reply")
Await.ready(system.terminate(), 10 seconds)
}
extractId splits lines into id -> data tuples. scan prepends id -> data tuples with a start-of-ID-range flag. The drop drops the primer element to scan. splitwhen starts a new substream for each start-of-range. fold concatenates substreams to lists and removes the start-of-ID-range boolean, so that each substream produces a single element. In place of the fold you probably want a custom SubFlow which processes a streams of rows for a single ID and emits some result for the ID range. concatSubstreams merges the per-ID-range substreams produced by splitWhen back into a single stream that's printed by runForEach .
Run with:
$ cat /tmp/example.csv
ID1,some input
ID1,some more input
ID1,last of ID1
ID2,one line of ID2
ID3,2nd before eof
ID3,eof
Output is:
(ID1,List(some input, some more input, last of ID1))
(ID2,List(one line of ID2))
(ID3,List(2nd before eof, eof))
It appears that the easiest way to add "back pressure" to your system without introducing huge modifications is to simply change the mailbox type of the input groups consuming Actor to BoundedMailbox.
Change the type of the Actor that consumes your lines to BoundedMailbox with high mailbox-push-timeout-time:
bounded-mailbox {
mailbox-type = "akka.dispatch.BoundedDequeBasedMailbox"
mailbox-capacity = 1
mailbox-push-timeout-time = 1h
}
val actor = system.actorOf(Props(classOf[InputGroupsConsumingActor]).withMailbox("bounded-mailbox"))
Create iterator from your file, create grouped (by id) iterator from that iterator. Then just cycle through the data, sending groups to consuming Actor. Note, that send will block in this case, when Actor's mailbox gets full.
def iterGroupBy[A, K](iter: Iterator[A])(keyFun: A => K): Iterator[Seq[A]] = {
def rec(s: Stream[A]): Stream[Seq[A]] =
if (s.isEmpty) Stream.empty else {
s.span(keyFun(s.head) == keyFun(_)) match {
case (prefix, suffix) => prefix.toList #:: rec(suffix)
}
}
rec(iter.toStream).toIterator
}
val lines = Source.fromFile("input.file").getLines()
iterGroupBy(lines){l => l.headOption}.foreach {
lines:Seq[String] =>
actor.tell(lines, ActorRef.noSender)
}
That's it!
You probably want to move file reading stuff to separate thread, as it's gonna block. Also by adjusting mailbox-capacity you can regulate amount of consumed memory. But if reading batch from the file is always faster than processing, it seems reasonable to keep capacity small, like 1 or 2.
upd iterGroupBy implemented with Stream, tested not to produce StackOverflow.

Using Iteratees and Enumerators in Play Scala to Stream Data to S3

I am building a Play Framework application in Scala where I would like to stream an array of bytes to S3. I am using the Play-S3 library to do this. The "Multipart file upload" of the documentation section is what's relevant here:
// Retrieve an upload ticket
val result:Future[BucketFileUploadTicket] =
bucket initiateMultipartUpload BucketFile(fileName, mimeType)
// Upload the parts and save the tickets
val result:Future[BucketFilePartUploadTicket] =
bucket uploadPart (uploadTicket, BucketFilePart(partNumber, content))
// Complete the upload using both the upload ticket and the part upload tickets
val result:Future[Unit] =
bucket completeMultipartUpload (uploadTicket, partUploadTickets)
I am trying to do the same thing in my application but with Iteratees and Enumerators.
The streams and asynchronicity make things a little complicated, but here is what I have so far (Note uploadTicket is defined earlier in the code):
val partNumberStream = Stream.iterate(1)(_ + 1).iterator
val partUploadTicketsIteratee = Iteratee.fold[Array[Byte], Future[Vector[BucketFilePartUploadTicket]]](Future.successful(Vector.empty[BucketFilePartUploadTicket])) { (partUploadTickets, bytes) =>
bucket.uploadPart(uploadTicket, BucketFilePart(partNumberStream.next(), bytes)).flatMap(partUploadTicket => partUploadTickets.map( _ :+ partUploadTicket))
}
(body |>>> partUploadTicketsIteratee).andThen {
case result =>
result.map(_.map(partUploadTickets => bucket.completeMultipartUpload(uploadTicket, partUploadTickets))) match {
case Success(x) => x.map(d => println("Success"))
case Failure(t) => throw t
}
}
Everything compiles and runs without incident. In fact, "Success" gets printed, but no file ever shows up on S3.
There might be multiple problems with your code. It's a bit unreadable caused by the map method calls. You might have a problem with your future composition. Another problem might be caused by the fact that all chunks (except for the last) should be at least 5MB.
The code below has not been tested, but shows a different approach. The iteratee approach is one where you can create small building blocks and compose them into a pipe of operations.
To make the code compile I added a trait and a few methods
trait BucketFilePartUploadTicket
val uploadPart: (Int, Array[Byte]) => Future[BucketFilePartUploadTicket] = ???
val completeUpload: Seq[BucketFilePartUploadTicket] => Future[Unit] = ???
val body: Enumerator[Array[Byte]] = ???
Here we create a few parts
// Create 5MB chunks
val chunked = {
val take5MB = Traversable.takeUpTo[Array[Byte]](1024 * 1024 * 5)
Enumeratee.grouped(take5MB transform Iteratee.consume())
}
// Add a counter, used as part number later on
val zipWithIndex = Enumeratee.scanLeft[Array[Byte]](0 -> Array.empty[Byte]) {
case ((counter, _), bytes) => (counter + 1) -> bytes
}
// Map the (Int, Array[Byte]) tuple to a BucketFilePartUploadTicket
val uploadPartTickets = Enumeratee.mapM[(Int, Array[Byte])](uploadPart.tupled)
// Construct the pipe to connect to the enumerator
// the ><> operator is an alias for compose, it is more intuitive because of
// it's arrow like structure
val pipe = chunked ><> zipWithIndex ><> uploadPartTickets
// Create a consumer that ends by finishing the upload
val consumeAndComplete =
Iteratee.getChunks[BucketFilePartUploadTicket] mapM completeUpload
Running it is done by simply connecting the parts
// This is the result, a Future[Unit]
val result = body through pipe run consumeAndComplete
Note that I did not test any code and might have made some mistakes in my approach. This however shows a different way of dealing with the problem and should probably help you to find a good solution.
Note that this approach waits for one part to complete upload before it takes on the next part. If the connection from your server to amazon is slower than the connection from the browser to you server this mechanism will slow the input.
You could take another approach where you do not wait for the Future of the part upload to complete. This would result in another step where you use Future.sequence to convert the sequence of upload futures into a single future containing a sequence of the results. The result would be a mechanism sending a part to amazon as soon as you have enough data.