Convert Source[ByteString, Any] to Source[ByteString, IOResult] - scala

I have a:
val fileStream: Source[ByteString, Any] = Source.single(ByteString.fromString("Hello"))
This Source[ByteString, Any] type comes from akka fileUpload directive:
https://doc.akka.io/docs/akka-http/current/routing-dsl/directives/file-upload-directives/fileUpload.html
Can I convert this to a Source[ByteString, IOResult], or alternatively, perform some other operation similar to Source.single(ByteString.fromString("Hello")) that would return me a Source[ByteString, IOResult] from a string?
I can create an IO result with:
val ioResult: IOResult = IOResult.createSuccessful(1L)
and a ByteString with:
val byteString: ByteString = ByteString.fromString("Hello")
so now I just need them as a Source[ByteString, IOResult]
Note, this is just for a unit test, I'm testing a function that returns a Source[ByteString, IOResult], and so I'd like to create an instance of this (without having to create a file) to assert the function is returning the correct ByteString, I don't really care about the IOResult part of the Source.

Specific to the fileUpload Directive
I suspect the fileUpload writers kept the materialization type as Any to allow for future changes to the API. By keeping it an Any they can later change it to the type that they truly want to settle on.
Therefore, even if you are able to cast to an IOResult you may bump into issues down the road if you upgrade the akka version and the type has changed...
Materialization Generally
The second type in the Source type parameters indicates what the stream will "materialize" into. Using your example code, and modify the Any to the actual type NotUsed, we can show the whole life-cycle of the stream:
val notUsed : NotUsed = Source.single(ByteString.fromString("Hello"))
.toMat(Sink.ignore)(Keep.left)
.run()
As you can see, when the stream is run it is turned into an actual value (i.e. materialized). In the above case the type of the value is NotUsed. This is because there isn't much you can do to a stream that sources a single value.
Contrast that stream with a stream that operates on a file:
val file = Paths.get("example.csv")
val fileIOResult: Future[IOResult] = FileIO.fromPath(file)
.to(Sink.ignore)
.run()
In this case the stream is reading the contents of the file and streaming it to the Sink. Here it would be useful to know if there were any errors from the file reading. To be able to find out how well the file reading went the stream is materialized into a Future[IOResult] which you can use to get information about the file reading:
fileIOResult foreach { ioResult =>
println(s"read ${ioResult.count} bytes from file")
}
Therefore, it doesn't really make sense to "convert" an Any into an IOResult...

Related

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

Understanding NotUsed and Done

I am having a hard time understanding the purpose and significance of NotUsed and Done in Akka Streams.
Let us see the following 2 simple examples:
Using NotUsed :
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[NotUsed] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.to(Sink.foreach(println))
val runResult:NotUsed = myStream.run()
Using Done
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[Future[Done]] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.toMat(Sink.foreach(println))(Keep.right)
val runResult: Future[Done] = myStream.run()
When I run these examples, I get the same output in both cases:
STACKOVERFLOW //output
So what exactly are NotUsed and Done? What are the differences and when should I prefer one above the other ?
First of all, the choice you are making is between NotUsed and Future[Done] (not just Done).
Now, you are essentially deciding the materialized value of your graph, by using the different combinators (to and toMat with Keep.right).
The materialized value is a way to interact with your stream while it's running. This choice does not affect the data processed by your stream, and for this reason you see the same output in both cases. The same element (the string "stackoverflow") goes through both streams.
The choice depends on what your main program is supposed to do after running the stream:
in case you are not interested in interacting with it, NotUsed is the right choice. It is just a dummy object, and it conveys the information that no interaction with the stream is allowed nor needed
in case you need to listen for the completion of the stream to perform some other action, you need to expose the Future[Done]. This way you can attach a callback to it using (e.g.) onComplete or map.

What's the use and meaning of second type parameter in akka.Source

What's the use and meaning of second type parameter in akka.Source?
Sample Code:-
def stream: ServiceCall[Source[String, NotUsed], Source[String, NotUsed]]
According to as much code I have seen up to now by default second type parameter is set to akka.NotUsed. But I don't know what's the significance of it.
The second parameter is the materialized value, that is, when you run the source, it's the value that gets returned by the run method to you. All stream shapes in Akka streams have them, that is, sources, sinks, flows, bidiflows, etc. With sinks, it's really obvious, if you're folding the stream, you're going to end up with a single value, that value is given to you through the materialized value, which is a future of the result:
def fold[U, T](zero: U)(f: (U, T) ⇒ U): Sink[T, Future[U]]
For sources though, it's less obvious, but here's one example:
def actorRef[T](bufferSize: Int, overflowStrategy: OverflowStrategy): Source[T, ActorRef]
This is a Source that, when you run it, materializes to an ActorRef. Every message that you send to the actor will be emitted from the source. If you wanted to use this in Lagom, you would do something like this:
def stream = ServiceCall { _ =>
Source.actorRef[String](16, OverflowStrategy.dropHead)
.mapMaterializedValue { actor =>
// send messages here, or maybe pass the actor to somewhere else
actor ! "Hello world"
// And return NotUsed so that it now materializes to `NotUsed`, as required by Lagom
NotUsed
}
}

multipart form data in Lagom

I want to have a service which receives an item object, the object contains; name, description, price and picture.
the other attributes are strings which easily can be sent as Json object but for including picture what is the best solution?
if multipart formdata is the best solution how it is handled in Lagom?
You may want to check the file upload example in the lagom-recipes repository on GitHub.
Basically, the idea is to create an additional Play router. After that, we have to tell Lagom to use it as noted in the reference documentation (this feature is available since 1.5.0). Here is how the router might look like:
class FileUploadRouter(action: DefaultActionBuilder,
parser: PlayBodyParsers,
implicit val exCtx: ExecutionContext) {
private def fileHandler: FilePartHandler[File] = {
case FileInfo(partName, filename, contentType, _) =>
val tempFile = {
val f = new java.io.File("./target/file-upload-data/uploads", UUID.randomUUID().toString).getAbsoluteFile
f.getParentFile.mkdirs()
f
}
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(tempFile.toPath)
val acc: Accumulator[ByteString, IOResult] = Accumulator(sink)
acc.map {
case akka.stream.IOResult(_, _) =>
FilePart(partName, filename, contentType, tempFile)
}
}
val router = Router.from {
case POST(p"/api/files") =>
action(parser.multipartFormData(fileHandler)) { request =>
val files = request.body.files.map(_.ref.getAbsolutePath)
Results.Ok(files.mkString("Uploaded[", ", ", "]"))
}
}
}
And then, we simply tell Lagom to use it
override lazy val lagomServer =
serverFor[FileUploadService](wire[FileUploadServiceImpl])
.additionalRouter(wire[FileUploadRouter].router)
Alternatively, we can make use of the PlayServiceCall class. Here is a simple sketch on how to do that provided by James Roper from the Lightbend team:
// The type of the service call is NotUsed because we are handling it out of band
def myServiceCall: ServiceCall[NotUsed, Result] = PlayServiceCall { wrapCall =>
// Create a Play action to handle the request
EssentialAction { requestHeader =>
// Now we create the sink for where we want to stream the request to - eg it could
// go to a file, a database, some other service. The way Play gives you a request
// body is that you need to return a sink from EssentialAction, and when it gets
// that sink, it stream the request body into that sink.
val sink: Sink[ByteString, Future[Done]] = ...
// Play wraps sinks in an abstraction called accumulator, which makes it easy to
// work with the result of handling the sink. An accumulator is like a future, but
// but rather than just being a value that will be available in future, it is a
// value that will be available once you have passed a stream of data into it.
// We wrap the sink in an accumulator here.
val accumulator: Accumulator[ByteString, Done] = Accumulator.forSink(sink)
// Now we have an accumulator, but we need the accumulator to, when it's done,
// produce an HTTP response. Right now, it's just producing akka.Done (or whatever
// your sink materialized to). So we flatMap it, to handle the result.
accumulator.flatMap { done =>
// At this point we create the ServiceCall, the reason we do that here is it means
// we can access the result of the accumulator (in this example, it's just Done so
// not very interesting, but it could be something else).
val wrappedAction = wrapCall(ServiceCall { notUsed =>
// Here is where we can do any of the actual business logic, and generate the
// result that can be returned to Lagom to be serialized like normal
...
})
// Now we invoke the wrapped action, and run it with no body (since we've already
// handled the request body with our sink/accumulator.
wrappedAction(request).run()
}
}
}
Generally speaking, it probably isn't a good idea to use Lagom for that purpose. As noted on the GitHub issue on PlayServiceCall documentation:
Many use cases where we fallback to PlayServiceCall are related to presentation or HTTP-specific use (I18N, file upload, ...) which indicate: coupling of the lagom service to the presentation layer or coupling of the lagom service to the transport.
Quoting James Roper again (a few years back):
So currently, multipart/form-data is not supported in Lagom, at least not out of the box. You can drop down to a lower level Play API to handle it, but perhaps it would be better to handle it in a web gateway, where any files handled are uploaded directly to a storage service such as S3, and then a Lagom service might store the meta data associated with it.
You can also check the discussion here, which provides some more insight.

How to read a web page, as a stream of lines, using akka-http 2.4.6

I've done a sample project in GitHub: akauppi/akka-2.4.6-trial
What I want seems simple: read a URL, provide the contents as a line-wise stream of Strings. Now I've struggled with this (and reading documentation) for the whole day so decided to push the sample public and ask for help.
I'm comfortable with Scala. I know Akka, and last time I've used Akka-streams it was probably pre-2.4. Now, I'm lost.
Questions:
On these lines I'd like to return a Source[String,Any], not a Future (note: those lines do not compile).
The problem probably is that Http().singleRequest(...) materialises the flow, and I don't want that. How to just inject the "recipe" of reading a web page without actually reading it?
def sourceAsByteString(url: URL)(implicit as: ActorSystem, mat: Materializer): Source[ByteString, Any] = {
import as.dispatcher
val req: HttpRequest = HttpRequest( uri = url.toString )
val tmp: Source[ByteString, Any] = Http().singleRequest(req).map( resp => resp.entity.dataBytes ) // does not compile, gives a 'Future'
tmp
}
The problem is that the chunks you get from the server are not lines, but might be anything. You will often get small responses in a single chunk. So you have to split the stream to lines yourself.
Something like this should work:
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.client.RequestBuilding._
import akka.stream.ActorMaterializer
implicit val system = ActorSystem("test")
implicit val mat = ActorMaterializer()
val delimiter: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(
ByteString("\r\n"),
maximumFrameLength = 100000,
allowTruncation = true)
import system.dispatcher
val f = Http().singleRequest(Get("http://www.google.com")).flatMap { res =>
val lines = res.entity.dataBytes.via(delimiter).map(_.utf8String)
lines.runForeach { line =>
println(line)
}
}
f.foreach { _ =>
system.terminate()
}
Note that if you wanted to return the lines instead of printing them, you would end up with a Future[Source[String, Any]], which is unavoidable because everything in akka-http is async. You could "flatten" this to a Source[String, Any] that produces no elements in case of a failed request, but that would probably not be a good idea.
To get a "recipe" for reading a web page, you could use Http().outgoingConnection("http://www.google.com"), which creates a Flow[HttpRequest, HttpResponse, Future[OutgoingConnection]], so a thing where you put in HttpRequest objects and get back HttpResponse objects.
The problem probably is that Http().singleRequest(...) materialises
the flow, and I don't want that.
That was indeed at the heart of the problem. There are two ways to start:
Http().singleRequest(...) leads to a Future (i.e. materializes the stream, already in the very beginning).
Source.single(HttpRequest(...)) leads to a Source (non-materialized).
http://doc.akka.io/docs/akka/2.4.7/scala/http/client-side/connection-level.html#connection-level-api
Ideally, such an important difference would be visible in the names of the methods used, but it's not. One simply has to know it, and understand the two above approaches are actually vastly different.
#RüdigerKlaehn's answer covers the linewise cutting pretty well, but also see the Cookbook
when dealing with a Source, use mapConcat in place of the flatMap (Futures), to flatten the res.entity.dataBytes (which is an inner stream). Having these two levels of streams (requests, then chunks per request) adds to the mental complexity especially since we only have one of the outer entities.
There might still be some simpler way, but I'm not looking at that actively any more. Things work. Maybe once I become more fluent with akka streams, I'll suggest a further solution.
Code for reading an HTTP response, linewise (akka-http 1.1.0-RC2):
val req: HttpRequest = ???
val fut: Future[Source[String,_]] = Http().singleRequest(req).map { resp =>
resp.entity.dataBytes.via(delimiter)
.map(_.utf8String)
}
delimiter as in #Rüdiger-klaehn's answer.