akka streams, InputStream and binary data - scala

I'm trying to use akka streams (scala) to read gzipped data streamed from Play's WS client. The data ultimately need to be presented as an InputStream to a legacy API. I'm using akka 2.4.10, so using akka.stream.scaladsl.Compression isn't an option. If I do the following (replacing the Play stuff with a toy app for simplicity; the results in Play are the same):
object AkkaTester extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
val filename = "/tmp/helloworld.txt.gz"
val path = Paths.get(filename)
val sink = StreamConverters.asInputStream()
val akkaInput = new GZIPInputStream(FileIO.fromPath(path).runWith(sink))
var akkaByte = akkaInput.read()
while (akkaByte != -1) print(akkaByte)
akkaInput.close()
system.terminate()
}
...then it generates the following:
Exception in thread "main" java.io.IOException: akka.stream.impl.io.InputStreamAdapter.read() returned value out of range -1..255: -117
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:272)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:259)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
I'm new to akka streams, so I'm not sure if this is a bug (the java docs for InputStream.read() clearly state that it should return values in the range of 0..255, so it feels like a bug) or just me mis-using akka streams in some way. FWIW, adding 256 to the negative value would give us the value that we would get if we used a straight FileInputStream in place of akka streams.
UPDATE
Definitely a bug - the read() method in the relevant code returns a raw, signed byte, meaning it will be in the range -128 to 127 rather than the required 0 to 255. I've filed an issue on it.

Related

My flink job keeps reading the same bytes over and over

I have a flink job that always falls behind no matter how much resource i throw at it.
I am using flink 1.15.2 but the problem happened on 1.14.[4|6]
Here is how i am connecting to the input stream:
val properties = getInputProperties(kafkaReadConfig)
val byteArrayDeserializer = new AbstractDeserializationSchema[Array[Byte]]() {
override def deserialize(bytes: Array[Byte]): Array[Byte] = bytes
}
val flinkKafkaConsumer = new FlinkKafkaConsumer[Array[Byte]](kafkaReadConfig.topic, byteArrayDeserializer, properties)
val res = env.addSource(flinkKafkaConsumer).setParallelism(kafkaReadConfig.parallelism).name(kafkaReadConfig.topic)
Just doing a simple FlatMap, i see the same bytes being read over and over again. My pipeline ends in a print(for debugging)
inputStream.setParallelism(1)
.flatMap(lookForDupes).setParallelism(1)
.print()
Is there something dumb i am doing? Do i need to verify the packet was processed somehow?
Looking for some way to mark the block as read at this point.
Have you inspected the task manager logs for clues? This can happen if the job is in a fail -> recover -> fail again loop.

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

Convert Source[ByteString, Any] to Source[ByteString, IOResult]

I have a:
val fileStream: Source[ByteString, Any] = Source.single(ByteString.fromString("Hello"))
This Source[ByteString, Any] type comes from akka fileUpload directive:
https://doc.akka.io/docs/akka-http/current/routing-dsl/directives/file-upload-directives/fileUpload.html
Can I convert this to a Source[ByteString, IOResult], or alternatively, perform some other operation similar to Source.single(ByteString.fromString("Hello")) that would return me a Source[ByteString, IOResult] from a string?
I can create an IO result with:
val ioResult: IOResult = IOResult.createSuccessful(1L)
and a ByteString with:
val byteString: ByteString = ByteString.fromString("Hello")
so now I just need them as a Source[ByteString, IOResult]
Note, this is just for a unit test, I'm testing a function that returns a Source[ByteString, IOResult], and so I'd like to create an instance of this (without having to create a file) to assert the function is returning the correct ByteString, I don't really care about the IOResult part of the Source.
Specific to the fileUpload Directive
I suspect the fileUpload writers kept the materialization type as Any to allow for future changes to the API. By keeping it an Any they can later change it to the type that they truly want to settle on.
Therefore, even if you are able to cast to an IOResult you may bump into issues down the road if you upgrade the akka version and the type has changed...
Materialization Generally
The second type in the Source type parameters indicates what the stream will "materialize" into. Using your example code, and modify the Any to the actual type NotUsed, we can show the whole life-cycle of the stream:
val notUsed : NotUsed = Source.single(ByteString.fromString("Hello"))
.toMat(Sink.ignore)(Keep.left)
.run()
As you can see, when the stream is run it is turned into an actual value (i.e. materialized). In the above case the type of the value is NotUsed. This is because there isn't much you can do to a stream that sources a single value.
Contrast that stream with a stream that operates on a file:
val file = Paths.get("example.csv")
val fileIOResult: Future[IOResult] = FileIO.fromPath(file)
.to(Sink.ignore)
.run()
In this case the stream is reading the contents of the file and streaming it to the Sink. Here it would be useful to know if there were any errors from the file reading. To be able to find out how well the file reading went the stream is materialized into a Future[IOResult] which you can use to get information about the file reading:
fileIOResult foreach { ioResult =>
println(s"read ${ioResult.count} bytes from file")
}
Therefore, it doesn't really make sense to "convert" an Any into an IOResult...

Understanding NotUsed and Done

I am having a hard time understanding the purpose and significance of NotUsed and Done in Akka Streams.
Let us see the following 2 simple examples:
Using NotUsed :
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[NotUsed] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.to(Sink.foreach(println))
val runResult:NotUsed = myStream.run()
Using Done
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[Future[Done]] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.toMat(Sink.foreach(println))(Keep.right)
val runResult: Future[Done] = myStream.run()
When I run these examples, I get the same output in both cases:
STACKOVERFLOW //output
So what exactly are NotUsed and Done? What are the differences and when should I prefer one above the other ?
First of all, the choice you are making is between NotUsed and Future[Done] (not just Done).
Now, you are essentially deciding the materialized value of your graph, by using the different combinators (to and toMat with Keep.right).
The materialized value is a way to interact with your stream while it's running. This choice does not affect the data processed by your stream, and for this reason you see the same output in both cases. The same element (the string "stackoverflow") goes through both streams.
The choice depends on what your main program is supposed to do after running the stream:
in case you are not interested in interacting with it, NotUsed is the right choice. It is just a dummy object, and it conveys the information that no interaction with the stream is allowed nor needed
in case you need to listen for the completion of the stream to perform some other action, you need to expose the Future[Done]. This way you can attach a callback to it using (e.g.) onComplete or map.

When exactly a Spark task can be serialized?

I read some related questions about this topic, but still cannot understand the following. I have this simple Spark application which reads some JSON records from a file:
object Main {
// implicit val formats = DefaultFormats // OK: here it works
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Spark Test App")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/alex/data/person.json")
implicit val formats = DefaultFormats // Exception: Task not serializable
val persons = input.flatMap { line ⇒
// implicit val formats = DefaultFormats // OK: here it also works
try {
val json = parse(line)
Some(json.extract[Person])
} catch {
case e: Exception ⇒ None
}
}
}
}
I suppose the implicit formats is not serializable since it includes some ThreadLocal for the date format. But, why it works when placed as a member of the object Main or inside the closure of flatMap, and not as a common val inside the main function?
Thanks in advance.
If the formats is inside the flatMap, it's only created as part of executing the mapping function. So the mapper can be serialized and sent to the cluster, since it doesn't contain a formats yet. The flipside is that this will create formats anew every time the mapper runs (i.e. once for every row) - you might prefer to use mapPartitions rather than flatMap so that you can have the value created once for each partition.
If formats is outside the flatMap then it's created once on the master machine, and you're attempting to serialize it and send it to the cluster.
I don't understand why formats as a field of Main would work. Maybe objects are magically pseudo-serializable because they're singletons (i.e. their fields aren't actually serialized, rather the fact that this is a reference to the single static Main instance is serialized)? That's just a guess though.
The best way to answer your question I think is in three short answers:
1) Why it works when placed as a member of the object Main?, the question here is that code works because it's inside an Object, not necessary the Main Object. And now: Why? because Spark serializes your whole object and send it to each of the executors, moreover an Object in Scala is generated like a JAVA Static class and the initial values of static fields in a Java class are stored in the jar and workers can use it directly. This is not the same if you use a class instead an Object.
2) The second question is: why it works if it's inside a flatmap?.
When you run transformations on a RDD (filter, flatMap ... etc), your transformation code is: serialized on the driver node, send to worker, once there it will be deserialized and executed. As you can see exactly the same as in 1) the code will be serialized "automatycally".
And finally the 3) question: Why this is not working as a common val inside the main function? this is because the val is not serialized "automatically", but you can test it like this: val yourVal = new yourVal with Serializable