Alpakka - read Kryo-serialized objects from S3 - scala

I have Kryo-serialized binary data stored on S3 (thousands of serialized objects).
Alpakka allows to read the content as data: Source[ByteString, NotUsed]. But Kryo format doesn't use delimiters so I can't split each serialized object into a separate ByteString using data.via(Framing.delimiter(...)).
So, Kryo actually needs to read the data to understand when an object ends, and it doesn't look streaming-friendly.
Is it possible to implement this case in streaming fashion so that I get Source[MyObject, NotUsed] in the end of the day?

Here is a graph stage that does that. It handles the case when a serialized object spans two byte strings. It needs to be improved when objects are large (not my use case) and can take more than two byte strings in Source[ByteString, NotUsed].
object KryoReadStage {
def flow[T](kryoSupport: KryoSupport,
`class`: Class[T],
serializer: Serializer[_]): Flow[ByteString, immutable.Seq[T], NotUsed] =
Flow.fromGraph(new KryoReadStage[T](kryoSupport, `class`, serializer))
}
final class KryoReadStage[T](kryoSupport: KryoSupport,
`class`: Class[T],
serializer: Serializer[_])
extends GraphStage[FlowShape[ByteString, immutable.Seq[T]]] {
override def shape: FlowShape[ByteString, immutable.Seq[T]] = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = {
new GraphStageLogic(shape) {
setHandler(in, new InHandler {
override def onPush(): Unit = {
val bytes =
if (previousBytes.length == 0) grab(in)
else ByteString.fromArrayUnsafe(previousBytes) ++ grab(in)
Managed(new Input(new ByteBufferBackedInputStream(bytes.asByteBuffer))) { input =>
var position = 0
val acc = ListBuffer[T]()
kryoSupport.withKryo { kryo =>
var last = false
while (!last && !input.eof()) {
tryRead(kryo, input) match {
case Some(t) =>
acc += t
position = input.total().toInt
previousBytes = EmptyArray
case None =>
val bytesLeft = new Array[Byte](bytes.length - position)
val bb = bytes.asByteBuffer
bb.position(position)
bb.get(bytesLeft)
last = true
previousBytes = bytesLeft
}
}
push(out, acc.toList)
}
}
}
private def tryRead(kryo: Kryo, input: Input): Option[T] =
try {
Some(kryo.readObject(input, `class`, serializer))
} catch {
case _: KryoException => None
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
private val EmptyArray: Array[Byte] = Array.empty
private var previousBytes: Array[Byte] = EmptyArray
}
}
override def toString: String = "KryoReadStage"
private lazy val in: Inlet[ByteString] = Inlet("KryoReadStage.in")
private lazy val out: Outlet[immutable.Seq[T]] = Outlet("KryoReadStage.out")
}
Example usage:
client.download(BucketName, key)
.via(KryoReadStage.flow(kryoSupport, `class`, serializer))
.flatMapConcat(Source(_))
It uses some additional helpers below.
ByteBufferBackedInputStream:
class ByteBufferBackedInputStream(buf: ByteBuffer) extends InputStream {
override def read: Int = {
if (!buf.hasRemaining) -1
else buf.get & 0xFF
}
override def read(bytes: Array[Byte], off: Int, len: Int): Int = {
if (!buf.hasRemaining) -1
else {
val read = Math.min(len, buf.remaining)
buf.get(bytes, off, read)
read
}
}
}
Managed:
object Managed {
type AutoCloseableView[T] = T => AutoCloseable
def apply[T: AutoCloseableView, V](resource: T)(op: T => V): V =
try {
op(resource)
} finally {
resource.close()
}
}
KryoSupport:
trait KryoSupport {
def withKryo[T](f: Kryo => T): T
}
class PooledKryoSupport(serializers: (Class[_], Serializer[_])*) extends KryoSupport {
override def withKryo[T](f: Kryo => T): T = {
pool.run(new KryoCallback[T] {
override def execute(kryo: Kryo): T = f(kryo)
})
}
private val pool = {
val factory = new KryoFactory() {
override def create(): Kryo = {
val kryo = new Kryo
(KryoSupport.ScalaSerializers ++ serializers).foreach {
case ((clazz, serializer)) =>
kryo.register(clazz, serializer)
}
kryo
}
}
new KryoPool.Builder(factory).softReferences().build()
}
}

Related

Is it possible to pass values from two instances of the same Scala class

Say I have this situation
class Pipe {
var vel = 3.4
var V = 300
var a = 10.2
var in = ???
var TotV = V+in
var out = TotV*a/vel
}
val pipe1 = new Pipe
val pipe2 = new Pipe
The in variable is were my problem is, what i'd like to do is get the out variable from pipe1 and feed that in as the in variable for pipe 2 effectively to join the two pipes but I cant figure out if this is even possible in the same class. So I can do it manually but need to know if its possible to do in the class.
pipe2.in = pipe1.out
my attempted fix was to add an ID field then try and use that to reference an instance with a higher id field but that doesnt seem doable. ie
class Pipe(id:Int) {
var vel = 3.4
var V = 300
var a = 10.2
var in = Pipe(id+1).out //this is the sticking point, I want to reference instances of this class and use their out value as in value for instances with a lower ID
var TotV = V+in
var out = TotV*a/vel
}
any help would be appreciated
You can do this by defining a companion object for the class and passing in the upstream pipe as an optional parameter to the factory method, then extracting its in value and passing it to the class constructor, as follows:
object Pipe {
def apply(upstreamPipe: Option[Pipe]): Pipe = {
val inValue = upstreamPipe match {
case Some(pipe) => pipe.out
case None => 0 // or whatever your default value is
new Pipe(inValue)
}
You would then call
val pipe1 = Pipe(None)
val pipe2 = Pipe(Some(pipe1))
Unfortunately your question is not clear now. Under certain assumptions what you describe looks like what is now called "FRP" aka "Functional Reactive Programming". If you want to do it in a serious way, you probably should take a look at some mature library such as RxScala or Monix that handle many important in the real world details such as error handling or scheduling/threading and many others.
For a simple task you might roll out a simple custom implementation like this:
trait Observable {
def subscribe(subscriber: Subscriber): RxConnection
}
trait RxConnection {
def disconnect(): Unit
}
trait Subscriber {
def onChanged(): Unit
}
trait RxOut[T] extends Observable {
def currentValue: Option[T]
}
class MulticastObservable extends Observable with Subscriber {
private val subscribers: mutable.Set[Subscriber] = mutable.HashSet()
override def onChanged(): Unit = subscribers.foreach(s => s.onChanged())
override def subscribe(subscriber: Subscriber): RxConnection = {
subscribers.add(subscriber)
new RxConnection {
override def disconnect(): Unit = subscribers.remove(subscriber)
}
}
}
abstract class BaseRxOut[T](private var _lastValue: Option[T]) extends RxOut[T] {
private val multicast = new MulticastObservable()
protected def lastValue: Option[T] = _lastValue
protected def lastValue_=(value: Option[T]): Unit = {
_lastValue = value
multicast.onChanged()
}
override def currentValue: Option[T] = lastValue
override def subscribe(subscriber: Subscriber): RxConnection = multicast.subscribe(subscriber)
}
class RxValue[T](initValue: T) extends BaseRxOut[T](Some(initValue)) {
def value: T = this.lastValue.get
def value_=(value: T): Unit = {
this.lastValue = Some(value)
}
}
trait InputConnector[T] {
def connectInput(input: RxOut[T]): RxConnection
}
class InputConnectorImpl[T] extends BaseRxOut[T](None) with InputConnector[T] {
val inputHolder = new RxValue[Option[(RxOut[T], RxConnection)]](None)
private def updateValue(): Unit = {
lastValue = for {inputWithDisconnect <- inputHolder.value
value <- inputWithDisconnect._1.currentValue}
yield value
}
override def connectInput(input: RxOut[T]): RxConnection = {
val current = inputHolder.value
if (current.exists(iwd => iwd._1 == input))
current.get._2
else {
current.foreach(iwd => iwd._2.disconnect())
inputHolder.value = Some(input, input.subscribe(() => this.updateValue()))
updateValue()
new RxConnection {
override def disconnect(): Unit = {
if (inputHolder.value.exists(iwd => iwd._1 == input)) {
inputHolder.value.foreach(iwd => iwd._2.disconnect())
inputHolder.value = None
updateValue()
}
}
}
}
}
}
abstract class BaseRxCalculation[Out] extends BaseRxOut[Out](None) {
protected def registerConnectors(connectors: InputConnectorImpl[_]*): Unit = {
connectors.foreach(c => c.subscribe(() => this.recalculate()))
}
private def recalculate(): Unit = {
var newValue = calculateOutput()
if (newValue != lastValue) {
lastValue = newValue
}
}
protected def calculateOutput(): Option[Out]
}
case class RxCalculation1[In1, Out](func: Function1[In1, Out]) extends BaseRxCalculation[Out] {
private val conn1Impl = new InputConnectorImpl[In1]
def conn1: InputConnector[In1] = conn1Impl // show to the outer world only InputConnector
registerConnectors(conn1Impl)
override protected def calculateOutput(): Option[Out] = {
for {v1 <- conn1Impl.currentValue}
yield func(v1)
}
}
case class RxCalculation2[In1, In2, Out](func: Function2[In1, In2, Out]) extends BaseRxCalculation[Out] {
private val conn1Impl = new InputConnectorImpl[In1]
def conn1: InputConnector[In1] = conn1Impl // show to the outer world only InputConnector
private val conn2Impl = new InputConnectorImpl[In2]
def conn2: InputConnector[In2] = conn2Impl // show to the outer world only InputConnector
registerConnectors(conn1Impl, conn2Impl)
override protected def calculateOutput(): Option[Out] = {
for {v1 <- conn1Impl.currentValue
v2 <- conn2Impl.currentValue}
yield func(v1, v2)
}
}
// add more RxCalculationN if needed
And you can use it like this:
def test(): Unit = {
val pipe2 = new RxCalculation1((in: Double) => {
println(s"in = $in")
val vel = 3.4
val V = 300
val a = 10.2
val TotV = V + in
TotV * a / vel
})
val in1 = new RxValue(2.0)
println(pipe2.currentValue)
val conn1 = pipe2.conn1.connectInput(in1)
println(pipe2.currentValue)
in1.value = 3.0
println(pipe2.currentValue)
conn1.disconnect()
println(pipe2.currentValue)
}
which prints
None
in = 2.0
Some(905.9999999999999)
in = 3.0
Some(909.0)
None
Here your "pipe" is RxCalculation1 (or other RxCalculationN) which wraps a function and you can "connect" and "disconnect" other "pipes" or just "values" to various inputs and start a chain of updates.

Akka Streams: Cannot push port twice, or before it being pulled

I am trying to test my sliding window stage using the Akka Streams TestKit and I see this exception.
Exception in thread "main" java.lang.AssertionError: assertion failed: expected OnNext(Stream(2, ?)), found OnError(java.lang.IllegalArgumentException: Cannot push port (Sliding.out(2043106095)) twice, or before it being pulled
Akka, Akka Streams, Akka Streams TestKit version: 2.5.9
Scala version: 2.12.4
case class Sliding[T](duration: Duration, step: Duration, f: T => Long) extends GraphStage[FlowShape[T, immutable.Seq[T]]] {
val in = Inlet[T]("Sliding.in")
val out = Outlet[immutable.Seq[T]]("Sliding.out")
override val shape: FlowShape[T, immutable.Seq[T]] = FlowShape(in, out)
override protected val initialAttributes: Attributes = Attributes.name("sliding")
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) with InHandler with OutHandler {
private var buf = Vector.empty[T]
var watermark = 0L
var dropUntilDuration = step.toMillis
private def isWindowDone(current: T) = {
if (buf.nonEmpty) {
val hts = f(buf.head)
val cts = f(current)
cts >= hts + duration.toMillis
} else false
}
override def onPush(): Unit = {
val data = grab(in)
val timeStamp = f(data)
if (timeStamp > watermark) {
watermark = timeStamp
if (isWindowDone(data)) {
push(out, buf)
buf = buf.dropWhile { x =>
val ts = f(x)
ts < dropUntilDuration
}
dropUntilDuration = dropUntilDuration + step.toMillis
}
buf :+= data
pull(in)
} else {
pull(in)
}
}
override def onPull(): Unit = {
pull(in)
}
override def onUpstreamFinish(): Unit = {
if (buf.nonEmpty) {
push(out, buf)
}
completeStage()
}
this.setHandlers(in, out, this)
}
}
Test code:
object WindowTest extends App {
implicit val as = ActorSystem("WindowTest")
implicit val m = ActorMaterializer()
val expectedResultIterator = Stream.from(1).map(_.toLong)
val infinite = Iterator.from(1)
Source
.fromIterator(() => infinite)
.map(_.toLong)
.via(Sliding(10 millis, 2 millis, identity))
.runWith(TestSink.probe[Seq[Long]])
.request(1)
.expectNext(expectedResultIterator.take(10).toSeq)
.request(1)
.expectNext(expectedResultIterator.take(11).drop(1).toSeq)
.expectComplete()
}

Create custom Serializer and Deserializer in kafka using scala

I am using kafka_2.10-0.10.0.1 and scala_2.10.3.
I want to write custom Serializer and Deserializer using scala.
I tried with these Serializer (from CustomType) and Deserializer (obtain a CustomType):
class CustomTypeSerializer extends Serializer[CustomType] {
private val gson: Gson = new Gson()
override def configure(configs: util.Map[String, _], isKey: Boolean):
Unit = {
// nothing to do
}
override def serialize(topic: String, data: CustomType): Array[Byte] = {
if (data == null)
null
else
gson.toJson(data).getBytes
}
override def close(): Unit = {
//nothing to do
}
}
class CustomTypeDeserializer extends Deserializer[CustomType] {
private val gson: Gson = new Gson()
override def deserialize(topic: String, bytes: Array[Byte]): CustomType = {
val offerJson = gson.toJson(bytes.toString)
val psType: Type = new TypeToken[CustomType]() {}.getType()
val ps: CustomType = gson.fromJson(offerJson, psType)
ps
}
override def configure(configs: util.Map[String, _], isKey: Boolean):
Unit = {
// nothing to do
}
override def close(): Unit = {
//nothing to do
}
}
But, I got this error:
Exception in thread "main" org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition topic_0_1-1 at offset 26
Caused by: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was BEGIN_ARRAY at line 1 column 2 path $
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:224)
at com.google.gson.Gson.fromJson(Gson.java:887)
at com.google.gson.Gson.fromJson(Gson.java:852)
at com.google.gson.Gson.fromJson(Gson.java:801)
at kafka.PSDeserializer.deserialize(PSDeserializer.scala:24)
at kafka.PSDeserializer.deserialize(PSDeserializer.scala:18)
at org.apache.kafka.clients.consumer.internals.Fetcher.parseRecord(Fetcher.java:627)
at org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:548)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
Can you help me please
Find below the custom serializer and deserializer for case class User, User(name:String,id:Int). Replace User in code with your case class. It will work.
import java.io.{ObjectInputStream, ByteArrayInputStream}
import java.util
import org.apache.kafka.common.serialization.{Deserializer, Serializer}
class CustomDeserializer extends Deserializer[User]{
override def configure(configs: util.Map[String,_],isKey: Boolean):Unit = {
}
override def deserialize(topic:String,bytes: Array[Byte]) = {
val byteIn = new ByteArrayInputStream(bytes)
val objIn = new ObjectInputStream(byteIn)
val obj = objIn.readObject().asInstanceOf[User]
byteIn.close()
objIn.close()
obj
}
override def close():Unit = {
}
}
import java.io.{ObjectOutputStream, ByteArrayOutputStream}
import java.util
import org.apache.kafka.common.serialization.Serializer
class CustomSerializer extends Serializer[User]{
override def configure(configs: util.Map[String,_],isKey: Boolean):Unit = {
}
override def serialize(topic:String, data:User):Array[Byte] = {
try {
val byteOut = new ByteArrayOutputStream()
val objOut = new ObjectOutputStream(byteOut)
objOut.writeObject(data)
objOut.close()
byteOut.close()
byteOut.toByteArray
}
catch {
case ex:Exception => throw new Exception(ex.getMessage)
}
}
override def close():Unit = {
}
}

Akka Kafka Custom Serializer

I'm using Akka Kafka (Scala) and want to send custom objects.
class TweetsSerializer extends Serializer[Seq[MyCustomType]] {
override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = ???
override def serialize(topic: String, data: Seq[MyCustomType]): Array[Byte] = ???
override def close(): Unit = ???
}
How can i correctly write my own serializer ? And, what should i do with field config ?
I would use the StringSerializer, I mean, I´d convert all my types to string before produce them. However that works:
case class MyCustomType(a: Int)
class TweetsSerializer extends Serializer[Seq[MyCustomType]] {
private var encoding = "UTF8"
override def configure(configs: java.util.Map[String, _], isKey: Boolean): Unit = {
val propertyName = if (isKey) "key.serializer.encoding"
else "value.serializer.encoding"
var encodingValue = configs.get(propertyName)
if (encodingValue == null) encodingValue = configs.get("serializer.encoding")
if (encodingValue != null && encodingValue.isInstanceOf[String]) encoding = encodingValue.asInstanceOf[String]
}
override def serialize(topic: String, data: Seq[MyCustomType]): Array[Byte] =
try
if (data == null) return null
else return {
data.map { v =>
v.a.toString
}
.mkString("").getBytes("UTF8")
}
catch {
case e: UnsupportedEncodingException =>
throw new SerializationException("Error when serializing string to byte[] due to unsupported encoding " + encoding)
}
override def close(): Unit = Unit
}
}
object testCustomKafkaSerializer extends App {
implicit val producerConfig = {
val props = new Properties()
props.setProperty("bootstrap.servers", "localhost:9092")
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[TweetsSerializer].getName)
props
}
lazy val kafkaProducer = new KafkaProducer[String, Seq[MyCustomType]](producerConfig)
// Create scala future from Java
private def publishToKafka(id: String, data: Seq[MyCustomType]) = {
kafkaProducer
.send(new ProducerRecord("outTopic", id, data))
.get()
}
val input = MyCustomType(1)
publishToKafka("customSerializerTopic", Seq(input))
}

How to create an Akka Stream Source[Seq[A]] from Source[A]

With previous versions of Akka Streams, groupBy returned a Source of Sources that could be materialized into a Source[Seq[A]].
With Akka Streams 2.4 I see that groupBy returns a SubFlow - it's not clear to me how use this. The transformations I need to apply to the flow have to have the whole Seq available, so I can't just map over the SubFlow (I think).
I've written a class that extends GraphStage that does the aggregation via a mutable collection in the GraphStageLogic, but is there in-built functionality for this? Am I missing the point of SubFlow?
I ended up writing a GraphStage:
class FlowAggregation[A, B](f: A => B) extends GraphStage[FlowShape[A, Seq[A]]] {
val in: Inlet[A] = Inlet("in")
val out: Outlet[Seq[A]] = Outlet("out")
override val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
private var counter: Option[B] = None
private var aggregate = scala.collection.mutable.ArrayBuffer.empty[A]
setHandler(in, new InHandler {
override def onPush(): Unit = {
val element = grab(in)
counter.fold({
counter = Some(f(element))
aggregate += element
pull(in)
}) { p =>
if (f(element) == p) {
aggregate += element
pull(in)
} else {
push(out, aggregate)
counter = Some(f(element))
aggregate = scala.collection.mutable.ArrayBuffer(element)
}
}
}
override def onUpstreamFinish(): Unit = {
emit(out, aggregate)
complete(out)
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
}
}