Converting a String to a Apache Avro GenericRecord - scala

Can someone give advices on how to convert a String into a GenericRecord? I would like to convert my records to avro and put them into a Kafka topic.
I used to have it as a String (I replaced it with GenericRecord).
def producerMethod(socket: Seq[Socket], prot: String): KafkaProducer[String, GenericRecord] =
producer(params(socket, prot))
def producerMethod(params: Map[String, String]): KafkaProducer[String, GenericRecord] =
new KafkaProducer[String, GenericRecord](params.asInstanceOf[Map[String, Object]].asJava)
I also have an apply method, which has an iterator in it.
It iterates over strings and adds the values in a tuple.
In the end I send the data to a topic:
The results variable is a String but I added "asInstanceOf"
result.send(new ProducerRecord[String,GenericRecord](topic,key,results.asInstanceOf[GenericRecord]))
Here is my iterator:
iterator.foreach { tuple =>
val (keyinfo, info, Worked(outinfo, _)) = tuple
val (key, results) = keyResultToString(keyinfo, info, outinfo)
result.send(new ProducerRecord[String,GenericRecord](topic,key,results.asInstanceOf[GenericRecord]))
}
Everything works, when I replace the GenericRecord back to String. But not if I would like to convert it to a GenericRecord

Related

Is there a way to name the inner mapValues() topic created as part of the count() operator in kafka-streams?

I'm, trying to name all processors of a simple word count kafka streams application, however, can't figure out how to name the inner topic created due to the mapValues() call inside the count() method, that's created as result of calling to stream(). This is the application code, followed by the topology description (showing only the second sub-topology):
def createTopology(builder: StreamsBuilder, config: Config): Topology = {
val consumed = Consumed
.as(inputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val produced = Produced
.as(outputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.longSerde)
val flatMapValuesProc = Named.as("flatMapValues")
val groupByProc = Grouped
.as("groupBy")
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val textLines: KStream[String, String] = builder.stream[String, String](inputTopic)(consumed)
val wordCounts: KTable[String, Long] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+"), named = flatMapValuesProc)
.groupBy((_, word) => word)(grouped)
.count(Named.as("count"))(Materialized.as(storeName))
wordCounts
.toStream(Named.as("toStream"))
.to(outputTopic)(produced)
builder.build()
}
Looking at the count() method, it seems like it's not possible to name this operation, from the code. Is there another way to name this inner topic?
def count(named: Named)(implicit materialized: Materialized[K, Long, ByteArrayKeyValueStore]): KTable[K, Long] = {
...
new KTable(
javaCountTable.mapValues[Long](
((l: java.lang.Long) => Long2long(l)).asValueMapper,
Materialized.`with`[K, Long, ByteArrayKeyValueStore](tableImpl.keySerde(), Serdes.longSerde)
)
)
}

How can I use a value in an Akka Stream to instantiate a GooglePubSub Flow?

I'm attempting to create a Flow to be used with a Source queue. I would like this to work with the Alpakka Google PubSub connector: https://doc.akka.io/docs/alpakka/current/google-cloud-pub-sub.html
In order to use this connector, I need to create a Flow that depends on the topic name provided as a String, as shown in the above link and in the code snippet.
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] =
GooglePubSub.publish(topic, config)
The question
I would like to be able to setup a Source queue that receives the topic and message required for publishing a message. I first create the necessary PublishRequest out of the message String. I then want to run this through the Flow that is instantiated by running GooglePubSub.publish(topic, config). However, I don't know how to get the topic to that part of the flow.
val gcFlow: Flow[(String, String), PublishRequest, NotUsed] = Flow[(String, String)]
.map(messageData => {
PublishRequest(Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._1.getBytes))))
)
})
.via(GooglePubSub.publish(topic, config))
val bufferSize = 10
val elementsToProcess = 5
// newSource is a Source[PublishRequest, NotUsed]
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.preMaterialize()
I'm not sure if there's a way to get the topic into the queue without it being a part of the initial data stream. And I don't know how to get the stream value into the dynamic Flow.
If I have improperly used some terminology, please keep in mind that I'm new to this.
You can achieve it by using flatMapConcat and generating a new Source within it:
// using tuple assuming (Topic, Message)
val gcFlow: Flow[(String, String), (String, PublishRequest), NotUsed] = Flow[(String, String)]
.map(messageData => {
val pr = PublishRequest(immutable.Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._2.getBytes)))))
// output flow shape of (String, PublishRequest)
(messageData._1, pr)
})
val publishFlow: Flow[(String, PublishRequest), Seq[String], NotUsed] =
Flow[(String, PublishRequest)].flatMapConcat {
case (topic: String, pr: PublishRequest) =>
// Create a Source[PublishRequest]
Source.single(pr).via(GooglePubSub.publish(topic, config))
}
// wire it up
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.via(publishFlow)
.preMaterialize()
Optionally you could substitute tuple with a case class to document it better
case class Something(topic: String, payload: PublishRequest)
// output flow shape of Something[String, PublishRequest]
Something(messageData._1, pr)
Flow[Something[String, PublishRequest]].flatMapConcat { s =>
Source.single(s.payload)... // etc
}
Explanation:
In gcFlow we output FlowShape of tuple (String, PublishRequest) which is passed through publishFlow. The input is tuple (String, PublishRequest) and in flatMapConcat we generate new Source[PublishRequest] which is flowed through GooglePubSub.publish
There would be slight overhead creating new Source for every item. This shouldn't have measurable impact on performance

ClassCastException in kafka-streams join

I have 2 kafka streams that I want to merge. Each of them is typed, and contains the right data:
private val inputOrderStream: KStream[String, OrderCreationRequest] =
orderCreationRequestBuilder
.stream[OrderCreationRequestKey, OrderCreationRequest]("order-creation-request-topic")
.map[String, OrderCreationRequest]((key: OrderCreationRequestKey, orderCreationRequest: OrderCreationRequest) ⇒ new KeyValue(key.id, orderCreationRequest))
private val inputPaymentStream: KStream[String, OrderPayment] =
orderPaymentBuilder
.stream[OrderPaymentKey, OrderPayment]("payment-topic")
.map[String, OrderPayment]((key: OrderPaymentKey, orderPayment: OrderPayment) ⇒ new KeyValue(key.id, orderPayment))
When I try to join then by bey, I get a very confusing java.lang.ClassCastException:
java.lang.ClassCastException: com.ordercreation.OrderPayment cannot be cast to com.ordercreation.OrderCreationRequest
at com.ordercreation.streams.Streams$$anon$1.apply(Streams.scala:48)
at org.apache.kafka.streams.kstream.internals.AbstractStream$1.apply(AbstractStream.java:71)
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:82)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:189)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndPunctuate(StreamThread.java:679)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:557)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
Here's the join code:
private val outputStream: KStream[String, String] =
inputPaymentStream
.join[OrderCreationRequest, String](
inputOrderStream,
new ValueJoiner[OrderPayment, OrderCreationRequest, String] {
override def apply(op: OrderPayment, ocr: OrderCreationRequest): String = s"Payed ${op.amountInCents} via ${op.method} on behalf of ${ocr.customerNumber}."
},
window,
Serdes.String(),
null,
null)
Investigating a bit more (using AnyRef instead of OrderCreationRequest in the join method) I can see that the same value of OrderPayment is given to both arguments of ValueJoiner.apply. Why is this happening? Am I missing anything?
Note: If I print the contents of the 2 streams I get the expected data so I'm sure the topics don't contains the same data.
Thank you!

Streaming CSV Source with AKKA-HTTP

I am trying to stream data from Mongodb using reactivemongo-akkastream 0.12.1 and return the result into a CSV stream in one of the routes (using Akka-http).
I did implement that following the exemple here:
http://doc.akka.io/docs/akka-http/10.0.0/scala/http/routing-dsl/source-streaming-support.html#simple-csv-streaming-example
and it looks working fine.
The only problem I am facing now is how to add the headers to the output CSV file. Any ideas?
Thanks
Aside from the fact that that example isn't really a robust method of generating CSV (doesn't provide proper escaping) you'll need to rework it a bit to add headers. Here's what I would do:
make a Flow to convert a Source[Tweet] to a source of CSV rows, e.g. a Source[List[String]]
concatenate it to a source containing your headers as a single List[String]
adapt the marshaller to render a source of rows rather than tweets
Here's some example code:
case class Tweet(uid: String, txt: String)
def getTweets: Source[Tweet, NotUsed] = ???
val tweetToRow: Flow[Tweet, List[String], NotUsed] =
Flow[Tweet].map { t =>
List(
t.uid,
t.txt.replaceAll(",", "."))
}
// provide a marshaller from a row (List[String]) to a ByteString
implicit val tweetAsCsv = Marshaller.strict[List[String], ByteString] { row =>
Marshalling.WithFixedContentType(ContentTypes.`text/csv(UTF-8)`, () =>
ByteString(row.mkString(","))
)
}
// enable csv streaming
implicit val csvStreaming = EntityStreamingSupport.csv()
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val tweets: Source[List[String], NotUsed] = getTweets.via(tweetToRow)
complete(headers.concat(tweets))
}
Update: if your getTweets method returns a Future you can just map over its source value and prepend the headers that way, e.g:
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val rows: Future[Source[List[String], NotUsed]] = getTweets
.map(tweets => headers.concat(tweets.via(tweetToRow)))
complete(rows)
}

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage