Set timestamp in output with Kafka Streams fails for transformations - scala

Suppose we have a transformer (written in Scala)
new Transformer[String, V, (String, V)]() {
var context: ProcessorContext = _
override def init(context: ProcessorContext): Unit = {
this.context = context
}
override def transform(key: String, value: V): (String, V) = {
val timestamp = toTimestamp(value)
context.forward(key, value, To.all().withTimestamp(timestamp))
key -> value
}
override def close(): Unit = ()
}
where toTimestamp is just a function which returns an a timestamp fetched from the record value. Once it gets executed, there's an NPE:
Exception in thread "...-6f3693b9-4e8d-4e65-9af6-928884320351-StreamThread-5" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:110)
at CustomTransformer.transform()
at CustomTransformer.transform()
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:302)
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:300)
at
what essentially happens is that ProcessorContextImpl fails in:
public <K, V> void forward(final K key, final V value, final To to) {
toInternal.update(to);
if (toInternal.hasTimestamp()) {
recordContext.setTimestamp(toInternal.timestamp());
}
final ProcessorNode previousNode = currentNode();
because the recordContext was not initialized (an it could only be done internally by KafkaStreams).
This is a follow up question Set timestamp in output with Kafka Streams 1

If you work with transformer, you need to make sure that a new Transformer object is create when TransformerSupplier#get() is called. (cf. https://docs.confluent.io/current/streams/faq.html#why-do-i-get-an-illegalstateexception-when-accessing-record-metadata)
In the original question, I thought it's about your context variable that results in NPE, but now I realized it's about the Kafka Streams internals.
The Scala API has a bug in 2.0.0 that may result in the case that the same Transformer instance is reused (https://issues.apache.org/jira/browse/KAFKA-7250). I think that you are hitting this bug. Rewriting your code a little bit should fix the issues. Note, that Kafka 2.0.1 and Kafka 2.1.0 contain a fix.

#matthias-j-sax Same behavior if processor reused in Java code.
Topology topology = new Topology();
MyProcessor myProcessor = new MyProcessor();
topology.addSource("source", "topic-1")
.addProcessor(
"processor",
() -> {
return myProcessor;
},
"source"
)
.addSink("sink", "topic-2", "processor");
KafkaStreams streams = new KafkaStreams(topology, config);
streams.start();

Related

RichSinkFunction for Cassandra in Flink

I read the advantages of using RichSinkFunction over directly calling the DB methods. Therefore, I decided to write my own RichSinkFunction.
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import com.datastax.driver.core.{Session, Cluster}
class CassandraAsSink extends RichSinkFunction {
override def open(parameters: Configuration): Unit = {
val cluster = Cluster.builder().addContactPoint("localhost").build()//
val session = cluster.connect("example")
}
override def invoke(value: Nothing, context: SinkFunction.Context): Unit = {
session.execute(
s"""
INSERT INTO users (name, credits, user_id)
VALUES ($name, $credits, $userId)
"""
)
}
override def close(): Unit = {
//something like session.close()
}
}
However, I am not able to develop it fully. I want to call this method under a separate class which should pass 3 arguments that I want to enter mentioned in the code. The record is in JSON format. I can manage that by parsing and getting the attributes. But how do I pass it to the invoke method and how can I pass the session object throughout the class. Also, is it a correct way of doing it since I am new to both Flink and Scala?
Will stream/string.new CassandraAsSink().invoke(name,credits,user_id) work when it comes to the calling part?
Modified:
class CassandraSink extends RichSinkFunction[String] {
var cluster: Cluster = _
var session: Session = _
println("inside....")
override def open(parameters: Configuration): Unit = {
cluster = Cluster.builder().addContactPoint("localhost").build() //
session = cluster.connect("example")
println("Connected....")
}
override def invoke(value: String): Unit = {
println("inside invoke: " + value)
session.execute(
s"""
INSERT INTO jsondata1(records_b)
VALUES ($value)
"""
)
}
override def close(): Unit = {
session.close()
println("Session Closed...")
//something like session.close()
}
}
Calling part:
val datastreamFromString:DataStream[String]=env.fromElements(data) // where data is string
datastreamFromString.addSink(new CassandraAsSink())
I figured out that there is some problem with my DataStream created from String. The class is working fine. I have initialized the env variable as the second line in the class.
Flink already has a Cassandra sink; it has valuable features you haven't attempted to support, especially checkpointing.
As for your questions:
You can make session a member variable that can be initialized in open and used in invoke.
Flink will call the invoke method for every stream record coming into the sink. This record passed to invoke as the value parameter. You'll need to extract the fields like name, etc from that value.
You'll need to attach the sink to your job graph; overall it will end up being something like this:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env
.addSource(source)
... // some processing
.addSink(new CassandraAsSink())
env.execute()
By the way, there are training lessons with examples and exercises included in the Flink documentation to help you get started.

Apache Flink - Refresh a Hashmap asynchronously

I am developing a Apache Flink application using Scala API ( I am pretty new using this technology).
I am using a hashmap to store some values that come from a database, and I need to refresh these values each 1h. There is any way to refresh this hashmap asynchronously?
Thanks!
I'm not sure what you mean by "refresh this hashmap asynchronously" in the context of a Flink workflow.
For what it's worth, if you have a hashmap that's keyed by some piece of data from records flowing through your workflow, then you can use Flink's support for managed key state to store the value (and checkpoint it), and make it queryable.
I interpret your question to mean that you are using some state in Flink to mirror/cache some data that comes from an external database, and you wish to periodically refresh it.
Typically this sort of thing is done by continuously streaming a Change Data Capture (CDC) stream from the external database into Flink. Continuous, streaming solutions are generally a better fit for Flink. But if you want to do this in hourly batches, you could write a custom source or a ProcessFunction that wakes up once an hour, makes a query to the database, and emits a stream of records that can be used to update the operator holding the state.
You can achieve this with the use of Apache Flink's Asynchronous I/O for External Data Access, see this post for details async io.
Here's a way to use AsyncDataStream to refresh a map periodically by creating a async function and attaching it to a source stream.
class AsyncEnricherFunction extends RichAsyncFunction[String, (String String)] {
#transient private var m: Map[String, String] = _
#transient private var client: DataBaseClient = _
#transient private var refreshInterval: Int = _
#throws(classOf[Exception])
override def open(parameters: Configuration): Unit = {
client = new DataBaseClient(host, port, credentials)
refreshInterval = 1000
load()
}
private def load(): Unit = {
val str = "select key, value from KeyValue"
m = client.query(str).asMap
lastRefreshed = System.currentTimeMillis()
}
override def asyncInvoke(input: String, resultFuture: ResultFuture[(String, String]): Unit = {
Future {
if (System.currentTimeMillis() > lastRefreshed + refreshInterval) load()
val enriched = (input, m(input))
resultFuture.complete(Seq(enriched))
}(ExecutionContext.global)
}
override def close() : Unit = { client.close() }
}
val in: DataStream[String] = env.addSource(src)
val enriched = AsyncDataStream.unorderedWait(in, AsyncEnricherFunction(), 5000, TimeUnit.MILLISECONDS, 100)

How do I test that an offset has been committed or not to Kafka

I have an Akka Stream Kafka source that is reading from a Kafka topic.
I have a simple task that is allowing disabling commit of the message offset. The commit is usually done calling commitScaladsl.
My problem is I don't know how to test if the offset has been committed or not.
We usually use EmbeddedKafka for testing, but I haven't figured out a way of asking for the last committed offset.
This is an example of the test I have written:
"KafkaSource" should {
"consume from a kafka topic and pass the message " in {
val commitToKafka = true
val key = "key".getBytes
val message = "message".getBytes
withRunningKafka {
val source = getKafkaSource(commitToKafka)
val (_, sub) = source
.toMat(TestSink.probe[CommittableMessage[Array[Byte], Array[Byte], ConsumerMessage.CommittableOffset]])(Keep.both)
.run()
val messageOpt = publishAndRequestRetry(topic, key, message, sub, retries)
messageOpt should not be empty
messageOpt.get.value shouldBe message
}
}
Now I want to add a check for the offset being committed or not.
Kafka stores the offsets by the TopicName and PartitionID. So you can use .committed() or .position method to check the last committed offset or current position of Kafka consumer.
committed() : Get the last committed offset for the given partition (whether the commit happened by this process or another).
position() : Get the offset of the next record that will be fetched (if a record with that offset exists).
I finally solved it using a ConsumerInterceptor, defined as:
class Interceptor extends ConsumerInterceptor[Array[Byte], Array[Byte]] {
override def onConsume(records: ConsumerRecords[Array[Byte], Array[Byte]]): ConsumerRecords[Array[Byte], Array[Byte]] = records
override def onCommit(offsets: java.util.Map[TopicPartition, OffsetAndMetadata]): Unit = {
import scala.collection.JavaConverters._
OffsetRecorder.add(offsets.asScala)
}
override def close(): Unit = {}
override def configure(configs: java.util.Map[String, _]): Unit = OffsetRecorder.clear
}
onCommit is called when the commit is done, in this case I just record it. I use configure method to have empty records at the start of each test.
Then, when creating the consumer settings for the source, I add the interceptor as a property:
ConsumerSettings(system, new ByteArrayDeserializer, new ByteArrayDeserializer)
.withBootstrapServers(s"localhost:${kafkaPort}")
.withGroupId("group-id")
.withProperty(ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, "package.of.my.test.Interceptor")

Random fail on KafkaStreams stateful application

Hi here is a problem I stumble upon since a few days and can't find the answer by myself.
I am using the scala streams API v2.0.0.
I have two incoming streams, branched over two handlers for segregation and both declaring a Transformer using a common StateStore.
To do a quick overview, it looks like
def buildStream(builder: StreamsBuilder, config: Config) = {
val store = Stores.keyValueStoreBuilder[String, AggregatedState](Stores.persistentKeyValueStore(config.storeName), ...)
builder.addStateStore(store)
val handlers = List(handler1, handler2)
builder
.stream(config.topic)
.branch(handlers.map(_.accepts).toList: _*) // Dispatch events to the first handler accepting it
.zip(handlers.toList) // (KStream[K, V], Handler)
.map((h, stream) => h.handle(stream)) // process the event on the correct handler
.reduce((s1, s2) => s1.merge(s2)) // merge them back as they return the same object
.to(config.output)
builder
}
Each of my handlers look the same: Take an event, do some operations, pass through the transform() method to derive a state and emit an aggregate:
class Handler1(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer1(config.storeName))
}
}
class Handler2(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer2(config.storeName))
}
}
The transformers use the same StateStore with the following logic: for a new event, check if its aggregate exists, if yes, update it + store it + emit the new aggregate, otherwise build the aggregate + store it + emit .
class Transformer1(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
class Transformer2(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
Transformer2 is the same, it's just the business logic that changes (how to merge a new event with an aggregated state)
The problem I have is that on stream startup, I can either have a normal startup or a boot exception :
15:07:23,420 ERROR org.apache.kafka.streams.processor.internals.AssignedStreamsTasks - stream-thread [job-tracker-prod-5ba8c2f7-d7fd-48b5-af4a-ac78feef71d3-StreamThread-1] Failed to commit stream task 1_0 due to the following error:
org.apache.kafka.streams.errors.ProcessorStateException: task [1_0] Failed to flush state store KSTREAM-AGGREGATE-STATE-STORE-0000000003
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:242)
at org.apache.kafka.streams.processor.internals.AbstractTask.flushState(AbstractTask.java:198)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:406)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:380)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:368)
at org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:67)
at org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:362)
at org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:352)
at org.apache.kafka.streams.processor.internals.TaskManager.commitAll(TaskManager.java:401)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:1035)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:845)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: java.lang.IllegalStateException: This should not happen as timestamp() should only be called while a record is processed
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:161)
at org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:66)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:31)
at org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:206)
at org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
at com.mycompany.streamprocess.Transformer1.transform(Transformer1.scala:49) // Line with store.put(key, agg)
I already searched and got results with "The transformer uses a Factory Pattern", which is what is used here (as the .transform takes the transformer and creates a TransformerSupplier under the hood).
As the error is pseudo-random (I could re-create it some times), I guess it could be a race condition on startup but I found nothing concluding.
Is it because I use the same state-store on different transformers?
I assume you are hitting https://issues.apache.org/jira/browse/KAFKA-7250
It's fixed in version 2.0.1 and 2.1.0.
If you cannot upgrade, you need to pass in the TransformerSupplier explicitly, because the Scale API constructs the supplier incorrectly in 2.0.0.
.transform(() => new Transformer1(config.storeName))

Creating a flow from actor in Akka Streams

It's possible to create sources and sinks from actors using Source.actorPublisher() and Sink.actorSubscriber() methods respectively. But is it possible to create a Flow from actor?
Conceptually there doesn't seem to be a good reason not to, given that it implements both ActorPublisher and ActorSubscriber traits, but unfortunately, the Flow object doesn't have any method for doing this. In this excellent blog post it's done in an earlier version of Akka Streams, so the question is if it's possible also in the latest (2.4.9) version.
I'm part of the Akka team and would like to use this question to clarify a few things about the raw Reactive Streams interfaces. I hope you'll find this useful.
Most notably, we'll be posting multiple posts on the Akka team blog about building custom stages, including Flows, soon, so keep an eye on it.
Don't use ActorPublisher / ActorSubscriber
Please don't use ActorPublisher and ActorSubscriber. They're too low level and you might end up implementing them in such a way that's violating the Reactive Streams specification. They're a relict of the past and even then were only "power-user mode only". There really is no reason to use those classes nowadays. We never provided a way to build a flow because the complexity is simply explosive if it was exposed as "raw" Actor API for you to implement and get all the rules implemented correctly.
If you really really want to implement raw ReactiveStreams interfaces, then please do use the Specification's TCK to verify your implementation is correct. You will likely be caught off guard by some of the more complex corner cases a Flow (or in RS terminology a Processor has to handle).
Most operations are possible to build without going low-level
Many flows you should be able to simply build by building from a Flow[T] and adding the needed operations onto it, just as an example:
val newFlow: Flow[String, Int, NotUsed] = Flow[String].map(_.toInt)
Which is a reusable description of the Flow.
Since you're asking about power user mode, this is the most powerful operator on the DSL itself: statefulFlatMapConcat. The vast majority of operations operating on plain stream elements is expressable using it: Flow.statefulMapConcat[T](f: () ⇒ (Out) ⇒ Iterable[T]): Repr[T].
If you need timers you could zip with a Source.timer etc.
GraphStage is the simplest and safest API to build custom stages
Instead, building Sources/Flows/Sinks has its own powerful and safe API: the GraphStage. Please read the documentation about building custom GraphStages (they can be a Sink/Source/Flow or even any arbitrary shape). It handles all of the complex Reactive Streams rules for you, while giving you full freedom and type-safety while implementing your stages (which could be a Flow).
For example, taken from the docs, is an GraphStage implementation of the filter(T => Boolean) operator:
class Filter[A](p: A => Boolean) extends GraphStage[FlowShape[A, A]] {
val in = Inlet[A]("Filter.in")
val out = Outlet[A]("Filter.out")
val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
setHandler(in, new InHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (p(elem)) push(out, elem)
else pull(in)
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
}
}
It also handles asynchronous channels and is fusable by default.
In addition to the docs, these blog posts explain in detail why this API is the holy grail of building custom stages of any shape:
Akka team blog: Mastering GraphStages (part 1, introduction) - a high level overview
... tomorrow we'll publish one about it's API as well...
Kunicki blog: Implementing a Custom Akka Streams Graph Stage - another example implementing sources (really applies 1:1 to building Flows)
Konrad's solution demonstrates how to create a custom stage that utilizes Actors, but in most cases I think that is a bit overkill.
Usually you have some Actor that is capable of responding to questions:
val actorRef : ActorRef = ???
type Input = ???
type Output = ???
val queryActor : Input => Future[Output] =
(actorRef ? _) andThen (_.mapTo[Output])
This can be easily utilized with basic Flow functionality which takes in the maximum number of concurrent requests:
val actorQueryFlow : Int => Flow[Input, Output, _] =
(parallelism) => Flow[Input].mapAsync[Output](parallelism)(queryActor)
Now actorQueryFlow can be integrated into any stream...
Here is a solution build by using a graph stage. The actor has to acknowledge all messages in order to have back-pressure. The actor is notified when the stream fails/completes and the stream fails when the actor terminates.
This can be useful if you don't want to use ask, e.g. when not every input message has a corresponding output message.
import akka.actor.{ActorRef, Status, Terminated}
import akka.stream._
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
object ActorRefBackpressureFlowStage {
case object StreamInit
case object StreamAck
case object StreamCompleted
case class StreamFailed(ex: Throwable)
case class StreamElementIn[A](element: A)
case class StreamElementOut[A](element: A)
}
/**
* Sends the elements of the stream to the given `ActorRef` that sends back back-pressure signal.
* First element is always `StreamInit`, then stream is waiting for acknowledgement message
* `ackMessage` from the given actor which means that it is ready to process
* elements. It also requires `ackMessage` message after each stream element
* to make backpressure work. Stream elements are wrapped inside `StreamElementIn(elem)` messages.
*
* The target actor can emit elements at any time by sending a `StreamElementOut(elem)` message, which will
* be emitted downstream when there is demand.
*
* If the target actor terminates the stage will fail with a WatchedActorTerminatedException.
* When the stream is completed successfully a `StreamCompleted` message
* will be sent to the destination actor.
* When the stream is completed with failure a `StreamFailed(ex)` message will be send to the destination actor.
*/
class ActorRefBackpressureFlowStage[In, Out](private val flowActor: ActorRef) extends GraphStage[FlowShape[In, Out]] {
import ActorRefBackpressureFlowStage._
val in: Inlet[In] = Inlet("ActorFlowIn")
val out: Outlet[Out] = Outlet("ActorFlowOut")
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
private lazy val self = getStageActor {
case (_, StreamAck) =>
if(firstPullReceived) {
if (!isClosed(in) && !hasBeenPulled(in)) {
pull(in)
}
} else {
pullOnFirstPullReceived = true
}
case (_, StreamElementOut(elemOut)) =>
val elem = elemOut.asInstanceOf[Out]
emit(out, elem)
case (_, Terminated(targetRef)) =>
failStage(new WatchedActorTerminatedException("ActorRefBackpressureFlowStage", targetRef))
case (actorRef, unexpected) =>
failStage(new IllegalStateException(s"Unexpected message: `$unexpected` received from actor `$actorRef`."))
}
var firstPullReceived: Boolean = false
var pullOnFirstPullReceived: Boolean = false
override def preStart(): Unit = {
//initialize stage actor and watch flow actor.
self.watch(flowActor)
tellFlowActor(StreamInit)
}
setHandler(in, new InHandler {
override def onPush(): Unit = {
val elementIn = grab(in)
tellFlowActor(StreamElementIn(elementIn))
}
override def onUpstreamFailure(ex: Throwable): Unit = {
tellFlowActor(StreamFailed(ex))
super.onUpstreamFailure(ex)
}
override def onUpstreamFinish(): Unit = {
tellFlowActor(StreamCompleted)
super.onUpstreamFinish()
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
if(!firstPullReceived) {
firstPullReceived = true
if(pullOnFirstPullReceived) {
if (!isClosed(in) && !hasBeenPulled(in)) {
pull(in)
}
}
}
}
override def onDownstreamFinish(): Unit = {
tellFlowActor(StreamCompleted)
super.onDownstreamFinish()
}
})
private def tellFlowActor(message: Any): Unit = {
flowActor.tell(message, self.ref)
}
}
override def shape: FlowShape[In, Out] = FlowShape(in, out)
}