Does flink streaming job maintain its keyed value state between job runs? - apache-kafka

Our usecase is we want to use flink streaming for a de-duplicator job, which reads it's data from source(kafka topic) and writes unique records into hdfs file sink.
Kafka topic could have duplicate data, which can be identified by using composite key
(adserver_id, unix_timestamp of the record)
so I decided to use flink keyed state stream to achieve de-duplication.
val messageStream: DataStream[String] = env.addSource(flinkKafkaConsumer)
messageStream
.map{
record =>
val key = record.adserver_id.get + record.event_timestamp.get
(key,record)
}
.keyBy(_._1)
.flatMap(new DedupDCNRecord())
.map(_.toString)
.addSink(sink)
// execute the stream
env.execute(applicationName)
}
Here is the code for de-duplication using value state from flink.
class DedupDCNRecord extends RichFlatMapFunction[(String, DCNRecord), DCNRecord] {
private var operatorState: ValueState[String] = null
override def open(configuration: Configuration) = {
operatorState = getRuntimeContext.getState(
DedupDCNRecord.descriptor
)
}
#throws[Exception]
override def flatMap(value: (String,DCNRecord), out: Collector[DCNRecord]): Unit = {
if (operatorState.value == null) { // we haven't seen the element yet
out.collect(value._2)
// set operator state to true so that we don't emit elements with this key again
operatorState.update(value._1)
}
}
}
While this approach works fine as long as streaming job is running and maintaining list of unique keys through valueState and performing de-duplication.
But as soon as I cancel the job, flink looses it's state(unique keys seen in previous run of the job) for valueState(only keeps unique keys for the current run) and let the records pass, which were already processed in previous run of the job.
Is there a way, we can enforce flink to mainatain it's valueState(unique_keys) seen so far ?
Appreciate your help.

This requires you capture a snapshot of the state before shutting down the job, and then restart from that snapshot:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch, using the savepoint as the starting point.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground. The section on Observing Failure & Recovery is also relevant here.

Related

Stale ktable records when joining kstream with ktable created by kstream aggregation

I'm trying to implement the event sourcing pattern with kafka streams in the following way.
I'm in a Security service and handle two use cases:
Register User, handling RegisterUserCommand should produce UserRegisteredEvent.
Change User Name, handling ChangeUserNameCommand should produce UserNameChangedEvent.
I have two topics:
Command Topic, 'security-command'. Every command is keyed and the key is user's email. For example:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Event Topic, 'security-event'. Every record is keyed by user's email:
foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
foo#bar.com:{"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Kafka Streams version 2.8.0
Kafka version 2.8
The implementation idea can be expressed in the following topology:
commandStream = builder.stream("security-command");
eventStream = builder.stream("security-event",
Consumed.with(
...,
new ZeroTimestampExtractor()
/*always returns 0 to get the latest version of snapshot*/));
// build the snapshot to get the current state of the user.
userSnapshots = eventStream.groupByKey()
.aggregate(() -> new UserSnapshot(),
(key /*email*/, event, currentSnapshot) -> currentSnapshot.apply(event));
// join commands with latest snapshot at the time of the join
commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
joinParams
);
// handle the command given the current snapshot
resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var newEvents = commandHandler(commandWithSnapshot.command(), commandWithSnapshot.snapshot());
return Arrays.stream(newEvents )
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
});
// append events to events topic
resultingEventStream.to("security-event");
For this topology, I'm using EOS exactly_once_beta.
A more explicit version of this topology:
KStream<String, Command<DomainEvent[]>> commandStream =
builder.stream(
commandTopic,
Consumed.with(Serdes.String(), new SecurityCommandSerde()));
KStream<String, DomainEvent> eventStream =
builder.stream(
eventTopic,
Consumed.with(
Serdes.String(),
new DomainEventSerde(),
new LatestRecordTimestampExtractor() /*always returns 0 to get the latest snapshot of the snapshot.*/));
// build the snapshots ktable by aggregating all the current events for a given user.
KTable<String, UserSnapshot> userSnapshots =
eventStream.groupByKey()
.aggregate(
() -> new UserSnapshot(),
(email, event, currentSnapshot) -> currentSnapshot.apply(event),
Materialized.with(
Serdes.String(),
new UserSnapshotSerde()));
// join command stream and snapshot table to get the stream of pairs <Command, UserSnapshot>
Joined<String, Command<DomainEvent[]>, UserSnapshot> commandWithSnapshotJoinParams =
Joined.with(
Serdes.String(),
new SecurityCommandSerde(),
new UserSnapshotSerde()
);
KStream<String, CommandWithUserSnapshot> commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
commandWithSnapshotJoinParams
);
var resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var command = commandWithSnapshot.command();
if (command instanceof RegisterUserCommand registerUserCommand) {
var handler = new RegisterUserCommandHandler();
var events = handler.handle(registerUserCommand);
// multiple events might be produced when a command is handled.
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
if (command instanceof ChangeUserNameCommand changeUserNameCommand) {
var handler = new ChangeUserNameCommandHandler();
var events = handler.handle(changeUserNameCommand, commandWithSnapshot.userSnapshot());
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
throw new IllegalArgumentException("...");
});
resultingEventStream.to(eventTopic, Produced.with(Serdes.String(), new DomainEventSerde()));
Problems I'm getting:
Launching the stream app on a command topic with existing records:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Outcome:
1. Stream application fails when processing the ChangeUserNameCommand, because the snapshot is null.
2. The events topic has a record for successful registration, but nothing for changing the name:
/*OK*/foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
Thoughts:
When processing the ChangeUserNameCommand, the snapshot is missing in the aggregated KTable, userSnapshots. Restarting the application succesfully produces the following record:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect.
Launching the stream app and producing a set of ChangeUserNameCommand commands at a time (fast).
Producing:
// Produce to command topic
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
// event topic outcome
/*OK*/ foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
// Produce at once to command topic
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex2"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex3"}}
// event topic outcome
/*OK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":1}}
Thoughts:
'ChangeUserNameCommand' commands are joined with a stale version of snapshot (pay attention to the version attribute).
The expected outcome would be:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":2}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":3}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect, setting the cache_max_bytes_buffering to 0 has no effect.
What am I missing in building such a topology? I expect that every command to be processed on the latest version of the snapshot. If I produce the commands with a few seconds delay between them, everything works as expected.
I think you missed change-log recovery part for the Tables. Read this to understand what happens with change-log recovery.
For tables, it is more complex because they must maintain additional
information—their state—to allow for stateful processing such as joins
and aggregations like COUNT() or SUM(). To achieve this while also
ensuring high processing performance, tables (through their state
stores) are materialized on local disk within a Kafka Streams
application instance or a ksqlDB server. But machines and containers
can be lost, along with any locally stored data. How can we make
tables fault tolerant, too?
The answer is that any data stored in a table is also stored remotely
in Kafka. Every table has its own change stream for this purpose—a
built-in change data capture (CDC) setup, we could say. So if we have
a table of account balances by customer, every time an account balance
is updated, a corresponding change event will be recorded into the
change stream of that table.
Also keep in mind, Restart a Kafka stream application should not process previously processed events. For that you need to commit offset of the message after processed it.
Found the root cause. Not sure if it is by design or a bug, but a stream task will wait only once per processing cycle for data in other partitions.
So if 2 records from command topic were read first, the stream task will wait max.task.idle.ms, allowing the poll() phase to happen, when processing the first command record. After it is processed, during processing the second one, the stream task will not allow polling to get newly generated events that resulted from first command processing.
In kafka 2.8, the code that is responsible for this behavior is in StreamTask.java. IsProcessable() is invoked at the beginning of processing phase. If it returns false, this will lead to repeating the polling phase.
public boolean isProcessable(final long wallClockTime) {
if (state() == State.CLOSED) {
return false;
}
if (hasPendingTxCommit) {
return false;
}
if (partitionGroup.allPartitionsBuffered()) {
idleStartTimeMs = RecordQueue.UNKNOWN;
return true;
} else if (partitionGroup.numBuffered() > 0) {
if (idleStartTimeMs == RecordQueue.UNKNOWN) {
idleStartTimeMs = wallClockTime;
}
if (wallClockTime - idleStartTimeMs >= maxTaskIdleMs) {
return true;
// idleStartTimeMs is not reset to default, RecordQueue.UNKNOWN, value,
// therefore the next time when the check for all buffered partitions is done, `true` is returned, meaning that the task is ready to be processed.
} else {
return false;
}
} else {
// there's no data in any of the topics; we should reset the enforced
// processing timer
idleStartTimeMs = RecordQueue.UNKNOWN;
return false;
}
}

Flink state empty (reinitialized) after rerun

I'm trying to connect two streams, first is persisting in MapValueState:
RocksDB save data in checkpoint folder, but after new run, state is empty. I run it locally and in flink cluster with cancel submiting in cluster and simply rerun locally
env.setStateBackend(new RocksDBStateBackend(..)
env.enableCheckpointing(1000)
...
val productDescriptionStream: KeyedStream[ProductDescription, String] = env.addSource(..)
.keyBy(_.id)
val productStockStream: KeyedStream[ProductStock, String] = env.addSource(..)
.keyBy(_.id)
and
productDescriptionStream
.connect(productStockStream)
.process(ProductProcessor())
.setParallelism(1)
env.execute("Product aggregator")
ProductProcessor
case class ProductProcessor() extends CoProcessFunction[ProductDescription, ProductStock, Product]{
private[this] lazy val stateDescriptor: MapStateDescriptor[String, ProductDescription] =
new MapStateDescriptor[String, ProductDescription](
"productDescription",
createTypeInformation[String],
createTypeInformation[ProductDescription]
)
private[this] lazy val states: MapState[String, ProductDescription] = getRuntimeContext.getMapState(stateDescriptor)
override def processElement1(value: ProductDescription,
ctx: CoProcessFunction[ProductDescription, ProductStock, Product]#Context,out: Collector[Product]
): Unit = {
states.put(value.id, value)
}}
override def processElement2(value: ProductStock,
ctx: CoProcessFunction[ProductDescription, ProductStock, Product]#Context, out: Collector[Product]
): Unit = {
if (states.contains(value.id)) {
val product =Product(
id = value.id,
description = Some(states.get(value.id).description),
stock = Some(value.stock),
updatedAt = value.updatedAt)
out.collect(product )
}}
Checkpoints are created by Flink for recovering from failures, not for resuming after a manual shutdown. When a job is canceled, the default behavior is for Flink to delete the checkpoints. Since the job can no longer fail, it won't need to recover.
You have several options:
(1) Configure your checkpointing to retain checkpoints when a job is cancelled:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Then when you restart the job you'll need to indicate that you want it restarted from a specific checkpoint:
flink run -s <checkpoint-path> ...
Otherwise, whenever you start a job it will begin with an empty state backend.
(2) Instead of canceling the job, use stop with savepoint:
flink stop [-p targetDirectory] [-d] <jobID>
after which you'll again need to use flink run -s ... to resume from the savepoint.
Stop with a savepoint is a cleaner approach than relying on there being a recent checkpoint to fall back to.
(3) Or you could use Ververica Platform Community Edition, which raises the level of abstraction to the point where you don't have to manage these details yourself.

Spark streaming store method only work in Duration window but not in foreachRDD workflow in customized receiver

I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?

Spark Structured Streaming: Running Kafka consumer on separate worker thread

So I have a Spark application that needs to read two streams from two kafka clusters (Kafka A and B) using structured streaming, and do some joins and filtering on the two streams. So is it possible to have a Spark job that reads stream from A, and also run a thread (Called consumer) on each worker that reads Kafka B and put data into a map. So later when we are filtering we can do something like stream.filter(row => consumer.idNotInMap(row.id))?
I have some questions regarding this approach:
If this approach works, will it cause any problems when the application is run on a cluster?
Will all consumer instance on each worker receive the same data in cluster mode? Or can we even let each consumer only listen on the Kafka partition for that worker node (which is probably controlled by Spark)?
How will the consumer instance gets serialized and passed to workers?
Currently they are initialized on Driver node but are there some ways to initialize it once for each worker node?
I feel like in my case I should use stream joining instead. I've already tried that and it didn't work, that's why I am taking this approach. It didn't work because stream from Kafka A is append only and stream B needs to have a state that can be updated, which makes it update only. Then joining streams of append and update mode is not supported in Spark.
Here are some pseudo-code:
// SparkJob.scala
val consumer = Consumer()
val getMetadata = udf(id => consumer.get(id))
val enrichedDataSet = stream.withColumn("metadata", getMetadata(stream("id"))
// Consumer.java
class Consumer implements Serializable {
private final ConcurrentHashMap<Integer, String> metadata;
public MetadataConsumer() {
metadata = new ConcurrentHashMap<>();
// read stream
listen();
}
// process kafka data inside this loop
private void listen() {
Thread t = new Thread(() -> {
KafkaConsumer consumer = ...;
while (consumer.hasNext()) {
var message = consumer.next();
// update metadata or put in new metadata
metadata.put(message.id, message);
}
});
t.start();
}
public String get(Integer key) {
return metadata.get(key);
}
}

Joining a KTable with a KStream and nothing arrives in the output topic

I leftjoin a KStream with a KTable, but I don't see any output to the output topic:
val stringSerde: Serde[String] = Serdes.String()
val longSerde: Serde[java.lang.Long] = Serdes.Long()
val genericRecordSerde: Serde[GenericRecord] = new GenericAvroSerde()
val builder = new KStreamBuilder()
val networkImprStream: KStream[Long, GenericRecord] = builder
.stream(dfpGcsNetworkImprEnhanced)
// Create a global table for advertisers. The data from this global table
// will be fully replicated on each instance of this application.
val advertiserTable: GlobalKTable[java.lang.Long, GenericRecord]= builder.globalTable(advertiserTopicName, "advertiser-store")
// Join the network impr stream to the advertiser global table. As this is global table
// we can use a non-key based join with out needing to repartition the input stream
val networkImprWithAdvertiserNameKStream: KStream[java.lang.Long, GenericRecord] = networkImprStream.leftJoin(advertiserTable,
(_, networkImpr) => {
println(networkImpr)
networkImpr.get("advertiserId").asInstanceOf[java.lang.Long]
},
(networkImpr: GenericRecord, adertiserIdToName: GenericRecord) => {
println(networkImpr)
networkImpr.put("advertiserName", adertiserIdToName.get("name"))
networkImpr
}
)
networkImprWithAdvertiserNameKStream.to(networkImprProcessed)
val streams = new KafkaStreams(builder, streamsConfiguration)
streams.cleanUp()
streams.start()
// usually the stream application would be running forever,
// in this example we just let it run for some time and stop since the input data is finite.
Thread.sleep(15000L)
If I bypass the join and directly output the input topic to the output, I see messages arriving. I've already changed the join to a left join, added some printlns to see when the key is extracted (nothing is printed on the console though). Also I use the kafka streams reset tool every time, so starting from the beginning. I am running out of ideas here. Also I've added some test access to the store and it works and contains keys from the stream (although this should not prohibit any output because of the left join).
In my source stream the key is null. Although I am not using this key to join the table this key must not be null. So creating an intermediate stream with a dummy key it works. So even I have a global KTable here the restrictions for the keys for the stream messages also apply here:
http://docs.confluent.io/current/streams/developer-guide.html#kstream-ktable-join
Input records for the stream with a null key or a null value are ignored and do not trigger the join.