Spark Structured Streaming: Running Kafka consumer on separate worker thread - scala

So I have a Spark application that needs to read two streams from two kafka clusters (Kafka A and B) using structured streaming, and do some joins and filtering on the two streams. So is it possible to have a Spark job that reads stream from A, and also run a thread (Called consumer) on each worker that reads Kafka B and put data into a map. So later when we are filtering we can do something like stream.filter(row => consumer.idNotInMap(row.id))?
I have some questions regarding this approach:
If this approach works, will it cause any problems when the application is run on a cluster?
Will all consumer instance on each worker receive the same data in cluster mode? Or can we even let each consumer only listen on the Kafka partition for that worker node (which is probably controlled by Spark)?
How will the consumer instance gets serialized and passed to workers?
Currently they are initialized on Driver node but are there some ways to initialize it once for each worker node?
I feel like in my case I should use stream joining instead. I've already tried that and it didn't work, that's why I am taking this approach. It didn't work because stream from Kafka A is append only and stream B needs to have a state that can be updated, which makes it update only. Then joining streams of append and update mode is not supported in Spark.
Here are some pseudo-code:
// SparkJob.scala
val consumer = Consumer()
val getMetadata = udf(id => consumer.get(id))
val enrichedDataSet = stream.withColumn("metadata", getMetadata(stream("id"))
// Consumer.java
class Consumer implements Serializable {
private final ConcurrentHashMap<Integer, String> metadata;
public MetadataConsumer() {
metadata = new ConcurrentHashMap<>();
// read stream
listen();
}
// process kafka data inside this loop
private void listen() {
Thread t = new Thread(() -> {
KafkaConsumer consumer = ...;
while (consumer.hasNext()) {
var message = consumer.next();
// update metadata or put in new metadata
metadata.put(message.id, message);
}
});
t.start();
}
public String get(Integer key) {
return metadata.get(key);
}
}

Related

Does flink streaming job maintain its keyed value state between job runs?

Our usecase is we want to use flink streaming for a de-duplicator job, which reads it's data from source(kafka topic) and writes unique records into hdfs file sink.
Kafka topic could have duplicate data, which can be identified by using composite key
(adserver_id, unix_timestamp of the record)
so I decided to use flink keyed state stream to achieve de-duplication.
val messageStream: DataStream[String] = env.addSource(flinkKafkaConsumer)
messageStream
.map{
record =>
val key = record.adserver_id.get + record.event_timestamp.get
(key,record)
}
.keyBy(_._1)
.flatMap(new DedupDCNRecord())
.map(_.toString)
.addSink(sink)
// execute the stream
env.execute(applicationName)
}
Here is the code for de-duplication using value state from flink.
class DedupDCNRecord extends RichFlatMapFunction[(String, DCNRecord), DCNRecord] {
private var operatorState: ValueState[String] = null
override def open(configuration: Configuration) = {
operatorState = getRuntimeContext.getState(
DedupDCNRecord.descriptor
)
}
#throws[Exception]
override def flatMap(value: (String,DCNRecord), out: Collector[DCNRecord]): Unit = {
if (operatorState.value == null) { // we haven't seen the element yet
out.collect(value._2)
// set operator state to true so that we don't emit elements with this key again
operatorState.update(value._1)
}
}
}
While this approach works fine as long as streaming job is running and maintaining list of unique keys through valueState and performing de-duplication.
But as soon as I cancel the job, flink looses it's state(unique keys seen in previous run of the job) for valueState(only keeps unique keys for the current run) and let the records pass, which were already processed in previous run of the job.
Is there a way, we can enforce flink to mainatain it's valueState(unique_keys) seen so far ?
Appreciate your help.
This requires you capture a snapshot of the state before shutting down the job, and then restart from that snapshot:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch, using the savepoint as the starting point.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground. The section on Observing Failure & Recovery is also relevant here.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

How to use a persisted StateStore between two Kafka Streams

I'm having some troubles trying to achieve the following via Kafka Streams:
At the startup of the app, the (compacted) topic alpha gets loaded into a Key-Value StateStore map
A Kafka Stream consumes from another topic, uses (.get) the map above and finally produces a new record into topic alpha
The result is that the in-memory map should aligned with the underlying topic, even if the streamer gets restarted.
My approach is the following:
val builder = new StreamsBuilderS()
val store = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"), kSerde, vSerde)
)
builder.addStateStore(store)
val loaderStreamer = new LoaderStreamer(store).startStream()
[...] // I wait a few seconds until the loading is complete and the stream os running
val map = instance.store("store", QueryableStoreTypes.keyValueStore[K, V]()) // !!!!!!!! ERROR HERE !!!!!!!!
builder
.stream("another-topic")(Consumed.`with`(kSerde, vSerde))
.doMyAggregationsAndgetFromTheMapAbove
.transform(() => new StoreTransformer[K, V]("store"), "store")
.to("alpha")(Produced.`with`(kSerde, vSerde))
LoaderStreamer(store):
[...]
val builders = new StreamsBuilderS()
builder.addStateStore(store)
builder
.table("alpha")(Consumed.`with`(kSerde, vSerde))
builder.build
[...]
StoreTransformer:
[...]
override def init(context: ProcessorContext): Unit = {
this.context = context
this.store =
context.getStateStore(store).asInstanceOf[KeyValueStore[K, V]]
}
override def transform(key: K, value: V): (K, V) = {
store.put(key, value)
(key, value)
}
[...]
...but what I get is:
Caused by: org.apache.kafka.streams.errors.InvalidStateStoreException:
The state store, store, may have migrated to another instance.
while trying to get the store handler.
Any idea on how to achieve this?
Thank you!
You can't share state store between two Kafka Streams applications.
According to documentation: https://docs.confluent.io/current/streams/faq.html#interactive-queries there might be two reason of above exception:
The local KafkaStreams instance is not yet ready and thus its local state stores cannot be queried yet.
The local KafkaStreams instance is ready, but the particular state store was just migrated to another instance behind the scenes.
The easiest way to deal with it is to wait till state store will be queryable:
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
Whole example can be found at confluent github.

How to read kafka statestore in processor which was updated from a different processor?

I have created 2 kafka statestores corresponding to 2 different topics as below.
StateStoreSupplier tempStore1 = Stores.create("tempStore1").withKeys(Serdes.String()).withValues(valueSerde).persistent().build();
StateStoreSupplier tempStore2 = Stores.create("tempStore2").withKeys(Serdes.String()).withValues(valueSerde).persistent().build();
streamsBuilder.addSource("Source", "tempTopic1", "tempTopic2")
.addProcessor("Process", () -> new Processor(), "Source")
.connectProcessorAndStateStores("Process", "tempStore1", "tempStore2")
.addStateStore(tempStore1, "Process")
.addStateStore(tempStore2, "Process");
In the Processor class, I'm able to read and add records to the StatesStores when there's a message in respective topics. But I'm unable to read the store tempStore1 when there's a message coming from tempTopic2 and vice verca.
I have to compare the messages from one statestore to the other, which means I need to read both the state stores.
Here's the sample code snipper from process method. I believe the ProcessorContext(context variable) is different for respective topics and hence the other store is not being accessible.
tempKeyValueStore1 = (KeyValueStore) context.getStateStore("tempStore1");
tempKeyValueStore2 = (KeyValueStore) context.getStateStore("tempStore2");
if(context.topic().equals("tempTopic1"))
{
tempKeyValueStore1.put(value.getHeader().getCorrelationId(), value);
}else if(context.topic().equals("tempTopic2"))
{
tempKeyValueStore2.put(value.getHeader().getCorrelationId(), value);
System.out.println("Size: "+tempKeyValueStore1.approximateNumEntries()); // returning as 0 although there records in that statestore
}
Thanks in advance.

How to check if Kafka Consumer is ready

I have Kafka commit policy set to latest and missing first few messages. If I give a sleep of 20 seconds before starting to send the messages to the input topic, everything is working as desired. I am not sure if the problem is with consumer taking long time for partition rebalancing. Is there a way to know if the consumer is ready before starting to poll ?
You can use consumer.assignment(), it will return set of partitions and verify whether all of the partitions are assigned which are available for that topic.
If you are using spring-kafka project, you can include spring-kafka-test dependancy and use below method to wait for topic assignment , but you need to have container.
ContainerTestUtils.waitForAssignment(Object container, int partitions);
You can do the following:
I have a test that reads data from kafka topic.
So you can't use KafkaConsumer in multithread environment, but you can pass parameter "AtomicReference assignment", update it in consumer-thread, and read it in another thread.
For example, snipped of working code in project for testing:
private void readAvro(String readFromKafka,
AtomicBoolean needStop,
List<Event> events,
String bootstrapServers,
int readTimeout) {
// print the topic name
AtomicReference<Set<TopicPartition>> assignment = new AtomicReference<>();
new Thread(() -> readAvro(bootstrapServers, readFromKafka, needStop, events, readTimeout, assignment)).start();
long startTime = System.currentTimeMillis();
long maxWaitingTime = 30_000;
for (long time = System.currentTimeMillis(); System.currentTimeMillis() - time < maxWaitingTime;) {
Set<TopicPartition> assignments = Optional.ofNullable(assignment.get()).orElse(new HashSet<>());
System.out.println("[!kafka-consumer!] Assignments [" + assignments.size() + "]: "
+ assignments.stream().map(v -> String.valueOf(v.partition())).collect(Collectors.joining(",")));
if (assignments.size() > 0) {
break;
}
try {
Thread.sleep(1_000);
} catch (InterruptedException e) {
e.printStackTrace();
needStop.set(true);
break;
}
}
System.out.println("Subscribed! Wait summary: " + (System.currentTimeMillis() - startTime));
}
private void readAvro(String bootstrapServers,
String readFromKafka,
AtomicBoolean needStop,
List<Event> events,
int readTimeout,
AtomicReference<Set<TopicPartition>> assignment) {
KafkaConsumer<String, byte[]> consumer = (KafkaConsumer<String, byte[]>) queueKafkaConsumer(bootstrapServers, "latest");
System.out.println("Subscribed to topic: " + readFromKafka);
consumer.subscribe(Collections.singletonList(readFromKafka));
long started = System.currentTimeMillis();
while (!needStop.get()) {
assignment.set(consumer.assignment());
ConsumerRecords<String, byte[]> records = consumer.poll(1_000);
events.addAll(CommonUtils4Tst.readEvents(records));
if (readTimeout == -1) {
if (events.size() > 0) {
break;
}
} else if (System.currentTimeMillis() - started > readTimeout) {
break;
}
}
needStop.set(true);
synchronized (MainTest.class) {
MainTest.class.notifyAll();
}
consumer.close();
}
P.S.
needStop - global flag, to stop all running thread if any in case of failure of success
events - list of object, that i want to check
readTimeout - how much time we will wait until read all data, if readTimeout == -1, then stop when we read anything
Thanks to Alexey (I have also voted up), I seemed to have resolved my issue essentially following the same idea.
Just want to share my experience... in our case we using Kafka in request & response way, somewhat like RPC. Request is being sent on one topic and then waiting for response on another topic. Running into a similar issue i.e. missing out first response.
I have tried ... KafkaConsumer.assignment(); repeatedly (with Thread.sleep(100);) but doesn't seem to help. Adding a KafkaConsumer.poll(50); seems to have primed the consumer (group) and receiving the first response too. Tested few times and it consistently working now.
BTW, testing requires stopping application & deleting Kafka topics and, for a good measure, restarted Kafka too.
PS: Just calling poll(50); without assignment(); fetching logic, like Alexey mentioned, may not guarantee that consumer (group) is ready.
You can modify an AlwaysSeekToEndListener (listens only to new messages) to include a callback:
public class AlwaysSeekToEndListener<K, V> implements ConsumerRebalanceListener {
private final Consumer<K, V> consumer;
private Runnable callback;
public AlwaysSeekToEndListener(Consumer<K, V> consumer) {
this.consumer = consumer;
}
public AlwaysSeekToEndListener(Consumer<K, V> consumer, Runnable callback) {
this.consumer = consumer;
this.callback = callback;
}
#Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
}
#Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
consumer.seekToEnd(partitions);
if (callback != null) {
callback.run();
}
}
}
and subscribe with a latch callback:
CountDownLatch initLatch = new CountDownLatch(1);
consumer.subscribe(singletonList(topic), new AlwaysSeekToEndListener<>(consumer, () -> initLatch.countDown()));
initLatch.await(); // blocks until consumer is ready and listening
then proceed to start your producer.
If your policy is set to latest - which takes effect if there are no previously committed offsets - but you have no previously committed offsets, then you should not worry about 'missing' messages, because you're telling Kafka not to care about messages that were sent 'previously' to your consumers being ready.
If you care about 'previous' messages, you should set the policy to earliest.
In any case, whatever the policy, the behaviour you are seeing is transient, i.e. once committed offsets are saved in Kafka, on every restart the consumers will pick up where they left previoulsy
I needed to know if a kafka consumer was ready before doing some testing, so i tried with consumer.assignment(), but it only returned the set of partitions assigned, but there was a problem, with this i cannot see if this partitions assigned to the group had offset setted, so later when i tried to use the consumer it didnĀ“t have offset setted correctly.
The solutions was to use committed(), this will give you the last commited offsets of the given partitions that you put in the arguments.
So you can do something like: consumer.committed(consumer.assignment())
If there is no partitions assigned yet it will return:
{}
If there is partitions assigned, but no offset yet:
{name.of.topic-0=null, name.of.topic-1=null}
But if there is partitions and offset:
{name.of.topic-0=OffsetAndMetadata{offset=5197881, leaderEpoch=null, metadata=''}, name.of.topic-1=OffsetAndMetadata{offset=5198832, leaderEpoch=null, metadata=''}}
With this information you can use something like:
consumer.committed(consumer.assignment()).isEmpty();
consumer.committed(consumer.assignment()).containsValue(null);
And with this information you can be sure that the kafka consumer is ready.