How to wait for a finite stream bulk result - apache-kafka

I have a stream processing application built with spring cloud streams & kafka streams,
this system takes logs from an application and compares them to observations made by another stream processor
and produces a score, the log stream is then split by the score (above & below some threshold).
The topology:
The issue:
So my problem is how to properly implement the "Log best observation selector processor",
There are a finite amount of observations at the moment the log is processed but there may be a lot of them.
So I came up with 2 solutions...
Group & Window log-scored-observations topic by log id and then reduce to get the highest score. (Problem: scoring all observations may take longer then the window)
Emit a scoring completed message after every scoring, join with log-relevant-observations, use log-scored-observations global table & interactive query to check that every observation id is in the global table store, when all ids are in the store map to the observation with the highest score. (Problem: global table does not appear to work when only used for interactive query)
What would be the best way to achieve what I'm trying?
I'm hoping not to create any partition, disk or memory bottleneck.
Everything has unique ids and tuples of relevant ids when the value is joined from log & observation.
(Edit: Switched text description of topology with a diagram & change title)

Solution #2 seems to work, but it emitted warnings because interactive queries takes some time to be ready - so I implemented the same solution with a Transformer:
#Slf4j
#Configuration
#RequiredArgsConstructor
#SuppressWarnings("unchecked")
public class LogBestObservationsSelectorProcessorConfig {
private String logScoredObservationsStore = "log-scored-observations-store";
private final Serde<LogEntryRelevantObservationIdTuple> logEntryRelevantObservationIdTupleSerde;
private final Serde<LogRelevantObservationIdsTuple> logRelevantObservationIdsTupleSerde;
private final Serde<LogEntryObservationMatchTuple> logEntryObservationMatchTupleSerde;
private final Serde<LogEntryObservationMatchIdsRelevantObservationsTuple> logEntryObservationMatchIdsRelevantObservationsTupleSerde;
#Bean
public Function<
GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>,
Function<
KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple>,
Function<
KTable<String, LogRelevantObservationIds>,
KStream<String, LogEntryObservationMatchTuple>
>
>
>
logBestObservationSelectorProcessor() {
return (GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> logScoredObservationsTable) ->
(KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple> logScoredObservationProcessedStream) ->
(KTable<String, LogRelevantObservationIdsTuple> logRelevantObservationIdsTable) -> {
return logScoredObservationProcessedStream
.selectKey((k, v) -> k.getLogId())
.leftJoin(
logRelevantObservationIdsTable,
LogEntryObservationMatchIdsRelevantObservationsTuple::new,
Joined.with(
Serdes.String(),
logEntryRelevantObservationIdTupleSerde,
logRelevantObservationIdsTupleSerde
)
)
.transform(() -> new LogEntryObservationMatchTransformer(logScoredObservationsStore))
.groupByKey(
Grouped.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.reduce(
(match1, match2) -> Double.compare(match1.getScore(), match2.getScore()) != -1 ? match1 : match2,
Materialized.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.toStream()
;
};
}
#RequiredArgsConstructor
private static class LogEntryObservationMatchTransformer implements Transformer<String, LogEntryObservationMatchIdsRelevantObservationsTuple, KeyValue<String, LogEntryObservationMatchTuple>> {
private final String stateStoreName;
private ProcessorContext context;
private TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> kvStore;
#Override
public void init(ProcessorContext context) {
this.context = context;
this.kvStore = (TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, LogEntryObservationMatchTuple> transform(String logId, LogEntryObservationMatchIdsRelevantObservationsTuple value) {
val observationIds = value.getLogEntryRelevantObservationsTuple().getRelevantObservations().getObservationIds();
val allObservationsProcessed = observationIds.stream()
.allMatch((observationId) -> {
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
return kvStore.get(key) != null;
});
if (!allObservationsProcessed) {
return null;
}
val observationId = value.getLogEntryRelevantObservationIdTuple().getObservationId();
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
ValueAndTimestamp<LogEntryObservationMatchTuple> observationMatchValueAndTimestamp = kvStore.get(key);
return new KeyValue<>(logId, observationMatchValueAndTimestamp.value());
}
#Override
public void close() {
}
}
}

Related

Joining streams Flink doesn't work with Kafka consumer

I'm trying to join two streams, one from the data collection, one consumes from Kafka.
code snippet
public static void main(String[] args) {
KafkaSource<JsonNode> kafkaSource = ...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Kafka messages : {"name": "John"}
final DataStream<JsonNode> dataStream1 = env.fromSource(kafkaSource, waterMark(), "Kafka").rebalance()
.assignTimestampsAndWatermarks(waterMark());
final DataStream<String> dataStream2 = env.fromElements("John", "Zbe", "Abe")
.assignTimestampsAndWatermarks(waterMark());
dataStream1
.join(dataStream2)
.where(new KeySelector<JsonNode, String>() {
#Override
public String getKey(JsonNode value) throws Exception {
return value.get("name").asText();
}
})
.equalTo(new KeySelector<String, String>() {
#Override
public String getKey(String value) throws Exception {
return value;
}
})
.window(SlidingEventTimeWindows.of(Time.minutes(50) /* size */, Time.minutes(10) /* slide */))
.apply(new JoinFunction<JsonNode, String, String>() {
#Override
public String join(JsonNode first, String second) throws Exception {
return first+" "+second;
}
}).print();
env.execute();
}
watermark
private static <T> WatermarkStrategy<T> waterMark() {
return new WatermarkStrategy<T>() {
#Override
public WatermarkGenerator<T> createWatermarkGenerator(
org.apache.flink.api.common.eventtime.WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
};
}
After running snippet code, it doesn't have any merged data in the output. Am I going wrong somewhere?
Apache flink version: 1.13.2
The problem is probably related to watermarking. Since you're not using event-time-based timestamps, try changing SlidingEventTimeWindows to SlidingProcessingTimeWindows and see if it then produces results.
The underlying problem is probably a lack of data. The rebalance() on the Kafka stream guarantees that idle partitions won't stall the watermarks unless all partitions are idle. But if this is an unbounded streaming job, unless you have some data that falls after the first window, the watermark won't advance far enough to trigger the first window.
Options:
Send some data with larger timestamps
Configure the Kafka source as a bounded stream by using the .setBounded(...) option on the KakfaSource builder
Stop the job using the --drain option (docs)
The fact that dataStream2 is bounded is also a problem, but I'm not sure how much of one. At best this will prevent any windows after the first one from producing any results (since datastream joins are inner joins).

Kafka Ktable also streaming duplicate updates

Kafka Ktable also streaming duplicate updates.
I want to process the Ktable(created with Kstream.reduce()) changelog stream, i.e any change in value of the keys in the Ktable. But its seems even when the same key value pair is sent multiple times to Ktable, it is sent downstream every time. I need to send update in the value for a key only if the value changes.
`
groupByKey(Grouped.with(new Serdes.LongSerde(),new Serdes.LongSerde()))
.reduce(new Reducer<Long>() {
#Override
public Long apply(Long t1, Long t2) {
return t2;
}
}).toStream().foreach((key, value) -> //for each update in ID, send update to the stream
{
sendUpdate(key);
});
`
It's default behavior of KTable#toStream(), it convert the changelog topic to a KStream, so the downstream operator of reduce get updated each time the upstream reduce operator receive a message.
You can archive your desire behavior using Processor API, in this case we use a KStream.transfomerValues().
First register a KeyValueStore to store your latest value:
//you don't need to add number_store, if your KTable already materialized to number_store
streamsBuilder
.addStateStore(Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("number_store"), Serdes.Long(), Serdes.Long()));
numberKStream
.transformValues(ExtractIfValueChangedTransformer::new, "number_store")
.filter((key, value) -> value != null)
.foreach((key, value) -> sendUpdate(key));
Then we create an ExtractIfValueChangedTransformer, only return value of new message if the value has changed, if not then return null:
public class ExtractIfValueChangedTransformer implements ValueTransformerWithKey<Long, Long, Long> {
KeyValueStore<Long, Long> kvStore;
#Override
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<Long, Long>) context.getStateStore("number_store");
}
#Override
public Long transform(Long key, Long newValue) {
Long oldValue = kvStore.get(key);
kvStore.put(key, newValue);
if (oldValue == null) return newValue;
return oldValue.equals(newValue) ? null : newValue;
}
#Override
public void close() {}
}
Kafka Streams provides 2 semantics : emit-on-update and emit-on-window-close.
KIP-557 is about adding emit-on-change semantic based on byte array comparison of data. It has been implemented in Kafka Streams 2.6 and then removed due to "potential data loss".
Nevertheless, I have developed an implementation of the emit-on-change semantic, by using the Kafka Streams DSL.
The idea is to convert a KStream with emit-on-update semantic to a KStream with emit-on-change semantic. You can use this implementation on the source Kstream that you provide to create the KTable, or on the KTable after applying .toStream().
This implementation implicitly creates a state store, where the value contains the KStream data and a flag, that indicates if an update should be emitted. This flag is set in the aggregate operation and is based on Object#equals for comparison. But you could change the implementation to use a Comparator.
Here is the withEmitOnChange method that change the semantic of a KStream. You might have to specify a serde for EmitOnChangeState data structure (see below).
public static <K, V> KStream<K, V> withEmitOnChange(KStream<K, V> streams) {
return streams
.groupByKey()
.aggregate(
() -> (EmitOnChangeState<V>) null,
(k, data, state) -> {
if (state == null) {
return new EmitOnChangeState<>(data, true);
} else {
return state.merge(data);
}
}
)
.toStream()
.filter((k, state) -> state.shouldEmit)
.mapValues(state -> (V) state.data);
}
Here is the data structure that is stored in state store and used to check if an update should be emitted.
public static class EmitOnChangeState<T> {
public final T data;
public final boolean shouldEmit;
public EmitOnChangeState(T data, boolean shouldEmit) {
this.data = data;
this.shouldEmit = shouldEmit;
}
public EmitOnChangeState<T> merge(T newData) {
return new EmitOnChangeState<>(newData, Objects.equals(data, newData));
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
EmitOnChangeState<?> that = (EmitOnChangeState<?>) o;
return shouldEmit == that.shouldEmit && Objects.equals(data, that.data);
}
#Override
public int hashCode() {
return Objects.hash(data, shouldEmit);
}
}
Usage:
KStream<ProductKey, Product> products = builder.stream("product-topic");
withEmitOnChange(products)
.to("out-product-topic"); // output topic with emit-on-change semantic

Kafka compare consecutive values for a key

We are building an application to get data from sensors. The data is streamed to Kafka from where consumers will publish it to different data stores. Each data point will have multiple attributes representing the state of the sensor.
In one of the consumers we want to publish the data to the data store only if the value has changed. for e.g. if there is temperature sensor which is polled for data every 10 secs we expect to receive data like
----------------------------------------------------------------------
Key Value
----------------------------------------------------------------------
Sensor1 {timestamp: "10-10-2019 10:20:30", temperature: 10}
Sensor1 {timestamp: "10-10-2019 10:20:40", temperature: 10}
Sensor1 {timestamp: "10-10-2019 10:20:50", temperature: 11}
In the above case only the first record and the third record should be published.
For this we need some way to compare the current value for a key with the previous value with the same key. I believe this should be possible with KTable or KStream but unable to find examples.
Any help will be great!
Here is an example how to solve this with KStream#transformValues().
StreamsBuilder builder = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, YourValueType>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName),
Serdes.String(),
YourValueTypeSerde());
builder.addStateStore(keyValueStoreBuilder);
stream = builder.stream(INPUT_TOPIC, Consumed.with(Serdes.Integer(), YourValueTypeSerde()))
.transformValues(() -> new ValueTransformerWithKey<String, YourValueType, YourValueType>() {
private KeyValueStore<String, YourValueType> state;
#Override
public void init(final ProcessorContext context) {
state = (KeyValueStore<String, YourValueType>) context.getStateStore(stateStoreName);}
#Override
public YourValueType transform(final String key, final YourValueType value) {
YourValueType prevValue = state.get(key);
if (prevValue != null) {
if (prevValue.temperature() != value.temperature()) {
return prevValue;
}
} else {
state.put(key, value);
}
return null;
}
#Override
public void close() {}
}, stateStorName))
.to(OUTPUT_TOPIC);
You compare the record with the previous record stored in the state store. If temperature is different you return the record from the state store and store the current record in the state store. If the temperature is equal you discard the current record.
You can use a Kafka stream Processor API. You can set up a local key value store as the state context. The process function is called for each record fetched.
In the process function you can check against the last value stored and accept or reject the latest record based on business logic (in your case comparing temperature value).
In the punctuate function you can then forward the record to the consumer on a schedule. See the sample code below (without punctuate)
public class SensorProcessor implements Processor<String, String> {
private ProcessorContext context;
private KeyValueStore<String, String> kvStore;
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
// keep the processor context locally because we need it in punctuate() and commit()
this.context = context;
// retrieve the key-value store named "SensorData"
kvStore = (KeyValueStore) context.getStateStore("SensorData");
// schedule a punctuate() method every second based on event-time
}
#Override
public void process(String sensorName, String sensorData) {
String oldValue = this.kvStore.get(sensorName);
if (oldValue == null) {
this.kvStore.put(sensorName, sensorData);
} else {
//Put the business logic for comparison
//compare temperatures
//if required put the value
this.kvStore.put(sensorName, sensorData);
//Forward it o consumer
context.forward(sensorName, sensorData);
}
context.commit();
}
#Override
public void close() {
// nothing to do
}
}
If you want to do this with Kafka Streams you have to use Processor API.
You need to implement you custom Transformer with State store.
For each message you should search value in State store if it has changed or it is not present you should return new value, otherwise null. Apart of that you should also save that value in state store (KeyValueStore::put(...))
More regarding Processor API can be found: here

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?
If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

How to Iterate through list with RxJava and perform initial process on first item

I am new to RxJava and finding it very useful for network and database processing within my Android applications.
I have two use cases that I cannot implement completely in RxJava
Use Case 1
Clear down my target database table Table A
Fetch a list of database records from Table B that contain a key field
For each row retrieved from Table B, call a Remote API and persist all the returned data into Table A
The closest I have managed is this code
final AtomicInteger id = new AtomicInteger(0);
DatabaseController.deleteAll(TableA_DO.class);
DatabaseController.fetchTable_Bs()
.subscribeOn(Schedulers.io())
.toObservable()
.flatMapIterable(b -> b)
.flatMap(b_record -> NetworkController.getTable_A_data(b_record.getKey()))
.flatMap(network -> transformNetwork(id, network, NETWORK_B_MAPPER))
.doOnNext(DatabaseController::persistRealmObjects)
.doOnComplete(onComplete)
.doOnError(onError)
.doAfterTerminate(doAfterTerminate())
.doOnSubscribe(compositeDisposable::add)
.subscribe();
Use Case 2
Clear down my target database table Table X
Clear down my target database table Table Y
Fetch a list of database records from Table Z that contain a key field
For each row retrieved from Table B, call a Remote API and persist some of the returned data into Table X the remainder of the data should be persisted into table Y
I have not managed to create any code for use case 2.
I have a number of questions regarding the use of RxJava for these use cases.
Is it possible to achieve both my use cases in RxJava?
Is it "Best Practice" to combine all these steps into an Rx "Stream"
UPDATE
I ended up with this POC test code which seems to work...
I am not sure if its the optimum solution however My API calls return Single and my database operations return Completable so I feel like this is the best solution for me.
public class UseCaseOneA {
public static void main(final String[] args) {
login()
.andThen(UseCaseOneA.deleteDatabaseTableA())
.andThen(UseCaseOneA.deleteDatabaseTableB())
.andThen(manufactureRecords())
.flatMapIterable(x -> x)
.flatMapSingle(record -> NetworkController.callApi(record.getPrimaryKey()))
.flatMapSingle(z -> transform(z))
.flatMapCompletable(p -> UseCaseOneA.insertDatabaseTableA(p))
.doOnComplete(() -> System.out.println("ON COMPLETE"))
.doFinally(() -> System.out.println("ON FINALLY"))
.subscribe();
}
private static Single<List<PayloadDO>> transform(final List<RemotePayload> payloads) {
return Single.create(new SingleOnSubscribe<List<PayloadDO>>() {
#Override
public void subscribe(final SingleEmitter<List<PayloadDO>> emitter) throws Exception {
System.out.println("transform - " + payloads.size());
final List<PayloadDO> payloadDOs = new ArrayList<>();
for (final RemotePayload remotePayload : payloads) {
payloadDOs.add(new PayloadDO(remotePayload.getPayload()));
}
emitter.onSuccess(payloadDOs);
}
});
}
private static Observable<List<Record>> manufactureRecords() {
final List<Record> records = new ArrayList<>();
records.add(new Record("111-111-111"));
records.add(new Record("222-222-222"));
records.add(new Record("3333-3333-3333"));
records.add(new Record("44-444-44444-44-4"));
records.add(new Record("5555-55-55-5-55-5555-5555"));
return Observable.just(records);
}
private static Completable deleteDatabaseTableA() {
return Completable.create(new CompletableOnSubscribe() {
#Override
public void subscribe(final CompletableEmitter emitter) throws Exception {
System.out.println("deleteDatabaseTableA");
emitter.onComplete();
}
});
}
private static Completable deleteDatabaseTableB() {
return Completable.create(new CompletableOnSubscribe() {
#Override
public void subscribe(final CompletableEmitter emitter) throws Exception {
System.out.println("deleteDatabaseTableB");
emitter.onComplete();
}
});
}
private static Completable insertDatabaseTableA(final List<PayloadDO> payloadDOs) {
return Completable.create(new CompletableOnSubscribe() {
#Override
public void subscribe(final CompletableEmitter emitter) throws Exception {
System.out.println("insertDatabaseTableA - " + payloadDOs);
emitter.onComplete();
}
});
}
private static Completable login() {
return Completable.complete();
}
}
This code doesn't address all my use case requirements. Namely being able to transform the remote payload records into multiple Database record types and insert each type into its own specific target databased table.
I could just call the Remote API twice to get the same remote data items and transform first into one database type then secondly into the second database type, however that seems wasteful.
Is there an operand in RxJava where I can reuse the output from my API calls and transform them into another database type?
You have to index the items yourself in some manner, for example, via external counting:
Observable.defer(() -> {
AtomicInteger counter = new AtomicInteger();
return DatabaseController.fetchTable_Bs()
.subscribeOn(Schedulers.io())
.toObservable()
.flatMapIterable(b -> b)
.doOnNext(item -> {
if (counter.getAndIncrement() == 0) {
// this is the very first item
} else {
// these are the subsequent items
}
});
});
The defer is necessary to isolate the counter to the inner sequence so that repetition still works if necessary.