Why does this example not cause Dirty Reads? - apache-kafka

I'm playing with Kafka Streams a bit and while investigating WordCountProcessorDemo I realized there must have been part of the picture that i'm missing. Namely, how does the library guarantee that no dirty read could happen in the code below:
#Override
public void process(final String dummy, final String line) {
final String[] words = line.toLowerCase(Locale.getDefault()).split(" ");
for (final String word : words) {
final Integer oldValue = this.kvStore.get(word);
if (oldValue == null) {
this.kvStore.put(word, 1);
} else {
this.kvStore.put(word, oldValue + 1);
}
}
context.commit();
}
As far as I undestand the matter, after firing kvStore.get(..) the state might get changed by another StreamProcessor instance, living on other machine consuming different partition. Therefore, since we performed a dirty read, the state will become inconsistent.
Does Kafka Streams deal somehow with such situation ?

the state might get changed by another StreamProcessor instance
Not really. The state is sharded and thus each Processor has it's own exclusive share of the overall state.

Related

Kafka streams event deduplication keeping last event in window

I'm using Kafka Streams in a deduplication events problem over short time windows (<= 1 minute).
First I've tried to tackle the problem by using DSL API with .suppress(Suppressed.untilWindowCloses(...)) operator but, given the fact that wall-clock time is not yet supported (I've seen the KIP 424), this operator is not viable for my use case.
Then, I've followed this official Confluent example in which low level Processor API is used and it was working fine but has one major limitation for my use-case. The single event (obtained by deduplication) is emitted at the beginning of the time window, subsequent duplicated events are "suppressed". In my use case I need the reverse of that, meaning that a single event should be emitted at the end of the window.
I'm asking for suggestions on how to implement this use case with Processor API.
My idea was to use the Processor API with a custom Transformer and a Punctuator.
The transformer would store in a WindowStore the distinct keys received without returning any KeyValue. Simultaneously, I'd schedule a punctuator running with an interval equal to the size of the window in the WindowStore. This punctuator will iterate over the elements in the store and forward them downstream.
The following are some core parts of the logic:
DeduplicationTransformer (slightly modified from official Confluent example):
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, V>) context.getStateStore(this.storeName);
// Schedule punctuator for this transformer.
context.schedule(Duration.ofMillis(this.windowSizeMs), PunctuationType.WALL_CLOCK_TIME,
new DeduplicationPunctuator<E, V>(eventIdStore, context, this.windowSizeMs));
}
#Override
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
if (!isDuplicate(eventId)) {
rememberNewEvent(eventId, value, context.timestamp());
}
return null;
}
}
DeduplicationPunctuator:
public DeduplicationPunctuator(WindowStore<E, V> eventIdStore, ProcessorContext context,
long retainPeriodMs) {
this.eventIdStore = eventIdStore;
this.context = context;
this.retainPeriodMs = retainPeriodMs;
}
#Override
public void punctuate(long invocationTime) {
LOGGER.info("Punctuator invoked at {}, searching from {}", new Date(invocationTime), new Date(invocationTime-retainPeriodMs));
KeyValueIterator<Windowed<E>, V> it =
eventIdStore.fetchAll(invocationTime - retainPeriodMs, invocationTime + retainPeriodMs);
while (it.hasNext()) {
KeyValue<Windowed<E>, V> next = it.next();
LOGGER.info("Punctuator running on {}", next.key.key());
context.forward(next.key.key(), next.value);
// Delete from store with tombstone
eventIdStore.put(next.key.key(), null, invocationTime);
context.commit();
}
it.close();
}
Is this a valid approach?
With the previous code, I'm running some integration tests and I've some synchronization issues. How can I be sure that the start of the window will coincide with the Punctuator's scheduled interval?
Also as an alternative approach, I was wondering (I've googled with no result), if there is any event triggered by window closing to which I can attach a callback in order to iterate over store and publish only distinct events.
Thanks.

Does a FlowableOperator inherently supports backpressure?

I've implemented an FlowableOperator as described in the RxJava2 wiki (https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#operator-targeting-lift) except that I perform some testing in the onNext() operation something like that:
public final class MyOperator implements FlowableOperator<Integer, Integer> {
...
static final class Op implements FlowableSubscriber<Integer>, Subscription {
#Override
public void onNext(Integer v) {
if (v % 2 == 0) {
child.onNext(v * v);
}
}
...
}
}
This operator is part of a chain where I have a Flowable created with a backpressure drop. In essence, it looks almost like this:
Flowable.<Integer>create(emitter -> myAction(), DROP)
.filter(v -> v > 2)
.lift(new MyOperator())
.subscribe(n -> doSomething(n));
I've met the following issue:
backpressure occurs, so doSomething(n) cannot handle the upcoming upstream
items are dropped due to the Backpressure strategy chosen
but doSomething(n) never receives back new item after the drop has been performed and while doSomething(n) was ready to deal with new items
Reading back the excellent blog post http://akarnokd.blogspot.fr/2015/05/pitfalls-of-operator-implementations.html of David Karnok, it's seems that I need to add a request(1) in the onNext() method. But that was with RxJava1...
So, my question is: is this fix enough in RxJava2 to deal with my backpressure issue? Or do my operator have to implement all the stuff about Atomics, drain stuff described in https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#atomics-serialization-deferred-actions to properly handle my backpressure issue?
Note: I've added the request(1) and it seems to work. But I can't figure out whether it's enough or whether my operator needs the tricky stuff of queue-drain and atomics.
Thanks in advance!
Does a FlowableOperator inherently supports backpressure?
FlowableOperator is an interface that is called for a given downstream Subscriber and should return a new Subscriber that wraps the downstream and modulates the Reactive Streams events passing in one or both directions. Backpressure support is the responsibility of the Subscriber implementation, not this particular functional interface. It could have been Function<Subscriber, Subscriber> but a separate named interface was deemed more usable and less prone to overload conflicts.
need to add a request(1) in the onNext() [...]
But I can't figure out whether it's enough or whether my operator needs the tricky stuff of queue-drain and atomics.
Yes, you have to do that in RxJava 2 as well. Since RxJava 2's Subscriber is not a class, it doesn't have v1's convenience request method. You have to save the Subscription in onSubscribe and call upstream.request(1) on the appropriate path in onNext. For your case, it should be quite enough.
I've updated the wiki with a new section explaining this case explicitly:
https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#replenishing
final class FilterOddSubscriber implements FlowableSubscriber<Integer>, Subscription {
final Subscriber<? super Integer> downstream;
Subscription upstream;
// ...
#Override
public void onSubscribe(Subscription s) {
if (upstream != null) {
s.cancel();
} else {
upstream = s; // <-------------------------
downstream.onSubscribe(this);
}
}
#Override
public void onNext(Integer item) {
if (item % 2 != 0) {
downstream.onNext(item);
} else {
upstream.request(1); // <-------------------------
}
}
#Override
public void request(long n) {
upstream.request(n);
}
// the rest omitted for brevity
}
Yes you have to do the tricky stuff...
I would avoid writing operators, except if you are very sure what you are doing? Nearly everything can be achieved with the default operators...
Writing operators, source-like (fromEmitter) or intermediate-like
(flatMap) has always been a hard task to do in RxJava. There are many
rules to obey, many cases to consider but at the same time, many
(legal) shortcuts to take to build a well performing code. Now writing
an operator specifically for 2.x is 10 times harder than for 1.x. If
you want to exploit all the advanced, 4th generation features, that's
even 2-3 times harder on top (so 30 times harder in total).
There is the tricky stuff explained: https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0

Samza: Delay processing of messages until timestamp

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.
What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.
It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.
Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.
I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set.
It should look something like:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
#SuppressWarnings("unchecked")
#Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).
You could read about key-value store here (for the latest Samza version):
https://samza.apache.org/learn/documentation/0.14/container/state-management.html

Understanding RxJava: Differences between Runnable callback

I'm trying to understand RxJava and I'm sure this question is a nonsense... I have this code using RxJava:
public Observable<T> getData(int id) {
if (dataAlreadyLoaded()) {
return Observable.create(new Observable.OnSubscribe<T>(){
T data = getDataFromMemory(id);
subscriber.onNext(data);
});
}
return Observable.create(new Observable.OnSubscribe<T>(){
#Override
public void call(Subscriber<? super String> subscriber) {
T data = getDataFromRemoteService(id);
subscriber.onNext(data);
}
});
}
And, for instance, I could use it this way:
Action1<String> action = new Action<String>() {
#Override
public void call(String s) {
//Do something with s
}
};
getData(3).subscribe(action);
and this another with callback that implements Runnable:
public void getData(int id, MyClassRunnable callback) {
if (dataAlreadyLoaded()) {
T data = getDataFromMemory(id);
callback.setData(data);
callback.run();
} else {
T data = getDataFromRemoteService(id);
callback.setData(data);
callback.run();
}
}
And I would use it this way:
getData(3, new MyClassRunnable()); //Do something in run method
Which are the differences? Why is the first one better?
The question is not about the framework itself but the paradigm. I'm trying to understand the use cases of reactive.
I appreciate any help. Thanks.
First of all, your RxJava version is much more complex than it needs to be. Here's a much simpler version:
public Observable<T> getData(int id) {
return Observable.fromCallable(() ->
dataAlreadyLoaded() ? getDataFromMemory(id) : getDataFromRemoteService(id)
);
}
Regardless, the problem you present is so trivial that there is no discernible difference between the two solutions. It's like asking which one is better for assigning integer values - var = var + 1 or var++. In this particular case they are identical, but when using assignment there are many more possibilities (adding values other than one, subtracting, multiplying, dividing, taking into account other variables, etc).
So what is it you can do with reactive? I like the summary on reactivex's website:
Easily create event streams or data streams. For a single piece of data this isn't so important, but when you have a stream of data the paradigm makes a lot more sense.
Compose and transform streams with query-like operators. In your above example there are no operators and a single stream. Operators let you transform data in handy ways, and combining multiple callbacks is much harder than combining multiple Observables.
Subscribe to any observable stream to perform side effects. You're only listening to a single event. Reactive is well-suited for listening to multiple events. It's also great for things like error handling - you can create a long sequence of events, but any errors are forwarded to the eventual subscriber.
Let's look at a more concrete with an example that has more intrigue: validating an email and password. You've got two text fields and a button. You want the button to become enabled once there is a email (let's say .*#.*) and password (of at least 8 characters) entered.
I've got two Observables that represent whatever the user has currently entered into the text fields:
Observable<String> email = /* you figure this out */;
Observable<String> password = /* and this, too */;
For validating each input, I can map the input String to true or false.
Observable<Boolean> validEmail = email.map(str -> str.matches(".*#.*"));
Observable<Boolean> validPw = password.map(str -> str.length() >= 8);
Then I can combine them to determine if I should enable the button or not:
Observable.combineLatest(validEmail, validPw, (b1, b2) -> b1 && b2)
.subscribe(enableButton -> /* enable button based on bool */);
Now, every time the user types something new into either text field, the button's state gets updated. I've setup the logic so that the button just reacts to the state of the text fields.
This simple example doesn't show it all, but it shows how things get a lot more interesting after you get past a simple subscription. Obviously, you can do this without the reactive paradigm, but it's simpler with reactive operators.

what is the purpose of BufferOverflowException

I don't quite understand the meaning/purpose of the BufferOverflowException();
In my course we're using it to code a queue and whilst adding elements to the queue we're using the BufferOverflowException.
According to docs.oracle it means "Unchecked exception thrown when a relative put operation reaches the target buffer's limit." and still I don't understand the meaning of it.
public class FIFOQueue<T>{
T[] data;
int first=0;
int last=0;
boolean full = false;
public FIFOQueue(int capacity){
data = (T[]) new Object[capacity];
}
public void add(T element){
if (full)
throw new BufferOverflowException();
data[last] = element;
last++;
if (last == data.length)
last = 0;
if (last == first)
full = true;
}
Buffer Overflow means that too much data is given to an application.
Example: Copy the text of a book into 'New Contact Name' on a phone.
Typically, if not handled well, this leads to a chrash...
More importantly, it can be a security flaw!
The extra data can get stored outside of the programs designated memory, and the extra data could be executable code.
So, it's good practice to always validate user input! :)