Kafka Streams - Transformers with State in Fields and Task / Threading Model

Kafka Streams - Transformers with State in Fields and Task / Threading Model - apache-kafka

I have a Transformer with a state store that uses punctuate to operate on said state store.
After a few iterations of punctuate, the operation may have finished, so I'd like to cancel the punctuate -- but only for the Task that has actually finished the operation on the partition's respective state store. The punctuate operations for the Tasks that are not done yet should keep running. To that purpose my transformer keeps a reference to the Cancellable returned by schedule().
As far as I can tell, every Task always gets its own isolated Transformer instance and every Task gets its own isolated scheduled punctuate() within that instance (?)
However, since this is effectively state, but not inside a stateStore, I'm not sure how safe this is. For instance, are there certain scenarios in which one transformer instance might be shared across tasks (and therefore absolutely no state must be kept outside of StateStores)?
public class CoolTransformer implements Transformer {
private KeyValueStore stateStore;
private Cancellable taskPunctuate; // <----- Will this lead to conflicts between tasks?
public void init(ProcessorContext context) {
this.store = context.getStateStore(...);
this.taskPunctuate = context.schedule(Duration.ofMillis(...), PunctuationType.WALL_CLOCK_TIME, this::scheduledOperation);
}
private void scheduledOperation(long l) {
stateStore.get(...)
// do stuff...
if (done) {
this.taskPunctuate.cancel(); // <----- Will this lead to conflicts between tasks?
}
}
public KeyValue transform(key, value) {
// do stuff
stateStore.put(key, value)
}
public void close() {
taskPunctuate.cancel();
}
}

You might be able to look into TransformerSupplier, specifically TransformSupplier#get(), this will ensure that ensure we new transformer will be created for when they should be kept independent. Also the Transformers should not share objects, so be careful of this with your Cancellable taskPunctuate. If either of these cases are violated you should see errors like org.apache.kafka.streams.errors.StreamsException: Current node is unknown, ConcurrentModificationException or InstanceAlreadyExistsException.

Related

opencensus - explicit context management

I am implementing an opencensus tracing in my (asynchronous) JVM app.
However I don't understand how is the context passed.
Sometimes it seems to work fine, sometimes traces from different requests appear nested for no reason.
I also have this warning appearing in the logs along with a stacktrace:
SEVERE: Context was not attached when detaching
How do I explicitly create a root span, and how can I explicitly pass a parent/context to the child spans?

In OpenCensus we have a concept of context independent of the "Span" or "Tags". It represents a Map that is propagated with the request (it is implemented as a thread-local so in sync calls automatically gets propagated). For callbacks/async calls just for propagation (we are using io.grpc.Context as the implementation of the context) use the wrap functions defined here https://github.com/grpc/grpc-java/blob/master/context/src/main/java/io/grpc/Context.java#L589. This will ensure just the context propagation, so entries in the context map will be propagated between different threads.
If you want to start a Span in one thread and end it in a different thread, use the withSpan methods from the tracer https://www.javadoc.io/doc/io.opencensus/opencensus-api/0.17.0 :
class MyClass {
private static Tracer tracer = Tracing.getTracer();
void handleRequest(Executor executor) {
Span span = tracer.spanBuilder("MyRunnableSpan").startSpan();
// do some work before scheduling the async
executor.execute(Context.wrap(tracer.withSpan(span, new Runnable() {
#Override
public void run() {
try {
sendResult();
} finally {
span.end();
}
}
})));
}
}
A bit more information about this here https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Span.md#span-creation

Does a FlowableOperator inherently supports backpressure?

I've implemented an FlowableOperator as described in the RxJava2 wiki (https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#operator-targeting-lift) except that I perform some testing in the onNext() operation something like that:
public final class MyOperator implements FlowableOperator<Integer, Integer> {
...
static final class Op implements FlowableSubscriber<Integer>, Subscription {
#Override
public void onNext(Integer v) {
if (v % 2 == 0) {
child.onNext(v * v);
}
}
...
}
}
This operator is part of a chain where I have a Flowable created with a backpressure drop. In essence, it looks almost like this:
Flowable.<Integer>create(emitter -> myAction(), DROP)
.filter(v -> v > 2)
.lift(new MyOperator())
.subscribe(n -> doSomething(n));
I've met the following issue:
backpressure occurs, so doSomething(n) cannot handle the upcoming upstream
items are dropped due to the Backpressure strategy chosen
but doSomething(n) never receives back new item after the drop has been performed and while doSomething(n) was ready to deal with new items
Reading back the excellent blog post http://akarnokd.blogspot.fr/2015/05/pitfalls-of-operator-implementations.html of David Karnok, it's seems that I need to add a request(1) in the onNext() method. But that was with RxJava1...
So, my question is: is this fix enough in RxJava2 to deal with my backpressure issue? Or do my operator have to implement all the stuff about Atomics, drain stuff described in https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#atomics-serialization-deferred-actions to properly handle my backpressure issue?
Note: I've added the request(1) and it seems to work. But I can't figure out whether it's enough or whether my operator needs the tricky stuff of queue-drain and atomics.
Thanks in advance!

Does a FlowableOperator inherently supports backpressure?
FlowableOperator is an interface that is called for a given downstream Subscriber and should return a new Subscriber that wraps the downstream and modulates the Reactive Streams events passing in one or both directions. Backpressure support is the responsibility of the Subscriber implementation, not this particular functional interface. It could have been Function<Subscriber, Subscriber> but a separate named interface was deemed more usable and less prone to overload conflicts.
need to add a request(1) in the onNext() [...]
But I can't figure out whether it's enough or whether my operator needs the tricky stuff of queue-drain and atomics.
Yes, you have to do that in RxJava 2 as well. Since RxJava 2's Subscriber is not a class, it doesn't have v1's convenience request method. You have to save the Subscription in onSubscribe and call upstream.request(1) on the appropriate path in onNext. For your case, it should be quite enough.
I've updated the wiki with a new section explaining this case explicitly:
https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0#replenishing
final class FilterOddSubscriber implements FlowableSubscriber<Integer>, Subscription {
final Subscriber<? super Integer> downstream;
Subscription upstream;
// ...
#Override
public void onSubscribe(Subscription s) {
if (upstream != null) {
s.cancel();
} else {
upstream = s; // <-------------------------
downstream.onSubscribe(this);
}
}
#Override
public void onNext(Integer item) {
if (item % 2 != 0) {
downstream.onNext(item);
} else {
upstream.request(1); // <-------------------------
}
}
#Override
public void request(long n) {
upstream.request(n);
}
// the rest omitted for brevity
}

Yes you have to do the tricky stuff...
I would avoid writing operators, except if you are very sure what you are doing? Nearly everything can be achieved with the default operators...
Writing operators, source-like (fromEmitter) or intermediate-like
(flatMap) has always been a hard task to do in RxJava. There are many
rules to obey, many cases to consider but at the same time, many
(legal) shortcuts to take to build a well performing code. Now writing
an operator specifically for 2.x is 10 times harder than for 1.x. If
you want to exploit all the advanced, 4th generation features, that's
even 2-3 times harder on top (so 30 times harder in total).
There is the tricky stuff explained: https://github.com/ReactiveX/RxJava/wiki/Writing-operators-for-2.0

Samza: Delay processing of messages until timestamp

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.
What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.

It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.
Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.

I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set.
It should look something like:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
#SuppressWarnings("unchecked")
#Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).
You could read about key-value store here (for the latest Samza version):
https://samza.apache.org/learn/documentation/0.14/container/state-management.html

How do I warm up an actor's state from database when starting up?

My requirement is to start a long running process to tag all the products that are expired. This is run every night at 1:00 AM. The customers may be accessing some of the products on the website, so they have instances around the time when the job is run. The others are in the persistent media, not yet having instances because the customers are not accessing them.
Where should I hook up the logic to read the latest state of an actor from a persistent media and create a brand new actor? Should I have that call in the Prestart override method? If so, how can I tell the ProductActor that a new actor being created.
Or should I send a message to the ProductActor like LoadMeFromAzureTable which will load the state from the persistent media after an actor being created?

There are different ways to do it depending on what you need, as opposed to there being precisely one "right" answer.
You could use a Persistent Actor to recover state from a durable store automatically on startup (or in case of crash, to recover). Or, if you don't want to use that module (still in beta as of July 2015), you could do it yourself one of two ways:
1) You could load your state in PreStart, but I'd only go with this if you can make the operation async via your database client and use the PipeTo pattern to send the results back to yourself incrementally. But if you need to have ALL the state resident in memory before you start doing work, then you need to...
2) Make a finite state machine using behavior switching. Start in a gated state, send yourself a message to load your data, and stash everything that comes in. Then switch to a receiving state and unstash all messages when your state is done loading. This is the approach I prefer.
Example (just mocking the DB load with a Task):
public class ProductActor : ReceiveActor, IWithUnboundedStash
{
public IStash Stash { get; set; }
public ProductActor()
{
// begin in gated state
BecomeLoading();
}
private void BecomeLoading()
{
Become(Loading);
LoadInitialState();
}
private void Loading()
{
Receive<DoneLoading>(done =>
{
BecomeReady();
});
// stash any messages that come in until we're done loading
ReceiveAny(o =>
{
Stash.Stash();
});
}
private void LoadInitialState()
{
// load your state here async & send back to self via PipeTo
Task.Run(() =>
{
// database loading task here
return new Object();
}).ContinueWith(tr =>
{
// do whatever (e.g. error handling)
return new DoneLoading();
}).PipeTo(Self);
}
private void BecomeReady()
{
Become(Ready);
// our state is ready! put all those stashed messages back in the mailbox
Stash.UnstashAll();
}
private void Ready()
{
// handle those unstashed + new messages...
ReceiveAny(o =>
{
// do whatever you need to do...
});
}
}
/// <summary>
/// Marker interface.
/// </summary>
public class DoneLoading {}

How to access memcached asynchronously in netty

I am writing a server in netty, in which I need to make a call to memcached. I am using spymemcached and can easily do the synchronous memcached call. I would like this memcached call to be async. Is that possible? The examples provided with netty do not seem to be helpful.
I tried using callbacks: created a ExecutorService pool in my Handler and submitted a callback worker to this pool. Like this:
public class MyHandler extends ChannelInboundMessageHandlerAdapter<MyPOJO> implements CallbackInterface{
...
private static ExecutorService pool = Executors.newFixedThreadPool(20);
#Override
public void messageReceived(ChannelHandlerContext ctx, MyPOJO pojo) {
...
CallingbackWorker worker = new CallingbackWorker(key, this);
pool.submit(worker);
...
}
public void myCallback() {
//get response
this.ctx.nextOutboundMessageBuf().add(response);
}
}
CallingbackWorker looks like:
public class CallingbackWorker implements Callable {
public CallingbackWorker(String key, CallbackInterface c) {
this.c = c;
this.key = key;
}
public Object call() {
//get value from key
c.myCallback(value);
}
However, when I do this, this.ctx.nextOutboundMessageBuf() in myCallback gets stuck.
So, overall, my question is: how to do async memcached calls in Netty?

There are two problems here: a small-ish issue with the way you're trying to code this, and a bigger one with many libraries that provide async service calls, but no good way to take full advantage of them in an async framework like Netty. That forces users into suboptimal hacks like this one, or a less-bad, but still not ideal approach I'll get to in a moment.
First the coding problem. The issue is that you're trying to call a ChannelHandlerContext method from a thread other than the one associated with your handler, which is not allowed. That's pretty easy to fix, as shown below. You could code it a few other ways, but this is probably the most straightforward:
private static ExecutorService pool = Executors.newFixedThreadPool(20);
public void channelRead(final ChannelHandlerContext ctx, final Object msg) {
//...
final GetFuture<String> future = memcachedClient().getAsync("foo", stringTranscoder());
// first wait for the response on a pool thread
pool.execute(new Runnable() {
public void run() {
String value;
Exception err;
try {
value = future.get(3, TimeUnit.SECONDS); // or whatever timeout you want
err = null;
} catch (Exception e) {
err = e;
value = null;
}
// put results into final variables; compiler won't let us do it directly above
final fValue = value;
final fErr = err;
// now process the result on the ChannelHandler's thread
ctx.executor().execute(new Runnable() {
public void run() {
handleResult(fValue, fErr);
}
});
}
});
// note that we drop through to here right after calling pool.execute() and
// return, freeing up the handler thread while we wait on the pool thread.
}
private void handleResult(String value, Exception err) {
// handle it
}
That will work, and might be sufficient for your application. But you've got a fixed-sized thread pool, so if you're ever going to handle much more than 20 concurrent connections, that will become a bottleneck. You could increase the pool size, or use an unbounded one, but at that point, you might as well be running under Tomcat, as memory consumption and context-switching overhead start to become issues, and you lose the scalabilty that was the attraction of Netty in the first place!
And the thing is, Spymemcached is NIO-based, event-driven, and uses just one thread for all its work, yet provides no way to fully take advantage of its event-driven nature. I expect they'll fix that before too long, just as Netty 4 and Cassandra have recently by providing callback (listener) methods on Future objects.
Meanwhile, being in the same boat as you, I researched the alternatives, and not being too happy with what I found, I wrote (yesterday) a Future tracker class that can poll up to thousands of Futures at a configurable rate, and call you back on the thread (Executor) of your choice when they complete. It uses just one thread to do this. I've put it up on GitHub if you'd like to try it out, but be warned that it's still wet, as they say. I've tested it a lot in the past day, and even with 10000 concurrent mock Future objects, polling once a millisecond, its CPU utilization is negligible, though it starts to go up beyond 10000. Using it, the example above looks like this:
// in some globally-accessible class:
public static final ForeignFutureTracker FFT = new ForeignFutureTracker(1, TimeUnit.MILLISECONDS);
// in a handler class:
public void channelRead(final ChannelHandlerContext ctx, final Object msg) {
// ...
final GetFuture<String> future = memcachedClient().getAsync("foo", stringTranscoder());
// add a listener for the Future, with a timeout in 2 seconds, and pass
// the Executor for the current context so the callback will run
// on the same thread.
Global.FFT.addListener(future, 2, TimeUnit.SECONDS, ctx.executor(),
new ForeignFutureListener<String,GetFuture<String>>() {
public void operationSuccess(String value) {
// do something ...
ctx.fireChannelRead(someval);
}
public void operationTimeout(GetFuture<String> f) {
// do something ...
}
public void operationFailure(Exception e) {
// do something ...
}
});
}
You don't want more than one or two FFT instances active at any time, or they could become a drain on CPU. But a single instance can handle thousands of outstanding Futures; about the only reason to have a second one would be to handle higher-latency calls, like S3, at a slower polling rate, say 10-20 milliseconds.
One drawback of the polling approach is that it adds a small amount of latency. For example, polling once a millisecond, on average it will add 500 microseconds to the response time. That won't be an issue for most applications, and I think is more than offset by the memory and CPU savings over the thread pool approach.
I expect within a year or so this will be a non-issue, as more async clients provide callback mechanisms, letting you fully leverage NIO and the event-driven model.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka Streams - Transformers with State in Fields and Task / Threading Model - apache-kafka

Related

opencensus - explicit context management

Does a FlowableOperator inherently supports backpressure?

Samza: Delay processing of messages until timestamp

How do I warm up an actor's state from database when starting up?

How to access memcached asynchronously in netty

Categories

Resources