Is there an RxJava Subject that caches values and forgets them once emitted? - rx-java2

I am looking for a Subject type (or some combination of operators) that would achieve a certain behaviour.
A Subject is created
A Subject's onNext() is called multiple times, all those values are cached in that Subject
A Consumer subscribes to that Subject
The Consumer receives all the values that have been cached
The Consumer unsubscribes from the Subject .. through a call to dispose()
A Subject's onNext() is called with 2 new values
A Consumer subscribes to that Subject
The Consumer should only receive the 2 new values, because the old values have already been emitted.
What I need basically is a special case between a ReplaySubject and a BehaviourSubject.
ReplaySubject replays all events .. BehaviourSubject replays on the last event.
I want a Subject that replays only events that has not been consumed. In other words .. events that the Subject had no subscribers when they emitted.
For completion, here's a test case that clarifies the behaviour
#Test
public void test() {
Subject<String> subject = MyDesiredSubject.create();
subject.onNext("1");
subject.onNext("2");
TestObserver<String> testObserver = subject.test();
testObserver.assertValues("1", "2");
testObserver.dispose();
subject.onNext("3");
subject.onNext("4");
testObserver = subject.test();
testObserver.assertValues("3", "4");
}

You could use DispatchWorkSubject in https://github.com/akarnokd/RxJava2Extensions .
A Subject variant that buffers items and allows one or more Observers
to exclusively consume one of the items in the buffer asynchronously.
If there are no Observers (or they all disposed), the
DispatchWorkSubject will keep buffering and later Observers can resume
the consumption of the buffer.
But if you only want to support a maximum of 1 subscriber at a time, then UnicastWorkSubject (also in Rx2 extensions).

Related

Maintaining cold observable semantics with a hot observable

I have a requirement to read items from an external queue, and persist them to a JDBC store. The items must be processed one-by-one, and the next item must only be read from the external queue once the previous item has been successfully persisted. At any given time there may or may not be an item available to read, and if not the application must block until the next item is available.
In order to enforce the one-by-one semantics, I decided to use a cold Observable using the generate method:
return Observable.generate(emitter -> {
final Future<Message> receivedFuture = ...;
final Message message = receivedFuture.get();
emitter.onNext(message);
});
This seems to work as expected for the receiving side.
In order to persist the data to the database, I decided to make use of the Vertx JDBCPool library.
messageObservable
.flatMapSingle(message ->
jdbcPool.prepareQuery("...")
.rxExecute(Tuple.of(...)) // produces a hot observable
)
According to the Vertx docs, the JDBCPool RX methods all produce hot observables.
The problem here seems to be that the flatmap to the JDBCPool method causes the entire chain to become hot. This has the undesirable consequence that messages are read from the queue before the previous message was persisted.
In other words, instead of
Read message 1
Write message 1
Read message 2
Write message 2
I now get
Read message 1
Read message 2
Read message 3
Write message 1
Read message 4
Write message 2
The only solution I have at the moment is to do a very undesirable thing and put the JDBCPool query in its own chain:
messageObservable
.flatMapSingle(message ->
Single.just(
jdbcPool.prepareQuery("...")
.rxExecute(Tuple.of(...))
.blockingWait()
)
I want to know what if there is a way I can combine both the one-by-one semantics of a cold observable stream in combination with a hot observable operation, while keeping the chain intact.

Why would publisher send new items even after cancel?

The documentation of Subscription#cancel says that
Data may still be sent to meet previously signalled demand after calling cancel.
In which scenario would people expect the publisher to continue to send till previous signalled demand is met?
Also, if I don't want any new items to be sent after cancellation, what should I do?
Unless you are creating low level operators or Publishers, you don't have to worry about this.
In which scenario would people expect the publisher to continue to send till previous signalled demand is met?
None of the mainstream Reactive Streams libraries do that as they stop sending items eventually. RxJava 2 and Reactor 3 are pretty eager on this so you'd most likely have an extra item on a low-lever asynchronously issued cancellation. Akka Stream may signal more than that (last time I checked, they mix control and item signals and there is a configuration setting for max synchronous items per stream that can lead to multiple items being emitted before the cancellation takes effect).
Also, if I don't want any new items to be sent after cancellation, what should I do?
Depends on what you implement: a Publisher or a Subscriber.
In a Publisher the most eager method is to set a volatile boolean cancelled field and check that every time you are in some kind of emission loop.
In a Subscriber, you can have a boolean done field that is checked in each onXXX so that when you call Subscription.cancel() from onNext, any subsequent call will be ignored.

Event Sourcing - Apache Kafka + Kafka Streams - How to assure atomicity / transactionality

I'm evaluating Event Sourcing with Apache Kafka Streams to see how viable it is for complex scenarios. As with relational databases I have come across some cases were atomicity/transactionality is essential:
Shopping app with two services:
OrderService: has a Kafka Streams store with the orders (OrdersStore)
ProductService: has a Kafka Streams store (ProductStockStore) with the products and their stock.
Flow:
OrderService publishes an OrderCreated event (with productId, orderId, userId info)
ProductService gets the OrderCreated event and queries its KafkaStreams Store (ProductStockStore) to check if there is stock for the product. If there is stock it publishes an OrderUpdated event (also with productId, orderId, userId info)
The point is that this event would be listened by ProductService Kafka Stream, which would process it to decrease the stock, so far so good.
But, imagine this:
Customer 1 places an order, order1 (there is a stock of 1 for the product)
Customer 2 places concurrently another order, order2, for the same product (stock is still 1)
ProductService processes order1 and sends a message OrderUpdated to decrease the stock. This message is put in the topic after the one from order2 -> OrderCreated
ProductService processes order2-OrderCreated and sends a message OrderUpdated to decrease the stock again. This is incorrect since it will introduce an inconsistency (stock should be 0 now).
The obvious problem is that our materialized view (the store) should be updated directly when we process the first OrderUpdated event. However the only way (I know) of updating the Kafka Stream Store is publishing another event (OrderUpdated) to be processed by the Kafka Stream. This way we can't perform this update transactionally.
I would appreciate ideas to deal with scenarios like this.
UPDATE: I'll try to clarify the problematic bit of the problem:
ProductService has a Kafka Streams Store, ProductStock with this stock (productId=1, quantity=1)
OrderService publishes two OrderPlaced events on the orders topic:
Event1 (key=product1, productId=product1, quantity=1, eventType="OrderPlaced")
Event2 (key=product1, productId=product1, quantity=1, eventType="OrderPlaced")
ProductService has a consumer on the orders topic. For simplicity let's suppose a single partition to assure messages consumption in order. This consumer executes the following logic:
if("OrderPlaced".equals(event.get("eventType"))){
Order order = new Order();
order.setId((String)event.get("orderId"));
order.setProductId((Integer)(event.get("productId")));
order.setUid(event.get("uid").toString());
// QUERY PRODUCTSTOCK TO CHECK AVAILABILITY
Integer productStock = getProductStock(order.getProductId());
if(productStock > 0) {
Map<String, Object> event = new HashMap<>();
event.put("name", "ProductReserved");
event.put("orderId", order.getId());
event.put("productId", order.getProductId());
// WRITES A PRODUCT RESERVED EVENT TO orders topic
orderProcessor.output().send(MessageBuilder.withPayload(event).build(), 500);
}else{
//XXX CANCEL ORDER
}
}
ProductService also has a Kafka Streams processor that is responsible to update the stock:
KStream<Integer, JsonNode> stream = kStreamBuilder.stream(integerSerde, jsonSerde, "orders");
stream.xxx().yyy(() -> {...}, "ProductsStock");
Event1 would be processed first and since there is still 1 available product it would generate the ProductReserved event.
Now, it's Event2's turn. If it is consumed by ProductService consumer BEFORE the ProductService Kafka Streams Processor processes the ProductReseved event generated by Event1, the consumer would still see that the ProductStore stock for product1 is 1, generating a ProductReserved event for Event2, then producing an inconsistency in the system.
This answer is a little late for your original question, but let me answer anyway for completeness.
There are a number of ways to solve this problem, but I would encourage addressing this is an event driven way. This would mean you (a) validate there is enough stock to process the order and (b) reserve the stock as a single, all within a single KStreams operation. The trick is to rekey by productId, that way you know orders for the same product will be executed sequentially on the same thread (so you can't get into the situation where Order1 & Order2 reserve stock of the same product twice).
There is a post that talks discusses how to do this: https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
Maybe more usefully there is some sample code also showing how it can be done:
https://github.com/confluentinc/kafka-streams-examples/blob/1cbcaddd85457b39ee6e9050164dc619b08e9e7d/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java#L76
Note how in this KStreams code the first line rekeys to productId, then a Transformer is used to (a) validate there is sufficient stock to process the order and (b) reserve the stock required by updating the state store. This is done atomically, using Kafka's Transactions feature.
This same problem is typical in assuring consistency in any distributed system. Instead of going for strong consistency, typically the process manager/saga pattern is used. This is somewhat similar to the 2-phase commit in distributed transactions but implemented explicitly in application code. It goes like this:
The Order Service asks the Product Service to reserve N items. The Product Service either accepts the command and reduces stock or rejects the command if it doesn't have enough items available. Upon positive reply to the command the Order Service can now emit OrderCreated event (although I'd call it OrderPlaced, as "placed" sounds mode idiomatic to the domain and "created" is more generic, but that's a detail). The Product Service either listens for OrderPlaced events or an explicit ConfirmResevation command is sent to it. Alternatively, if something else happened (e.g. failed to clear funds), an appropriate event can be emitted or CancelReservation command sent explicitly to the ProductService. To cater for exceptional circumstances, the ProductService may also have a scheduler (in KafkaStreams punctuation can come in handy for this) to cancel reservations that weren't confirmed or aborted within a timeout period.
The technicalities of the orchestration of the two services and handling the error conditions and compensating actions (cancelling reservation in this case) can be handled in the services directly, or in an explicit Process Manager component to segregate this responsibility. Personally I'd go for an explicit Process Manager that could be implemented using Kafka Streams Processor API.

Max number of tuple replays on Storm Kafka Spout

We’re using Storm with the Kafka Spout. When we fail messages, we’d like to replay them, but in some cases bad data or code errors will cause messages to always fail a Bolt, so we’ll get into an infinite replay cycle. Obviously we’re fixing errors when we find them, but would like our topology to be generally fault tolerant. How can we ack() a tuple after it’s been replayed more than N times?
Looking through the code for the Kafka Spout, I see that it was designed to retry with an exponential backoff timer and the comments on the PR state:
"The spout does not terminate the retry cycle (it is my conviction that it should not do so, because it cannot report context about the failure that happened to abort the reqeust), it only handles delaying the retries. A bolt in the topology is still expected to eventually call ack() instead of fail() to stop the cycle."
I've seen StackOverflow responses that recommend writing a custom spout, but I'd rather not be stuck maintaining a custom patch of the internals of the Kafka Spout if there's a recommended way to do this in a Bolt.
What’s the right way to do this in a Bolt? I don’t see any state in the tuple that exposes how many times it’s been replayed.
Storm itself does not provide any support for your problem. Thus, a customized solution is the only way to go. Even if you do not want to patch KafkaSpout, I think, introducing a counter and breaking the replay cycle in it, would be the best approach. As an alternative, you could also inherit from KafkaSpout and put a counter in your subclass. This is of course somewhat similar to a patch, but might be less intrusive and easier to implement.
If you want to use a Bolt, you could do the following (which also requires some changes to the KafkaSpout or a subclass of it).
Assign an unique IDs as an additional attribute to each tuple (maybe, there is already a unique ID available; otherwise, you could introduce a "counter-ID" or just the whole tuple, ie, all attributes, to identify each tuple).
Insert a bolt after KafkaSpout via fieldsGrouping on the ID (to ensure that a tuple that is replayed is streamed to the same bolt instance).
Within your bolt, use a HashMap<ID,Counter> that buffers all tuples and counts the number of (re-)tries. If the counter is smaller than your threshold value, forward the input tuple so it gets processed by the actual topology that follows (of course, you need to anchor the tuple appropriately). If the count is larger than your threshold, ack the tuple to break the cycle and remove its entry from the HashMap (you might also want to LOG all failed tuples).
In order to remove successfully processed tuples from the HashMap, each time a tuple is acked in KafkaSpout you need to forward the tuple ID to the bolt so that it can remove the tuple from the HashMap. Just declare a second output stream for your KafkaSpout subclass and overwrite Spout.ack(...) (of course you need to call super.ack(...) to ensure KafkaSpout gets the ack, too).
This approach might consume a lot of memory though. As an alternative to have an entry for each tuple in the HashMap you could also use a third stream (that is connected to the bolt as the other two), and forward a tuple ID if a tuple fails (ie, in Spout.fail(...)). Each time, the bolt receives a "fail" message from this third stream, the counter is increase. As long as no entry is in the HashMap (or the threshold is not reached), the bolt simply forwards the tuple for processing. This should reduce the used memory but requires some more logic to be implemented in your spout and bolt.
Both approaches have the disadvantage, that each acked tuple results in an additional message to your newly introduces bolt (thus, increasing network traffic). For the second approach, it might seem that you only need to send a "ack" message to the bolt for tuples that failed before. However, you do not know which tuples did fail and which not. If you want to get rid of this network overhead, you could introduce a second HashMap in KafkaSpout that buffers the IDs of failed messages. Thus, you can only send an "ack" message if a failed tuple was replayed successfully. Of course, this third approach makes the logic to be implemented even more complex.
Without modifying KafkaSpout to some extend, I see no solution for your problem. I personally would patch KafkaSpout or would use the third approach with a HashMap in KafkaSpout subclass and the bolt (because it consumed little memory and does not put a lot of additional load on the network compared to the first two solutions).
Basically it works like this:
If you deploy topologies they should be production grade (this is, a certain level of quality is expected, and the number of tuples low).
If a tuple fails, check if the tuple is actually valid.
If a tuple is valid (for example failed to be inserted because it's not possible to connect to an external database, or something like this) reply it.
If a tuple is miss-formed and can never be handled (for example an database id which is text and the database is expecting an integer) it should be ack, you will never be able to fix such thing or insert it into the database.
New kinds of exceptions, should be logged (as well as the tuple contents itself). You should check these logs and generate the rule to validate tuples in the future. And eventually add code to correctly process them (ETL) in the future.
Don't log everything, otherwise your log files will be huge, be very selective on what do you log. The contents of the log files should be useful and not a pile of rubbish.
Keep doing this, and eventually you will only cover all cases.
We also face the similar data where we have bad data coming in causing the bolt to fail infinitely.
In order to resolve this on runtime, we have introduced one more bolt naming it as "DebugBolt" for reference. So the spout sends the message to this bolt first and then this bolts does the required data fix for the bad messages and then emits them to the required bolt. This way one can fix the data errors on the fly.
Also, if you need to delete some messages, you can actually pass an ignoreFlag from your DebugBolt to your original Bolt and your original bolt should just send an ack to spout without processing if the ignoreFlag is True.
We simply had our bolt emit the bad tuple on an error stream and acked it. Another bolt handled the error by writing it back to a Kafka topic specifically for errors. This allows us to easily direct normal vs. error data flow through the topology.
The only case where we fail a tuple is because some required resource is offline, such as a network connection, DB, ... These are retriable errors. Anything else is directed to the error stream to be fixed or handled as is appropriate.
This all assumes of course, that you don't want to incur any data loss. If you only want to attempt a best effort and ignore after a few retries, then I would look at other options.
As per my knowledge Storm doesn't provide built-in support for this.
I have applied below-mentioned implementation:
public class AuditMessageWriter extends BaseBolt {
private static final long serialVersionUID = 1L;
Map<Object, Integer> failedTuple = new HashMap<>();
public AuditMessageWriter() {
}
/**
* {#inheritDoc}
*/
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
//any initialization if u want
}
/**
* {#inheritDoc}
*/
#Override
public void execute(Tuple input) {
try {
//Write your processing logic
collector.ack(input);
} catch (Exception e2) {
//In case of any exception save the tuple in failedTuple map with a count 1
//Before adding the tuple in failedTuple map check the count and increase it and fail the tuple
//if failure count reaches the limit (message reprocess limit) log that and remove from map and acknowledge the tuple
log(input);
ExceptionHandler.LogError(e2, "Message IO Exception");
}
}
void log(Tuple input) {
try {
//Here u can pass result to dead queue or log that
//And ack the tuple
} catch (Exception e) {
ExceptionHandler.LogError(e, "Exception while logging");
}
}
#Override
public void cleanup() {
// To declare output fields.Not required in this alert.
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// To declare output fields.Not required in this alert.
}
#Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}

How to handle OnComplete message with internal queuing reactive stream subscriber?

I'm using Akka-Stream 1.0 with a simple reactive stream:
An publisher sends N messages
A subscriber consumes the N messages
with
override val requestStrategy = new MaxInFlightRequestStrategy(max = 20) {
override def inFlightInternally: Int = messageBacklog.size
The publisher will close the stream after N messages (dynamically) via sending an OnComplete message.
The subscriber receives the messages and goes into canceled state right away. The problem is, that the subscriber needs some time to process each messages meaning that I usually have some backlog of messages - which can't be processed anymore as the subscriber gets canceled - IMHO in ActorSubscriber.scala:195
Processing a message means that my Subscriber will offload the work to someone else (Sending content back via Spray's ChunkedMessages) and gets a ack message back as soon a message is completed. As the Actor is canceled, the ack message is never processed and the backlog processed.
What is recommended to let me complete the backlog?
I could 'invent' my own 'Done Marker' but that sounds very strange to me. Obviously my code works with MaxInFlightRequestStrategy and a max of 1 - as there the demand will be always only 1 - meaning I never have a backlog of messages.
After long hours of debugging and trying around I think I understand what was/is going on - hopefully it saves other peoples time:
I think I failed with a conceptual misunderstanding on how to implement an reactive subscriber:
I was spooling messages internally of an ActorSubscriber and released those spooled messages at the right time back to the business logic via self ! SpooledMessage - which caused the calculations of the Subscriber to go crazy: Each spooled messages was counted twice as 'received' causing the internals to ask for even more messages from upstream.
Fixing this by processing the spooled messages within the actor itself resolved that problem - allowing me also to use OnComplete properly: As soon as this messages is received, the Subscriber does not get any new messages but I process the internal queue on its own (without using self ! ...) and thus complete the whole stream processing.