I am getting started with Akka streams; I am trying to create a stream that reads data from a web service and then persists them in S3.
I was wondering, if I define a Sink using the Sink.fold method (in order to gather information about the persisted elements) for the persistence, are the elements sent to the sink going to be processed one after another, or in parallel?
It's a basic question, but I wasn't able to find a definitive answer in the docs.
Since Sink.fold needs the result from the previous elements to combine it with the next one, it's necessarily sequential.
It's more of a Sink.foldLeft, actually.
In other words, if you have a, b as elements, and you fold them using f, you need acc = f(zero, a) to be able to process f(acc, b). So, until the processing for ais done b cannot be processed.
From the api doc:
A Sink that will invoke the given function for every received element, giving it its previous output (or the given zero value) and the element as input. The returned java.util.concurrent.CompletionStage will be completed with value of the final function evaluation when the input stream ends, or completed with Failure if there is a failure is signaled in the stream.
Related
I have an events-sourced entity (C) that needs to change its state in response to state changes in another entity of a different type (P). The logic to whether the state of C should actually change is quite complex and the data to compute that lives in C; moreover, many instances of C should listen to one instance of P, and the set of instances increases over time, so I'd rather they pull out of a stream knowing the ID of P than have P keep track of the IDs of all the Cs and push to them.
I am thinking of doing something such as:
Tag a projection of P's events
Have a Subscribe(P.id) command that gets sent to C
If C is not already subscribing to a P (it can only subscribe to one, and it shouldn't change), fire an event Subscribed(P.id)
In response to the event, use Akka-persistent-query to materialize the stream of events tagged in 1, map them to commands, and run asynchronously with a sync that sends them to my ES entity reference
This seems a bit like an anti pattern to have a stream run in the event handler. I am wondering if there's a better/more supported way to do this without the upstream having to know about the downstream. I decided against Akka pub-sub because it does at-most-once delivery, and I'd like to avoid using Kafka if possible.
You definitely don't want to run the stream in the event handler: the event handler should never side effect.
Assuming that you would like a C to get events from times when that C was not running (including before that C had ever run), this suggests that a stream should be run for each C. Since the subscription will be to one particular P, I'd seriously consider not tagging, but instead using the eventsByPersistenceId stream to get all the events of a P and ignore the ones that aren't of interest. In the stream, you translate those to commands in C's API, including the offset in P's event stream with the command, and send it to C (for at-least-once delivery, a mapAsync with an ask is useful; C will persist an event recording that it processed the offset: this allows the command to be idempotent, as C can acknowledge the command if the offset is less-than-or-equal-to the high water offset in its state).
This stream gets kicked off by the command-handler after successfully persisting a Subscribed(P.id) event (in this case starting from offset 0) and then gets kicked off after the persistent actor is rehydrated if the state shows it's subscribed (in this case starting from one plus the high water offset).
The rationale for not using tagging here arises from an assumption that the number of events C isn't interested in is smaller than the number of events with the tag from Ps that C isn't subscribed to (note that for most of the persistence plugins, the more tags there are, the more overhead there is: a tag which is only used by one particular instance of an entity is often not a good idea). If the tag in question is rarely seen, this assumption might not hold and eventsByTag and filtering by id could be useful.
This does of course have the downside of running discrete streams for every C: depending on how many Cs are subscribed to a given P, the overhead of this may be substantial, and the streams for subscribers which are caught up will be especially wasteful. In this scenario, responsibility for delivering commands to subscribed Cs for a given P can be moved to an actor. The only real change in that scenario is that where C would run the stream, it instead confirms that it is subscribed to the event stream by asking that actor feeding events from the P. Because this approach is a marked step-up in complexity (especially around managing when Cs join and drop out of the shared "caught-up" stream), I'd tend to recommend starting with the stream-per-C approach and then going to the shared stream (it's also worth noting that there can be multiple shared streams: in fact I'd tend to have shared streams be per-ActorSystem (e.g. a "node singleton" per P of interest) so as not to involve remoting), since it's not difficult to make the transition (from C's perspective, there's not really a difference whether the adapted commands are coming from a stream it started or from a stream being run by some other actor).
We have a Stream of events that are being pulled from kafka using fs2-kafka and we are finishing processing when the events are newer than a given deadline or the offset is at the end of the partition (this does not matter too much for the question, but to give a bit of context about the program).
We ideally would want to log when those conditions are met, but takeWhile and takeThrough require pure functions O => Boolean.
Our stream is:
partitionStream
.takeWhile(shouldProcess(processingDelay, _))
.takeThrough(!atLogEndOffset(_, endOffsets))
Obviously we could do a log.info inside shouldProcess and atLongEndOffset but that would mean side effecting inside a pure function, something we would prefer not doing.
What approach would be better without calling the functions twice (one for logging and one for the condition evaluation)?
Thanks!
I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.
I use Kafka streams to process the real-time data and I need to do some aggregate operations for data of a windowed time.
I have two questions about the aggregate operation.
How to get the aggregated data? I need to send it to a 3rd service.
After the aggregate operation, I can't send message to a 3rd service, the code doesn't run.
Here is my code:
stream = builder.stream("topic");
windowedKStream = stream.map(XXXXX).groupByKey().windowedBy("5mins");
ktable = windowedKStream.aggregate(()->"", new Aggregator(K,V,result));
// my data is stored in 'result' variable, but I can't get it at the end of the 5 mins window.
// I need to send the 'result' to a 3rd service. But I don't know where to temporarily store it and then how to get it.
// below is the code the call a 3rd service, but the code can't be executed(reachable).
// I think it should be executed every 5 mins when thewindows is over. But it isn't.
result = httpclient.execute('result');
I guess might want to do something like:
ktable.toStream().foreach((k,v) -> httpclient.execute(v));
Each time the KTable is updated (with caching disabled), the update record will be sent downstream, and foreach will be executed with v being the current aggregation result.
My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Storm as it is rolling time period X on each event, rather than fixed time interval.
Thanks,
Harish
For Storm, would need to put all your logic into your UDF code using low level Java API (I doubt that Trindent is helpful). I never worked with Samza and cannot provide any help for it (or judge which system would be the better fit for your problem).
In Storm for example, you could assign a timestamp to each tuple in Spout.nextTuple(), and buffer all tuples of an incomplete payment within a Bolt in descending order of the timestamp. Each time Bolt.execute() is called, you can compare the timestamp of the new tuple with the head (ie, oldest tuple) of your queue. If the input tuple has a larger timestamep than head-T plus X, you know that your head tuple times out and you can raise your trigger for it.
Of course, you need to do fieldsGrouping() to ensure that all tuples belonging to the same payment are processed by the same Bolt instance. You might also need to somewhat order the incoming bolt tuples by timestamp or use more advance time-out logic to deal with out-of-order tuples (with regard to increasing timestamps).
Depending on you latency requirement and input stream rate you might also use "tick tuples" to trigger the comparison of the head tuple with this dummy tick tuple. Or as an ever stricter implementation, do all this logic directly in Spout.next() (if you know that all tuples of a payment go through the same Spout instance).