Storm UI: Difference between Execute and Process Latencies - real-time

I've been trying to document meaning of all of the Storm metrics for my current project.
During this process I've collected data from this group responses as well as github.
While some of the metrics are pretty self explanatory, I've got really confused with some of the bolt metrics.
For example, what is the difference between Process Latency and Execute Latency?
From the posts on this Google group I've gathered the following information:
List item processing latency = timestamp when ack is called - timestamp when execute is passed tuple
List item execute latency = timestamp when execute function ends - timestamp when execute is passed tuple (source: http://goo.gl/3KRAl)
and
List item Process latency is time until tuple is acked, execute latency is time spent in execute for a tuple (source: http://goo.gl/m0fTC)
Based on what I'm seeing in my storm UI, my Execute Latency is almost always larger than the Process Latency. How could that be?
Could anyone help me with exact definition of both latencies?
Thanks in advance!

Ack is called in the execute function of the executor or in nutshell will be called before execute method of a bolt ends. So, execute latency is more. This snippet will make things clear (taken from here as is):
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
I hope this helps.

According to your definition, Execute latency should always be bigger than the Process latency, as ack is called inside the execute method as Abhijeet also mentioned.
If you notice a major difference, then you probably have some code after acking a tuple.

Related

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Asynchronous or synchronous pull for counting stream data in pub sub pub/sub?

I would like to count the number of messages in the last hour (last hour referring to a timestamp field in the message data).
I currently have a code that will count the messages synchronously (I am using Google Cloud Pub/Sub Synchronous pull), but I noticed it will take quite long.
My code will repeatedly poll the subscription for a predefined (I set it to 100+) number of times so that I am sure there are no more messages in the last hour that are coming in out of order.
This is not an acceptable design because it means the user has to wait for 5-10 mins for the service to count the messages when they want the metric!
Are there best practices in Pub Sub design for solving this kind of problem?
This seems like a simple problem to solve (count the number of events in the last X timeframe) so I thought there might be.
Will asynchronous design help? How would an async design work? I am not too sure about the async and Python future concept (I am using GCP Pub/Sub's Python client library).
I will try to catch the message differently. My solution is based on logging and BigQuery. The idea is to write a log, for example message received with timestamp xxxxx, to filter this log pattern and to sink the result in BigQuery.
Then, when a user ask, you simply have to request BigQuery and to count the message in the desired lap of time. You also have the advantage to change the time frame, to have an history,...
For writing this log, 2 solutions
Cheaper but not really recommended, the process which consume the message log it with it process it. However, you are dependent of an external service. And this service has 2 responsibilities: its work, and this log (for metrics). Not SOLID. Maybe it's can be the role of the publisher with a loge like this: message published at XXXX. However this imply that all the publisher or all the subscribers are on GCP.
Better is to plug a function, the cheaper (128Mb of memory) to simply handle the message and write the log.

Mongo DB reading 4 million documents

We have Scheduled jobs that runs daily,This jobs looks for matching Documents for that day and takes the document and do minimal transform and sent it a queue for downstream processing. Typically we have 4 millions Documents to be processed for a day. Our aim is to complete the processing within one hour. I am looking for suggestions on the best practices to read 4 million Documents from MongoDB quickly ?
The MongoDB Async driver is the first stop for low overhead querying. There's a good example of using the SingleResultCallback on that page:
Block<Document> printDocumentBlock = new Block<Document>() {
#Override
public void apply(final Document document) {
System.out.println(document.toJson());
}
};
SingleResultCallback<Void> callbackWhenFinished = new SingleResultCallback<Void>() {
#Override
public void onResult(final Void result, final Throwable t) {
System.out.println("Operation Finished!");
}
};
collection.find().forEach(printDocumentBlock, callbackWhenFinished);
It is a common pattern in asynchronous database drivers to allow results to be passed on for processing as soon as they are available. The use of OS-level async I/O will help with low CPU overhead. Which brings up the next problem - how to get the data out.
Without seeing the specifics of your work, you probably want to place the results into an in memory queue to be picked up by another thread at this point so the reader thread can keep reading results. An ArrayBlockingQueue is probably appropriate. put is more appropriate than add because it will block the reader thread if the worker(s) isn't able to keep up (keeping things balanced). Ideally, you don't want it to back up which is where multiple threads will be necessary. If the order of the results is important, use a single worker thread, otherwise use a ThreadPoolExecutor with the queue passed into the constructor. Using the in-memory queue does open up the possibility for data-loss if the results are being somehow discarded as they are read (i.e. if you were immediately sending off another query to delete them), and the reader process crashed.
At this point, either do the 'minimal transforms' on the worker thread(s), or serialize them in the workers and put them on a real queue (e.g. RabbitMQ, ZeroMQ). Putting them onto a real queue allows the work to be divided up amoungst multiple machines trivially, and provides optional persistence allowing recovery of work, and those queues have great clustering options for scalability. Those machines can then put the results into the queue you mentioned in the question (assuming it has the capacity).
The bottleneck in a system like that is how quickly one machine can get through a single mongo query, and how many results the final queue can handle. All the other parts (MongoDB, queues, # of worker machines) are individually scalable. By doing as little work as possible on the querying machine and pushing that work onto other machines that impact can be greatly reduced. It sounds like your destination queue is out of your control.
When trying to work out where bottlenecks are, measurements are critical. Adding metrics to your application up front will let you know which areas need improvement when things aren't going well.
That set-up can build a pretty scalable system. I've built many similar systems before. Beyond that, you'll want to investigate getting your data into something like Apache Storm.

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

Apache Storm aggregation rules for missing expected events in rolling time-period

My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Storm as it is rolling time period X on each event, rather than fixed time interval.
Thanks,
Harish
For Storm, would need to put all your logic into your UDF code using low level Java API (I doubt that Trindent is helpful). I never worked with Samza and cannot provide any help for it (or judge which system would be the better fit for your problem).
In Storm for example, you could assign a timestamp to each tuple in Spout.nextTuple(), and buffer all tuples of an incomplete payment within a Bolt in descending order of the timestamp. Each time Bolt.execute() is called, you can compare the timestamp of the new tuple with the head (ie, oldest tuple) of your queue. If the input tuple has a larger timestamep than head-T plus X, you know that your head tuple times out and you can raise your trigger for it.
Of course, you need to do fieldsGrouping() to ensure that all tuples belonging to the same payment are processed by the same Bolt instance. You might also need to somewhat order the incoming bolt tuples by timestamp or use more advance time-out logic to deal with out-of-order tuples (with regard to increasing timestamps).
Depending on you latency requirement and input stream rate you might also use "tick tuples" to trigger the comparison of the head tuple with this dummy tick tuple. Or as an ever stricter implementation, do all this logic directly in Spout.next() (if you know that all tuples of a payment go through the same Spout instance).