Spark Streaming mapWithState seems to rebuild complete state periodically - scala

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?

Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.

In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

Related

Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
(pseudocode)
while(...)
{
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB(offset);
}
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
while(...)
{
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
while(...)
{
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric.
1) What does system lag exactly means?
The maximum time that an item of data has been awaiting processing
Above is what we see in GCP Console when we hit information icon. What does an item of data mean in this case? Stream processing has concept of Windowing, event time vs processing time, watermark, etc. When is an item considered awaiting to be processed? For example is it simply when the message arrives regardless of its state?
2) What is the optimum threshold for this metric?
We try to keep this metric as low as possible, but we don't have any recommendation on how low we should keep it. For example do we have some recommendation such as keeping system lag between 20s to 30s is optimum.
3)How does system lag implicates sinks
How does system lag affect latency of the event itself?
Depending on the pipeline being executed there are a number of places that elements may be queued up awaiting processing. This is typically when the elements are passed between machines, such as within a GroupByKey, although the PubSub source also reflects the oldest unacked element.
For a given step (sinks included) "System Lag" measures the age of the oldest element in the closest input queue to that step.
It is not unusual for there to be spikes in this measure -- elements are pulled off the queue after they are processed, so if many new elements are delivered it may take a while before the queue is back to a manageable size. What is important is that the system lag goes back down after these spikes.
The latency of a sink depends on several factors:
The rate that elements arrive in the pipeline limits the rate the input watermark advances.
The configuration of windowing and triggers affect how long the pipeline must wait before emitting a given window.
System lag is a measure of how much added delay is currently being introduced by code executing within the pipeline.
It is likely easier to look at the "Data Watermark" of the sink, which reports up to what point in (event) time the sink has been processed.

nondeterminism.njoin: maxQueued and prefetching

Why does the njoin prefetch the data before processing? It seems like an unnecessary complication, unless it has something to do with how Processes of Processes are merged?
I have a stream that runs effects whenever a new element is generated. I'd like to keep the effects to a minimum, so whenever a njoin with, say maxOpen = 4, 4 should be the maximum number of elements generated at the same time (no element should be generated unless it can be processed immediately).
Is there a way to solve this gracefully with njoin? Right now I'm using a bounded queue of "tickets" (an element is generated only after it got a ticket).
See https://github.com/scalaz/scalaz-stream/issues/274, specifically the comment below from djspiewak.
"From a conceptual level, the problem here is the interface point between the "pull" model of Process and the "push" model that is required for any concurrent stream merging. Both wye and njoin sit at this boundary point and "cheat" by actively pulling on their source processes to fill an inbound queue, pushing results into an outbound queue pending the pull on the output process. (obviously, both wye and njoin make their inbound queues implicit via Actor) For the most part, this works extremely well and it preserves most of the properties that users care about (e.g. propagation of termination, back pressure, etc)."
The second parameter to njoined, maxQueued, bounds the amount of prefetching. If that parameter is 0, there is no limit on the queue size, and thus no limit on the prefetching. The docs for mergeN, which calls njoin explain a bit more the reasoning for this prefetching behavior. "Internally mergeN keeps small buffer that reads ahead up to n values of A where n equals to number of active source streams. That does not mean that every source process is consulted in this read-ahead cache, it just tries to be as much fair as possible when processes provide their A on almost the same speed." So it seems that the njoin is dealing with the problem of what happens when all the sources provide a value at nearly the same time, but it's trying to prevent any one of those joined streams from crowding out slower streams.