GCP Dataflow: System Lag for streaming from Pub/Sub IO - streaming

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric.
1) What does system lag exactly means?
The maximum time that an item of data has been awaiting processing
Above is what we see in GCP Console when we hit information icon. What does an item of data mean in this case? Stream processing has concept of Windowing, event time vs processing time, watermark, etc. When is an item considered awaiting to be processed? For example is it simply when the message arrives regardless of its state?
2) What is the optimum threshold for this metric?
We try to keep this metric as low as possible, but we don't have any recommendation on how low we should keep it. For example do we have some recommendation such as keeping system lag between 20s to 30s is optimum.
3)How does system lag implicates sinks
How does system lag affect latency of the event itself?

Depending on the pipeline being executed there are a number of places that elements may be queued up awaiting processing. This is typically when the elements are passed between machines, such as within a GroupByKey, although the PubSub source also reflects the oldest unacked element.
For a given step (sinks included) "System Lag" measures the age of the oldest element in the closest input queue to that step.
It is not unusual for there to be spikes in this measure -- elements are pulled off the queue after they are processed, so if many new elements are delivered it may take a while before the queue is back to a manageable size. What is important is that the system lag goes back down after these spikes.
The latency of a sink depends on several factors:
The rate that elements arrive in the pipeline limits the rate the input watermark advances.
The configuration of windowing and triggers affect how long the pipeline must wait before emitting a given window.
System lag is a measure of how much added delay is currently being introduced by code executing within the pipeline.
It is likely easier to look at the "Data Watermark" of the sink, which reports up to what point in (event) time the sink has been processed.

Related

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

process pubsub messages in constant rate. Using streaming and serverless

The scenario:
I have thousands of requests I need to issue each day.
I know the number at the beginning of the day and hopefully I want to send all the data about the requests to pubsub. Message per request.
I want to make the requests in constant rate. for example if I have 172800 requests, I want to process 2 in each second.
The ultimate way will involved pubsub push and cloud run.
Using pull with long running instances is also an option.
Any other option are also welcome.
I want to avoid running in a loop and fetch records from a database with limit.
This is how I am doing it today.
You can use batch and flow control settings for fine-tuning Pub/Sub performance which will help in processing messages at a constant rate.
Batching
A batch, within the context of Cloud Pub/Sub, refers to a group of one or more messages published to a topic by a publisher in a single publish request. Batching is done by default in the client library or explicitly by the user. The purpose for this feature is to allow for a higher throughput of messages while also providing a more efficient way for messages to travel through the various layers of the service(s). Adjusting the batch size (i.e. how many messages or bytes are sent in a publish request) can be used to achieve the desired level of throughput.
Features specific to batching on the publisher side include setElementCountThreshold(), setRequestByteThreshold(), and setDelayThreshold() as part of setBatchSettings() on a publisher client (the naming varies slightly in the different client libraries). These features can be used to finely tune the behavior of batching to find a better balance among cost, latency, and throughput.
Note: The maximum number of messages that can be published in a single batch is 1000 messages or 10 MB.
An example of these batching properties can be found in the Publish with batching settings documentation.
Flow Control
Flow control features on the subscriber side can help control the unhealthy behavior of tasks on the pipeline by allowing the subscriber to regulate the rate at which messages are ingested. These features provide the added functionality to adjust how sensitive the service is to sudden spikes or drops of published throughput.
Some features that are helpful for adjusting flow control and other settings on the subscriber are setMaxOutstandingElementCount(), setMaxOutstandingRequestBytes(), and setMaxAckExtensionPeriod().
Examples of these settings being used can be found in the Subscribe with flow control documentation.
For more information refer to this link.
If you are having long running instances as subscribers, then you will need to set relevant FlowControl settings for example .setMaxOutstandingElementCount(1000L)
Once you have set it to the desired number (for example 1000), this should control the maximum amount of messages the subscriber receives before pausing the message stream, as explained in the code below from this documentation:
// The subscriber will pause the message stream and stop receiving more messsages from the
// server if any one of the conditions is met.
FlowControlSettings flowControlSettings =
FlowControlSettings.newBuilder()
// 1,000 outstanding messages. Must be >0. It controls the maximum number of messages
// the subscriber receives before pausing the message stream.
.setMaxOutstandingElementCount(1000L)
// 100 MiB. Must be >0. It controls the maximum size of messages the subscriber
// receives before pausing the message stream.
.setMaxOutstandingRequestBytes(100L * 1024L * 1024L)
.build();

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Why doesn't my Azure Function scale up?

For a test, I created a new function app. I added two functions, one was an http trigger that when invoked, pushed 500 messages to a queue. The other, a queue trigger to read the messages. The queue trigger function code, was setup to read a message and randomly sleep from 1 to 30 seconds. This was intended to simulate longer running tasks.
I invoked the http trigger to create the messages, then watched the que fill up (messages were processed by the other trigger). I also wired up app insights to this function app, but I did not see is scale beyond 1 server.
Do Azure functions scale up soley on the # of messages in the que?
Also, I implemented these functions in Powershell.
If you're running in the Azure Functions consumption plan, we monitor both the length and the throughput of your queue to determine whether additional VM resources are needed.
Note that a single function app instance can process multiple queue messages concurrently without needing to scale across multiple VMs. So if all 500 messages can be consumed relatively quickly (again, in the consumption plan), then it's possible that you won't scale at all.
The exact algorithm for scaling isn't published (it's subject to lots of tweaking), but generally speaking you can expect the system to automatically scale you out if messages are getting added to the queue faster than your functions can process them. Your app will also scale out if the latency of the first message in the queue is continuously increasing (meaning, messages are sitting idle and not getting processed). The time between VMs getting added is usually in the tens of seconds.
There are some thresholds based on queue count as well. For example, the system tries to ensure that there is at least 1 VM for every 1K queue messages, but usually the scale decisions are based on message throughput as I described earlier.
I think #Chris Gillum put it well, it's hard for us to push the limits of the server to the point that things will start to scale.
Some other options available are:
Use durable functions and scale with Threading:
https://learn.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup
Another method could be to use Event Hubs which are designed for massive scale. Instead of queues, have Function #1 trigger an Event, and your Function #2 subscribed to that Event Hub trigger. Adding Streaming Analytics, could also be an option to more fully expand on capabilities if needed.

How can I measure the propagation latency of DynamoDB Streams?

I'm using DynamoDB Streams + Kinesis Client Library (KCL).
How can I measure latency between when an event was created in a stream and when it was processed on KCL side?
As I know, KCL's MillisBehindLatest metric is specific to Kinesis Streams(not DynamoDB streams).
approximateCreationDateTime record attribute has a minute-level approximation, which is not acceptable for monitoring in sub-second latency systems.
Could you please help with some useful metrics for monitoringDynamoDB Streams latency?
You can change the way you do writes in your application to allow your application to track the propagation delay of mutations in the table's stream. For example, you could always update a 'last_updated=' timestamp attribute when you create and update items. That way, when your creations and updates appear in the stream, you can estimate the propagation delay by subtracting the current time from last_updated in the NEW_IMAGE of the stream record.
Because deletions do not have a NEW_IMAGE in stream records, your deletes would need to take place in two steps:
logical deletion where you write the 'logically_deleted='
timestamp to the item and
physical deletion where you actually call DeleteItem immediately following 1.
Then, you would use the same math as for creations and updates, only differences being that you would use the OLD_IMAGE when processing deletions and you would need to subtract at least around 10ms to account for the time it takes to perform the logical delete (step 1).