Late data handling | Apache Beam

Late data handling | Apache Beam - apache-beam

Late data which has missed the window and .withAllowedLateness period is dropped off from the pipeline as documented here
I have a few questions on this behavior:
How to handle late data which is dropped off from the pipeline? Can we add default behavior? Say all late data should be logged somewhere like catch-all bucket?
Can we have a Metric(Google Dataflow Metrics/Beam) to say how many of these messages are dropped off from pipeline due to huge latency?

In general we define late data as elements that, by the time they arrive, we just prefer to drop them and do not want to process any further. As far as I know, adding extra functionality to handle those messages would require substantial effort to modify the Java SDK. However, if you just want to log them this is done by the LateDataDroppingDoFnRunner code, which is responsible for dropping data from expired windows:
for (WindowedValue<InputT> input : concatElements) {
BoundedWindow window = Iterables.getOnlyElement(input.getWindows());
if (canDropDueToExpiredWindow(window)) {
// The element is too late for this window.
droppedDueToLateness.inc();
WindowTracing.debug(
"{}: Dropping element at {} for key:{}; window:{} "
+ "since too far behind inputWatermark:{}; outputWatermark:{}",
LateDataFilter.class.getSimpleName(),
input.getTimestamp(),
key,
window,
timerInternals.currentInputWatermarkTime(),
timerInternals.currentOutputWatermarkTime());
}
}
Note that the log has DEBUG level so you might not see it. As explained here, to override the level in Dataflow, you can use --defaultWorkerLogLevel=DEBUG or, even better, specify a particular class such as --workerLogLevelOverrides={"org.apache.beam.sdk.util.WindowTracing":"DEBUG"}. Choosing your keys wisely can help expose information to identify the dropped message (i.e. data lineage).
As can be seen in the previous snippet, droppedDueToLateness is a Counter metric that is incremented each time we drop an element: droppedDueToLateness.inc();. You can monitor it using Stackdriver with resource type dataflow_job and metric custom.googleapis.com/dataflow/droppedDueToLateness.

Related

Use kafka to detect changes on values

I have a streaming application that continuously takes in a stream of coordinates along with some custom metadata that also includes a bitstring. This stream is produced onto a kafka topic using producer API. Now another application needs to process this stream [Streams API] and store the specific bit from the bit string and generate alerts when this bit changes
Below is the continuous stream of messages that need to be processed
{"device_id":"1","status_bit":"0"}
{"device_id":"2","status_bit":"1"}
{"device_id":"1","status_bit":"0"}
{"device_id":"3","status_bit":"1"}
{"device_id":"1","status_bit":"1"} // need to generate alert with change: 0->1
{"device_id":"3","status_bits":"1"}
{"device_id":"2","status_bit":"1"}
{"device_id":"3","status_bits":"0"} // need to generate alert with change 1->0
Now I would like to write these alerts to another kafka topic like
{"device_id":1,"init":0,"final":1,"timestamp":"somets"}
{"device_id":3,"init":1,"final":0,"timestamp":"somets"}
I can save the current bit in the state store using something like
streamsBuilder
.stream("my-topic")
.mapValues((key, value) -> value.getStatusBit())
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.reduce((oldAggValue, newMessageValue) -> newMessageValue, Materialized.as("bit-temp-store"));
but I am unable to understand how can I detect this change from the existing bit. Do I need to query the state store somehow inside the processor topology? If yes? How? If no? What else could be done?
Any suggestions/ideas that I can try(maybe completely different from what I am thinking) are also appreciated. I am new to Kafka and thinking in terms of event driven streams is eluding me.
Thanks in advance.

I am not sure this is the best approach, but in the similar task I used an intermediate entity to capture the state change. In your case it will be something like
streamsBuilder.stream("my-topic").groupByKey()
.aggregate(DeviceState::new, new Aggregator<String, Device, DeviceState>() {
public DeviceState apply(String key, Device newValue, DeviceState state) {
if(!newValue.getStatusBit().equals(state.getStatusBit())){
state.setChanged(true);
}
state.setStatusBit(newValue.getStatusBit());
state.setDeviceId(newValue.getDeviceId());
state.setKey(key);
return state;
}
}, TimeWindows.of(…) …).filter((s, t) -> (t.changed())).toStream();
In the resulting topic you will have the changes. You can also add some attributes to DeviceState to initialise it first, depending whether you want to send the event, when the first device record arrives, etc.

How to setBusy(false) Indicator SAPUI5 for all controls at one time

we want to take care that all running busy indicators will be stopped after a couple of time. How can we do that? For the moment we use setBusy(false) for each control.
Thanks a lot!

I think that you should change your overall approach because it's not a good UI/UX pattern.
First of all, why do you have more than one busy control in your view? For instance, you if you are loading record in a list you just set busy the list, not the whole page. If you are submitting a form data, you set busy only the form not everything else.
Second of all, why do you say "For the moment we use setBusy(false) for each control"? You should remove the busy state after a specific event. For istance when you finished to load list's result or you get the result of a form submission.
Anyway, to solve your current issue, the best approach is to use XML binding with a temporary JSON model.
You could have a JSON model like with content like this:
{
busy: false
}
and you bind the busy property of the control to youtJSONModel>/busy at this point when you need to set the control to a busy state you can do this.getView().getModel("youtJSONModel").setProperty("/busy", true); and when you have finished the operation you can do this.getView().getModel("youtJSONModel").setProperty("/busy", false);

Is it possible to get the state of the previous window for a given key

I have events (ProductOrderRequested, ProductColorChanged, ProductDelivered...) and I want to build a golden record of my product.
But my goal is to build the golden record step by step : each session of activity will give me an updated state of my product and I need to store each version of the state for tracability purpose
I have a quite simple pipeline (code is better than words) :
events
.apply("SessionWindow", Window.
<KV<String, Event>>into(Sessions.withGapDuration(gapSession)
.triggering(<early and late data trigger>))
.apply("GroupByKey", GroupByKey.create())
.apply("ComputeState", ParDo.of(new StatefulFn()))
My problem is for a given window I have to compute the new state based on :
The previous state (i.e computed state of the previous window)
The events received
I would like to avoid calling an external service to get the previous state but instead get the state of the previous window. Is it something possible ?

In Apache Beam state is always scoped per window (also see this answer). So I can only think of re-windowing into the global window and handle the state there. In this global StatefulFn you can store and handle the prior state(s).
It would then look like this:
events
.apply("SessionWindow", Window.
<KV<String, Event>>into(Sessions.withGapDuration(gapSession)
.triggering(<early and late data trigger>))
.apply("GroupByKey", GroupByKey.create())
.apply("Re-window into Global Window", Window.
<KV<String, Event>>into(new GlobalWindows())
.triggering(<early and late data trigger>))
.apply("ComputeState", ParDo.of(new StatefulFn()))
Please also note that as of now, Apache Beam doesn't support stateful processing for merging windows (see this issue). Therefore, your StatefulFn on a session window basis will not work properly when your triggers emit early or late results of session windows since the state is not merged. This is another reason to work with a non-merging window like the global window.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.

I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

ResearchKit: How to get pedometer data (step count specifically) from ORKOrderedTask.fitnessCheckTaskWithIdentifier result

I added the ORKOrderedTask.fitnessCheckTaskWithIdentifier Task and it renders find in the UI. But unlike other simpler tasks containing scale/choice/date questions, I was not able to find the exact way to read the sensor data collected via ORKOrderedTask.fitnessCheckTaskWithIdentifier.
I have used the following:
private var walkingTask : ORKTask {
return ORKOrderedTask.fitnessCheckTaskWithIdentifier("shortWalkTask", intendedUseDescription: "Take a short walk", walkDuration: 10, restDuration: 5, options: nil)
}
upon task completion the task view controller delegate below is hit.
//ORKTaskViewControllerDelegate
func taskViewController(taskViewController: ORKTaskViewController, didFinishWithReason reason: ORKTaskViewControllerFinishReason, error: NSError?)
is there a way to drill down into the result object contained in task view controller (taskViewController.result) to get the step count? Or will i have to go through health kit or something and then query the required observation? Request help from anyone who has used this task before and can provide some input on how to fetch the pedometer data (step count specifically) for the duration the task was active?
I'm using swift.

The step count is not reflected in the result objects per se. Instead, one of the child ORKFileResult objects, generated from the pedometer recorder, will contain the pedometer records queried from CoreMotion, serialized to JSON.
However, exposing the step count on a result object, sounds like a useful extension / improvement, and we should see if it generalizes to other recorders too. Please open an issue on GitHub and we will see what we can do!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Late data handling | Apache Beam - apache-beam

Related

Use kafka to detect changes on values

How to setBusy(false) Indicator SAPUI5 for all controls at one time

Is it possible to get the state of the previous window for a given key

Moving from file-based tracing session to real time session

ResearchKit: How to get pedometer data (step count specifically) from ORKOrderedTask.fitnessCheckTaskWithIdentifier result

Categories

Resources