Is it possible to get the state of the previous window for a given key - apache-beam

I have events (ProductOrderRequested, ProductColorChanged, ProductDelivered...) and I want to build a golden record of my product.
But my goal is to build the golden record step by step : each session of activity will give me an updated state of my product and I need to store each version of the state for tracability purpose
I have a quite simple pipeline (code is better than words) :
events
.apply("SessionWindow", Window.
<KV<String, Event>>into(Sessions.withGapDuration(gapSession)
.triggering(<early and late data trigger>))
.apply("GroupByKey", GroupByKey.create())
.apply("ComputeState", ParDo.of(new StatefulFn()))
My problem is for a given window I have to compute the new state based on :
The previous state (i.e computed state of the previous window)
The events received
I would like to avoid calling an external service to get the previous state but instead get the state of the previous window. Is it something possible ?

In Apache Beam state is always scoped per window (also see this answer). So I can only think of re-windowing into the global window and handle the state there. In this global StatefulFn you can store and handle the prior state(s).
It would then look like this:
events
.apply("SessionWindow", Window.
<KV<String, Event>>into(Sessions.withGapDuration(gapSession)
.triggering(<early and late data trigger>))
.apply("GroupByKey", GroupByKey.create())
.apply("Re-window into Global Window", Window.
<KV<String, Event>>into(new GlobalWindows())
.triggering(<early and late data trigger>))
.apply("ComputeState", ParDo.of(new StatefulFn()))
Please also note that as of now, Apache Beam doesn't support stateful processing for merging windows (see this issue). Therefore, your StatefulFn on a session window basis will not work properly when your triggers emit early or late results of session windows since the state is not merged. This is another reason to work with a non-merging window like the global window.

Related

Late data handling | Apache Beam

Late data which has missed the window and .withAllowedLateness period is dropped off from the pipeline as documented here
I have a few questions on this behavior:
How to handle late data which is dropped off from the pipeline? Can we add default behavior? Say all late data should be logged somewhere like catch-all bucket?
Can we have a Metric(Google Dataflow Metrics/Beam) to say how many of these messages are dropped off from pipeline due to huge latency?
In general we define late data as elements that, by the time they arrive, we just prefer to drop them and do not want to process any further. As far as I know, adding extra functionality to handle those messages would require substantial effort to modify the Java SDK. However, if you just want to log them this is done by the LateDataDroppingDoFnRunner code, which is responsible for dropping data from expired windows:
for (WindowedValue<InputT> input : concatElements) {
BoundedWindow window = Iterables.getOnlyElement(input.getWindows());
if (canDropDueToExpiredWindow(window)) {
// The element is too late for this window.
droppedDueToLateness.inc();
WindowTracing.debug(
"{}: Dropping element at {} for key:{}; window:{} "
+ "since too far behind inputWatermark:{}; outputWatermark:{}",
LateDataFilter.class.getSimpleName(),
input.getTimestamp(),
key,
window,
timerInternals.currentInputWatermarkTime(),
timerInternals.currentOutputWatermarkTime());
}
}
Note that the log has DEBUG level so you might not see it. As explained here, to override the level in Dataflow, you can use --defaultWorkerLogLevel=DEBUG or, even better, specify a particular class such as --workerLogLevelOverrides={"org.apache.beam.sdk.util.WindowTracing":"DEBUG"}. Choosing your keys wisely can help expose information to identify the dropped message (i.e. data lineage).
As can be seen in the previous snippet, droppedDueToLateness is a Counter metric that is incremented each time we drop an element: droppedDueToLateness.inc();. You can monitor it using Stackdriver with resource type dataflow_job and metric custom.googleapis.com/dataflow/droppedDueToLateness.

flink streaming window trigger

I have flink stream and I am calucating few things on some time window say 30 seconds.
here what happens it is giving me result my aggregating previous windows as well.
say for first 30 seconds I get result 10.
next thiry seconds I want fresh result, instead I get last window result + new
and so on.
so my question is how I get fresh result for each window.
You need to use a purging trigger. What you want is FIRE_AND_PURGE (emit and remove window content), what the default flink trigger does is FIRE (emit and keep window content).
input
.keyBy(...)
.timeWindow(Time.seconds(30))
// The important part: Replace the default non-purging ProcessingTimeTrigger
.trigger(new PurgingTrigger[..., TimeWindow](ProcessingTimeTrigger))
.reduce(...)
For a more in depth explanation have a look into Triggers and FIRE vs FIRE_AND_PURGE.
A Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. If the default trigger does not fit your needs, you can specify a custom trigger using trigger(...).
When a trigger fires, it can either FIRE or FIRE_AND_PURGE. While FIRE keeps the contents of the window, FIRE_AND_PURGE removes its content. By default, the pre-implemented triggers simply FIRE without purging the window state.
The functionality you describe can be found in Tumbling Windows: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/windows.html#tumbling-windows
A bit more detail and/or code would help :)
I'm little late into this question but I encountered the same issue with OP's. What I found out later was a bug in my own code. FYI my mistake could be good reference for your problem.
// Old code (modified to be an example):
val tenSecondGrouping: DataStream[MyCustomGrouping] = userIdsStream
.keyBy(_.somePartitionedKey)
.window(TumblingProcessingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.trigger(ProcessingTimeTrigger.create())
.aggregate(new MyCustomAggregateFunc(new MyCustomGrouping()))
Bug happened at new MyCustomGrouping: I unintentionally created a singleton MyCustomGrouping object and reusing it in MyCustomAggregateFunc. As more tumbling windows created, the later aggregation results grow crazy! The fix was to create new MyCustomGrouping each time MyCustomAggregateFunc is triggered. So:
// New code, problem solved
...
.aggregate(new MyCustomAggregateFunc(() => new MyCustomGrouping()))
// passing in a func to create new object per trigger

Delphi: How to restore a form's original location when monitor configuration changes?

I have a multi-form application in which a child form is positioned on the second monitor on startup, at which time its BoundsRect is saved.
When the computer's display configuration changes, Windows moves the form to the first (primary) monitor. I can catch this change with WM_DISPLAYCHANGE:
procedure WMDisplayChange(var msg: TWMDisplayChange); message WM_DISPLAYCHANGE;
What I'm interested in doing is moving the child form back to the second monitor when it reappears in the configuration (i.e. Screen.MonitorCount goes from 1 to 2), e.g.:
childForm.BoundsRect := childForm.m_WorkingBounds;
// (or)
childForm.BoundsRect := Screen.Monitors[Screen.MonitorCount-1].BoundsRect;
However this assignment is have no affect -- the child form stays on monitor 0.
I've tried other approaches, such as SetWindowPos(), with no success ...
Root of your problem is in the fact that Delphi VCL does not refresh its internal list of monitors when they actually change. You have to force that refresh yourself.
Monitors are refreshed with TScreen.GetMonitors method that is unfortunately private method so you cannot call it directly.
However, TApplication.WndProc(var Message: TMessage) processes WM_WTSSESSION_CHANGE and upon receiving that message it calls Screen.GetMonitors - this is most benign way to achieve your goal.
When you receive notifications that monitors are changed just send it to Application:
SendMessage(Application.Handle, WM_WTSSESSION_CHANGE, 0, 0);
I tested this with old version Delphi5 and it worked easy just to:
Screen.Free;
Screen := TScreen.Create(Nil);
The screen handling has changed in later versions of Delphi, however a similar approach may work.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

How can I save output from Simulink?

I'm a student learning to use MATLAB. For an assignment, I have to create a simple state machine and collect some results. I'm used to using Verilog/Modelsim, and I'd like to collect data only when the state machine's output changes, which is not necessarily every time/sample period.
Right now I have a model that looks like this:
RequestChart ----> ResponseChart ----> Unit Delay Block --> (Back to RequestChart)
| |
------------------------> Mux --> "To Workspace" Sink Block
I've tried setting the sink block to save as "Array" format, but it only saves 51 values. I've tried setting it to "Timeseries", but it saves tons of zero values.
Can someone give me some suggestions? Like I said, MATLAB is new to me, please let me know if I need to clarify my question or provide more information.
Edit: Here's a screen capture of my model:
Generally Simulink will output a sample at every integration step. If you want to only output data when a particular event occurs -- in this case only when some data changes -- then do the following,
run the output of the state machine into a Detect Change block (from the Logic and Bit Operations library)
run that signal into the trigger port of a Triggered Subsystem.
run the output of the state machine into the data port of the Triggered Subsystem.
inside the triggered subsystem, run the data signal into a To Workspace block.
Data will only be saved at time point that the trigger occurs, i.e. when your data changes.
In your Simulink window, make sure the Relative Tolerance is small so that you can generate many more points in between your start and ending time. Click on the Simulation option at the top of the window, then click on Model Configuration Parameters.
From there, change the Relative Tolerance to something small... like 1e-10. After that, try running your simulation again. You should have a lot more points in your output array that you can now save.