When to use windowing against grouping by event time? - apache-beam

Kind of related, as allowing infinitely late data would also solve my problem.
I have a pipeline linked to a pub/sub topic, from which data can come very late (days) as well as nearly real time: I have to make simple parsing and aggregation tasks on it, based on event time.
As each element holds a timestamp of the event time, my first idea when looking at the programming guide was to use windowing by setting the timestamp on event time.
However, as the processing can happen a long time after the event, I would have to allow very late data, which based on the answer of the first link I posted seems to not be such a good idea.
An alternative solution I have come up with would be to simply group the data based on event time, without marking it as a timestamp, and then calculate my aggregates based on such a key. As the data for a given even time tend to arrive at the same time in the pipeline, I could then just provide windows of arbitrary length based on processing time so that my data is ingested as soon as it is available.
This got me wondering: does that mean that windowing is useless as soon as your data comes a bit later than real time ? Would it be possible to start a window when the first element is received and close it a bit afterwards ?

Related

Aggregate design with EventSourcing and large number of events

I'd like to start adventure with EventSourcing. As a playground I have a system that gathers data from set of Sensors organized in Arrays. Each Sensor have a single value like temperature. What I need from this system is
to get the current value of Sensors readings
last month of Sensor value history
when Sensor value changes, I have to calculate Array "status" and
store it (also for a month)
Array "status" can be corrected manually by the user
Number of Arrays and Sensors is growing. For each Array I have many readings per second.
Now I wanted to have the Array as an Aggregate with Sensors as it's entities. In this case each Sensor reading update would upgrade Array Aggregate version. That gives > 10M of changes for a month. In this design I can't cut off not old events. I can't think about time required for restoring ReadModels after a year of data.
I think I could store current state as CRUD table and remove Sensor current data from Array. Keep just definition. Then I can use service that will handle the Sensor data stream, check Array "status" and keep Array "status" as separate Aggregate. Service would emit "Sensor data update" event. This event would trigger ReadModel keeping historical data handling 1 month constraint. I will not pollute event store with Sensor readings events. In case of Array "status" I will be able to remove whole past "status" Aggregates from the event store. Arrays would keep only Sensor definitions, so EventStore would be relatively small.
I loose complete history. I can't restore my 1 month signal history ReadModel. I would have to pay additional attention not to break it.
The goal is to learn how to scale EventSourcing / CQRS system. How to handle large EventStore and rebuild damaged or inflate new ReadModels within hours not days.
Does this idea fits into ES / CQRS?
(EDIT: is it OK to update RM with event stream not from an Aggregates?)
How to handle issues with growing event store and fixing broken ReadModels?
Thanks!
Does this idea fits into ES / CQRS?
One of the things that you need to be really careful about, is understanding which information is under the control of your domain model, and which belongs to something outside.
If your sensors are physical devices in the real world, broadcasting readings, then your domain model is not the authority. That sensor data is probably going to be read, validated (ie: no corruption to the messages in transit) and stored. In other words, the sensor measurements are events (past), not commands (imperative). Throw them into a convenient data store.
With that in mind, you need to look carefully at whether your arrays are domain entities (reading in sensor data, and making interesting decisions) or projections (a reorganization of the streams of sensor measurements).
It may be useful to review When to avoid CQRS, by Udi Dahan. One of the things he talks about there is that, when done right, aggregates look like processes.
In short, make sure that you are applying the right tools to your problem.
That said, yes -- if you have enough events that folding them into a projection isn't easy, then it is hard. You have to look at how much budget you have to solve the problem, and start digging into more I/O efficient representations of your events, more memory efficient representations of your events, batching, etc. Trying to find different ways to partition the work among different cores.
LMAX did a pretty good job documenting the lessons they learned in processing high volume message streams; search for information about their architecture.
Aggregates with lots of events
Aggregate is a term for the write side (C in CQRS). Aggregate receives a command, and using its state emits events into event store. Aggregate state is built using events from the event store. So if there are lot of events for the given aggregate, it takes time to build the state.
In order to speed up building a state for an aggregate, CQRS/ES frameworks are using snapshots - this is a serialized aggregate state that is stored for particular aggregate version, so you are building the state not from the beginning of time, but from the latest snapshot. You can store snapshots for, say, every 100 events. And don't forget to rebuild them if your projection function changed.
Frameworks such as reSolve are doing this for you transparently.
Your scenario
In your particular case it seems to me that your business logic is trivial, meaning you don't need an aggregate state to calculate anything or to make a decision - there is no business logic, you essentially just store events as they being generated by sensor. So in your custom framework you can just avoid building an aggregate state at write side - just store events as sensor data coming in.
At the read side you would use event stream as usual - upon receiving an event you can store it into Read Model database with necessary categorization or time slots.
If you don't need old data in the ReadModel - you may just skip old events during rebuilding - it should be very fast.
If you don't want to store old event in the event store - you can delete them, but this would not be real event sourcing anymore.

Comparing records from dstream

I have a case with Apache Spark where I like to analyse sensorstreams. The stream exists of sensordata from a variety of sensors but all push the same kind of data.
From this stream I like to know for each sensor how long a specific value is below a certain threshold. A sensor submits records every x seconds containing: timestamp and value. I like to extract the intervals at which a sensor is below the value to get the duration, starttime of interval, endtime of interval and average value.
I'm not sure about the proper ('Sparkish') way to extract the duration, start- and endtime of every interval from all connected sensors.
The approach I currently use is a foreach-loop with some state variables to uniquely mark each record if it is part of an interval from a specific sensor. When the records have been marked a map-reduce approach is used to extract the needed info. But I don't feel comfortable with the foreach-loop because it does not fit in the map-reduce approach and therefore does not scale well when work is distributed among workers.
In more general terms I'm facing the challenge comparing individual records in a rdd and records from different dstreams.
Anyone recognises such a (trivial) case and knows a better/more elegant approach to tackle this.
I found out that the best way to do this is by using mapWithState(). This function offers an elegant and flexible way to maintain state among values from successive dstreams.

Reporting of workflow and times

I have to start moving transnational data into a reporting database, but would like to move towards a more warehouse/data mart design, eventually leveraging Sql Server Analytics.
The thing that is being measured is the time between points of a workflow on a piece of work. How would you model that when the things that can happen, do not have a specific order. Also some work wont have all the actions, or might have the same action multiple times.
It makes me want to put the data into a typical relational design with one table the key or piece of work and a table that has all the actions and times. Is that wrong? The business is going to try to use tableau for report writing and I know it can do all kinds of sources, but again, I would like to move away from transaction into warehousing.
The work is the dimension and the actions and times are the facts?
Is there any other good online resources for modeling questions?
Thanks
It may seem like splitting hairs, but you don't want to measure the time between points in a workflow, you need to measure time within a point of a workflow. If you change your perspective, it can become much easier to model.
Your OLTP system will likely capture the timestamp of when the event occurred. When you convert that to OLAP, you should turn that into a start & stop time for each event. While you're at it, calculate the duration, in seconds or minutes, and the occurrence number for the event. If the task was sent to "Design" three times, you should have three design events, numbered 1,2,3.
If you're want to know how much time a task spent in design, the cube will sum the duration of all three design events to present a total time. You can also do some calculated measures to determine first time in and last time out.
Having the start & stop times of the task allow you to , for example, find all of tasks that finished design in January.
If you're looking for an average above the event grain, for example what is the average time in design across all tasks, you'll need to do a new calculated measure using total time in design/# tasks (not events).
Assuming you have more granular states, it is a good idea to define parent states for use in executive reporting. In my company, the operational teams have workflows with 60+ states, but management wanted them rolled up into five summary states. The rollup hierarchy should be part of your workflow states dimension.
Hope that helps.

NoSQL recommendation for a project (text streaming)

I'm looking for a NoSQL DB recommendation... here's what I'm working on:
I'm writing a web-based client for delivering text streams (basically, real-time captions) to a significant number of consumers. Once things are fully ramped up, there might be 100+ events happening at any given moment. Many will be small (< 10 consumers) but some of them could be quite large (10,000+ simultaneous consumers, maybe more?).
During the course of each event, text will be accumulating at a rate of anywhere from a few words per minute up to 200+ words per minute. Each consumer will be running a web client (a browser on a desktop/laptop/tablet/smartphone) which will poll periodically for any text that it hasn't already received. It will also be possible for a given user to ask for the full text of the event up to the time that they make the request. Completed events have to stick around for a while, but will be removed within about 24-36 hours of their completion.
My first thought is to use Redis, which has methods for appending to a text value in the datastore as well as built-in support for getting a substring from the end of a text value (i.e. a client could just hold the character offset of the last character it received and would pass that to the client API and that would be used to pull a substring from the event text). I am concerned though that the growth of the string containing the event text might be an unusual use of Redis and could cause me some issues.
So... is there a NoSQL DB that seems particularly well suited to this sort of application? Is there any significant reason NOT to use Redis for something like this?
An underlying open question is what to do about new clients. For example, say an event has started and someone connects a few minutes into it. Do they need everything from the beginning or just from when they connected?
If the latter I'd recommend a message system instead of appending strings to strings. One way would be to use Redis' Pub/Sub instead. That seems a better fit overall, and especially if new connections do not need everything from the beginning. For longer term storage, a client that listens as any other and the archive entry - preferably by local cache and then upload the completed transcript when completed or in-progress. I'd keep the real-time need and code separate from requesting history and archives.
Another route would be to use an ordered set, using a timestamp for the time the entry was made. As a result the client only keeps track of the last update and retrieves anything from that time on. Ordered Sets documentation can be found here. This method also provides the ability to select a region of time from the transcript. With a bit of math you could even replay the event from a transcript viewpoint as if it were live. If you've got tens of thousands of clients pulling the entire transcript each poll
Another advantage of the timestamp ordered set is string encoding. When using Redis strings and getrange you have to use fixed-width encodings. The range is byte-offsets, not character offsets. If you need the ability to support, say UTF-8, this might be a problem for you.
A third option is to append a string of text to a list. This is similar to the sorted set except that your client stores the last index (size of the list) and on each poll tries to get anything from lastIndex+1 to the end.

How should I implement "get objects changed since" pattern with MongoDB?

I have a collection of objects, let's say they are "posts," and those objects can be modified. I'd like to display a list on the client side that updates dynamically. So on the client side, if doing this via polling, the client would invoke an API like:
getPostsChangedSince(serial)
where serial could be a monotonically increasing number, probably a timestamp. The client gets back a list of posts that have changed since that time, stores a new latest-serial, and next time the client polls it requests changes since that latest serial.
I think the basic idea is the same in this question (which is about ASP.NET): How to implement "get latests changed items" with ADO.NET Data Services?
I'm trying to find the best way to implement this in MongoDB.
I like the idea of using the time for the serial, since it automatically works at least mostly correctly even if there are multiple app servers. The serial would be stored in each post object, and updated whenever the object is modified.
The timestamp-based serial could be implemented as:
a Date (I think this is stored as a 64-bit milliseconds since epoch?)
a Timestamp http://www.mongodb.org/display/DOCS/Timestamp+Data+Type
something "by hand" e.g. store milliseconds as a number
Some nice features to have in a solution would include:
ensure that creating then immediately updating an object within the OS timer resolution will still increment the serial despite it being the same time
even better would to be guaranteed monotonic increase globally for all objects, not just guarantee that changing a given object will bump the serial on that object (absent this, getPostsChangedSince() calls probably need a fuzz backward in time, to avoid missing changes - at price of getting some changes twice)
mongodb-side timestamps might be nice because getting the time in the app creates a gap between when you get the time, and when the new object is saved and available in queries
update using findAndModify() with a query including the old serial, so "conflicts" (two changes at once) will throw an error allowing the app to retry
I realize some of the corner cases here are a little bit "academic" and can likely be fudged around in real life.
My approach so far is:
use the Date type for the serial
when modifying an object, get the current time, and if it matches the object's old serial, add 1 millisecond (yes this breaks if you make two modifications quickly without re-fetching from mongodb, but that seems OK)
use findAndModify(), but based on https://jira.mongodb.org/browse/JAVA-276 there may not be a way to detect if it ends up not finding anything to modify (i.e. second change is ignored, in case of conflict)
Questions:
I feel like I should use Timestamp instead; true? Any downsides?
if you had a mongo cluster, might time in milliseconds be more unique and correct than Timestamp's time in seconds plus a number, while with one mongod Timestamp is more unique?
is there a way to detect whether findAndModify() updated anything?
any general advice / experiences with this problem? how would you do it?
Have you considered "externalizing" the serial number generator? Time with MongoDB precision is good, but can become difficult to synchronize when involving multiple machines. One choice is that you can use memcached or something similar which is memory based, extremely fast and can be serialized (memcached has a CAS operation).
So what you would do is store a "seed" in memcached with a key say, counter.
Everytime an app needs to do an insert, it gets the next number from memcached and increments the counter.
On second thoughts, you can even do away with memcached and just use a single row (sorry document) collection that just has the counter. You can get the counter and increment it which will be an extremely fast operation, mimicking memcached.
And then naturally, you can index the data appropriately. However, I am wondering that this would result in the index to be very imbalanced (right-side loped). Depending upon the situation, it might be worthwhile exploring the use of capped collection. So when you insert data into your main collection, also insert it into the capped collection and read data from that collection.
You could continue to use your regular collection, as you do now, and after each update additionally insert the ID of the post into a special TTL collection. See http://docs.mongodb.org/manual/tutorial/expire-data/ for more info on using such a collection. Mongo will take care of all timing issues, you don't need to worry about serial numbers, and you can very quickly access time based lists of objects by their IDs.
Caveat:
use the blocking form of findAndModify, to ensure the changes have really been processed:
Blocking/Safe Writes
Unless you specify the "new" parameter as true the write operation will not block, and will not return an error (if there is one). If you do want the "new" document returned then the operation will wait until the write is done to return the new document, or an error.
For a "safe" (blocking) write operation you must call getLastError (if not using "new").