Scheduling job with Apache NiFi by passing dynamic property values - scheduler

I have created a NiFi workflow as shown below:
GenerateFlowFile --> Custom Processor --> LogAttribute
My custom processor has a property as start date. But the start date should change in each scheduled run based on the maximum end date from previous run. Basically looking for incremental data fetch from the server.
Could you please help, how this can be achieved in Apache NiFi?

Processor scheduling is usually left to the data flow manager configuring the processor into their flow. I recommend you let them schedule the processor, expecting it to run on a periodic basis.
But you can use Apache NiFi's State Manager feature to store data that tracks your incremental progress. You could then decide what action to take, if any, when the processor is triggered. If there is nothing to do, don't do anything.
The best examples of this are List* processors like ListFile. These processors typically store a timestamp of the file they last read, the use that timestamp to determine which newer files should be acted on, regardless of how frequently they are asked to check. It is likely that most executions of a List* processor will result in no output.
There are some examples of reading and persisting state data in the AbstractListProcessor class.

Related

Backfill Beam pipeline with historical data

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub.
In order to bring the pipeline state up to date, and to create the correct outputs, there is a significant amount of historical event data which must be processed first. This historical data is available via JDBC. In testing, I am able to use the JdbcIO.Read PTransform to read and handle all historical state, but I'd like to initialize my production pipeline using this JDBC event data, and then cleanly transition to reading events from Pub/Sub. This same process may happen again in the future if the pipeline logic is ever altered in a backward incompatible way.
Note that while this historical read is happening, new events are continuing to arrive into Pub/Sub (and these end up in the database also), so there should be a clean cutover from only historical events read from JDBC, and only newer events read from Pub/Sub.
Some approaches I have considered:
Have a pipeline that reads from both inputs, but filters data from JDBC before a certain timestamp, and from pub/sub after a certain timestamp. Once the pipeline is caught up deploy an update removing the JDBC input.
I don't think this will work because removal of an I/O transform is not backward compatible. Alternately, the JDBC part of the pipeline must stay there forever, burning CPU cycles for no good reason.
Write a one-time job that populates pub/sub with the entirety of the historical data, and then starts the main pipeline reading only from pub/sub.
This seems to use more pub/sub resources than necessary, AND I think newer data interleaved in the pipeline with much older data will cause watermarks to be advanced too early.
Variation of option #2 -- stop creating new events until the historical data is handled, to avoid messing up watermarks.
This requires downtime.
It seems like it would be a common requirement to backfill historical data into a pipeline, but I haven't been able to find a good approach to this.
Your first option, reading from a Bounded source (filtered to timestamp <= cutoff) and PubSub (filtered to timestamp > cutoff) should work well.
Because JDBC.Read() is a bounded source, it will be read all the data and then "finish" i.e. never produce any more data, advance its watermark to +infinity, and not be invoked again (so there's no concern about it consuming CPU cycles, even if it's present in your graph).

Event sourcing - why a dedicated event store?

I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it

Is Event Sourcing applicable for batch inputs?

I have a use case where the inputs to the application comes in batches of XML files. For example, a nightly batch of bank transactions. I am trying to see if I can use event sourcing to create a log of events. Based on what I read so far, the examples seems to be based on user driven input (click streams, updates from a user interface etc.,). Is event sourcing using a distributed log mechanism(like Kafka) a valid approach for batch/file based inputs?
Below is the approach I would like to take:
Accept input as a batch in file/xml
Run some basic validations in the memory.
Convert the batch input into a series of events
Write the event log to a Kafka topic(s).
Use the event log to store the data into the database, send the events
to a search engine, update caches, run spark jobs to do aggregations
etc.,
Repeat the process for other incoming batches.
If this approach is not efficient, what other options are available for distributed processing of such inputs?
Are your inputs coming from something that looks like an event storage? I.e. a database that acts as an immutable source of truth, of append only events.
If that is the case, you have the foundation to use event sourcing, and additionally CQRS. (They're not the same thing)
What you would have to realize is that the so called write side / command side... has already been done for you.
The incoming batch of XML files with transactions... each transaction is an event already. It doesn't sound like you need to convert these to events, to then put these into Kafka. You can just map these to something you can put into Kafka, and then all subscribers of the topics can do stuff accordingly.
Effectively you would be implementing the read side of Event Sourcing + CQRS.
In practical terms, unless you are going to be doing things on the write side (where the xml files are generated / where user input is received)... I wouldn't worry too much about the subtleties of event sourcing as it relates to DDD and CQRS. I would simply think of what you're doing as a way to distribute your data to multiple services.
And make sure to consider how caches, search engines, etc. will only be updated whenever you get those XML files.
If each individual event in these xml files has a timestamp then you can think of the output to Kafka as just a steam of late arriving events. Kafka allows you to set the event time on these messages to be the timestamp of the event rather than the time it was ingested to Kafka. In that way, any downstream processing apps like Kafka Streams can put the event into the right temporal context and aggregate into the proper time windows or session windows or even join with other realtime inputs

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.

Is CEP what I need (system state and event replaying)

I'm looking for a CEP engine, but I' don't know if any engine meets my requirements.
My system has to process multiple streams of event data and generate complex events and this is exactly what almost any CEP engine perfectly fits (ESPER, Drools).
I store all raw events in database (it's not CEP part, but I do this) and use rules (or continious queries or something) to generate custom actions on complex events. But some of my rules are dependent on the events in the past.
For instance: I could have a sensor sending event everytime my spouse is coming or leaving home and if both my car and the car of my fancy woman are near the house, I get SMS 'Dangerous'.
The problem is that with restart of event processing service I lose all information on the state of the system (is my wife at home?) and to restore it I need to replay events for unknow period of time. The system state can depend not only on raw events, but on complex events as well.
The same problem arises when I need some report on complex events in the past. I have raw events data stored in database, and could generate these complex events replaying raw events, but I don't know for which exactly period I have to replay them.
At the same time it's clear that for the most rules it's possible to find automatically the number of events to be processed from the past (or period of time to load events to be processed) to restore system state.
If given action depends on presence of my wife at home, CEP system has to request last status change. If report on complex events is requested and complex event depends on average price within the previous period, all price change events for this period should be replayed. And so on...
If I miss something?
The RuleCore CEP Server might solve your problems if I remember correctly. It does not lose state if you restart it and it contains a virtual logical clock so that you can replay events using any notion of time.
I'm not sure if your question is whether current CEP products offer joining historical data with live events, but if that's what you need, Esper allows you to pull data from JDBC sources (which connects your historical data with your live events) and reflect them in your EPL statements. I guess you already checked the Esper website, if not, you'll see that Esper has excellent documentation with lots of cookbook examples
But even if you model your historical events after your live events, that does not solve your problem with choosing the correct timeframe, and as you wrote, this timeframe is use case dependent.
As previous people mentioned, I don't think your problem is really an engine problem, but more of a use case one. All engines I am familiar with, including Drools Fusion and Esper can join incoming events with historical data and/or state data queried on demand from an external source (like a database). It seems to me that what you need to do is persist state (or "timestamp check-points") when a relevant change happens and re-load the state on re-starts instead of replaying events for an unknown time frame.
Alternatively, if using Drools, you can inspect existing rules (kind of reflection on your rules/queries) to figure out which types of events your rules need and backtrack your event log until a point in time where all requirements are met and load/replay your events from there using the session clock.
Finally, you can use a cluster to reduce the restarts, but that does not solve the problem you describe.
Hope it helps.