Flink StreamingEnvronment does not terminate when using Kafka as source - apache-kafka

I am using Flink for a Streaming Application. When creating the stream from a Collection or a List, the application terminates and everything after the "env.execute" gets executed normally.
I need to use a different source for the stream. More precisely, I use Kafka as a source (env.addSource(...)). In this case, the program just block while reaching the end of the stream.
I created an appropriate Deserialization Schema for my stream, having an extra event that signals the end of the stream.
I know that the isEndOfStream() condition succeeds on that point (I have an appropriate message printed on the screen in this case).
At this point the program just stops and does nothing, and so the commands that follow the "execute" line aren't on my disposal.
I am using Flink 1.7.2 and the flink-connector-kafka_2.11, with Scala 2.11.12. I am executing using the IntelliJ environment and Maven.
While researching, I found a suggestion to throw an exception while getting at the end of the stream (using the Schema's capabilities). That does not support my goal because I also have more operators/commands within the execution of the environment that need to be executed (and do get executed correctly at this point). If i choose to disrupt the program by throwing an exception I would lose everything else.
After the execution line I use the .getNetRuntime() function to measure the running time of my operators within the stream.
I need to have StreamingEnvironment end like it does when using a List as a source. Is there a way to remove Kafka at this point for example?

Related

Flink: posting messages to an external API: custom sink or lambda function

We are developing a pipeline in apache flink (datastream API) that needs to sends its messages to an external system using API calls. Sometimes such an API call will fail, in this case our message needs some extra treatment (and/or a retry).
We had a few options for doing this:
We map() our stream through a function that does the API call and get the result of the API call returned, so we can act upon failures subsequently (this was my original idea, and why i did this: flink scala map with dead letter queue)
We write a custom sink function that does the same.
However, both options have problems i think:
With the map() approach i won't be able to get exactly once (or at most once which would also be fine) semantics since flink is free to re-execute pieces of pipelines after recovering from a crash in order to get the state up to date.
With the custom sink approach i can't get a stream of failed API calls for further processing: a sink is a dead end from the flink APPs point of view.
Is there a better solution for this problem ?
The async i/o operator is designed for this scenario. It's a better starting point than a map.
There's also been recent work done to develop a generic async sink, see FLIP-171. This has been merged into master and will be released as part of Flink 1.15.
One of those should be your best way forward. Whatever you do, don't do blocking i/o in your user functions. That causes backpressure and often leads to performance problems and checkpoint failures.

How to log custom flatMap function in Flink (scala) on Kubernetes?

I'm applying a custom flatMap function to a DataStream in Flink and want to log the exceptions, that may occur in my flatMap function. The Flink job is deployed and run on Kubernetes, so I think I can't just write to some log files, and access them manually. I may only have access to the Flink manager through the web browser. So, how can I output the exception to stdout or some error/log stream, such that I can view them through the web interface?
If you change the flatMap to a ProcessFunction, then you could use a side output to send a report about each exception to whatever sink you want to connect to the side output.

Force Apache Flink to execute at a given point

It is my understanding that Apache Flink does not actually run the operations that you ask it to until the result of those operations is needed for something. This makes it difficult to time exactly how long each operation takes, which is exactly what i am trying to do in order to compare its efficiency to Apache Spark. Is there a way to force it to run the operations when I want it to?
When running a Flink program one defines the topology and operators to be executed on a cluster. One triggers the job execution by calling env.execute where env is either an ExecutionEnvironment or a StreamExecutionEnvironment. There is one exception for batch jobs which are the API calls collect and print which trigger an eager execution.
You could use the web ui to extract the runtime of different operators. For each operator you see when it's deployed and when it finished execution.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

NEventStore: Sagas, Commands and not Losing Them

NEventStore: 5.1
Simple setup: WebApp (Asp.NET 4.5) == command-side
I'm searching for the "right" way for not losing commands, with an eye on sagas/process-managers which maybe would wait endlessly for an event produced from a command that was actually never handled.
Old: Dispatchers
I initially used sync commands, but with an eye on sagas/process-managers I thought it would be safer to first store them an then get them through SyncDispatcher (or AsyncDispatcher). Otherwise, that's my concern, if a saga would try to send a command and the command didn't finish due to app-crash/powerloss/..., it would be lost and noone would know.
So I created a command-stream and appended each command to that. The IsDispatched showed, if that command was already handled.
That worked.
PollingClient and Command-Stream
Now that the dispatchers are obsolete, I switched to PollingClient. What I lost is the Dispatched information.
A startup-issue arose:
I naively started polling from the current latest checkpoint going forward, but when the application restarted there was a chance that commands were stored but not executed before the crash and therefore lost (that actually happened).
I just came across the idea:
store the basic outcome of commands as (non-domain-)events in another stream.
This stream would contain CommandSucceeded and CommandFailed events.
Whenever the application starts the latest command-id or command-checkpoint-number gets extracted used to load the commands right after that one...
Questions
Are my concerns, that sync command-handling leads to the danger of losing a saga-generated command, wrong? If yes, why?
Is this generally a good idea: one big command stream?
Is this generally a good idea: store generic command-outcome-events in a stream?
You can:
Store you command in a command queue | persistent log
Use command id (guid) as Commit Id on NEventStore
Mark your command as executed in your Command Handler | Pipeline Hook | Polling Client
NEventStore gives you idempotency on same AggregateId (streamid) + CommitId, so if you app crashes before the command is marked as processed and you replay your command, the resulting commits are automatically discarded by NES.
Afaik NEventStore is meant to be the storage for event sourcing i.e storing domain objects as a stream of events. Commands and sagas have nothing to do with it. It's your service bus which should take care of durability and saga management.
Personally, I treat the event store simply as a repository detail. The application service (command handler) will dispatch the generated events, after they've been persisted.
If the app crashes and the service bus is durable (not a memory one) then the event/command will be handled again automatically, because the service bus should detect if a message wasn't successfully handled. Of course, your message handlers should be idempotent for that reason.