i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.
In all example I see a simple single transformer/processor topology for Kafka. My doubt is whether we can modularise application logic by breaking down in to multiple transformers/processors applying sequentially to a single input stream.
Please find use case below :
Current application configuration is a single processor containing all processing logic tasks like filtering, validation, application logic, delaying(Kafka is too fast for dbs) and invoke SP/push to down stream.
But we are now planning to decouple all these operations by breaking down each task into separate processors/transformers of Kstream.
Since we are relatively new to Kafka, we are not sure of the pros and cons of this approach especially with respect to Kafka internals like state store/ task scheduling/ multithreading model.
Please share your expert opinions and experiences
Please note that we do not have control over topic, no new topic can be created for this design. The design must be feasible for the existing topic alone.
Kafka Streams allows you to split your logic into multiple processors. Internally, Kafka Streams implements a "depth-first" execution strategy. Thus, each time you call "forward" the output tuple is immediately processed by the downstream processor and "forward" return after downstream processing finished (note, that writing data into a topic and reading it back "breaks" the in-memory pipeline -- thus, when data is written to a topic, there is no guarantee when downstream processor will read and process those records).
If you have state that is shared between multiple processor, you would need to attach the store to all processor that need to access to store. The execution on the store will be single threaded and thus, there should be no performance difference.
As long as you connect processor directly (and not via topics) all processor will be part of the same tasks. Thus, there shouldn't be a performance difference.
We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.
I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).
Does Storm support dynamic topology? The functionality I want from this is to dynamically change the topology according to the user requirement while the Storm topology is running. For example, when user want to know the top-10 words of a stream, I use the top-10 bolt to process it, when user want to know something else, I use the other bolt to process the stream and 'unplug' the top-10 bolt.
I know it could be done by partition the stream or duplicate the stream and alway running every functionalities and only demo the data we want, or we could shut down the stream and update another topology, but is there a 'hot plug-in' way to do that?
You can't dinamically change a Storm topology's structure, i.e. modify the spouts and bolts wiring. A Storm topology's wiring is always static.
However, you could implement the needed functionality in other ways you already described. IMHO, the best, most logical way would be by running multiple topologies -- in case the data processing differs greatly. But if most of the processing is similar in both cases, just duplicate the source stream and process the data in different branches of the same topology.
It was added on STORM-561, on 03/Jun/15:
There is no built in way to do this (switch out one bolt for another), but what you can do is write a bolt that executes arbitrary code based on the input it receives. So long as your input and output has the same structure in storm (same tuples emitted), you could theoretically execute whatever you wanted at run time in your bolt. This is especially easy if you build your bolt in Clojure, but it's possible in essentially every language you can use with Storm.
However, this probably doesn't make a lot of sense as most computations you'll want to do involve more than one bolt and lend themselves to passing differently structured tuples around. As schiavuzzi already said in their answer, you're probably better off running multiple topologies if there are multiple, independent computations you'd like to do to a stream.
For hot deployment there is a new streaming platform from eBay.
Jetstream: https://github.com/pulsarIO/jetstream.
It has a built in config management tool and your config sits in mongodb. When user modify the config bean, the tool will publish the notification to zookeeper, the corresponding JetStream applications will be get notified and change the config dynamically
I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead