I was wondering how does spring-xd deals with processors in a stream. What I really would like to know is if processors are blocking code, or they are related to how reactor (https://github.com/reactor/reactor/wiki/Processor) deals with processors.
If I need to execute expensive blocking operations (aka call an outside system), what is the best way of doing? I'd love to use reactor or any other reactive framework for that, but how to do so within XD pipeline architecture?
Regards
The term processor in a Spring XD stream has a specific meaning - it is basically a Spring Integration message flow from a channel named input to a channel named output. These channels, by convention, are what produce and consume the payload in an XD stream. For example, if a stream mystream is defined as someSource | someProcessor | someSink, the processor module may execute an expensive operation asynchronously but the stream still has to wait for a message on the processor's output channel, so you won't see improved throughput.
However there are some situations in which implementing a sink to run asynchronously would help. In this case the stream will not block when the message arrives on the sink's input channel. The async sink (kind of has a ring to it) could be attached to a tap on a stream:
mystream = someSource | ... | someSink
mytap = tap:stream:mystream > asyncSink
or a named queue (or topic):
mystream = someSource | ... | > queue:myQueue
queue:myQueue > asyncSink
or it could be the sink for the primary stream.
To implement an asynchronous sink requires configuring a Spring Integration endpoint, e.g., a ServiceActivator to call an external service, with a poller and a task executor within the sink module. The endpoint polls a Pollable channel (the input channel itself might be declared as a queue channel for example). See http://docs.spring.io/spring-integration/reference/html/messaging-endpoints-chapter.html for details.
Related
We are developing a pipeline in apache flink (datastream API) that needs to sends its messages to an external system using API calls. Sometimes such an API call will fail, in this case our message needs some extra treatment (and/or a retry).
We had a few options for doing this:
We map() our stream through a function that does the API call and get the result of the API call returned, so we can act upon failures subsequently (this was my original idea, and why i did this: flink scala map with dead letter queue)
We write a custom sink function that does the same.
However, both options have problems i think:
With the map() approach i won't be able to get exactly once (or at most once which would also be fine) semantics since flink is free to re-execute pieces of pipelines after recovering from a crash in order to get the state up to date.
With the custom sink approach i can't get a stream of failed API calls for further processing: a sink is a dead end from the flink APPs point of view.
Is there a better solution for this problem ?
The async i/o operator is designed for this scenario. It's a better starting point than a map.
There's also been recent work done to develop a generic async sink, see FLIP-171. This has been merged into master and will be released as part of Flink 1.15.
One of those should be your best way forward. Whatever you do, don't do blocking i/o in your user functions. That causes backpressure and often leads to performance problems and checkpoint failures.
Is there a way to do a typical batch processing with Vert.x - like providing a file or DB query as input and let each record be processed by a vertice in non-blocking way.
In examples of Vertice, a server is defined in startup. And even though multiple vertices are deployed, server is created only onece. Which means that Vert.x engine does have a build in concept of a server and knows how to send incomming requests to each vertice for processing.
Same happens with Event Bus as well.
But is there a way to define a vertice with a handler for processing data from a general stream - query, file, etc..
I am particularly interested in spreading data processing over cluster nodes.
One way I can think of, is execute a query a regular way and then publish data to event bus for processing. But that means that if I have to process few millions of records, I will run out of memory. Of course I could do paging, etc.. - but there is no coordination between retrieving and processing of data.
Thanks
Andrius
If you are using the JDBC Client, you can stream the query result:
(using vertx-rx-java2)
JDBCClient client = ...;
JsonObject params = new JsonArray().add(dataCategory);
client.rxQueryStreamWithParams("SELECT * FROM data WHERE data.category = ?", params)
.flatMapObservable(SQLRowStream::toObservable)
.subscribe(
(JsonArray row) -> vertx.eventBus().send("data.process", row)
);
This way each row is send to the event bus. If you then have multiple verticle instances that each listen to this address, you spread the data processing to multiple threads.
If you are using another SQL Client have a look at its documentation - Maybe is has a similar method.
I've been working with Kafka Streams a bit and got some basic functionality working but I'm having some trouble wrapping my head around the process method in the Kafka Streams DSL. Specifically:
I know there are two ways of using Kafka Streams, the lower-level Processing API and the higher level Streams DSL. With the lower level you more explicitly define your topology, naming each node and such while the Streams DSL abstracts most of that away.
However, the higher level Streams DSL has a method called process() which is a terminal operation (i.e. the method returns void). So my question is, where does the processed data - the data that the Processor's void forward(key, value) method sends - go?
In the lower level Processing API, you name you processor Node and can link Sinks to that name but in the Streams DSL there is no name, or at least none that I can find.
It is possible but "clumsy" to forward() data from a Processor in the DSL. As you stated, the process() method is defined as terminal operation and thus, you should not call forward(). If you call forward() anyway and there is no downstream processor assigned, forward() is basically a no-op.
However, instead of trying add a downstream processor for process() (what would be a hack), you should use transform() (and related methods, transformValues(), flatTransform(), and flatTransformValues()) instead ff you want to send data down stream.
In all example I see a simple single transformer/processor topology for Kafka. My doubt is whether we can modularise application logic by breaking down in to multiple transformers/processors applying sequentially to a single input stream.
Please find use case below :
Current application configuration is a single processor containing all processing logic tasks like filtering, validation, application logic, delaying(Kafka is too fast for dbs) and invoke SP/push to down stream.
But we are now planning to decouple all these operations by breaking down each task into separate processors/transformers of Kstream.
Since we are relatively new to Kafka, we are not sure of the pros and cons of this approach especially with respect to Kafka internals like state store/ task scheduling/ multithreading model.
Please share your expert opinions and experiences
Please note that we do not have control over topic, no new topic can be created for this design. The design must be feasible for the existing topic alone.
Kafka Streams allows you to split your logic into multiple processors. Internally, Kafka Streams implements a "depth-first" execution strategy. Thus, each time you call "forward" the output tuple is immediately processed by the downstream processor and "forward" return after downstream processing finished (note, that writing data into a topic and reading it back "breaks" the in-memory pipeline -- thus, when data is written to a topic, there is no guarantee when downstream processor will read and process those records).
If you have state that is shared between multiple processor, you would need to attach the store to all processor that need to access to store. The execution on the store will be single threaded and thus, there should be no performance difference.
As long as you connect processor directly (and not via topics) all processor will be part of the same tasks. Thus, there shouldn't be a performance difference.
I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead