Sending Http Request with Kafka Stream - apache-kafka

I am aware that it's not recommended to send http request with kafka stream as the blocking nature of external RPC calls may impact performance.
However what if the use case doesn't allow me to avoid sending http request?
I'm building an application that consumes from an input topic, then for each message it will go thorough various iterations of filtering, mapping, and joining with kTable. At the end of these iterations the result is ready to be "delivered".
Apparently, the "delivery" method is via http request. I have to call external rest APIs to send these result to various vendors. I will also need to wait for the response to come back and based on the result I will mark the delivery as either successful or failed and produce the result to an output topic, which will be consumed by other service for archiving purpose.
I'm aware that http calls will block the currently calling stream thread, so I configured a timeout which is strictly and greatly less than the consumer's max.poll.interval.ms to avoid rebalance in case the external API service is down. Also the timed out request will be sent to a low priority queue to be ready for delivery re-attempt in later times.
As you can see, I cannot really avoid making external RPC calls within kafka streams. I'm just curious if there is better architecture that's meant for such use case?

If you cannot avoid it, then one other option is to send data to some outbound "request" topic, and write a consumer to do the requests, and produce back to a "response" topic with HTTP status codes or success/fail indicators, for example.
Then have Streams also consume this response topic for the joining.
The main reason not to do blocking RPC within Streams is that it's very sensitive to time, and increasing timeouts excessively should generally be avoided when possible.

Related

How to implement communication between consumer and producer with fast and slow workers?

I'm looking for a pattern and existing implementations for my situation:
I have synchronous soa which uses REST APIs, in fact I implement remote procedure call with REST. And I have got some slow workers which process requests for substantial time (around 30 seconds) and due to some license constraints for some requests I can process them only sequentially (see system setup).
What are recommended ways to implement communication for such case?
How can I mix synchronous and asynchronous communication in the situation when the consumer is behind firewall and I cannot send him notification about completed tasks easily and I might not be able to let consumer use my message broker if I have one?
Workers are implemented in Python using Flask and gunicorn. At the moment I'm using synchronous REST interfaces and allow for the delay as I only had fast workers. I looked at Kafka and RabbitMq and they would fit for backend side
communication, however how does producer communicate with the consumer?
If the consumer fires API request my producer can return code 202, then how shall producer notify consumer that result is available? Will consumer have to poll the producer for results?
Also if I use message brokers and my gateway acts on behalf of consumer, it should have a registry of requests (I already have GUID for every request now) and results, which approach would you recommend for implementing it?
Producer- agent which produces the message
Consumer - agent which can handle the message and implement the logic for processing the message

Does Confluent Rest Proxy API (producer) retry event publish on failure

We are planning to use Confluent Rest proxy as the platform layer for listening to all user events (& publishing them to Kafka). Working on micro-services model & having varied types of event generators, we want our APIs/event-generators to be decoupled from event listening/handling. Also, at event listening layer, event resilience is important for us.
From what I understand, if the Rest proxy layer fails to publish to Kafka(once) for any reason, it doesn't retry. This functionality needs to be handled by the caller (the client layer), which needs to make synchronous calls & retry on failure. However, couldn't find any details on this, in the product documentation. Could someone please confirm the same?
Confluent Rest Proxy developers claim that with the right Rest Proxy cluster set-up & right request batching by the client, performance as good as native producers' can be achieved. Any exceptions/(positive/negative)thoughts here?
Calls to the Rest Proxy Producer API are blocking. If the client doesn't need to know the partition & offset details, can we configure these calls to be non-blocking in anyway, such that once the request is received, resilience is managed by the Rest Proxy layer itself. The client just receives a 200 HTTP status as acknowledgement, whenever a produce msg request is received.
The REST Proxy is just a normal Kafka Producer and Kafka Consumer and can be configured with or without retries enabled, just as any other Kafka Producer can.
A single producer publishing via a REST Proxy will not achieve the same throughput as a single native Java Producer. However you can scale up many REST proxies and many HTTP producers to get high performance in aggregate. You can also mitigate the performance penalty imposed by HTTP by batching multiple messages together into a consolidated HTTP request to minimize the number of HTTP round trips on the wire.

What happens to messages that come to a server implements stream processing after the source reached its bound?

Im learning akka streams but obviously its relevant to any streaming framework :)
quoting akka documentation:
Reactive Streams is just to define a common mechanism of how to move
data across an asynchronous boundary without losses, buffering or
resource exhaustion
Now, from what I understand is that if up until before streams, lets take an http server for example, the request would come and when the receiver wasent finished with a request, so the new requests that are coming will be collected in a buffer that will hold the waiting requests, and then there is a problem that this buffer have an unknown size and at some point if the server is overloaded we can loose requests that were waiting.
So then stream processing came to play and they bounded this buffer to be controllable...so we can predefine the number of messages (requests in my example) we want to have in line and we can take care of each at a time.
my question, if we implement that a source in our server can have a 3 messages at most, so if the 4th id coming what happens with it?
I mean when another server will call us and we are already taking care of 3 requests...what will happened to he's request?
What you're describing is not actually the main problem that Reactive Streams implementations solve.
Backpressure in terms of the number of requests is solved with regular networking tools. For example, in Java you can configure a thread pool of a networking library (for example Netty) to some parallelism level, and the library will take care of accepting as much requests as possible. Or, if you use synchronous sockets API, it is even simpler - you can postpone calling accept() on the server socket until all of the currently connected clients are served. In either case, there is no "buffer" on either side, it's just until the server accepts a connection, the client will be blocked (either inside a system call for blocking APIs, or in an event loop for async APIs).
What Reactive Streams implementations solve is how to handle backpressure inside a higher-level data pipeline. Reactive streams implementations (e.g. akka-streams) provide a way to construct a pipeline of data in which, when the consumer of the data is slow, the producer will slow down automatically as well, and this would work across any kind of underlying transport, be it HTTP, WebSockets, raw TCP connections or even in-process messaging.
For example, consider a simple WebSocket connection, where the client sends a continuous stream of information (e.g. data from some sensor), and the server writes this data to some database. Now suppose that the database on the server side becomes slow for some reason (networking problems, disk overload, whatever). The server now can't keep up with the data the client sends, that is, it cannot save it to the database in time before the new piece of data arrives. If you're using a reactive streams implementation throughout this pipeline, the server will signal to the client automatically that it cannot process more data, and the client will automatically tweak its rate of producing in order not to overload the server.
Naturally, this can be done without any Reactive Streams implementation, e.g. by manually controlling acknowledgements. However, like with many other libraries, Reactive Streams implementations solve this problem for you. They also provide an easy way to define such pipelines, and usually they have interfaces for various external systems like databases. In particular, such libraries may implement backpressure on the lowest level, down to to the TCP connection, which may be hard to do manually.
As for Reactive Streams itself, it is just a description of an API which can be implemented by a library, which defines common terms and behavior and allows such libraries to be interchangeable or to interact easily, e.g. you can connect an akka-streams pipeline to a Monix pipeline using the interfaces from the specification, and the combined pipeline will work seamlessly and supporting all of the backpressure features of Reacive Streams.

Is RabbitMQ capable of "pushing" messages from a queue to a consumer?

With RabbitMQ, is there a way to "push" messages from a queue TO a consumer as opposed to having a consumer "poll and pull" messages FROM a queue?
This has been the cause of some debate on a current project i'm on. The argument from one side is that having consumers (i.e. a windows service) "poll" a queue until a new message arrives is somewhat inefficient and less desirable than the idea having the message "pushed" automatically from the queue to the subscriber(s)/consumer(s).
I can only seem to find information supporting the idea of consumers "polling and pulling" messages off of a queue (e.g. using a windows service to poll the queue for new messages). There isn't much information on the idea of "pushing" messages to a consumer/subscriber...
Having the server push messages to the client is one of the two ways to get messages to the client, and the preferred way for most applications. This is known as consuming messages via a subscription.
The client is connected. (The AMQP/RabbitMQ/most messaging systems model is that the client is always connected - except for network interruptions, of course.)
You use the client API to arrange that your channel consume messages by supplying a callback method. Then whenever a message is available the server sends it to the client over the channel and the client application gets it via an asynchronous callback (typically one thread per channel). You can set the "prefetch count" on the channel which controls the amount of pipelining your client can do over that channel. (For further parallelism an application can have multiple channels running over one connection, which is a common design that serves various purposes.)
The alternative is for the client to poll for messages one at a time, over the channel, via a get method.
You "push" messages from Producer to Exchange.
https://www.rabbitmq.com/tutorials/tutorial-three-python.html
BTW this is fitting very well IoT scenarios. Devices produce messages and sends them to an exchange. Queue is handling persistence, FIFO and other features, as well as delivery of messages to subscribers.
And, by the way, you never "Poll" the queue. Instead, you always subscribe to publisher. Similar to observer pattern. Generally, I would say genius principle.
So it is similar to post box or post office, except it sends you a notification when message is available.
Quoting from the docs here:
AMQP brokers either deliver messages to consumers subscribed to
queues, or consumers fetch/pull messages from queues on demand.
And from here:
Storing messages in queues is useless unless applications can consume
them. In the AMQP 0-9-1 Model, there are two ways for applications to
do this:
Have messages delivered to them ("push API")
Fetch messages as needed ("pull API")
With the "push API", applications have to indicate interest in
consuming messages from a particular queue. When they do so, we say
that they register a consumer or, simply put, subscribe to a queue. It
is possible to have more than one consumer per queue or to register an
exclusive consumer (excludes all other consumers from the queue while
it is consuming).
Each consumer (subscription) has an identifier called a consumer tag.
It can be used to unsubscribe from messages. Consumer tags are just
strings.
RabbitMQ broker is like server that wont send data to consumer without consumer client getting registering itself to server. but then question comes like below
Can RabbitMQ keep client consumer details and connect to client when packet comes?
Answer is no. so what is alternative well then write plugin by yourself that maintain client information in some kind of config. Plugin will pull from RabbitMQ Queue and push to client.
Please give look at this plugin might help.
https://www.rabbitmq.com/shovel.html
Frankly speaking Client need to implement AMQP protocol to receive so and should listen connection on some port for that. This sound like another server.
Regards,
Vishal

What is Microsoft Message Queuing (MSMQ)? How does it work?

I need to work with MSMQ (Microsoft Message Queuing). What is it, what is it for, how does it work? How is it different from web services?
With all due respect to #Juan's answer, both are ways of exchanging data between two disconnected processes, i.e. interprocess communication channels (IPC). Message queues are asynchronous, while webservices are synchronous. They use different protocols and back-end services to do this so they are completely different in implementation, but similar in purpose.
You would want to use message queues when there is a possibility that the other communicating process may not be available, yet you still want to have the message sent at the time of the client's choosing. Delivery will occur the when process on the other end wakes up and receives notification of the message's arrival.
As its name states, it's just a queue manager.
You can Send objects (serialized) to the queue where they will stay until you Receive them.
It's normally used to send messages or objects between applications in a decoupled way
It has nothing to do with webservices, they are two different things
Info on MSMQ:
https://msdn.microsoft.com/en-us/library/ms711472(v=vs.85).aspx
Info on WebServices:
http://msdn.microsoft.com/en-us/library/ms972326.aspx
Transactional Queue Management 101
A transactional queue is a middleware system that asynchronously routes messages of one sort of another between hosts that may or may not be connected at any given time. This means that it must also be capable of persisting the message somewhere. Examples of such systems are MSMQ and IBM MQ
A Transactional Queue can also participate in a distributed transaction, and a rollback can trigger the disposal of messages. This means that a message is guaranteed to be delivered with at-most-once semantics or guaranteed delivery if not rolled back. The message won't be delivered if:
Host A posts the message but Host B
is not connected
Something (possibly but not
necessarily initiated from Host A)
rolls back the transaction
B connects after the transaction is
rolled back
In this case B will never be aware the message even existed unless informed through some other medium. If the transaction was rolled back, this probably doesn't matter. If B connects and collects the message before the transaction is rolled back, the rollback will also reverse the effects of the message on B.
Note that A can post the message to the queue with the guarantee of at-most-once delivery. If the transaction is committed Host A can assume that the message has been delivered by the reliable transport medium. If the transaction is rolled back, Host A can assume that any effects of the message have been reversed.
Web Services
A web service is remote procedure call or other service (e.g. RESTFul API's) published by a (typically) HTTP Server. It is a synchronous request/response protocol and has no guarantee of delivery built into the protocol. It is up to the client to validate that the service has been correctly run. Typically this will be through a reply to the request or timeout of the call.
In the latter case, web services do not guarantee at-most-once semantics. The server can complete the service and fail to deliver a response (possibly through something outside the server going wrong). The application must be able to deal with this situation.
IIRC, RESTFul services should be idempotent (the same state is achieved after any number of invocations of the same service), which is a strategy for dealing with this lack of guaranteed notification of success/failure in web service architectures. The idea is that conceptually one writes state rather than invoking a service, so one can write any number of times. This means that a lack of feedback about success can be tolerated by the application as it can re-try the posting until it gets a 'success' message from the server.
Note that you can use Windows Communication Foundation (WCF) as an abstraction layer above MSMQ. This gives you the feel of working with a service - with only one-way operations.
For more information, see:
http://msdn.microsoft.com/en-us/library/ms789048.aspx
Actually there is no relation between MSMQ and WebService.
Using MSMQ for interprocess communication (you can use also sockets, windows messaging, mapped memory).
it is a windows service that responsible for keeping messages till someone dequeue them.
you can say it is more reliable than sockets as messages are stored on a harddisk but it is slower than other IPC techniques.
You can use MSMQ in dotnet with small lines of code, Just Declare your MessageQueue object and call Receive and Send methods.
The Message itself can be normal string or binary data.
As everyone has explained MSMQ is used as a queue for messages. Messages can be wrapper for actual data, object and anything that you can serialize and send across the wire. MSMQ has it's own limitations. MSMQ 1.0 and MSMQ 2.0 had a 4MB message limit. This restriction was lifted off with MSMQ 3.0. Message oriented Middleware (MOM) is a concept that heavily depends on Messaging. Enterprise Service Bus foundation is built on Messaging. All these new technologies, depend on Messaging for asynchronous data delivery with reliability.
MSMQ stands for Microsoft Messaging Queue.
It is simply a queue that stores messages formatted so that it can pass to DB (may on same machine or on Server). There are different types of queues over there which categorizes the messages among themselves.
If there is some problem/error inside message or invalid message is passed, it automatically goes to Dead queue which denotes that it is not to be processed further. But before passing a message to dead queue it will retry until a max count and till it is not processed. Then it will be sent to the Dead queue.
It is generally used for sending log message from client machine to server or DB so that if there is any issue happens on client machine then developer or support team can go through log to solve problem.
MSMQ is also a service provided by Microsoft to Get records of Log files.
You get Better Idea from this blog http://msdn.microsoft.com/en-us/library/ms711472(v=vs.85).aspx.