Batching the send operation on outbound adapter in Spring Integration - spring-batch

I have an outbound channel adapter (in this case is SFTP but it would be the same for a JMS or WS) at the end of a Spring Integration flow. By using direct channels every time there is a messaging flowing, it will be sent out synchronously.
Now, I need to process messages all the way until they reach the outbound adapter, but wait for a predetermined interval before sending them out. In other words batching the send operation.
I know the Spring Batch project might offer a solution to this but I need to find a solution with Spring Integration compoonents (in the int-* namespaces)
What would be a typical pattern to achieve this?

The Aggregator pattern is for you.
In your particular case I'd call that like window, because you don't have any specific correlation to group messages, but just need to build a batch as you call it.
So, I think your Aggregator config may look like:
<int:aggregator input-channel="input" output-channel="output"
correlation-strategy-expression="1"
release-strategy-expression="size() == 10"
expire-groups-upon-completion="true"
send-partial-result-on-expiry="true"/>
correlation-strategy-expression="1" means group any incoming messages
release-strategy-expression="size() == 10" allows to form and release batches by 10 messages
expire-groups-upon-completion="true" says to aggregator to remove the releases group from it store. That allow to for a new group for the same correlationKey (1 in our case)
send-partial-result-on-expiry="true" specifies that normal release operation (send to the output-channel) must be done on expire function when we don't have enough messages to build a whole batch (size 10 in our case). For these options, please, follow with documentation mentioned above.

Related

Asynchronous or synchronous pull for counting stream data in pub sub pub/sub?

I would like to count the number of messages in the last hour (last hour referring to a timestamp field in the message data).
I currently have a code that will count the messages synchronously (I am using Google Cloud Pub/Sub Synchronous pull), but I noticed it will take quite long.
My code will repeatedly poll the subscription for a predefined (I set it to 100+) number of times so that I am sure there are no more messages in the last hour that are coming in out of order.
This is not an acceptable design because it means the user has to wait for 5-10 mins for the service to count the messages when they want the metric!
Are there best practices in Pub Sub design for solving this kind of problem?
This seems like a simple problem to solve (count the number of events in the last X timeframe) so I thought there might be.
Will asynchronous design help? How would an async design work? I am not too sure about the async and Python future concept (I am using GCP Pub/Sub's Python client library).
I will try to catch the message differently. My solution is based on logging and BigQuery. The idea is to write a log, for example message received with timestamp xxxxx, to filter this log pattern and to sink the result in BigQuery.
Then, when a user ask, you simply have to request BigQuery and to count the message in the desired lap of time. You also have the advantage to change the time frame, to have an history,...
For writing this log, 2 solutions
Cheaper but not really recommended, the process which consume the message log it with it process it. However, you are dependent of an external service. And this service has 2 responsibilities: its work, and this log (for metrics). Not SOLID. Maybe it's can be the role of the publisher with a loge like this: message published at XXXX. However this imply that all the publisher or all the subscribers are on GCP.
Better is to plug a function, the cheaper (128Mb of memory) to simply handle the message and write the log.

Biztalk - How to throttle a streaming disassemble pipeline

I need to limit the number of orchestration instances spawned while debatching a large message in a streaming disassemble receive pipeline. Let’s say that I have a large xml coming in that contains 100 000 separate "Order" message. The receive pipeline would then debatch it and create 100 000 "ProcessOrder" orchestrations. This is too much and I need to limit that.
Requirements
The debatching needs to be done in a streaming manner so that I only load one "Order" message in memory at a time before sending it to the messagebox;
The debatching needs to be throttled based on the number of current running "ProcessOrder" orchestration instances (say if I already have 100 running instances, the debatching would wait till one is over to send another "Order" message to the messagebox).
Where I'm at
I have the receive pipeline that does the debatching and functional modifications to my messages. It does what it should in a streaming manner and puts individual messages in VirtualStreams;
I have an orchestration and helper methods that can limit the number of “ProcessOrder” orchestration instances.
The problem
I know that I can run a receive pipeline inside an orchestration (and that would solve my problem since on every "getnext" call to the pipeline, I could just hold on if there are too many running orchestration instances) but, digging in biztalk dlls, I noticed that using Microsoft.XLANGs.Pipeline.XLANGPipelineManager still loads up all the messages in memory instead of enumerating them like Microsoft.BizTalk.PipelineOM.PipelineManager does. I know they are putting every messages in VirtualStream but this is still inadequate, memory wise, for such a large message number.
Question
My next step would be to run the receive pipeline directly in the receive port (so it would use Microsoft.BizTalk.PipelineOM.PipelineManager) without having the orchestration that limits the number of “ProcessOrder” instances, but to meet the requirements, I would need to add a delay logic in my pipeline. Is this a viable option? If not, why? and what other alternative do I have?
You should debatch all messages once from pipeline and store those individual messages in MSMQ before even they are processed by orchestration. Use standard pipeline to debatch messages as they are efficient to handle large files debatching. MSMQ is available for free through Turn On Windows Features. Using MSMQ is very easy and does not require any development. Sending to MSMQ will be very fast 100K messages is not issue at all.
Then have a receive location to read from MSMQ. Depending on your orchestration throughput, you can control message flow by using BizTalk receive host throttling or by receiving the messages from MSMQ in Order or using the combination of both. Make sure you have separate host instance for both receive MSMQ and send MSMQ and for your orchestration processing.
This will be done through all configurations without any extra code simplifing your design. Make sure you have orchestration with minimum number of persistent points.

How to resequence after filtering for aggregation /Spring Integration/

I'm doing a project in Spring Integration and I have a big problem.
There are some filtering components in the flow and later in the flow I have an aggregation element.
The problem is that the filtering component does not support to "apply-sequence" property. It filters out some records without modifying the original sequence number however the number of messages are reduced.
Later in the flow I need an aggregation which fails releasing elements since some messages are filtered out.
I don't want to use any special routing elements which have apply-sequence property.
Can you suggest me any common solution for this type of filtering problem?
Thanks,
I'd say you misunderstand the behaviour of the filter and aggregator.
I guees you have some apply-sequence-aware component upstream. So, all messages in that group accept several headers - correlationId - to group messages in the default aggregator; sequenceNumber - the index of the message; sequenceSize - the number of messages in the group.
Filter just checks messages for some condition and sends them to the outpu-channel or does discard logic. It doesn't modify messages. However even if we could do that, it doesn't sounds good anyway.
Assume we have just only two messages in the group. The first on is OK for filtering - we just send it to the aggregator. But the second is discarded, and, yes, it won't be sent to aggregator. And the last one never releases that group, because the sequenceSize isn't reached.
To overcome your requirement you need to have some custom ReleaseStrategy on the aggregator (by default it is SequenceSizeReleaseStrategy). For example to check some state in your system that all messages in the group have been sent independently of true or false result after filter. Or have some fake message for the same reason and check its availability in the group.
In this case you will need just take care about correlationId to group messages in the aggregator.
UPDATE
What is the suggested release strategy for such a scenario? Would it be a good strategy to use timeout as release stretegy?
What I can say that sometimes it is really difficult to find good solution for some integration scenarios. The messaging is stateless by nature, so to correlate and group an undetermined number of messages may be a problem.
There is need to see requirements and environment.
For example when all your messages are processed in the single thread you can safely send some fake marker message in the end directly to the aggregator and check it from ReleaseStrategy. And it will work even when all your messages from the group may be discarded.
If you process those messages in parallel or they are received from different threads, you really won't be able to determine the order of messages and the time for each process.
In this case the TimeoutCountSequenceSizeReleaseStrategy really can help. Of course, there will be need to find the good timeframe compromise according to the requirements to your system.

Oracle Service Bus Proxy Service Scheduler

I need to create a proxy service scheduler that receive messages of the queue after 5 minutes. like queue produce message either a single or multiple but proxy receieve that messages after interval of every 5 minutes. how can i achieve this only using oracle service bus ...
Kindly help me for this
OSB do not provide Scheduler capabilities out of the box. You can do either of the following:
For JMS Queue put infinite retries by not setting retry limit and set retry interval as 5 minutes.
Create a scheduler. Check this post for the same: http://blogs.oracle.com/jamesbayer/entry/weblogic_scheduling_a_polling
Answer left for reference only, messages shouldn't be a subject to complex computed selections in this way, some value comparison and pattern matching only.
To fetch only old enough messages from queue,
not modifying queue or messages
not introducing any new brokers between queue and consumer
not prematurely consuming messages
, use Message Selector field of OSB Proxy on JMS Transport tab to set boolean expression (SQL 92) that checks that message's JMSTimestamp header is at least 5 minutes older than current time.
... and I wasn't successful to quickly produce valid message selector neither from timestamp nor JMSMessageID (it contains time in milis - 'ID:<465788.1372152510324.0>').
I guess somebody could still use it in some specific case.
You can use Quartz scheduler APIs to create schedulers across domains.
Regards,
Sajeev
I don't know whether this works for you, but its working good for me. May be you can use this to do your needful.
Goto Transport Details of your Proxy Service, under Advanced Options tab, set the following fields.
Polling Frequency (Mention your frequency 300 sec(5 min))
Physical Directory (may be here you need to give your Queue path)

Flink kafka to S3 with retry

I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead