TCP messages - Durable and fast solution tips? - sockets

We use Spring integration for TCP socket communication with the hardware.
The client would be sending a sequence number to uniquely identify a message.
My requirement is to store these sequence numbers part of the socket message and validate them for non repetitive sequence numbers.
I went thru IdempotentReceiver, sounds like what i wanted.
But I need a durable and faster mechanism to store it, before unexpected shutdown of service and use the in memory cache for retrieving the latest sequence number.
Thank you in advance.!

You can use PropertiesPersistingMetadataStore for idempotent receiver:
The PropertiesPersistingMetadataStore is backed by a properties file and a PropertiesPersister.
By default, it only persists the state when the application context is closed normally. It implements Flushable so you can persist the state at will, be invoking flush().
See more in its JavaDocs.

Related

External processing using Kafka Streams

There are several questions regarding message enrichment using external data, and the recommendation is almost always the same: ingest external data using Kafka Connect and then join the records using state stores. Although it fits in most cases, there are several other use cases in which it does not, such as IP to location and user agent detection, to name a few.
Enriching a message with an IP-based location usually requires a lookup by a range of IPs, but currently, there is no built-in state store that provides such capability. For user agent analysis, if you rely on a third-party service, you have no choices other than performing external calls.
We spend some time thinking about it, and we came up with an idea of implementing a custom state store on top of a database that supports range queries, like Postgres. We could also abstract an external HTTP or GRPC service behind a state store, but we're not sure if it is the right way.
In that sense, what is the recommended approach when you cannot avoid querying an external service during the stream processing, but you still must guarantee fault tolerance? What happens when an error occurs while the state store is retrieving data (a request fails, for instance)? Do Kafka Streams retry processing the message?
Generally, KeyValueStore#range(fromKey, toKey) is supported by build-in stores. Thus, it would be good to understand how the range queries you try to do are done? Also note, that internally, everything is stored as byte[] arrasy and RocksDB (default storage engine) sorts data accordingly -- hence, you can actually implement quite sophisticated range queries if you start to reason about the byte layout, and pass in corresponding "prefix keys" into #range().
If you really need to call an external service, you have "two" options to not lose data: if an external calls fails, throw an exception and let the Kafka Streams die. This is obviously not a real option, however, if you swallow error from the external lookup you would "skip" the input message and it would be unprocessed. Kafka Streams cannot know that processing "failed" (it does not know what your code does) and will not "retry", but consider the message as completed (similar if you would filter it out).
Hence, to make it work, you would need to put all data you use to trigger the lookup into a state store if the external call fails, and retry later (ie, do a lookup into the store to find unprocessed data and retry). This retry can either be a "side task" when you process the next input message, of you schedule a punctuation, to implement the retry. Note, that this mechanism changes the order in which records are processed, what might or might not be ok for your use case.

Is there any ACK-like mechanism for PostgreSQLs logical decoding/replication?

Is there a way for a replication client to say whenever they it was able to successfully store the data, or is it that PostgreSQL is streaming pending data to the client and the moment data leave network interface it is considered delivered?
I'd think that client has a chance to say "ACK - I got the data", but I can't seem to find this anywhere... I'm simply wondering what if the client fails to store the data (e.g. due to power failure) - isn't there a way to get it again from Postgres?
General info here https://www.postgresql.org/docs/9.5/static/logicaldecoding.html
I'll answer my own Q.
After doing much more reading, I can say there is ACK-like mechanism there.
Under some conditions (e.g. on interval) server will ask logical replication consumer to report what was the last piece of data that was persisted (i.e. flushed to disk or similar). Then and only then server will treat data up to that reported point delivered for given replication channel.

What methods are available on unix for pub sub IPC?

There are various options for IPC.
Over a network:
for client-server, can use TCP
for pub sub, can use UDP multicast
Locally:
for client-server, can use unix domain sockets
for pub sub, can use ???
I suppose what I'd be interested in is some kind of file descriptor that supports many readers (subscribers) and many writers (publishers) simultaneously. Is this usage pattern feasible/efficient on unix?
After much googling I haven't found a whole lot in the way of ipc multicast, so I have decided to write a program pubsub that takes as arguments a publisher address and a subscriber address, listens and accepts connections on these 2 addresses, and then for each payload received on a publisher connection write it to each of the subscriber connections. It wouldn't surprise me if this is inefficient or reinventing the wheel but I have not come across a better solution.
I was looking for solutions to a similar problem and found /dev/fanout. Fanout is a kernel module that replicates its input out to all processes reading from it. You can think of it as IPC Broadcast mechanism. Works well for small data payloads according to the author. Multiple processes can write to the device and multiple processes can read from it. I am not sure of atomicity of writes though. Small writes from multiple processes should occur atomically as with FIFOs, etc.
More about Fanout:
http://compgroups.net/comp.linux.development.system/-dev-fanout-a-one-to-many-multi/2869739
http://www.linuxtoys.org/fanout/fanout.html
There are Posix message queues too. As man mq_overview puts it:
POSIX message queues allow processes to exchange data in the form of messages. This API is distinct from that provided by
System V message queues (msgget(2), msgsnd(2), msgrcv(2), etc.), but provides similar functionality.
Message queues are created and opened using mq_open(3); this function returns a message queue descriptor (mqd_t), which is
used to refer to the open message queue in later calls. Each message queue is identified by a name of the form /somename;
that is, a null-terminated string of up to NAME_MAX (i.e., 255) characters consisting of an initial slash, followed by one
or more characters, none of which are slashes. Two processes can operate on the same queue by passing the same name to
mq_open(3).
Messages are transferred to and from a queue using mq_send(3) and mq_receive(3). When a process has finished using the queue, it closes it using mq_close(3), and when the queue is no longer required, it can be deleted using mq_unlink(3).
Queue attributes can be retrieved and (in some cases) modified using mq_getattr(3) and mq_setattr(3). A process can request asynchronous notification of the arrival of a message on a previously empty queue using mq_notify(3).
A message queue descriptor is a reference to an open message queue description (cf. open(2)). After a fork(2), a child inherits copies of its parent's message queue descriptors, and these descriptors refer to the same open message queue descriptions as the corresponding descriptors in the parent. Corresponding descriptors in the two processes share the flags (mq_flags) that are associated with the open message queue description.
Each message has an associated priority, and messages are always delivered to the receiving process highest priority first.
Message priorities range from 0 (low) to sysconf(_SC_MQ_PRIO_MAX) - 1 (high). On Linux, sysconf(_SC_MQ_PRIO_MAX) returns 32768, but POSIX.1 requires only that an implementation support at least priorities in the range 0 to 31; some implementations provide only this range.
A more friendly introduction by Michael Kerrisk is available here: http://man7.org/conf/lca2013/IPC_Overview-LCA-2013-printable.pdf

Slow Subscriber

I am newbie to ZMQ
ZMQ Version - 2.2.1
Ubuntu - 10.04
I am using the PUB-SUB pattern for communication between multiple publishers and multiple subscribers. A forwarder is used to subscribe data from multiple publishers and the same is published to all the subscribers.
Currently, if three publishers are running and if each publisher sends 1000 messages in 1second via the PUB channel. The subscriber receives the data, stores it and writes to a database every 1second.
Because of the involvement of database, the rate at which subscriber receives the data is getting delayed, as a result the memory usage (RAM) increases by 6-7MB every 1second. Finally the subscriber gets killed by OS due to OOM
I tried using the options ZWQ_HWM & ZMQ_SWAP on both the sockets of forwarder. But still the issue persists.
Is there any solution for this???
Overall your problem is that your database cannot keep up with your publisher. 0MQ cannot solve this for you. You need an architectural solution based on changing the behavior of your system, presumably the way you do inserts.
You have a few options:
Use a faster database
Use a faster database insert method
Write to a log which is processed asynchronously by another process
Change to a socket pattern that lets the receivers tell the senders that they are backed up, so the senders pause (if that's possible)
I think in your case the spool-to-disk-file option is the best.

Why don't I get all the data when with my non-blocking Perl socket?

I'm using Perl sockets in AIX 5.3, Perl version 5.8.2
I have a server written in Perl sockets. There is a option called "Blocking", which can be set to 0 or 1. When I use Blocking => 0 and run the server and client send data (5000 bytes), I am able to recieve only 2902 bytes in one call. When I use Blocking => 1, I am able to recieve all the bytes in one call.
Is this how sockets work or is it a bug?
This is a fundamental part of sockets - or rather, TCP, which is stream-oriented. (UDP is packet-oriented.)
You should never assume that you'll get back as much data as you ask for, nor that there isn't more data available. Basically more data can come at any time while the connection is open. (The read/recv/whatever call will probably return a specific value to mean "the other end closed the connection.)
This means you have to design your protocol to handle this - if you're effectively trying to pass discrete messages from A to B, two common ways of doing this are:
Prefix each message with a length. The reader first reads the length, then keeps reading the data until it's read as much as it needs.
Have some sort of message terminator/delimiter. This is trickier, as depending on what you're doing you may need to be aware of the possibility of reading the start of the next message while you're reading the first one. It also means "understanding" the data itself in the "reading" code, rather than just reading bytes arbitrarily. However, it does mean that the sender doesn't need to know how long the message is before starting to send.
(The other alternative is to have just one message for the whole connection - i.e. you read until the the connection is closed.)
Blocking means that the socket waits till there is data there before returning from a recieve function. It's entirely possible there's a tiny wait on the end as well to try to fill the buffer before returning, or it could just be a timing issue. It's also entirely possible that the non-blocking implementation returns one packet at a time, no matter if there's more than one or not. In short, no it's not a bug, but the specific 'why' of it is the old cop-out "it's implementation specific".