Kafka Producer - Trace Kerberos user - apache-kafka

In a kerberized set-up with a Kafka cluster I can easily enforce that for topic X, Users A and B have the right to produce messages, and for topic Y, User C has the right to produce messages.
In terms of traceability, this gives me the guarantee that all messages on topic Y are produced by User C. However, I would also like to know which messages on topic X were produced by User A, and which messages were produced by User B.
Is there an option to tag each message with the user that produced it? With one remark perhaps, that I do not want to do this on the producer-side but would like to enforce it broker-side so I have an absolute guarantee that this user information is present.
Any help would be greatly appreciated.

I don't think there currently is a way to do that to be honest. The only two pluggable classes in the broker that I know of are the KafkaPrincipalBuilder and the Authorizer, both of which have the principal you are after, but do not have access to the actual records to add the principal in a header field or wherever one would want it.
And even if we look to the producer side, the two options there would be Interceptor and Serializer classes that you can customize, both of which have access to the records, but not the user principal.
If you actually want to enforce this I'm afraid you'll have to become a bit creative and do something like have them sign every message with a private key in a header field or something similar - when processing data on the other end you can then discard anything that doesn't contain a that field.
Apart from that I think there is not much you can do except actually changing Kafka itself to add this information - in general the handleProduceRequest method should be a good starting place, here you have the requests that have been authorized and can also access the requests session which contains the principal. Records are only represented as bytebuffers here though, not sure whether you can easily headers at this point tbh..

Related

Avro messages within Avro messages: reasonable?

I want to do something crazy with Kafka and avro. Someone talk me off the ledge:
record Bundle {
string key;
array<bytes> msgs;
}
Producers individually serialize a bunch of messages that share a key, then serialize a bundle and post to a topic.
A generic Flattener service is configured by startup parameters to listen to 1...n kafka topics containing bundles, then blindly forward the bundled messages to configured output topics one at a time. (Blindly meaning it takes the bytes from the array and puts them on the wire.)
Use case:
I have services that respond to small operations (update record, delete record, etc). At times, I want batches of ops that need to be gauranteed not to be interleaved with other ops for the same key.
To accomplish this, my thought was to position a Flattener in front of each of the services in question. Normal, one-off commands get stored in 1-item bundles, true batchs are bundled into bigger ones.
I don't use a specific field type for the inner messages, because I'd like to be able to re-use Flattener all over the place
Does this make any sense at all? Potential drawbacks?
EDIT:
Each instance of the Flattener service would only be delivering message of types known to the ultimate consumers with schema_ids embedded in them.
The only reason array is not an array of a specific type is that I'd like to be able to re-use Flattener unchanged in front of multiple different services (just started with different environment variables / command line parameters).
I'm going to move my comment to an answer because I think it's reasonable to "talk you off the ledge" ;)
If you set up a Producer<String, GenericRecord> (change the Avro class as you wish), you already have a String key and Avro bytes as the value. This way, you won't need to embed anything

Message routing in kafka

We're trying to build a platform using microservices that communicate async over kafka.
It would seem natural, the way i understood it, to have 1 topic per aggregate type in each microservice. So a microservice implementing user registration would publish user related events into the topic "users".
Other microservices would listen to events created from the "users" microservices and implement their own logic and fill their DBs accordingly. The problem is that other microservices might not be interested in all the events generated by the user microservice but rather a subset of these events, like UserCreated only (without UsernameChanged... for example).
Using RabbitMq is easy since event handlers are invoked based on message type.
Did you ever implement message based routing/filtering over kafka?
Should we consume all the messages, deserialize them and ignore unneeded ones by the consumer? (sounds like an overhead)
Should we forward these topics to storm and redirect these messages to consumer targeted topics? (sounds like an overkill and un-scalable)
Using partitions doesn't seem logical as a routing mechanism
Use a different topic for each of the standard object actions: Create, Read, Update, and Delete, with a naming convention like "UserCreated", "UserRead", etc. If you think about it, you will likely have a different schema for the objects in each. Created will require a valid object; Read will require some kind of filter; Update you might want to handle incremental updates (add 10 to a specific field, etc).
If the different actions have different schemas it makes deserialization difficult. If you're in a loosey-goosey language like JavaScript, ok -- no big deal. But a strictly typed language like Scala and having different schemas in this same topic is problematic.
It'll also solve you're problem -- you can listen for exactly the types of actions you want, no more, not less.

How to resequence after filtering for aggregation /Spring Integration/

I'm doing a project in Spring Integration and I have a big problem.
There are some filtering components in the flow and later in the flow I have an aggregation element.
The problem is that the filtering component does not support to "apply-sequence" property. It filters out some records without modifying the original sequence number however the number of messages are reduced.
Later in the flow I need an aggregation which fails releasing elements since some messages are filtered out.
I don't want to use any special routing elements which have apply-sequence property.
Can you suggest me any common solution for this type of filtering problem?
Thanks,
I'd say you misunderstand the behaviour of the filter and aggregator.
I guees you have some apply-sequence-aware component upstream. So, all messages in that group accept several headers - correlationId - to group messages in the default aggregator; sequenceNumber - the index of the message; sequenceSize - the number of messages in the group.
Filter just checks messages for some condition and sends them to the outpu-channel or does discard logic. It doesn't modify messages. However even if we could do that, it doesn't sounds good anyway.
Assume we have just only two messages in the group. The first on is OK for filtering - we just send it to the aggregator. But the second is discarded, and, yes, it won't be sent to aggregator. And the last one never releases that group, because the sequenceSize isn't reached.
To overcome your requirement you need to have some custom ReleaseStrategy on the aggregator (by default it is SequenceSizeReleaseStrategy). For example to check some state in your system that all messages in the group have been sent independently of true or false result after filter. Or have some fake message for the same reason and check its availability in the group.
In this case you will need just take care about correlationId to group messages in the aggregator.
UPDATE
What is the suggested release strategy for such a scenario? Would it be a good strategy to use timeout as release stretegy?
What I can say that sometimes it is really difficult to find good solution for some integration scenarios. The messaging is stateless by nature, so to correlate and group an undetermined number of messages may be a problem.
There is need to see requirements and environment.
For example when all your messages are processed in the single thread you can safely send some fake marker message in the end directly to the aggregator and check it from ReleaseStrategy. And it will work even when all your messages from the group may be discarded.
If you process those messages in parallel or they are received from different threads, you really won't be able to determine the order of messages and the time for each process.
In this case the TimeoutCountSequenceSizeReleaseStrategy really can help. Of course, there will be need to find the good timeframe compromise according to the requirements to your system.

Which Solution Handles Publisher/Subscriber Scenario Better?

The scenario is publisher/subscriber, and I am looking for a solution which can give the feasibility of sending one message generated by ONE producer across MULTIPLE consumers in real-time. the light weight this scenario can be handled by one solution, the better!
In case of AMQP servers I've only checked out Rabbitmq and using rabbitmq server for pub/sub pattern each consumer should declare an anonymous, private queue and bind it to an fanout exchange, so in case of thousand users consuming one message in real-time there will be thousands or so anonymous queue handling by rabbitmq.
But I really do not like the approach by the rabbitmq, It would be ideal if rabbitmq could handle this pub/sub scenario with one queue, one message , many consumers listening on one queue!
what I want to ask is which AMQP server or other type of solutions (anyone similar including XMPP servers or Apache Kafka or ...) handles the pub/sub pattern/scenario better and much more efficient than RabbitMQ with consuming (of course) less server resource?
preferences in order of interest:
in case of AMQP enabled server handling the pub/sub scenario with only ONE or LESS number of queues (as explained)
handling thousands of consumers in a light-weight manner, consuming less server resource comparing to other solutions in pub/sub pattern
clustering, tolerating failing of nodes
Many Language Bindings ( Python and Java at least)
easy to use and administer
I know my question may be VERY general but I like to hear the ideas and suggestions for the pub/sub case.
thanks.
In general, for RabbitMQ, if you put the user in the routing key, you should be able to use a single exchange and then a small number of queues (even a single one if you wanted, but you could divide them up by server or similar if that makes sense given your setup).
If you don't need guaranteed order (as one would for, say, guaranteeing that FK constraints wouldn't get hit for a sequence of changes to various SQL database tables), then there's no reason you can't have a bunch of consumers drawing from a single queue.
If you want a broadcast-message type of scenario, then that could perhaps be handled a bit differently. Instead of the single user in the routing key, which you could use for non-broadcast-type messages, have a special user type, say, __broadcast__, that no user could actually have, and have the users to broadcast to stored in the payload of the message along with the message itself.
Your message processing code could then take care of depositing that message in the database (or whatever the end destination is) across all of those users.
Edit in response to comment from OP:
So the routing key might look something like this message.[user] where [user] could be the actual user if it were a point-to-point message, and a special __broadcast__ user (or similar user name that an actual user would not be allowed to register) which would indicate a broadcast style message.
You could then place the users to which the message should be delivered in the payload of the message, and then that message content (which would also be in the payload) could be delivered to each user. The mechanism for doing that would depend on what your end destination is. i.e. do the messages end up getting stored in Postgres, or Mongo DB or similar?

Specify what type of queue is that

I'm very new to messaging system, and I was trying to find my answer on http://www.rabbitmq.com/tutorials/, and I'm pretty sure it should be over there, but so far I got little bit confused with all bindings, queues, exchanges.
So I'm looking an answer for the question how to specify what type of "queue" (sorry if i have to use other word for this) is it. On producer side. To be more clear I'll give you an example:
So I want my consumer to subscribe to one "queue" and than once it receives it perform some operation based on what's inside this queue. Lets say if message contains a picture than do something, if it is a text, than do something else.
I was thinking my producer should add something like type:foo to the payload, and than consumer will look for this type. But I hope there is a better solution for this. Something like add a header to the queue.
Thank you.
If your consumer have to do different tasks for different types of message, then it would be better to create one distinct consumer per task.
That way, you can easily create one queue for each type of message and make each consumer consume messages from the right queue.
Your producer can send the message to the correct queue either directly or by using RabbitMQ routing.
Take a look at the "Routing" tutorial on the RabbitMQ website, it seems to match your use-case : http://www.rabbitmq.com/tutorials/tutorial-four-python.html