Avro messages within Avro messages: reasonable? - apache-kafka

I want to do something crazy with Kafka and avro. Someone talk me off the ledge:
record Bundle {
string key;
array<bytes> msgs;
}
Producers individually serialize a bunch of messages that share a key, then serialize a bundle and post to a topic.
A generic Flattener service is configured by startup parameters to listen to 1...n kafka topics containing bundles, then blindly forward the bundled messages to configured output topics one at a time. (Blindly meaning it takes the bytes from the array and puts them on the wire.)
Use case:
I have services that respond to small operations (update record, delete record, etc). At times, I want batches of ops that need to be gauranteed not to be interleaved with other ops for the same key.
To accomplish this, my thought was to position a Flattener in front of each of the services in question. Normal, one-off commands get stored in 1-item bundles, true batchs are bundled into bigger ones.
I don't use a specific field type for the inner messages, because I'd like to be able to re-use Flattener all over the place
Does this make any sense at all? Potential drawbacks?
EDIT:
Each instance of the Flattener service would only be delivering message of types known to the ultimate consumers with schema_ids embedded in them.
The only reason array is not an array of a specific type is that I'd like to be able to re-use Flattener unchanged in front of multiple different services (just started with different environment variables / command line parameters).

I'm going to move my comment to an answer because I think it's reasonable to "talk you off the ledge" ;)
If you set up a Producer<String, GenericRecord> (change the Avro class as you wish), you already have a String key and Avro bytes as the value. This way, you won't need to embed anything

Related

Can the Kafka Connect JDBC Sink dump raw data?

Partly for testing and debugging but also to work around an issue we are seeing in a topic where we have are unable to change the producer I would like to be able to store the value as a string in a CLOB in a database table.
I have this working as a Java based consumer but I am looking at whether this could be achieved using Kafka Connect.
Everything I have read says you need a schema with the reasoning being that how else would it know how to process the data into columns (which makes sense) but I don't want to do any processing of the data (which could be JSON but might just be text) I just want to treat the whole value as a string and load it into one column.
Is there any way this can be done within the Connect config or am I looking at adding extra processing to update the message (in which case the Java client is probably going to end up being simpler)
No, the JDBC Sink connector requires a schema to work. You could modify the source code to add in this behaviour.
I would personally try to stick with Kafka Connect for streaming data to a database since it does all the difficult stuff (scale out, restarts, etc etc etc) very well. Depending on the processing that you're talking about, it could well be that Single Message Transform would be very applicable, since they fit into the Kafka Connect pipeline. Or for more complex processing, Kafka Streams or ksqlDB.

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

Message routing in kafka

We're trying to build a platform using microservices that communicate async over kafka.
It would seem natural, the way i understood it, to have 1 topic per aggregate type in each microservice. So a microservice implementing user registration would publish user related events into the topic "users".
Other microservices would listen to events created from the "users" microservices and implement their own logic and fill their DBs accordingly. The problem is that other microservices might not be interested in all the events generated by the user microservice but rather a subset of these events, like UserCreated only (without UsernameChanged... for example).
Using RabbitMq is easy since event handlers are invoked based on message type.
Did you ever implement message based routing/filtering over kafka?
Should we consume all the messages, deserialize them and ignore unneeded ones by the consumer? (sounds like an overhead)
Should we forward these topics to storm and redirect these messages to consumer targeted topics? (sounds like an overkill and un-scalable)
Using partitions doesn't seem logical as a routing mechanism
Use a different topic for each of the standard object actions: Create, Read, Update, and Delete, with a naming convention like "UserCreated", "UserRead", etc. If you think about it, you will likely have a different schema for the objects in each. Created will require a valid object; Read will require some kind of filter; Update you might want to handle incremental updates (add 10 to a specific field, etc).
If the different actions have different schemas it makes deserialization difficult. If you're in a loosey-goosey language like JavaScript, ok -- no big deal. But a strictly typed language like Scala and having different schemas in this same topic is problematic.
It'll also solve you're problem -- you can listen for exactly the types of actions you want, no more, not less.

Which Solution Handles Publisher/Subscriber Scenario Better?

The scenario is publisher/subscriber, and I am looking for a solution which can give the feasibility of sending one message generated by ONE producer across MULTIPLE consumers in real-time. the light weight this scenario can be handled by one solution, the better!
In case of AMQP servers I've only checked out Rabbitmq and using rabbitmq server for pub/sub pattern each consumer should declare an anonymous, private queue and bind it to an fanout exchange, so in case of thousand users consuming one message in real-time there will be thousands or so anonymous queue handling by rabbitmq.
But I really do not like the approach by the rabbitmq, It would be ideal if rabbitmq could handle this pub/sub scenario with one queue, one message , many consumers listening on one queue!
what I want to ask is which AMQP server or other type of solutions (anyone similar including XMPP servers or Apache Kafka or ...) handles the pub/sub pattern/scenario better and much more efficient than RabbitMQ with consuming (of course) less server resource?
preferences in order of interest:
in case of AMQP enabled server handling the pub/sub scenario with only ONE or LESS number of queues (as explained)
handling thousands of consumers in a light-weight manner, consuming less server resource comparing to other solutions in pub/sub pattern
clustering, tolerating failing of nodes
Many Language Bindings ( Python and Java at least)
easy to use and administer
I know my question may be VERY general but I like to hear the ideas and suggestions for the pub/sub case.
thanks.
In general, for RabbitMQ, if you put the user in the routing key, you should be able to use a single exchange and then a small number of queues (even a single one if you wanted, but you could divide them up by server or similar if that makes sense given your setup).
If you don't need guaranteed order (as one would for, say, guaranteeing that FK constraints wouldn't get hit for a sequence of changes to various SQL database tables), then there's no reason you can't have a bunch of consumers drawing from a single queue.
If you want a broadcast-message type of scenario, then that could perhaps be handled a bit differently. Instead of the single user in the routing key, which you could use for non-broadcast-type messages, have a special user type, say, __broadcast__, that no user could actually have, and have the users to broadcast to stored in the payload of the message along with the message itself.
Your message processing code could then take care of depositing that message in the database (or whatever the end destination is) across all of those users.
Edit in response to comment from OP:
So the routing key might look something like this message.[user] where [user] could be the actual user if it were a point-to-point message, and a special __broadcast__ user (or similar user name that an actual user would not be allowed to register) which would indicate a broadcast style message.
You could then place the users to which the message should be delivered in the payload of the message, and then that message content (which would also be in the payload) could be delivered to each user. The mechanism for doing that would depend on what your end destination is. i.e. do the messages end up getting stored in Postgres, or Mongo DB or similar?