Kafka Message Keys with Composite Values - apache-kafka

I am working on a system that will produce kafka messages. These messages will be organized into topics that more or less represent database tables. Many of these tables have composite keys and this aspect of the design is out of my control. The goal is to prepare these messages in a way that they can be easily consumed by common sink connectors, without a lot of manipulation.
I will be using the schema registry and avro format for all of the obvious advantages. Having the entire "row" expressed as a record in the message value is fine for upsert operations, but I also need to support deletes. From what I can tell, this means my message needs a key so I can have "tombstone" messages. Also keep in mind that I want to avoid any sort of transforms unless absolutely necessary.
In a perfect world, the message key would be a "record" that included strongly-typed key-column values and the message value would have the other column values (both controlled by the schema registry). However, it seems like a lot of the tooling around kafka expects message keys to be a single, primitive value. This makes me wonder if I need to compute a key value where I concatenate my multiple key columns into a single string value and keep the individual columns in my message value. Is this right or am I missing something? What other options do I have?

I'm assuming that you know the relationship between the message key and partition assignment.
As per my understanding, there is nothing that stops you from using a complex type like STRUCT as a key with or without a key schema. Please refer to the API here. If you are using an out of box connector that does not support complex type as key, then you may have to write your own Single Message Transformations (SMT) to move the key attributes into the value.
The approach that you mentioned - contacting columns to create the key and keeping the values of the same column in the value attribute would work in many cases if you don't want to write code. The only downside I could see is that your messages would be larger than required. If you don't need a partition assignment strategy or ordering requirement, then the message can have no key or a random key.

I wanted to follow-up with an answer that solved my issue:
The strategy I mentioned of using a concatenated string, technically worked. However, it certainly wasn't very elegant.
My original issue in using a structured key was that I wasn't using the correct converter for deserializing the key, which led to other errors. Once I used the avro converter, I was able to get my multi-part key and use it effectively.
Both, when implemented appropriately allowed me to produce valid tombstone messages that could represent deletes.

Related

is it possible to discard records in kafka connect value converter (de-serialization)?

I have a custom binary format of my messages in Kafka (protobuf) and I want to avoid processing time when doing the de-serialization of said messages.
My idea would be to somehow discard the messages I do not want during the value converter for de-serialization.
I'm trying to generate a custom value converter that would only process certain messages based on some headers, and I would like to avoid the processing time of deserializing all of the messages.
Up to now I have a sort of filter transformation to discard those messages, but I wanted to avoid even processing them so really discard them on the value converter itself. The transformation if I understood correctly always happens after the converters.
I tried to just return a null on it but that failed, meaning the consumer crashed because the message becomes null. I was wondering if there is a way of doing this and if yes any known example?
If not, I can off-course just return an empty SchemaAndValue but I was wondering if there was a nicer way because this way, I still need to return something and then filter them out with a transformation.
EDIT: Based on the answer, which is what I was looking for the easier way is to simply return the ByteArrayConverter for the messages I'm not interested in
Filtering is a type of processing, so a transform is the correct way to do this.
If you meant you want to prevent deserialization, and you're using some custom binary format and filtering based on its content, maybe using record headers would be a better way to exclude events instead. Then use ByteArrayConverter as a pass-through

Kafka Streams WindowStore Fetch Record Ordering

The Kafka Streams 2.2.0 documentation for the WindowStore and ReadOnlyWindowStore method fetch(K key, Instant from, Instant to) states:
For each key, the iterator guarantees ordering of windows, starting
from the oldest/earliest available window to the newest/latest window.
None of the other fetch methods state this (except the deprecated fetch(K key, long from, long to)), but do they offer the same guarantee?
Additionally, is there any guarantee on ordering of records within a given window? Or is that up to the underlying hashing collection (I assume) implementation and handling of possible hash collisions?
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...
The other methods don't have ordering guarantees, because the order depends on the byte-order of the serialized keys. It's hard to reason about this ordering for Kafka Streams, because the serializers are provided by the user.
I should also note that we built the WindowStore with retainDuplicates() set to true. So a single key would have multiple entries within a window. Unless we're using it wrong; which I guess would be a different question...
You are using it wrong :) -- you can store different keys for the same window by default. If you enable retainDuplicates() you can store the same key multiple times for the same window.

Can we change data type of dimension post ingestion in Druid

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html

Can't change partitioning scheme for actors

While playing with azure service fabric actors, here is the weird thing I've recently found out about - I can't change the default settings for partitioning. If I try to, say, set Named partitioning or change low/high key for UniformInt64, it gets overwritten each time I build my project in Visual Studio. There is no problem to do this for statefull service, it only happens with actors. No errors, no records in Event Log, no nothing... I've found just one single reference about the same problem on the Internet -
https://social.msdn.microsoft.com/Forums/vstudio/en-US/4edbf0a3-307b-489f-b936-43af9a365a0a/applicationmanifestxml-overwritten-on-each-build?forum=AzureServiceFabric
But I haven't seen any explanations to that - neither on MSDN, nor in official documentation. Any ideas? Would it really be 'by design'?
P.S.
Executing just Powershell script to deploy the app does allow me to set the scheme the way I want it to. Still it's frustrating to not being able to do this in VS. Probably there is a good reason to that... it should be, right? :)
Reliable Services can be created with different partition schemes and
partition key ranges. The Actor Service uses the Int64 partitioning
scheme with the full Int64 key range to map actors to partitions.
Every ActorId is hashed to an Int64, which is why the actor service
must use an Int64 partitioning scheme with the full Int64 key range.
However, custom ID values can be used for an ActorID, including GUIDs,
strings, and Int64s.
When using GUIDs and strings, the values are hashed to an Int64.
However, when explicitly providing an Int64 to an ActorId, the Int64
will map directly to a partition without further hashing. This can be
used to control which partition actors are placed in.
(Source)
This ActorId => PartitionKey translation strategy doesn't work if your partitions are named.

DDS Keyed Topics

I am currently using RTI DDS on a system where we will have one main topic for multiple items, such as a car topic with multiple vin numbers. Since this is the design I am trying to then make a "keyed" topic which is basically a topic that has a member acting as a key (kind of like the primary key in the database) which in this example would be the vin of each car. To implement the keyed topics I am using an IDL file which is as follows,
const string CAR_TOPIC = "CAR";
enum ALARMSTATUS {
ON,
OFF
};
struct keys {
long vin; //#key
string make;
ALARMSTATUS alarm;
};
When I run the IDL file through the rtigen tool for making C,Java, etc kind of files from the IDL, the only thing I can do is run the program and see
Writing keys, count 0
Writing keys, count 1 ...
and
keys subscriber sleeping for 4 sec...
Received:
vin: 38
make:
alarm : ON
keys subscriber sleeping for 4 sec...
Received:
vin: 38
make:
alarm : ON ...
Thus making it hard to see how the keyed topics work and if they are really working at all. Does anyone have any input what to do with the files generated from the IDL files to make the program more functional? Also I never see the topic CAR so I am not sure I am using the right syntax to set the topic for the DDS.
When you say "the only thing I can do is run the program", it is not clear what "the" program is. I do not recognize the exact output that you give, so did you adjust the code of the generated example?
Anyway, responding to some of your remarks:
Thus making it hard to see how the keyed topics work and if they are really working at all.
The concept of keys is most clearly visible when you have values for multiple instances (that is, different key-values) present simultaneously in your DataReader. This is comparable to having a database table containing multiple rows at the same time. So in order to demonstrate the key concept, you will have to assign different values to the key-fields on the DataWriter side and write() the resulting samples. This does not happen by default in the generated examples, so you have to do adjust the code to achieve that.
On the DataReader side, you will have to make sure that multiple values remain stored to demonstrate the effect. This means that you should not do a take() (which is similar to a "destructive read"), but a read(). This way, the number of values in your DataReader will grow in line with the number of distinct key values that you wrote.
Note that in real life, you should not have a growing number of key-values for ever, just like you do not want a database table to contain an ever growing number of rows.
Also I never see the topic CAR so I am not sure I am using the right syntax to set the topic for the DDS.
Check out the piece of code that creates the Topic. The method name depends on the language you use, but should have something like create_topic() in it. The second parameter to that call is the name of the Topic. In general, the IDL constant CAR_TOPIC that you defined will not be automatically used as the name of the Topic, you have to indicate that in the code.
Depending on the example you are running, you could try -h to get some extra flags to use. You might be able to increase verbosity to see the name of the Topic being created, or set the topic name off the command line.
If you want to verify the name of the Topic in your system, you could use rtiddsspy to watch the data flowing. Its output includes the names of the Topics it discovers.