Why base64 encode/decode in Kafka REST Proxy? - apache-kafka

Producer serializes the message and send them to Broker in byte arrays. And Consumers deserializes those byte arrays. Broker always stores and passes byte arrays. This is how I understood.
But when you use REST Proxy in Kafka, Producer encodes the message with base64, and Consumer decodes those base64 messages.
A Python example of Producer and Consumer :
# Producer using the REST Proxy
payload = {"records" :
[{
"key":base64.b64encode("firstkey"),
"value":base64.b64encode("firstvalue")
}]}
# Consumer using the REST Proxy
print "Message Key:" + base64.b64decode(message["key"])
Why do you send message in base64 to the Broker instead of byte arrays?
When using REST Proxy, a Broker stores messages in base64 format?

When a Producer wants to send a message 'Man', it serializes into bytes (bits). A Broker will store it as 010011010110000101101110. When a Consumer gets this message, it will deserialize back to Man.
However, according to Confluent document :
Data formats - The REST Proxy can read and write data using JSON, raw bytes encoded with base64 or using JSON-encoded Avro.
Therefore, a Producer using REST Proxy will change the message Man into TWFu (base64 encode) and send this to a Broker, and a Consumer using REST Proxy will base64 decode this back to Man.

As you already answered the broker always stores the data in a binary format.
Answering why base 64 is needed instead I found this on the confluent documentation (https://www.confluent.io/blog/a-comprehensive-rest-proxy-for-kafka/):
The necessity of using base64 encoding is more clear when you have to send raw binary data through the Rest Proxy:
If you opt to use raw binary data, it cannot be embedded directly in JSON, so the API uses a string containing the base64 encoded data.

Related

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

How to disable headers from Consuming in kafka

Is there any option to disable Kafka headers being consumed from consumer. In my case I wrote a consumer to consume messages from a Kafka topic published by an upstream system. My processing doesn't require any information from headers and the published headers are heavy weight (bigger than the message itself in size). So my consumer is taking longer time than expected.
Any option that I can only consume message content leaving headers so that it saves time to transfer the headers over network and de-serialize them at consumer. Your help is appreciated.
Every message is a Record with Headers (as of Kafka 0.11).
length: varint
attributes: int8
bit 0~7: unused
timestampDelta: varint
offsetDelta: varint
keyLength: varint
key: byte[]
valueLen: varint
value: byte[]
Headers => [Header]
Record Header
headerKeyLength: varint
headerKey: String
headerValueLength: varint
Value: byte[]
Even if you ignore deserializing them, they will still be sent over the wire as part of the Record's TCP packet body.
You could try using a Kafka 0.10.2 client version, for example, which might drop the header entirely, because they just weren't part of the API, but YMMV.
As mentioned in the comments, the most reliable way here would be to stop sending such heavy information in the upstream application. Or the middle-ground would be to compress, and/or binary encode that data.

Unexpected character in Kafka message with Flume

I have an ingestion pipeline using Flume & Kafka, consuming CSV files, converting events in JSON in a Flume Interceptor and pushing it in Kafka.
When I'm logging the message before being sent to Kafka, it's a normal, valid JSON. But when consuming the same message from Kafka, I'm getting errors when trying to serialize it, saying it's not valid JSON.
Indeed I have unrecognized chars at the beginning of my message:
e.g. �
I think it stands for the empty header that flume try to had to the event when posting to Kafka. But I can't seem to be able to prevent this from happening.
Does anyone knows how to completely remove headers from Flume events being sent or more precisely, remove those chars ?
Looks like a basic character encoding issue, like if kafka runs on Linux while the producer runs on a windows machine. You might want to triple-check that all machines handle utf-8 encoded messages.
this post should be your friend.

Kafka data types of messages

I was wondering about what types of data we could have in Kafka topics.
As I know in application level this is a key-value pairs and this could be the data of type which is supported by the language.
For example we send some messages to the topic, could it be some json, parquet files, serialized data or we operate with the messages only like with the plain text format?
Thanks for you help.
There are various message formats depending on if you are talking about the APIs, the wire protocol, or the on disk storage.
Some of these Kafka Message formats are described in the docs here
https://kafka.apache.org/documentation/#messageformat
Kafka has the concept of a Serializer/Deserializer or SerDes (pronounced Sir-Deez).
https://en.m.wikipedia.org/wiki/SerDes
A Serializer is a function that can take any message and converts it into the byte array that is actually sent on the wire using the Kafka Protocol.
A Deserializer does the opposite, it reads the raw message bytes portion of the Kafka wire protocol and re-creates a message as you want the receiving application to see it.
There are built-in SerDes libraries for Strings, Long, ByteArrays, ByteBuffers and a wealth of community SerDes libraries for JSON, ProtoBuf, Avro, as well as application specific message formats.
You can build your own SerDes libraries as well see the following
How to create Custom serializer in kafka?
On the topic it's always just serialised data. Serialisation happens in the producer before sending and deserialisation in the consumer after fetching. Serializers and deserializers are pluggable, so as you said at application level it's key value pairs of any data type you want.

protobuf within avro encoded message on kafka

Wanted to know if there is a better way to solve the problem that we are having. Here is the flow:
Our client code understands only protocol buffers (protobuf). On the server side, our gateway gets the protobuf and puts it onto Kafka.
Now avrò is the recommended encoding scheme, so we put the specific protobuf within avro (as a byte array) and we put it onto the message bus. The reason we do this is to avoid having to do entire protobuf->avro conversion.
On the consumer side, it reads the avro message, gets the protobuf out of it and works on that.
How reliable is protobuf with Kafka? Are there a lot of people using it? What exactly are the advantages/disadvantages of using Kafka with protobuf?
Is there a better way to handle our use case/scenario?
thanks
Kafka doesn't differentiate between encoding schemes since at the end every message flows in and out of kafka as binary.
Both Proto-buff and Avro are binary based encoding schemes, why would you want to wrap a proto-buff inside an Avro schema, when you can directly put the proto-buff message into Kafka?