Working with Protobuf Kafka messages from Hive - apache-kafka

Regular (JSON) Kafka topics can be easily connected to Hive as external tables, like this:
CREATE EXTERNAL TABLE
dummy_table (
`field1` BIGINT,
`field2` STRING,
`field3` STRING
)
STORED BY
'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
"kafka.topic" = "dummy_topic",
"kafka.bootstrap.servers" = "dummybroker:9092")
But what about Protobuf encoded topics? Can they be connected, too? I wasn't able to find any examples of this on the net.
If yes - how (where) in code should .Proto file be specified?

You'd have to add kafka.serde.class to the properties.
Assuming you're using Confluent Schema Registry w/ Proto messages, only Avro is supported
Otherwise, there was an old project called Elephant-Bird for adding Protobuf support to Hive. I'm not sure if it still works, or can be used for the Kafka Serde config. But assuming it can, your Proto file would need to be placed in HDFS, for example, and gathered by each Hive map task

Related

How to fix the deserialization error when merging 2 kstreams topics using leftJoin?

I am new to Kafka. I am working on a personal project where I want to write to 2 different Avro topics and merge them using leftJoin. Once I merge them, I want to produce the same messages to a KSQL DB as well. (I haven't implemented that part yet).
I am using Kafka Template to produce to the 2 Avro topics and convert them into kstreams to merge them. I am also using KafkaListener to print any messages in them and that work is working. Here's where I am having issues at: 2 of them actually. In either cases, it doesn't produce any messages in the merged topic.
If I removed the consumed.with()from the kstream, then it throws a default key Serde error.
But if I keep it, then it throws a deserialization error.
I have even provided the default serialization and deserialization in both my application.properties and in the streamConfig inside main() but it's still not working.
Can somebody please help me with how to merge the 2 Avro topics? Is it error occurring because I am using the Avro schema? Should I use JSON instead? I wanna use a schema because my value part of the message will have multiple values in it.
For eg: {Key : Value} = {company : {inventory_id, company, color, inventory}} = {Toyota : {0, RAV4, 50,000}}
Here's a link to all the file: application.properties, DefaultKeySerdeError.txt, DeserializationError.txt, FilterStreams.java, Inventory.avsc, Pricing.avsc, and MergedAvro.avsc . Let me know if yall want me to put them below. Thank you very much for your help in advance!
https://gist.github.com/Arjun13/b76f53c9c2b4e88225ef71a18eb08e2f
Looking at the DeserializationError.txt file, it looks like the problem is you haven't provided the credentials for schema registry. Even though you have provided them in the application.properties file, they're not getting into the serdes configuration, so if you add the basic.auth.user.info configs to the serdeConfig map you should be all set.

Glue avro schema registry with flink and kafka for any object

I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.

Get all Kafka Source Connectors writing to a specific topic

I have the name of a Kafka Topic. I would like to know what Connectors are using this topic. Specifically, I need the Source Connector name so I can modify the source query. I only have access to the Confluent Control Center. We have hundreds of Connectors and I cannot search through them manually.
Thank you in advance!
You'd need to write a script.
Given: Connect REST endpoint
Do: (psuedocode)
to_find = 'your_topic_name'
topic_used_by = []
connectors = GET /connectors
for each connector_name in connectors:
connector_config = GET /connectors/{connector_name}
connector_config = connector_config['config']
if 'topics.regex' in connector_config:
// pattern match against your topic
if to_find.matches(connector_config['topics.regex']):
topic_used_by.add(connector_name)
else if 'topics' in connector_config:
// split these on commas
if to_find in connector_config['topics'].split(','):
topic_used_by.add(connector_name)
print(topic_used_by)
From that, your HTTP client should be easy to extend to update any config value, then POST /connectors/{connector_name}/config
Keep in mind that neither 'topics' nor 'topics.regex' are source connector properties (they are sink connector properties). Therefore, you'd need to modify that to use whatever connector properties you do have (ideally, filtering by the connector class name). For example, the table.whitelist property of the JDBC Source, or collection in the MongoDB source, determine the topic name.
And unless you filter by class name, that would return both sources and sinks, so unless there is "Source", or similar, in the name of the connector class, this is the best you can do
Also, you'd need to consider that some topic names may be set by various transform configurations, which could have any JSON key value in the config, so there'd be no straightforward way to search on those without at least transform_config_value.contains('your_topic_name')
You can use feature implemented through KIP-558: Track the set of actively used topics by connectors in Kafka Connect.
E.g. Get the set of active topics from a connector called 'some-source':
curl -s 'http://localhost:8083/connectors/some-source/topics' | jq
{
"some-source": {
"topics": [
"foo",
"bar",
"baz",
]
}
}

using existing kafka topic with avro in ksqldb

suppose i have a topic, lets say 'some_topic'. data in this topic is serialized with avro using schema registry.
schema subject name is the same as the name of the topic - 'some_topic', without '-value' postfix
what i want to do is to create a new topic, lets say 'some_topic_new', where data will be serialized with the same schema, but some fields will be 'nullified'
i'm trying to evaluate if this could be done using ksqldb and have two questions:
is it possible to create stream/table based on existing topic and using existing schema?
maybe something like create table some_topic (*) with (schema_subject=some_topic, ...).
so fields for new table would be taken from existing schema automatically
could creating of new schema with '-value' postfix be avoided when creating new stream/table?
When you create a stream in ksqlDB based on another you can have it inherit the schema. Note that it won't share the same schema though, but the definition will be the same.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='AVRO');
CREATE STREAM my_stream_new
WITH (KAFKA_TOPIC='some_topic_new', VALUE_FORMAT='AVRO') AS
SELECT * FROM my_stream;

For AvroProducer to Kafka, where are avro schema for "key" and "value"?

From the AvroProducer example in the confluent-kafka-python repo, it appears that the key/value schema are loaded from files. That is, from this code:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
value_schema = avro.load('ValueSchema.avsc')
key_schema = avro.load('KeySchema.avsc')
value = {"name": "Value"}
key = {"name": "Key"}
avroProducer = AvroProducer({'bootstrap.servers': 'mybroker,mybroker2', 'schema.registry.url': 'http://schem_registry_host:port'}, default_key_schema=key_schema, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value=value, key=key)
it appears that the files ValueSchema.avsc and KeySchema.avsc are loaded independently of the Avro Schema Registry.
Is this right? What's the point of referencing the URL for the Avro Schema Registry, but then loading schema from disk for key/value's?
Please clarify.
I ran into the same issue where it was initially unclear what the point of the local files are. As mentioned by the other answers, for the first write to an Avro topic, or an update to the topic's schema, you need the schema string - you can see this from the Kafka REST documentation here.
Once you have the schema in the registry, you can read it with REST (I used the requests Python module in this case) and use the avro.loads() method to get it. I found this useful because the produce() function requires that you have a value schema for your AvroProducer, and this code will work without that local file being present:
get_schema_req_data = requests.get("http://1.2.3.4:8081/subjects/sample_value_schema/versions/latest")
get_schema_req_data.raise_for_status()
schema_string = get_schema_req_data.json()['schema']
value_schema = avro.loads(schema_string)
avroProducer = AvroProducer({'bootstrap.servers': '1.2.3.4:9092', 'schema.registry.url': 'http://1.2.3.4:8081'}, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value={"data" : "that matches your schema" })
Hope this helps.
That is just one way to create a key and value schema in the Schema Registry in the first place. You can create it in the SR first using the SR REST API or you can create new schemas or new versions of existing schemas in the SR by publishing them with new messages. It's entirely your choice which method is preferred.
Take a look at the code and consider that schema from the registry is needed by a consumer rather than a producer. MessageSerializer registers schema in the schema registry for you :)