Is there a way of telling a sink connector in Kafka Connect how to look for schema entries - apache-kafka

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?

The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

Related

How we can Dump kafka topic into presto

I need to pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
I am following this tutorial https://prestodb.io/docs/current/connector/kafka-tutorial.html#step-2-load-data
I am not able to understand how this command will work.
$ ./kafka-tpch load --brokers localhost:9092 --prefix tpch. --tpch-type tiny
Suppose I have created test topic in kafka using producer. How will tpch file will generate of this topic?
If you already have a topic, you should skip to step 3 where it actually sets up the topics to query via Presto
kafka-tpch load creates new topics with the specified prefix
Above command creates a tpch schema and loads various tables under it. This can be used for testing purpose. If you want to work with your actual kafka topics, you need to enlist them in /catalog/kafka.properties against kafka.tables-names. If you simply provide a topic name without prefix (such as test_topic), it would land into "default" schema. However, if you specify a topic name with prefix (such as test_schema.test_topic), then the topic would appear under test_schema. While querying using presto, you can provide this schema name.

Configuring Kafka connect Postgress Debezium CDC plugin

I am trying to use kafka connect to read changes in postgress DB.
I have Kafka running on my local system and i want to use the Kafka connect API in standalone mode to read the postgress server DB changes.
connect-standalone.sh connect-standalone.properties dbezium.properties
i would appreciate if someone can help me with setting up configuration properties for CDC postgress debezium connector
https://www.confluent.io/connector/debezium-postgresql-cdc-connector/
I am following the below to construct the properties
https://debezium.io/docs/connectors/postgresql/#how-the-postgresql-connector-works
The name of the Kafka topics takes by default the form
serverName.schemaName.tableName, where serverName is the logical name
of the connector as specified with the database.server.name
configuration property
and here is what i have come up with for dbezium.properties
name=cdc_demo
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
plugin.name=wal2json
slot.name=debezium
slot.drop_on_stop=false
database.hostname=localhost
database.port=5432
database.user=postgress
database.password=postgress
database.dbname=test
time.precision.mode=adaptive
database.sslmode=disable
Lets say i create a PG schema name as demo and table name as suppliers
So i need to create a topic with name as test.demo.suppliers so that this plugin can push the data to?
Also can someone suggest a docker image which has the postgress server + with suitable replication plugin such as wal2json etc? i am having hard time configuring postgress and the CDC plugin myself.
Check out the tutorial with associated Docker Compose and sample config.
The topic you've come up with sounds correct, but if you have your Kafka broker configured to auto-create topics (which is the default behaviour IIRC) then it will get created for you and you don't need to pre-create it.

How is schema from Schema-Registry is propagated over Replicator

How do schemas from Confluent Schema-Registry get propagated by Confluent-Replicator to destination Kafka-Cluster and Schema-Registry?
Is each replicated message schema contained in it or are schemas replicated somehow separately through a separate topic?
I didn't see any configuration possibilities in Confluent-Replicator regarding this.
It sounds like you are asking how the schema registry can be used in a multi data center environment. There's a pretty good doc on this https://docs.confluent.io/current/schema-registry/docs/multidc.html
Replicator can be used to keep the schema registry data in sync on the backend as shown in the doc.
Schemas are not stored with the topic, only their ID's. And the _schemas topic is not replicated, only the ID's stored within the replicated topics.
On a high-level, if you use the AvroConverter with Replicator, it'll deserialize the message from the source cluster, optionally rename the topic as per the replicator configuration, then serialize the message and send the new subject name to the destination cluster + registry.
Otherwise, if you use the ByteArrayConverter, it will not inspect the message, and it just copies it along to the destination cluster with no registration.
A small optimization on the Avro way would be to only inspect that the message is Avro encoded on the first 5 bytes, as per the Schema Registry specification, then perform HTTP lookups to the source subject using Schema Registry REST API GET /schemas/ids/:id, again rename topic if needed to the destination cluster, and POST the schema there. A similar approach can work in any Consumer/Producer pair such as a MirrorMaker MessageHandler implementation.

schema registry : Share partially/ authorization system

We need to share part of our Schema registry with another company and don't want them to see all the schemas. They also need to do the same for theirs.
Is there any way that each of us can share only part of our schema registry ?
Out of the box, no.
Assuming each Schema Registry is hooked to a separate Kafka Clusters (call them yours and theirs), what you could do, is
Write a Kafka Streams application to filter() the messages you want them to see to a _schemas_theirs topic.
Use MirrorMaker, or Confluent Replicator, to copy your local _schemas_theirs topic to the theirs Cluster's _schemas topic that is being read by the other registry.
Have them do the same thing, copying their filtered data into yours Kafka Cluster's _schemas topic

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.