schema registry : Share partially/ authorization system - apache-kafka

We need to share part of our Schema registry with another company and don't want them to see all the schemas. They also need to do the same for theirs.
Is there any way that each of us can share only part of our schema registry ?

Out of the box, no.
Assuming each Schema Registry is hooked to a separate Kafka Clusters (call them yours and theirs), what you could do, is
Write a Kafka Streams application to filter() the messages you want them to see to a _schemas_theirs topic.
Use MirrorMaker, or Confluent Replicator, to copy your local _schemas_theirs topic to the theirs Cluster's _schemas topic that is being read by the other registry.
Have them do the same thing, copying their filtered data into yours Kafka Cluster's _schemas topic

Related

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

Does Confluent Schema Registry keep track of the producers to various Kafka topics?

I am trying to plot an overall topology for my Kafka cluster (i.e., producers-->topics-->consumers).
For the mapping from topics to consumers, I'm able to obtain it using the kafka-consumer-groups.sh script.
However, for the mapping from producers to topics, I understand there is no equivalent script in vanilla Kafka.
Question:
Does the Schema Registry allow us to associate metadata with producers and/or topics or otherwise create a mapping of all producers producing to a particular topic?
Schema Registry has no such functionality
Closest I've seen to something like this, is using Distributed Tracing (Brave library) or Cloudera's SMM tool, which requires authorized Kafka clients so it can trace requests and Producer client.id to topics, then consumer instances to groups
There's also the Stream Registry project
which I helped with the initial version for the vision of managing client state/discovery, but I think it took different direction and the documentation is not maintained

How we can Dump kafka topic into presto

I need to pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
I am following this tutorial https://prestodb.io/docs/current/connector/kafka-tutorial.html#step-2-load-data
I am not able to understand how this command will work.
$ ./kafka-tpch load --brokers localhost:9092 --prefix tpch. --tpch-type tiny
Suppose I have created test topic in kafka using producer. How will tpch file will generate of this topic?
If you already have a topic, you should skip to step 3 where it actually sets up the topics to query via Presto
kafka-tpch load creates new topics with the specified prefix
Above command creates a tpch schema and loads various tables under it. This can be used for testing purpose. If you want to work with your actual kafka topics, you need to enlist them in /catalog/kafka.properties against kafka.tables-names. If you simply provide a topic name without prefix (such as test_topic), it would land into "default" schema. However, if you specify a topic name with prefix (such as test_schema.test_topic), then the topic would appear under test_schema. While querying using presto, you can provide this schema name.

How to replicate schema with Kafka mirror maker?

We are using the mirror maker to sync on-premise and AWS Kafka topics. How can a topic with its schema registered in on-premise be replicated exactly the same in other clusters (AWS in this case)?
How Avro schema is replicated using mirror maker?
MirrorMaker only copies byte arrays, not schemas. And doesn't care about the format of the data
As of Confluent 4.x or later, Schema Registry added endpoint GET /schemas/ids/(number). So, if your consumers are configured to the original registry, this shouldn't matter since your destination consumers can lookup the schema ID.
You otherwise can mirror the _schemas topic as well, as recommend by Confluent when using Confluent Replicator
If you absolutely need one-to-one schema copying, you would need to implement a MessageHandler interface, and pass this on to the MirrorMaker command, to get and post the schema, similar to the internal logic I added to this Kafka Connect plugin (which you could use Connect instead of MirrorMaker). https://github.com/OneCricketeer/schema-registry-transfer-smt

How is schema from Schema-Registry is propagated over Replicator

How do schemas from Confluent Schema-Registry get propagated by Confluent-Replicator to destination Kafka-Cluster and Schema-Registry?
Is each replicated message schema contained in it or are schemas replicated somehow separately through a separate topic?
I didn't see any configuration possibilities in Confluent-Replicator regarding this.
It sounds like you are asking how the schema registry can be used in a multi data center environment. There's a pretty good doc on this https://docs.confluent.io/current/schema-registry/docs/multidc.html
Replicator can be used to keep the schema registry data in sync on the backend as shown in the doc.
Schemas are not stored with the topic, only their ID's. And the _schemas topic is not replicated, only the ID's stored within the replicated topics.
On a high-level, if you use the AvroConverter with Replicator, it'll deserialize the message from the source cluster, optionally rename the topic as per the replicator configuration, then serialize the message and send the new subject name to the destination cluster + registry.
Otherwise, if you use the ByteArrayConverter, it will not inspect the message, and it just copies it along to the destination cluster with no registration.
A small optimization on the Avro way would be to only inspect that the message is Avro encoded on the first 5 bytes, as per the Schema Registry specification, then perform HTTP lookups to the source subject using Schema Registry REST API GET /schemas/ids/:id, again rename topic if needed to the destination cluster, and POST the schema there. A similar approach can work in any Consumer/Producer pair such as a MirrorMaker MessageHandler implementation.