Confluent Schema Registry as a stand alone service - apache-kafka

Can Confluent Schema Registry used by applications outside of Kafka Streams? I am specifically interested in using this component with message queues other than Apache Kafka, such as Cloud Pub/Sub. Based on investigations the component seem like tightly coupled with applications using Confluent Platform.

Well, the Confluent Schema Registry does depend on Kafka (it's where the schemas are actually stored). You don't need the rest of Confluent Platform.
While there is a Storage interface that could, in theory, be re-written against an external system, I am not aware of a way to change out the default implementation.
Once you had Kafka (and subsequently Zookeeper), the REST API itself could be wrapped by any external serialization library. Flink, NiFi, and StreamSets for example, have taken this approach for Avro schema management.

Related

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

Integrating Flink Kafka with schema registry

We are using a confluent Platform for Kafka deployment. We are using a schema registry for storing schema. Is it possible to integrate schema registry with flink? How to read the data in AVRO format from confluent platform?
These classes are designed to meet this need
ConfluentRegistryAvroSerializationSchema
ConfluentRegistryAvroDeserializationSchema
See the linked JavaDoc for more info on the classes.
Each can be provided to the Kafka Connector via the respective serialization method arguments.
Flink SQL can also be used.

Does kafka support schema registries out of the box, or is it a confluent platform feature?

I came across the following article on how to use the schema registry available in the confluent platform.
https://docs.confluent.io/platform/current/schema-registry/schema-validation.html
According to that article, we can specify confluent.schema.registry.url in server.properties to point Kafka to the schema registry.
My question is, is it possible to point a Kafka cluster which is not a part of confluent platform deployment, to a schema registry using confluent.schema.registry.url?
Server-side schema validation is part of Confluent Server, not Apache Kafka.
I will make sure that that docs page gets updated to be more clear - thanks for raising it.

Using confluent cp-schema-registry, does it have to talk to the same Kafka you are using for producers/consumers?

We already have Kafka running in production. And unfortunately it's an older version, 0.10.2. I want to start using cp-schema-registry, from the community edition of Confluent Platform. That would mean installing the older 3.2.2 image of schema registry for compatibility with our old kafka.
From what I've read in the documentation, it seems that Confluent Schema Registry uses Kafka as it's backend for storing it's state. But the clients that are producing to/reading from Kafka topics talk to Schema Registry independently of Kafka.
So I am wondering if it would be easier to manage in production, running Schema Registry/Kafka/Zookeeper in one container all together, independent of our main Kafka cluster. Then I can use the latest version of everything. The other benefit is that standing up this new service component up could not cause any unexpected negative consequence to the existing Kafka cluster.
I find the documentation doesn't really explain well what the pros/cons of each deployment strategy are. Can someone offer guidance on how they have deployed schema registry in an environment with an existing Kafka? What is the main advantage of connecting schema registry to your main Kafka cluster?
Newer Kafka clients are backwards compatible with Kafka 0.10, so there's no reason you couldn't use a newer Schema Registry than 3.2
In the docs
Schema Registry that is included in Confluent Platform 3.2 and later is compatible with any Kafka broker that is included in Confluent Platform 3.0 and later
I would certainly avoid putting everything in one container... That's not how they're meant to be used and there's no reason you would need another Zookeeper server
Having a secondary Kafka cluster only to hold one topic of schemas seems unnecessary when you could store the same information on your existing cluster
the clients that are producing to/reading from Kafka topics talk to Schema Registry independently of Kafka
Clients talk to both. Only Avro schemas are sent over HTTP before your regular client code reaches the topic. No, schemas and client data do not have to be part of the same Kafka cluster
Anytime anyone deploys Schema Registry, it's being added to "an existing Kafka", just the difference is yours might have more data in it

Why is Kafka connect light weight?

I have been working with kafka connect, Spark streaming , Nifi with kafka for streaming data.
I am aware that unlike other technologies kafka connect is not a separate application and it is a tool of kafka.
In case of distributed mode all technologies implement the parallelism by the underlying tasks or threads. What makes kafka connect to be efficient when dealing with kafka and why is it called light weight?
It's efficient and lightweight because it uses the built-in Kafka protocols and doesn't require an external system such as YARN. While it is arguably better/easier to deploy Connect in Mesos/Kubernetes/Docker, it is not required
The connect API is also maintained by the core Kafka developers rather than people that just want a simple integration into another tool. For example, last time I checked, NiFi cannot access the Kafka message timestamps. And dealing with the Avro Schema Registry seems to be an after thought in the other tools as compared to using Confluent Certified Connectors