How we can Dump kafka topic into presto - apache-kafka

I need to pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
I am following this tutorial https://prestodb.io/docs/current/connector/kafka-tutorial.html#step-2-load-data
I am not able to understand how this command will work.
$ ./kafka-tpch load --brokers localhost:9092 --prefix tpch. --tpch-type tiny
Suppose I have created test topic in kafka using producer. How will tpch file will generate of this topic?

If you already have a topic, you should skip to step 3 where it actually sets up the topics to query via Presto
kafka-tpch load creates new topics with the specified prefix

Above command creates a tpch schema and loads various tables under it. This can be used for testing purpose. If you want to work with your actual kafka topics, you need to enlist them in /catalog/kafka.properties against kafka.tables-names. If you simply provide a topic name without prefix (such as test_topic), it would land into "default" schema. However, if you specify a topic name with prefix (such as test_schema.test_topic), then the topic would appear under test_schema. While querying using presto, you can provide this schema name.

Related

How to group kafka topics in different dbs and collections with mongodb sink connector depending on kafka topic name or message key/value

As the title states, I'm using debezium Postgres source connector and I would like MongoDB sink connector to group kafka topics in different collection and databases (also different dbs to isolate unrelated data) according to their names. While inquiring I came across with topic.regex connector property at mongo docs. Unfortunately, this only creates a collection in mongo for each kafka topic successfully matched against the specified regex, and I'm planning on using the same mongodb server to harbor many dbs captured from multiple debezium source connectors. Can you help me?
Note: I read this mongo sink setting FieldPathNamespaceMapper, but I'm not sure if it would fit my needs nor how to correctly configure it.
topics.regex is a general sink connector peppery, not unique to Mongo.
If I understand the problem, correctly, obviously only collections will get created in the configured database for Kafka topics that actually exist (match the pattern) and get consumed by the sink.
If you want collections that don't match a pattern, then you'll still need to consume them, but need to explicitly rename the topics via RegexRouter transform before records are written to Mongo
In kafka workers are simple containers that can run multiple connectors. For each connector workers generate tasks according to internal rules and your configurations. So, if you take a look at mongodb sink connector configurations:
https://www.mongodb.com/docs/kafka-connector/current/sink-connector/configuration-properties/all-properties/
You can create different connectors with the same connection.uri, database and collection, or different values. So you might use the topics.regex or topics parameters to group the topics for a single connector with its own connection.uri, database and collection, and run multiple connectors at the same time. Remember that if tasks.max > 1 in your connector, messages might be read out of order. If this is not a problem, set a value of tasks.max next to the number of mongodb shards. The worker will adjust the number of tasks automatically.

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

How is schema from Schema-Registry is propagated over Replicator

How do schemas from Confluent Schema-Registry get propagated by Confluent-Replicator to destination Kafka-Cluster and Schema-Registry?
Is each replicated message schema contained in it or are schemas replicated somehow separately through a separate topic?
I didn't see any configuration possibilities in Confluent-Replicator regarding this.
It sounds like you are asking how the schema registry can be used in a multi data center environment. There's a pretty good doc on this https://docs.confluent.io/current/schema-registry/docs/multidc.html
Replicator can be used to keep the schema registry data in sync on the backend as shown in the doc.
Schemas are not stored with the topic, only their ID's. And the _schemas topic is not replicated, only the ID's stored within the replicated topics.
On a high-level, if you use the AvroConverter with Replicator, it'll deserialize the message from the source cluster, optionally rename the topic as per the replicator configuration, then serialize the message and send the new subject name to the destination cluster + registry.
Otherwise, if you use the ByteArrayConverter, it will not inspect the message, and it just copies it along to the destination cluster with no registration.
A small optimization on the Avro way would be to only inspect that the message is Avro encoded on the first 5 bytes, as per the Schema Registry specification, then perform HTTP lookups to the source subject using Schema Registry REST API GET /schemas/ids/:id, again rename topic if needed to the destination cluster, and POST the schema there. A similar approach can work in any Consumer/Producer pair such as a MirrorMaker MessageHandler implementation.

schema registry : Share partially/ authorization system

We need to share part of our Schema registry with another company and don't want them to see all the schemas. They also need to do the same for theirs.
Is there any way that each of us can share only part of our schema registry ?
Out of the box, no.
Assuming each Schema Registry is hooked to a separate Kafka Clusters (call them yours and theirs), what you could do, is
Write a Kafka Streams application to filter() the messages you want them to see to a _schemas_theirs topic.
Use MirrorMaker, or Confluent Replicator, to copy your local _schemas_theirs topic to the theirs Cluster's _schemas topic that is being read by the other registry.
Have them do the same thing, copying their filtered data into yours Kafka Cluster's _schemas topic

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.