Should topic created via kafka-topics automatically have associated subjects created? - apache-kafka

I'm attempting to mimic 'confluent load' (which isn't recommended for production usage) to add the connectors which automatically creates the topic, subject, etc. that allows for ksql stream and table creation. I'm using curl to interact with the rest interface.
When kafka-topics is used to create topics, does this also create the associated subjects for "topicName-value", etc.?
$ curl -X GET http://localhost:8082/topics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 199 100 199 0 0 14930 0 --:--:-- --:--:-- --:--:-- 15307
[
"Topic_OracleSource2"
]
curl -X GET http://localhost:8081/subjects | jq
[]
Nothing shows. However, performing a curl:
curl -X POST -H "Content-Type: application/vnd.kafka.avro.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"name\", \"type\": \"string\"}]}", "records": [{"value": {"name": "testUser"}}]}' "http://localhost:8082/topics/avrotest"
creates the subject:
curl -X GET http://localhost:8081/subjects | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 18 100 18 0 0 2020 0 --:--:-- --:--:-- --:--:-- 2250
[
"avrotest-value"
]
As far as I know, doing it this way isn't recommended as topics are created on the fly and not pre-created in a controlled environment.
The reason this question comes about is that it seems the subject 'topicName-value/key' pair is needed to create streams for the topic inside KSQL.
Without subject, I can only see data coming across with the avro-based connector created but can't further perform transformation using ksql stream and table.

kafka-topics only interacts with Zookeeper and Kafka. It has no notion of the existence of a Schema Registry.
The process that creates the Avro schema / subject is the Avro Serializer configuration via the producer. If a Kafka Connect source is configured with the AvroConverter, it'll register a schema itself upon getting data, so you should not need curl, assuming you are satisfied with the generated schema
To my knowledge, there's no way to prevent KSQL from auto-registering a schema in the registry.
seems the subject 'topicName-value/key' pair is needed to create streams for the topic inside KSQL.
If you want to use Avro, yes. But, no not "needed" for other data formats KSQL supports
can't further perform transformation using ksql stream and table.
You'll need to be more explicit about why that is. Are you getting errors?

When kafka-topics is used to create topics, does this also create the associated subjects for "topicName-value", etc.?
No, the subjects are not created automatically. (kafka-topics today doesn't even allow you to pass an Avro schema.)
Might be worth a feature request?

Related

Not able to get the data insert into snowflake database, through snowflake kafka connector, however connector is getting started

Kafka,Zookeeper user#ip-XX-XX-XX-XXX config]$ curl -X POST -H "Content-Type: application/json" --data #snowflake-connector.json http://XX.XX.XX.XXX:8083/connectors
{"name":"file-stream-distributed","config":{"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector","tasks.max":"1","topics":"uat.product.topic","buffer.count.records":"10000","buffer.flush.time":"60","buffer.size.bytes":"5000000","snowflake.url.name":"XXX00000.XX-XXXX-0.snowflakecomputing.com:443","snowflake.user.name":"kafka_connector_user_1","snowflake.private.key":"XXXXX,"snowflake.database.name":"KAFKA_DB","snowflake.schema.name":"KAFKA_SCHEMA","key.converter":"org.apache.kafka.connect.storage.StringConverter","key.converter.schemas.enable":"true","value.converter":"com.snowflake.kafka.connector.records.SnowflakeJsonConverter","value.converter:schemas.enable":"true","value.converter.schema.registry.url":"http://localhost:8081","name":"file-stream-distributed"},"tasks":[],"type":"sink"}[Panamax-UAT-Kafka,Zookeeper deploy#ip-XX-XX-XX-XXX config]$
Kafka,Zookeeper user#ip-XX-XX-XX-XXX config]$
Checking for the status of connector: But its giving not found
Kafka,Zookeeper user#ip-XX-XX-XX-XXX config]$ curl XX.XX.XX.XXX:8083/connectors/file-stream-demo-distributed/tasks
{"error_code":404,"message":"Connector file-stream-demo-distributed not found"}
Kafka,Zookeeper user#ip-XX-XX-XX-XXX config]$
And the data is not getting inserted into the database
logs: /opt/Kafka/kafka/logs/connect.logs:
[2022-05-29 14:51:22,521] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 39 with protocol version 2 and got assignment: Assignme nt{error=0, leader='connect-1-fee23ff6-e61d-4e4c-982d-da2af6272a08', leaderUrl='http://XX.XX.XX.XXX:8083/', offset=4353, connectorIds=[], taskIds=[], revokedConnector Ids=[], revokedTaskIds=[], delay=0} with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1681)
You need to specify a user that has appropriate CREATE TABLE privileges for use in Kafka connector. Suggest you review the documention.

Apache Druid throwing network error in console

When we are trying to query a datasource , the query runs for 5+ mins and throws network error in console. We are trying to fetch some huge result in millions. Is this some limitation in druid, where we can't fetch huge records? Other aggregated queries are running fine and producing results.
SELECT * FROM "datasource"
WHERE "__time" >= TIMESTAMP '2021-06-21' and "__time" <= TIMESTAMP '2021-06-23' and consumerid=1234
Segment Granularity: DAY
Query Granularity : DAY
Segments Created: 736
Avg Segment Size: 462 MB
Total Datasource Size: 340.28 GB
Replicated Size: 680.55 GB
secondary partition: single_dim (consumerid)
Is there some way to overcome this Issue?
I've tried this via API also, after 5.30 seconds it throws error.
curl --location --request POST 'https://druiddev-druid.int.org/druid/v2/sql' --header 'Authorization: Basic Username/p' --header 'Content-Type: application/json' --data-raw '{
"query": "SELECT * FROM datasource WHERE consumerid=1234 and buydate='\''01/01/2021'\''",
"resultFormat" : "csv",
"batchSize":20480
}' > output.dat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 17.9M 0 17.9M 0 165 58699 0 --:--:-- 0:05:20 --:--:-- 86008
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Kafka MirrorMaker 2.0 duplicate each messages

I am trying to replicate Kafka cluster with MirrorMaker 2.0. I am using following mm2.properties:
name = mirror-site1-site2
topics = .*
connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector
tasks.max = 1
plugin.path=/usr/share/java/kafka/plugin
clusters = site1, site2
# for demo, source and target clusters are the same
source.cluster.alias = site1
target.cluster.alias = site2
site1.sasl.mechanism=SCRAM-SHA-256
site1.security.protocol=SASL_PLAINTEXT
site1.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
username="<someuser>" \
password="<somepass>";
site2.sasl.mechanism=SCRAM-SHA-256
site2.security.protocol=SASL_PLAINTEXT
site2.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
username="<someuser>" \
password="<somepass>";
site1.bootstrap.servers = <IP1>:9093, <IP2>:9093, <IP3>:9093, <IP4>:9093
site2.bootstrap.servers = <IP5>:9093, <IP6>:9093, <IP7>:9093, <IP8>:9093
site1->site2.enabled = true
site1->site2.topics = topic1
# use ByteArrayConverter to ensure that records are not re-encoded
key.converter = org.apache.kafka.connect.converters.ByteArrayConverter
value.converter = org.apache.kafka.connect.converters.ByteArrayConverter
So here's the issue, mm2 seems to allways replicate x3 messages :
# Manual message production:
kafkacat -P -b <IP1>:9093,<IP2>:9093,<IP3>:9093,<IP4>:9093 -t "topic1"
# Result in the source topic (site1 cluster):
% Reached end of topic topic1 [2] at offset 405
Message1
% Reached end of topic topic1 [2] at offset 406
Message2
% Reached end of topic topic1 [6] at offset 408
Message3
% Reached end of topic topic1 [2] at offset 407
kafkacat -P -b <IP5>:9093,<IP6>:9093,<IP7>:9093,<IP8>:9093 -t "site1.topic1"
# Result in the target topic (site2 cluster):
% Reached end of topic site1.titi [2] at offset 1216
Message1
Message1
Message1
% Reached end of topic site1.titi [2] at offset 1219
Message2
Message2
Message2
% Reached end of topic site1.titi [6] at offset 1229
Message3
Message3
Message3
I tried using Kafka from confluent package and kafka_2.13-2.4.0 directly from Apache, both with Debian 10.1.
I first encouraged this behaviour with confluent 5.4, thought it could be a bug in their package as they have replicator and should not really care about mm2, but I reproduced exactly the same issue with kafka_2.13-2.4.0 directly from Apache without any change.
I'm aware that mm2 is not yet idempotent and can't guarantee once delivery. In my tests (I tried many things including producer tuning or bigger batch of thousand messages). In all these test mm2 always duplicate X3 all messages.
Did I miss something, did someone encourage the same thing ? As a site note with legacy mm1 with the same packages I don't have this issue.
Appreciate any help... Thanks !
Even if the changelog didnt made me very confident about an improvement I tried again to run a mm2, from kafka 2.4.1 this time. => no change allways these strange duplications.
I installed this released on a new server to ensure the strange behaviour I met wasnt something related to the server.
As I use ACL does I need special right ? I put "all" thinking it cant be more permisive... Even if mm2 isnt idempotent yep, I'll give a try to the right related to that.
That suprise me the more is that I cant find anything reporting an issue like this, for sure I must do something wrong, but what that is the question...
You need to remove connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector from your configuration, because this is telling Mirror Maker to use this class for Heartbeats and Checkpoints connectors that it generates along with the Source connector that replicates data, and this class makes them behave exactly like a Source connector, so that's why you get 3 messages replicated each time, you've actually generated 3 Source connectors.
Enabling idempotence to the client config will fix the issue. By default it will be set to false. Add the below to the mm2.properties file
source.cluster.producer.enable.idempotence = true
target.cluster.producer.enable.idempotence = true

Confluent Kafka Rest Proxy - Avro Deserialization

I am trying to use Confluent Kafka REST Proxy to retrieve data in Avro format from one of my topics but unfortunately I get a deserialization error. I am querying the Kafka REST proxy using the following command
curl -X GET -H "Accept: application/vnd.kafka.avro.v2+json"
http://localhost:8082/consumers/my-group/instances/my-consumer/records?timeout=30000
And I get as response
{
"error_code": 50002,
"message": "Kafka error: Error deserializing key/value for partition input-0 at offset 0. If needed, please seek past the record to continue consumption."
}
and the logs on Kafka Rest Proxy server are:
org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition input-0 at offset 0. If needed, please seek past the record to continue consumption.
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The data have been produced using KafkaAvroSerializer and the schema is present on the Schema Registry. Also note that data are readable by using avro-console-consumer on CLI.
Does anybody know how to resolve this issue?
It's most likely that as well as valid Avro messages on the topic, you also have invalid ones. That's what this error means, and is exactly the error that I got when I tried to consume a non-Avro message locally with the REST Proxy:
ERROR Unexpected exception in consumer read task id=io.confluent.kafkarest.v2.KafkaConsumerReadTask#2e20d4f3 (io.confluent.kafkarest.v2.KafkaConsumerReadTask)
org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition avrotest-0 at offset 2. If needed, please seek past the record to continue consumption.
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
I would use a tool such as kafkacat to inspect the actual messages at the offset given in the error, e.g.:
kafkacat -C -b localhost:9092 -t test_topic_avro -o 0 -c 1
The -o 0 will consume the message at offset 0, and -c 1 means consume just one message.
You can also seek past the problematic offset, e.g. for topic avrotest move the offset to 1:
echo '{ "offsets": [ { "topic": "avrotest", "partition": 0, "offset": 1 } ] }' | \
http POST localhost:8082/consumers/rmoff_consumer_group/instances/rmoff_consumer_instance/positions \
Content-Type:application/vnd.kafka.v2+json
It wasn't supported to have String keys and AVRO values in the rest proxy until recently:
https://github.com/confluentinc/kafka-rest/issues/210
So recently that the code has been merged, but issue is still open and docs haven't been updated fully:
https://github.com/confluentinc/kafka-rest/pull/797

Shows invalid characters while consuming using kafka console consumer

While consuming from the Kafka topic using Kafka console consumer or kt(GoLang CLI tool for Kafka), I am getting invalid characters.
...
\u0000\ufffd?\u0006app\u0000\u0000\u0000\u0000\u0000\u0000\u003e#\u0001
\u0000\u000cSec-39\u001aSome Actual Value Text\ufffd\ufffd\ufffd\ufffd\ufffd
\ufffd\u0015#\ufffd\ufffd\ufffd\ufffd\ufffd\ufff
...
Even though Kafka connect can actually sink the proper data to an SQL database.
Given that you say
Kafka connect can actually sink the proper data to an SQL database.
my assumption would be that you're using Avro serialization for the data on the topic. Kafka Connect configured correctly will take the Avro data and deserialise it.
However, console tools such as kafka-console-consumer, kt, kafkacat et al do not support Avro, and so you get a bunch of weird characters if you use them to read data from a topic that is Avro-encoded.
To read Avro data to the command line you can use kafka-avro-console-consumer:
kafka-avro-console-consumer
--bootstrap-server kafka:29092\
--topic test_topic_avro \
--property schema.registry.url=http://schema-registry:8081
Edit: Adding a suggestion from #CodeGeas too:
Alternatively, reading data using REST Proxy can be done with the following:
# Create a consumer for JSON data
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
-H "Accept: application/vnd.kafka.v2+json" \
--data '{"name": "my_consumer_instance", "format": "avro", "auto.offset.reset": "earliest"}' \
# Subscribe the consumer to a topic
http://kafka-rest-instance:8082/consumers/my_json_consumer
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
--data '{"topics":["YOUR-TOPIC-NAME"]}' \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/subscription
# Then consume some data from a topic using the base URL in the first response.
curl -X GET -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/records
Later, to delete the consumer afterwards:
curl -X DELETE -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance
By default, the console consumer tools deserializes both the message key and value using ByteArrayDeserializer but then obviously tries to print data to the command line using the default formatter.
This tool however allows to customize the deserializers and formatter used. See the following extract from the help output:
--formatter <String: class> The name of a class to use for
formatting kafka messages for
display. (default: kafka.tools.
DefaultMessageFormatter)
--property <String: prop> The properties to initialize the
message formatter. Default
properties include:
print.timestamp=true|false
print.key=true|false
print.value=true|false
key.separator=<key.separator>
line.separator=<line.separator>
key.deserializer=<key.deserializer>
value.deserializer=<value.
deserializer>
Users can also pass in customized
properties for their formatter; more
specifically, users can pass in
properties keyed with 'key.
deserializer.' and 'value.
deserializer.' prefixes to configure
their deserializers.
--key-deserializer <String:
deserializer for key>
--value-deserializer <String:
deserializer for values>
Using these settings, you should be able to change the output to be what you want.