There's no avro data in hdfs using kafka connect - apache-kafka

I am using kafka connect distribution.
The command is : bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
The worker configuration is:
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
The kafka connect start over with no errors!
The topic connect-configs,connect-offsets,connect-statuses has been created.
The topic mysiteview has been created.
Then i create kafka connectors using RESTful API like this:
curl -X POST -H "Content-Type: application/json" --data '{"name":"hdfs-sink-mysiteview","config":{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","tasks.max":"3","topics":"mysiteview","hdfs.url":"hdfs://master1:8020","topics.dir":"/kafka/topics","logs.dir":"/kafka/logs","format.class":"io.confluent.connect.hdfs.avro.AvroFormat","flush.size":"1000","rotate.interval.ms":"1000","partitioner.class":"io.confluent.connect.hdfs.partitioner.DailyPartitioner","path.format":"YYYY-MM-dd","schema.compatibility":"BACKWARD","locale":"zh_CN","timezone":"Asia/Shanghai"}}' http://kafka1:8083/connectors
And when i producer data to topic "mysiteview" something like this:
{"f1":"192.168.1.1","f2":"aa.example.com"}
The java code is following:
Properties props = new Properties();
props.put("bootstrap.servers","kafka1:9092");
props.put("acks","all");
props.put("retries",3);
props.put("batch.size", 16384);
props.put("linger.ms",30);
props.put("buffer.memory",33554432);
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String,String>(props);
Random rnd = new Random();
for(long nEvents = 0; nEvents < events; nEvents++) {
long runtime = new Date().getTime();
String site = "www.example.com";
String ipString = "192.168.2." + rnd.nextInt(255);
String key = "" + rnd.nextInt(255);
User u = new User();
u.setF1(ipString);
u.setF2(site+" "+rnd.nextInt(255));
System.out.println(JSON.toJSONString(u));
producer.send(new ProducerRecord<String,String>("mysiteview",JSON.toJSONString(u)));
Thread.sleep(50);
}
producer.flush();
producer.close();
The weird things occured.
I get data from kafka-logs but no data in hdfs(no topic directory).
I try the connector command:
curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/status
output is:
{"name":"hdfs-sink-mysiteview","connector":{"state":"RUNNING","worker_id":"10.255.223.178:8083"},"tasks":[{"state":"RUNNING","id":0,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":1,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":2,"worker_id":"10.255.223.178:8083"}]}
But when i inspect the task status using following command:
curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/hdfs-sink-siteview-1
I get the result: "Error 404" . Three tasks is as the same error!
What' going wrong?

Without seeing the worker's log, I'm not sure with which exception exactly your HDFS Connector instances are failing when you use the settings you describe above. However I can spot a few issues with the configuration:
You mention that you start your Connect worker with: bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties. These properties default to having key and value converters set to AvroConverter and require you to run the schema-registry service. If indeed you've edited the configuration in connect-avro-distributed.properties to use the JsonConverter instead, your HDFS connector will probably fail during the conversion of Kafka records to Connect's SinkRecord data type, just before it tries to export your data to HDFS.
Until recently, the HDFS connector was able to export only Avro records, to files of Avro or Parquet format. And that requires using the AvroConverter as mentioned above. The capability to export records to text files as JSON was added recently, and will appear in version 4.0.0 of the connector (you may try this capability by checking-out and building the connector from source).
At this point, my first suggestion would be to try and import your data with bin/kafka-avro-console-producer. Define their schema, confirm that the data are imported successfully with bin/kafka-avro-console-consumer and then set your HDFS Connector to use AvroFormat as above. The quickstart at the connector's page describes a very similar process, and maybe it would be a great starting point for your use case.

maybe you are just using the REST-Api wrong.
According to the documentation the call should be
/connectors/:connector_name/tasks/:task_id
https://docs.confluent.io/3.3.1/connect/restapi.html#get--connectors-(string-name)-tasks-(int-taskid)-status

Related

Configure Apache Kafka sink jdbc connector

I want to send the data sent to the topic to a postgresql-database. So I follow this guide and have configured the properties-file like this:
name=transaction-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=transactions
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=db-user
connection.password=
auto.create=true
insert.mode=insert
table.name.format=transaction
pk.mode=none
I start the connector with
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/sink-quickstart-postgresql.properties
The sink-connector is created but does not start due to this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is in avro-format and registered and I can send (produce) messages to the topic and read (consume) from it. But I can't seem to sent it to the database.
This is my ./etc/schema-registry/connect-avro-standalone.properties
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
This is a producer feeding the topic using the java-api:
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
try (KafkaProducer<String, Transaction> producer = new KafkaProducer<>(properties)) {
Transaction transaction = new Transaction();
transaction.setFoo("foo");
transaction.setBar("bar");
UUID uuid = UUID.randomUUID();
final ProducerRecord<String, Transaction> record = new ProducerRecord<>(TOPIC, uuid.toString(), transaction);
producer.send(record);
}
I'm verifying data is properly serialized and deserialized using
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic transactions \
--from-beginning --max-messages 1
The database is up and running.
This is not correct:
The unknown magic byte can be due to a id-field not part of the schema
What that error means that the message on the topic was not serialised using the Schema Registry Avro serialiser.
How are you putting data on the topic?
Maybe all the messages have the problem, maybe only someā€”but by default this will halt the Kafka Connect task.
You can set
"errors.tolerance":"all",
to get it to ignore messages that it can't deserialise. But if all of them are not correctly Avro serialised this won't help and you need to serialise them correctly, or choose a different Converter (e.g. if they're actually JSON, use the JSONConverter).
These references should help you more:
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues
http://rmoff.dev/ksldn19-kafka-connect
Edit :
If you are serialising the key with StringSerializer then you need to use this in your Connect config:
key.converter=org.apache.kafka.connect.storage.StringConverter
You can set it at the worker (global property, applies to all connectors that you run on it), or just for this connector (i.e. put it in the connector properties itself, it will override the worker settings)

Not able to use Kafka's JdbcSourceConnector to read data from Oracle DB to kafka topic

I am trying to write a standalone java program using kafka-jdbc-connect API to stream data from oracle-table to kafka topic.
API used: I'm currently trying to use Kafka Connectors, JdbcSourceConnector class to be precise.
Constraint: Use Confluent Java API and not do it through CLI or by executing provided shell script.
What I did: create an instance of JdbcSourceConnector.java class and call start(Properties) method of this class by providing the Properties object as a parameter. This properties object has database connection properties, table whitelist property, topic prefix etc.
After starting thread, i'm unable to read the data from "topic-prefix-tablename" topic. I am not sure how to pass Kafka Broker details to JdbcSourceConnector. Calling start() method on JdbcSourceConnector starting thread but not doing anything.
Is there a simple java API tutorial page/example code i can refer because all the examples i see are using CLI/shell scripts?
Any help is appreciated
Code:
public static void main(String[] args) {
Map<String, String> jdbcConnectorConfig = new HashMap<String, String>();
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.CONNECTION_URL_CONFIG, "<DATABASE_URL>");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.CONNECTION_USER_CONFIG, "<DATABASE_USER>");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.CONNECTION_PASSWORD_CONFIG, "<DATABASE_PASSWORD>");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.POLL_INTERVAL_MS_CONFIG, "300000");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.BATCH_MAX_ROWS_CONFIG, "10");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.MODE_CONFIG, "timestamp");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.TABLE_WHITELIST_CONFIG, "<TABLE_NAME>");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.TIMESTAMP_COLUMN_NAME_CONFIG, "<TABLE_COLUMN_NAME>");
jdbcConnectorConfig.put(JdbcSourceConnectorConfig.TOPIC_PREFIX_CONFIG, "test-oracle-jdbc-");
JdbcSourceConnector jdbcSourceConnector = new JdbcSourceConnector ();
jdbcSourceConnector.start(jdbcConnectorConfig);
}
Assuming you are trying to do it in Standalone mode.
In your Application run configuration, your main class should be "org.apache.kafka.connect.cli.ConnectStandalone" and you need to pass two property files as program arguments.
You should also extend "your-custom-JdbcSourceConnector" class with "org.apache.kafka.connect.source.SourceConnector" class
Main Class: org.apache.kafka.connect.cli.ConnectStandalone
Program Arguments: .\path-to-config\connect-standalone.conf .\path-to-config\connetcor.properties
"connect-standalone.conf" file will contain all Kafka broker details.
// Example connect-standalone.conf
bootstrap.servers=<comma seperated brokers list here>
group.id=some_loca_group_id
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=connect.offset
offset.flush.interval.ms=100
offset.flush.timeout.ms=180000
buffer.memory=67108864
batch.size=128000
producers.acks=1
"connector.properties" file will contain all details required to create and start connector
// Example connector.properties
name=some-local-connector-name
connector.class=your-custom-JdbcSourceConnector
tasks.max=3
topic=output-topic
fetchsize=10000
More info here : https://docs.confluent.io/current/connect/devguide.html#connector-example

Using kafka to produce data for clickhouse

I want to use kafka integration for clickhouse. I tried to use official tutorial like here! All table has been created. I run kafka server. Next run kafka producer and write in command promt json object like row in database. Like this:
{"timestamp":1554138000,"level":"first","message":"abc"}
I checked kafka consumer.It received object. But when I cheked tables in my clickhouse database there were empty rows. Any ideas what I did wrong?
UPDATE
To ignore malformed messages pass kafka_skip_broken_messages-param to table definition.
It looks like a well-known issue that occurred in one of the latest version of CH, try to add extra parameter kafka_row_delimiter to engine configuration:
CREATE TABLE queue (
timestamp UInt64,
level String,
message String
)
ENGINE = Kafka SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topic',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n'
kafka_skip_broken_messages = 1;
So sorry. There was my fail. Before starting clickhouse and kafka. I tested sending simple messages into topic by kafka. And clickhouse tried parse it. I just create new topic and now everytning works. Thank you!

How I deserialize Avro from Kafka with embedded schema

I recieve binary Avro files from a Kafka topic and I must deserialize them. In the message received by Kafka, I can see a schema at the start of every message. I know it's a better practice to not embed the schema and separate it from the actual Avro file, but I don't have control over the producer and I can't change that.
My code runs on top of Apache Storm. First I create a reader:
mDatumReader = new GenericDatumReader<GenericRecord>();
And later I try to deserialize the message without declaring schema:
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord payload = mDatumReader.read(null, decoder);
But then I get an error when a message arrives:
Caused by: java.lang.NullPointerException: writer cannot be null!
at org.apache.avro.io.ResolvingDecoder.resolve(ResolvingDecoder.java:77) ~[stormjar.jar:?]
at org.apache.avro.io.ResolvingDecoder.<init>(ResolvingDecoder.java:46) ~[stormjar.jar:?]
at org.apache.avro.io.DecoderFactory.resolvingDecoder(DecoderFactory.java:307) ~[stormjar.jar:?]
at org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:122) ~[stormjar.jar:?]
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:137) ~[stormjar.jar:?]
All the answers I've seen are about using other formats, changing the messages delivered to Kafka or something else. I don't have control over those things.
My question is, given a message in bytes[] with embedded schema inside binary message, how to deserialize that Avro file without declaring schema so I can read it.
With the DatumReader/Writer, there is no such thing like an embedded schema. Had been my misunderstanding when looking at Avro & Kafka the first time as well. But the source code of the Avro Serializer clearly shows there is no schema embedded when using the GenericDatumWriter.
It is the DataFileWriter who does write a schema at the beginning of the file and then adds GenericRecords using the GenericDatumWriter.
Since you said there is a schema at the beginning, I assume you can read it, turn it into a Schema object and then pass that into the GenericDatumReader(schema) constructor.
Would be interesting to know how the message is serialized. Maybe the DataFileWriter is used to write into a byte[] instead of an actual file, then you could use the DataFileReader to deserialize the data?
Add Maven Dependancy
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.9.1</version>
<type>maven-plugin</type>
</dependency>
Create a file like below
{"namespace": "tachyonis.space",
"type": "record",
"name": "Avro",
"fields": [
{"name": "Id", "type": "string"},
]
}
Save above as Avro.avsc in src/main/resources.
In Eclipse or any IDE Run > Maven generate sources which create Avro.java to package folder [namespace] tachyonis.space
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL_CONFIG);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
KafkaConsumer<String, Avro> consumer = new KafkaConsumer<>(props);
The consumer/producer has to run in the same machine. Otherwise you need to configure hosts file in Windows/Linux and change all components configurations properties from localhost to map to the actual IP address for broadcast to the producers/consumers. Otherwise you get errors like network connection issues
Connection to node -3 (/127.0.0.1:9092) could not be established. Broker may not be available

checking kafka data if compressed

The document said add the line compression.codec=gzip in producer.properties to make the message compressed.
However when I open the data file such as: 0000000000000.log I found the data does not look like it is compressed. How should check whether the data in kafka is compressed already?
P.S: Every testing I would stop the Kafka cluster and Zookeeper and deleted all of the data in kafka-logs and Zookeeper,then start the server again and create a new topic.
The Java ProducerConfig class has changed for this config.
public static final String COMPRESSION_TYPE_CONFIG = "compression.type";
I've successfully produced messages with the java client (0.8.2.1) using the ProducerConfig.COMPRESSION_TYPE_CONFIG and it works fine.
Example:
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip");
Or set compression.type=gzip in your server.properties file for the Java client.
Update for cli tool
Read the usage for the command line tools:
chrisblack:kafka:% ./bin/kafka-console-producer.sh
...
--compression-codec [compression-codec] The compression codec: either 'none',
'gzip', 'snappy', or 'lz4'.If
specified without value, then it
defaults to 'gzip'
...
--new-producer Use the new producer implementation.
--producer-property <producer_prop> A mechanism to pass user-defined
properties in the form key=value to
the producer.
--property <prop> A mechanism to pass user-defined
properties in the form key=value to
the message reader. This allows
custom configuration for a user-
defined message reader.
...
Try running a similar command from the shell:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test_compression --compression-codec