some data did not consumed by kafka-engine - apache-kafka

We are using the kafka-engine to connect kafka topic, and then MATERIALIZED VIEW to store the data.
But from time to time, some data did not consumed by kafka-engine( due to we also apply flume to put the data into hdfs file, and these missing data can be found in hdfs file).
Is there any other method to find related log to located problem except upgrade the clcikhouse server version(we are on the way to upgrade clickhouse server )

You can enable rdkafka logs by adding <kafka><debug>all</debug></kafka> fragment to clickhouse config.xml file
Logs will be written to /var/log/clickhouse-server/stderr file (in docker - to docker logs).

Related

Is there any way to send chrome history logs to kafka?

I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?
Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect

is there a way to know how many record are written by kafka connect sink?

I use the HDFS sink connector and I want to know how many record are put to HDFS.
In logs I have log when the connector start to put file to HDFS, but not how many record there are.
For exemple :
INFO Opening record writer for: hdfs://hdfs/path/+tmp/table/partition=2020-02-27/19955b52-8189-4f70-94b5-46d579cd1505_tmp.avro (io.confluent.connect.hdfs.avro.AvroRecordWriterProvider)
Is it possible by extending the connector itself ?
I use kafka connect HDFS 2 sink.
Out of the box, not that I know of (of course, it's open source, and you could look). Each file would have variable amount of data, so metric tracking wouldn't be all too useful.
I cannot recall if debug or trace logs expose that information.
You can can use Hive/Spark/HDFS CLI to inspect each file, though

Apache Kafka topic data in HDFS format

Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html

Which directory does apache kafka store the data in broker nodes

I can see a property in config/server.properties called log.dir? Does this mean kafka uses the same directory for storing logs and data both?
Kafka topics are "distributed and partitioned append only logs". Parameter log.dir defines where topics (ie, data) is stored.
It is not related to application/broker logging.
The default log.dir is /tmp/kafka-logs which you may want to change in case your OS has a /tmp directory cleaner.
log.dir or log.dirs in the config/server.properties specifiy the directories in which the log data is kept.
The server log directory is kafka_base_dir/logs by default. You could modify it by specifying another directory for 'kafka.logs.dir' in log4j.properties.
log.dir in server.properties is the place where the Kafka broker will store the commit logs containing your data. Typically this will your high speed mount disk for mission critical use-cases.
For application/broker logging you can use general log4j logging to get the event logging in your custom location. Below are the variables to do this.
-Dlog4j.configuration=file:<configuration file with log rolling, logging level etc.> & -Dkafka.logs.dir=<path to logs>
The directory location of logs and data were perfectly described by Mathias. Yet the data were designed for internal processing of Kafka engine, may you could use Kafka Connect to store and manipulate the data. Kafka Connect is a tool for scalability and reliability data between Apache Kafka and other systems. Look the picture bellow:
It will make simple to define connectors that move large amount of data into and out of Kafka internal data system. Kafka Connect can ingest entire database making the data available for stream processing or sink the specific data of a single topic (or multiples) to another system or database for further analysis.

Kafka connect not working for file streaming

I have been using Kafka connect for the confluent platform using the following guide
Kafka connect quickstart
But it doesn't update the sink file anymore, any changes in the source file are not written in the kafka topic.
I have already deleted all tmp files but no change.
Thanks in advance
Start up a new file source connector with a new location for storing the offsets. This connector is meant as a demo and really doesn't handle anything except a simple file that only gets append updates. Note, you shouldn't be doing anything with this connector other than a simple demo. Have a look at the connector hub if you need something for production.
To OP, I have had this like 5 mins ago but when I restarted the connector it's fine, both test.sink.txt and the consumer are getting new line added. So in a nutshell, just restart your connector.
The FileStreamSource/Sink does not work after it worked fine and you've already restarted the zookeeper, kafka server and the connector but still it does not work then the problem is with the CONNECT.OFFSETS file in the kafka directory.
You should delete it and create a new empty one.
I faced the same problem before. But correcting the path of the input and output files in the properties files as below worked for me. And it streamed from input file(test.txt) to output file(test.sink.txt).
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=/home/mypath/kafka/test.txt
topic=connect-test
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=/home/mypath/kafka/test.sink.txt
topics=connect-test