I am trying to stream data using Kafka-Connect with HDFS Sink Connector. Both Standalone and Distributed modes are running fine but its writing into HDFS only once (based on flush-size) and not streaming later on. Please help if I'm missing some thing.
Confluent 2.0.0 & Kafka 0.9.0
I faced this issue long back.Just check below parameter is missing
Connect-hdfs-sink properties
"logs.dir":"/hdfs_directory/data/log"
"request.timeout.ms":"310000"
"offset.flush.interval.ms":"5000"
"heartbeat.interval.ms":"60000"
"session.timeout.ms":"300000
"max.poll.records":"100"
Related
I am using File System source connector to ingest data in kafka. But I am not able to find any file sink connector i checked pulse and spooldir everyone has only source connector. I am trying to use fileStream sink connector but it is not production grade as it is mentioned in official website.
Could anyone please suggest me any solution or connectors.
Note: I dont want to use consumer application
I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.
I installed Neo4j and I can access the server. I can make nodes though cypher.
Now I want to use it for data streams. But I'm not sure how to do so. I just started Neo4j and I'm struggling with installing 'Stream Plugin'.
Any help is highly appreciated.
You should copy the jar files for the Neo4j streams plugin directly into your /plugins folder and configure the connections to Kafka and Zookeeper as well as other Neo4j property values at the neo4j.conf file as described here. For example:
kafka.zookeeper.connect=zookeeper-host:2181
kafka.bootstrap.servers=kafka-host:9092
Alternatively, if you are looking only for a sink connection from Kafka (i.e. moving records from Kafka topics to into Neo4j), you can also use Kafka Connect with the the supported Kafka Connect Neo4j Sink. More at https://www.confluent.io/hub/neo4j/kafka-connect-neo4j
I can run example by following Start Streaming with Kafka and Spring Cloud, but unfortunately it doesn't use confluent schema registry. I read the confluent schema registry part of Spring Cloud Stream reference guide, but it didn't work with my confluent 3.0.0 and the guide doesn't mention how to produce Avro message using confluent schema registry. So, can anyone guide me how to achieve it? Thanks!
The Spring Cloud Stream is not yet compatible with Confluent Schema Registry. See discussion in this thread https://github.com/spring-cloud/spring-cloud-stream/issues/850
I have a system pushing Avro data in to multiple Kafka topics.
I want to push that data to HDFS. I came across confluent but am not sure how can I send data to HDFS without starting kafka-avro-console-producer.
Steps I performed:
I have my own Kafka and ZooKeeper running so i just started schema registry of confluent.
I started kafka-connect-hdfs after changing topic name.
This step is also successful. It's able to connect to HDFS.
After this I started pushing data to Kafka but the messages were not being pushed to HDFS.
Please help. I'm new to Confluent.
You can avoid using the kafka-avro-console-producer and use your own producer to send messages to the topics, but we strongly encourage you to use the Confluent Schema Registry (https://github.com/confluentinc/schema-registry) to manage your schemas and use the Avro serializer that is bundled with the Schema Registry to keep your Avro data consistent. There's a nice writeup on the rationale for why this is a good idea to do here.
If you are able to send messages that were produced with the kafka-avro-console-producer to HDFS, then your problem is likely in the kafka-connect-hdfs connector not being able to deserialize the data. I assume you are going through the quickstart guide. The best results will come from you using the same serializer on both sides (in and out of Kafka) if you are intending to write Avro to HDFS. How this process works is described in this documentation.