In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream.
I read about seekToEnd method for Consumer. How can I apply it to the stream?
spark-kafka transitively includes kafka-clients, so you're welcome to initialize the raw consumer instance on your own and seek it
Alternatively, if no consumer group exists, you would set startingOffsets=latest in your Spark config
note: Kafka Direct Stream API is deprecated as of Spark 2.4 and you should be using Structured Streaming instead
Related
How to create new Kafka Producer from existing Consumer with java ?
You can't create a KafkaProducer from a KafkaConsumer instance.
You have to explicitly create a KafkaProducer using the same connection settings as your producer.
Considering the use case you mentioned (copying data from a topic to another), I'd recommend using Kafka Streams. There's actually an example in Kafka that does exactly that: https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/pipe/PipeDemo.java
I will recommend to use the Kafka Streams library. It reads data from kafka topics and do some processing and write back to another topics.
That could be simpler approach for you.
https://kafka.apache.org/documentation/streams/
Current limitation is, Source and destination cluster should be same with Kafka Streams.
Otherwise you need to use Processor API to define another destination cluster.
Another approach, is simply define a producer in the consumer program. Wherever your rule matches(based on offset or any conditions), call producer.send() method
I am having an scala spark application in which I need to switch between streaming from kafka vs kinesis based on the application configuration.
Both the spark API's for kafka streaming (spark-streaming-kafka-0-10_2.11) and kinesis streaming (spark-streaming-kinesis-asl_2.11) returns an InputDStream on creating a stream, but the value types are different.
Kafka stream creating returns InputDStream[ConsumerRecord[String, String]],
whereas, Kinesis stream creating returns InputDStream[Array[Byte]]
Is there any API that returns a generic InputDStream irrespective of kafka or kinesis, so that my stream processing can have a generic implementation, instead of having to write separate code for kafka and kinesis.
I tried assigning both the stream to a InputDStream[Any], but that did not work.
Appreciate any idea on how this can be done.
I have just started my hands dirty with kafka. I have gone through this. It only says data/topic management for kafka stream DSL. Can anyone share any link for same sort of data management for Processor API of kafka stream? I am specially interested for user and internal topic management of Processor API.
TopologyBuilder builder = new TopologyBuilder();
// add the source processor node that takes Kafka topic "source-topic" as input
builder.addSource("Source", "source-topic")
From where to populate this source topic with input data before stream processor starts consuming the same?
In short, can we write to kafka "Source" topic using streams, like as producer writes to topic? Or is stream only for parallel consumption of topic?
I believe we should as "Kafka's Streams API is built on top of Kafka's producer and consumer clients".
Yes, you have to use a KafkaProducer to generate inputs for the source topics which feeds the KStream.
But, the intermediate topics can be populated, via
KafkaStreams#to
KafkaStreams#through
You can use JXL(Java Excel API) to write a producer which writes to a kafka topic from an excel file.
Then create a kafka streams application to consume that topic and produce to another topic.
And you can use context.getTopic() to get topic from which the processor is receiving.
Then set multiple if statements to call the process logic for that topic inside the process() function.
I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.
I produce some messages first and these messages are persisted on disk by kafka's brokers. Then I start the spark streaming program to process these data, but I can't receive anything in spark streaming. And there is not any error log.
However, If I produce message when the spark streaming program is running, it can receive data.
Can spark streaming only receive the real time data from kafka?
To control the behavior of what data is consumed at the start of a new consumer stream, you should provide auto.offset.reset as part of the properties used to create the kafka stream.
auto.offset.reset can take the following values:
earliest => the kafka topic will be consumed from the earliest offset available
latest => the kafka topic will be consumed, starting at the current latest offset
Also note that depending on the kafka consumer model you are using (received-based or direct), the behavior of a restarted spark streaming job will be different.