does pyspark support spark-streaming-kafka-0-10 lib? - pyspark

my kafka cluster version is 0.10.0.0, and i want to use pyspark stream to read kafka data. but in Spark Streaming + Kafka Integration Guide, http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
there is no python code example.
so can pyspark use spark-streaming-kafka-0-10 to integrate kafka?
Thank you in advance for your help !

I also use spark streaming with Kafka 0.10.0 cluster. After adding following line to your code, you are good to go.
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0
And here a sample in python:
# Initialize SparkContext
sc = SparkContext(appName="sampleKafka")
# Initialize spark stream context
batchInterval = 10
ssc = StreamingContext(sc, batchInterval)
# Set kafka topic
topic = {"myTopic": 1}
# Set application groupId
groupId = "myTopic"
# Set zookeeper parameter
zkQuorum = "zookeeperhostname:2181"
# Create Kafka stream
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
#Do as you wish with your stream
# Start stream
ssc.start()
ssc.awaitTermination()

You can use spark-streaming-kafka-0-8 when your brokers are 0.10 and later. spark-streaming-kafka-0-8 supports newer brokers versions while streaming-kafka-0-10 does not support older broker versions. streaming-kafka-0-10 as of now is still experimental and has no Python support.

Related

Apache Flink KafkaSource doesnt set group.id

I have a simple stream execution configured as:
val config: Configuration = new Configuration()
config.setString("taskmanager.memory.managed.size", "4g")
config.setString("parallelism.default", "4")
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(config)
env
.fromSource(KafkaSource.builder[String]
.setBootstrapServers("node1:9093,node2:9093,node3:9093")
.setTopics("example-topic")
//.setProperties(kafkaProps) // didn't work
.setProperty("security.protocol", "SASL_SSL")
.setProperty("sasl.mechanism", "GSSAPI")
.setProperty("sasl.kerberos.service.name", "kafka")
.setProperty("group.id","groupid-test")
//.setGroupId("groupid-test") // didn't work
.setStartingOffsets(OffsetsInitializer.earliest)
.setProperty("partition.discovery.interval.ms", "60000") // discover part
.setDeserializer(KafkaRecordDeserializationSchema.valueOnly(classOf[StringDeserializer]))
.build,
WatermarkStrategy.noWatermarks[String],
"example-input-topic"
)
.print
env.execute("asdasd")
My flink version is: 1.14.2
My kafka is running on cloudera. Kafka version: 2.2.1-cdh6.3.2
Am able to consume records from Kafka. But it doesnt set groupid for topic. Does anyone has any ideas?
Since Flink 1.14.0, the group.id is an optional value. See https://issues.apache.org/jira/browse/FLINK-24051. You can set your own value if you want to specify one. You can see from the accompanying PR how this was previously set at https://github.com/apache/flink/pull/17052/files#diff-34b4ff8d43271eeac91ba17f29b13322f6e0ff3d15f71003a839aeb780fe30fbL56

Why does the kafka consumer code freeze when I start spark stream?

I am new to Kafka and trying to implement Kafka consumer logic in spark2 and when I run all my code in the shell and start the streaming it shows nothing.
I have viewed many posts in StackOverflow but nothing helped me. I have even downloaded all the dependency jars from maven and tried to run but it still shows nothing.
Spark Version: 2.2.0
Scala version 2.11.8
jars I downloaded are kafka-clients-2.2.0.jar and spark-streaming-kafka-0-10_2.11-2.2.0.jar
but it still I face the same issue.
Please find the below code snippet
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.kafka010.{KafkaUtils, ConsumerStrategies, LocationStrategies}
val brokers = "host1:port, host2:port"
val groupid = "default"
val topics = "kafka_sample"
val topicset = topics.split(",").toSet
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val msg = KafkaUtils.createDirectStream[String, String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicset, kafkaParams)
)
msg.foreachRDD{
rdd => rdd.collect().foreach(println)
}
ssc.start()
I am expecting SparkStreaming to start but it doesn't do anything. What mistake have I done here? Or is this a known issue?
The driver will be sitting idle unless you call ssc.awaitTermination() at the end. If you're using spark-shell then it's not a good tool for streaming jobs.
Please, use interactive tools like Zeppelin or Spark notebook for interacting with streaming or try building your app as jar file and then deploy.
Also, if you're trying out spark streaming, Structured Streaming would be better as it is quite easy to play with.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
After ssc.start() use ssc.awaitTermination() in your code.
For testing, write your code in a Main Object and run it in any IDE like Intellij
You can use command shell and publish messages from the Kafka producer.
I have written all these steps in a simple example in a blog post with working code in GitHub. Please refer to: http://softwaredevelopercentral.blogspot.com/2018/10/spark-streaming-and-kafka-integration.html

Offset Management For Apache Kafka With Apache Spark Batch

I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !

How can flink reads newest data from kafka

Now, in my scenario, flink reads newest data from kafka everytime.
For example,
kafka products:
log1
log2
log3
When read,only log3 is needed.
Kafka consumer API, seekToEnd() can do it.
Does FlinkKafkaConsumer have the same function?
Flink 1.3 has this function.
FlinkKafkaConsumer09 flinkKafkaConsumer09 = new FlinkKafkaConsumer09<>(properties.getProperty("topic"), new RowDeserializationSchema(properties.getProperty("separator"), resultType), properties);
flinkKafkaConsumer09.setStartFromLatest();

Flink with Kafka Consumer doesn't work

I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect.
The code is very simple:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "myGroup")
println("topic: "+args(0))
val stream = env.addSource(new FlinkKafkaConsumer09[String](args(0), new SimpleStringSchema(), properties))
stream.print
env.execute()
I use kafka 0.9.0.0 with the same topics (in consumer[Flink] and producer[Kafka console]), but when I send my jar to the cluster, nothing happens:
Cluster Flink
What it could be happening?
Your stream.print will not print in console on flink .It will write to flink0.9/logs/recentlog. Other-wise you can add your own logger for confirming output.
For this particular case (a Source chained into a Sink) the Webinterface will never report Bytes/Records sent/received. Note that this will change in the somewhat near future.
Please check whether the job-/taskmanager logs do not contain any output.