flink read data from kafka - apache-kafka

I write a simple example
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers","xxxxxx")
properties.setProperty("zookeeper.connect","xxxxxx")
properties.setProperty("group.id", "caffrey")
val stream = env
.addSource(new FlinkKafkaConsumer082[String]("topic", new SimpleStringSchema(), properties))
.print()
env.execute("Flink Kafka Example")
when I run this code I got an error like this:
[error] Class
org.apache.flink.streaming.api.checkpoint.CheckpointNotifier not found
- continuing with a stub.
I google this error and find CheckpointNotifier is an interface.
I really don't understand where did I do wrong.

Since CheckpointNotifier is a class from an older Flink version, I suspect that you are mixing different Flink dependencies in your pom file.
Make sure all Flink dependencies have the same version.

Related

Why does the kafka consumer code freeze when I start spark stream?

I am new to Kafka and trying to implement Kafka consumer logic in spark2 and when I run all my code in the shell and start the streaming it shows nothing.
I have viewed many posts in StackOverflow but nothing helped me. I have even downloaded all the dependency jars from maven and tried to run but it still shows nothing.
Spark Version: 2.2.0
Scala version 2.11.8
jars I downloaded are kafka-clients-2.2.0.jar and spark-streaming-kafka-0-10_2.11-2.2.0.jar
but it still I face the same issue.
Please find the below code snippet
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.kafka010.{KafkaUtils, ConsumerStrategies, LocationStrategies}
val brokers = "host1:port, host2:port"
val groupid = "default"
val topics = "kafka_sample"
val topicset = topics.split(",").toSet
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val msg = KafkaUtils.createDirectStream[String, String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicset, kafkaParams)
)
msg.foreachRDD{
rdd => rdd.collect().foreach(println)
}
ssc.start()
I am expecting SparkStreaming to start but it doesn't do anything. What mistake have I done here? Or is this a known issue?
The driver will be sitting idle unless you call ssc.awaitTermination() at the end. If you're using spark-shell then it's not a good tool for streaming jobs.
Please, use interactive tools like Zeppelin or Spark notebook for interacting with streaming or try building your app as jar file and then deploy.
Also, if you're trying out spark streaming, Structured Streaming would be better as it is quite easy to play with.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
After ssc.start() use ssc.awaitTermination() in your code.
For testing, write your code in a Main Object and run it in any IDE like Intellij
You can use command shell and publish messages from the Kafka producer.
I have written all these steps in a simple example in a blog post with working code in GitHub. Please refer to: http://softwaredevelopercentral.blogspot.com/2018/10/spark-streaming-and-kafka-integration.html

How to dockerize and write a dockerfile a Scala written kafka producer

I am new to dockers and been trying to understand how to write a docker file to create my custom image. My Scala class is producing message to a topic continuously. I want to reproduce the same functionality with dockers. Can someone help me with the docker file.
I have tried using sbt docker:publishLocal, it creates the image but when i try to run the image it says unble to find the class. I am specifically looking to run it using docker file.
Here is the code which is working in intelliJ
import java.util.Properties
import org.apache.kafka.clients.producer._
object Scala_producer extends App{
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC="tt"
println(producer.partitionsFor(TOPIC))
while(true){
val record = new ProducerRecord(TOPIC, "key", "hello ")
producer.send(record)
println("producing")
}
producer.close()
}
i expect to run the docker and get infinite producing message.
If this is the only issue, then add the mainclass to your build.sbt:
mainClass in Compile := Some("com.example.Scala_producer")

Read from Kafka into Flink Scala Shell

I'm attempting to connect to and read from Kafka (2.1) on my local machine, in the scala-shell that comes with Flink (1.7.2).
Here's what I'm doing :
:require flink-connector-kafka_2.11-1.7.1.jar
:require flink-connector-kafka-base_2.11-1.7.1.jar
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase
import java.util.Properties
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
After, the last statement I'm getting the following error :
scala> var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
<console>:69: error: overloaded method value addSource with alternatives:
[T](function: org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext[T] => Unit)(implicit evidence$10: org.apache.flink.api.common.typeinfo.TypeInformation[T])org.apache.flink.streaming.api.scala.DataStream[T] <and>
[T](function: org.apache.flink.streaming.api.functions.source.SourceFunction[T])(implicit evidence$9: org.apache.flink.api.common.typeinfo.TypeInformation[T])org.apache.flink.streaming.api.scala.DataStream[T]
cannot be applied to (org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer[String])
var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
I have created the topic named "topic" and I'm able to produce and read messages from it, through another client correctly. I'm using java version 1.8.0_201 and following the instructions from https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html .
Any help on what could be going wrong?
Some dependencies need other dependencies, implicitly. We usually use some dependency managers like maven or sbt and when we add some dependencies into the project, the dependency manager will provide its implicit dependencies in the background.
On the other hand, when you use shells, where there is no dependency manager, you are responsible for providing your code dependencies. Using Flink Kafka connector explicitly needs the Flink Connector Kafka jar, but you should notice that Flink Connector Kafka needs some dependencies, too. You can find it's dependencies at the bottom of the page, which is in the section Compile Dependencies. So starting with this preface, I added the following jar files to the directory FLINK_HOME/lib (Flink classpath):
flink-connector-kafka-0.11_2.11-1.4.2.jar
flink-connector-kafka-0.10_2.11-1.4.2.jar
flink-connector-kafka-0.9_2.11-1.4.2.jar
flink-connector-kafka-base_2.11-1.4.2.jar
flink-core-1.4.2.jar
kafka_2.11-2.1.1.jar
kafka-clients-2.1.0.jar
and I could successfully consume Kafka messages using the following code in the Flink shell:
scala> import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
scala> import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
scala> import java.util.Properties
import java.util.Properties
scala> val properties = new Properties()
properties: java.util.Properties = {}
scala> properties.setProperty("bootstrap.servers", "localhost:9092")
res0: Object = null
scala> properties.setProperty("group.id", "test")
res1: Object = null
scala> val stream = senv.addSource(new FlinkKafkaConsumer011[String]("topic", new SimpleStringSchema(), properties)).print()
warning: there was one deprecation warning; re-run with -deprecation for details
stream: org.apache.flink.streaming.api.datastream.DataStreamSink[String] = org.apache.flink.streaming.api.datastream.DataStreamSink#71de1091
scala> senv.execute("Kafka Consumer Test")
Submitting job with JobID: 23e3bb3466d914a2747ae5fed293a076. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink#localhost:40093/user/jobmanager#1760995711] with leader session id 00000000-0000-0000-0000-000000000000.
03/11/2019 21:42:39 Job execution switched to status RUNNING.
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to SCHEDULED
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to SCHEDULED
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to DEPLOYING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to DEPLOYING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to RUNNING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to RUNNING
hello
hello
In addition, another way to add some jar files to the Flink classpath is to pass the jars as arguments for Flink shell start command:
bin/start-scala-shell.sh local "--addclasspath <path/to/jar.jar>"
Test environment:
Flink 1.4.2
Kafka 2.1.0
Java 1.8 201
Scala 2.11
Most probably you should import Flink's Scala implicits before adding a source:
import org.apache.flink.streaming.api.scala._

Flink - producing to kinesis not working

I'm trying to run a simple program which reads from one kinesis stream, does a trivial transformation, and writes the result to another kinesis stream.
Running locally on Flink 1.4.0 (this is the version supported on EMR currently, so no way of upgrading).
Here is the code:
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumerConfig = new Properties()
consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val kinesisMaps = env.addSource(new FlinkKinesisConsumer[String](
"source-stream", new SimpleStringSchema, consumerConfig))
val jsonMaps = kinesisMaps.map { jsonStr => JSON.parseFull(jsonStr).get.asInstanceOf[Map[String, String]] }
val values = jsonMaps.map(jsonMap => jsonMap("field_name"))
values.print()
val producerConfig = new Properties()
producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
val kinesisProducer = new FlinkKinesisProducer[String](new SimpleStringSchema, producerConfig)
kinesisProducer.setFailOnError(true)
kinesisProducer.setDefaultStream("target-stream")
kinesisProducer.setDefaultPartition("0")
values.addSink(kinesisProducer)
// execute program
env.execute("Flink Kinesis")
}
If I comment out the producing code, the program runs as expected and prints the correct values.
As soon as I add the producer code, I get the following exception:
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)
at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.invoke(FlinkKinesisProducer.java:248)
at org.apache.flink.streaming.api.functions.sink.SinkFunction.invoke(SinkFunction.java:52)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:608)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:569)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:486)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.deserializeRecordForCollectionAndUpdateState(ShardConsumer.java:264)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any idea what's the cause of this?
Apparently, this is an issue with the old version of Amazon KPL which is used in Flink 1.4.
There are at least two possible solutions for this:
Upgrade to Flink version 1.5.
You can still use it on EMR, if you install it as described here, under the section Custom EMR Installation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/aws.html
When building the Kinesis connector for Flink 1.4, you can build it with newer AWS dependencies: I've cherry-picked the aws dependency changes in pom.xml of the connector from 1.5, and built the connector with them. Looks like it's working as expected.

Flink with Kafka Consumer doesn't work

I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect.
The code is very simple:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "myGroup")
println("topic: "+args(0))
val stream = env.addSource(new FlinkKafkaConsumer09[String](args(0), new SimpleStringSchema(), properties))
stream.print
env.execute()
I use kafka 0.9.0.0 with the same topics (in consumer[Flink] and producer[Kafka console]), but when I send my jar to the cluster, nothing happens:
Cluster Flink
What it could be happening?
Your stream.print will not print in console on flink .It will write to flink0.9/logs/recentlog. Other-wise you can add your own logger for confirming output.
For this particular case (a Source chained into a Sink) the Webinterface will never report Bytes/Records sent/received. Note that this will change in the somewhat near future.
Please check whether the job-/taskmanager logs do not contain any output.