I am trying to run flink job as below to read data from Apache Kafka & print:
Java Program
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "test.net:9092");
properties.setProperty("group.id", "flink_consumer");
properties.setProperty("zookeeper.connect", "dev.com:2181,dev2.com:2181,dev.com:2181/dev2");
properties.setProperty("topic", "topic_name");
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer082<>("topic_name", new SimpleStringSchema(), properties));
messageStream.rebalance().map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
public String map(String value) throws Exception {
return "Kafka and Flink says: " + value;
}
}).print();
env.execute();
Scala Code
var properties = new Properties();
properties.setProperty("bootstrap.servers", "msg01.staging.bigdata.sv2.247-inc.net:9092");
properties.setProperty("group.id", "flink_consumer");
properties.setProperty("zookeeper.connect", "host33.dev.swamp.sv2.tellme.com:2181,host37.dev.swamp.sv2.tellme.com:2181,host38.dev.swamp.sv2.tellme.com:2181/staging_sv2");
properties.setProperty("topic", "sv2.staging.rtdp.idm.events.omnichannel");
var env = StreamExecutionEnvironment.getExecutionEnvironment();
var stream:DataStream[(String)] = env
.addSource(new FlinkKafkaConsumer082[String]("sv2.staging.rtdp.idm.events.omnichannel", new SimpleStringSchema(), properties));
stream.print();
env.execute();
Whenever I run this in app in eclipse, I see below out to start with:
03/27/2017 20:06:19 Job execution switched to status RUNNING.
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(1/4) switched to SCHEDULED
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(1/4) switched to DEPLOYING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(2/4) switched to SCHEDULED
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(2/4) switched to DEPLOYING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(3/4) switched to SCHEDULED
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(3/4) switched to DEPLOYING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(4/4) switched to SCHEDULED
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(4/4) switched to DEPLOYING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(4/4) switched to RUNNING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(2/4) switched to RUNNING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(1/4) switched to RUNNING
03/27/2017 20:06:19 Source: Custom Source -> Sink: Unnamed(3/4) switched to RUNNING
Question I have is:
1) Why am I seeing 4 instance of sink in all the cases(Scheduled, deployed and running).
2) For every line received in Apache Kafka, I see being printed here multiple times mostly 4 times. What's a reason?
Ideally I want to read each lines only once and do further processing with it. Any input/help will be appreciable!
If you run the program in the LocalStreamEnvironment (which you get when you call StreamExecutionEnvironment.getExecutionEnvironment() in an IDE) the default parallelism of all operators is equal to the number of CPU cores.
So in your example each operator is parallelized into four subtasks. In the log you see message for each of these four subtasks (3/4 indicates this is the third of in total four tasks).
You can control the number of subtasks by calling StreamExecutionEnvironment.setParallelism(int) or call setParallelism(int) on each individual operator.
Given your program, the Kafka records should not be replicated. Each record should only be printed once. However, since the records are written in parallel, line of output is prefixed by x> where x indicates the id of the parallel subtask that emitted the line.
Related
I have a simple stream execution configured as:
val config: Configuration = new Configuration()
config.setString("taskmanager.memory.managed.size", "4g")
config.setString("parallelism.default", "4")
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(config)
env
.fromSource(KafkaSource.builder[String]
.setBootstrapServers("node1:9093,node2:9093,node3:9093")
.setTopics("example-topic")
//.setProperties(kafkaProps) // didn't work
.setProperty("security.protocol", "SASL_SSL")
.setProperty("sasl.mechanism", "GSSAPI")
.setProperty("sasl.kerberos.service.name", "kafka")
.setProperty("group.id","groupid-test")
//.setGroupId("groupid-test") // didn't work
.setStartingOffsets(OffsetsInitializer.earliest)
.setProperty("partition.discovery.interval.ms", "60000") // discover part
.setDeserializer(KafkaRecordDeserializationSchema.valueOnly(classOf[StringDeserializer]))
.build,
WatermarkStrategy.noWatermarks[String],
"example-input-topic"
)
.print
env.execute("asdasd")
My flink version is: 1.14.2
My kafka is running on cloudera. Kafka version: 2.2.1-cdh6.3.2
Am able to consume records from Kafka. But it doesnt set groupid for topic. Does anyone has any ideas?
Since Flink 1.14.0, the group.id is an optional value. See https://issues.apache.org/jira/browse/FLINK-24051. You can set your own value if you want to specify one. You can see from the accompanying PR how this was previously set at https://github.com/apache/flink/pull/17052/files#diff-34b4ff8d43271eeac91ba17f29b13322f6e0ff3d15f71003a839aeb780fe30fbL56
version flink(1.11.3), kafka(2.1.1)
My flink datapipeline is kafka(source) -> flink -> kafka(sink).
When I submit job first, it works well.
but after jobmanager or taskmanagers fail, if they restarted, they occur exception
2020-12-31 10:35:23.831 [objectOperator -> Sink: objectSink (1/1)] WARN o.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer - Encountered error org.apache.kafka.common.errors.InvalidTxnStateException: The producer attempted a transactional operation in an invalid state. while recovering transaction KafkaTransactionState [transactionalId=objectOperator -> Sink: objectSink-bcabd9b643c47ab46ace22db2e1285b6-3, producerId=14698, epoch=7]. Presumably this transaction has been already committed before
2020-12-31 10:35:23.919 [userOperator -> Sink: userSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - userOperator -> Sink: userSink (1/1) (2a5a171aa335f444740b4acfc7688d7c) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
2020-12-31 10:35:24.131 [objectOperator -> Sink: objectSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - objectOperator -> Sink: objectSink (1/1) (07fe747a81b31e016e88ea6331b31433) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default producerId at version 1
I don't know why this error occurs.
my kafka producer code
Properties props = new Properties();
props.setProperty("bootstrap.servers", servers);
props.setProperty("transaction.timeout.ms", "30000");
FlinkKafkaProducer<CountModel> producer = new FlinkKafkaProducer<CountModel>(
topic,((record, timestamp) -> new ProducerRecord<>(
topic
, Longs.toByteArray(record.getUserInKey())
, JsonUtils.toJsonBytes(record))), props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
I don't think it's a version issue.
It seems that no one has experienced the same error as me
Each Producer is assigned a unique PID when it is initialized. This PID is transparent to the application and is not exposed to the user at all. For a given PID, the sequence number will increase from 0, and each Topic-Partition will have an independent sequence number. When the Producer sends data, it will identify a sequence number for each msg, and the Server will use this to verify whether the data is duplicated. The PID here is globally unique, and a new PID will be assigned after the Producer is restarted after a failure. This is also one of the reasons why idempotence cannot be achieved across sessions.
If you resume from savepoint, the previous producerId will be used, and a new session will generate 1000 new producerIds (these id runs through the entire session, equivalent to the default value), so it will be non-default
I'm attempting to connect to and read from Kafka (2.1) on my local machine, in the scala-shell that comes with Flink (1.7.2).
Here's what I'm doing :
:require flink-connector-kafka_2.11-1.7.1.jar
:require flink-connector-kafka-base_2.11-1.7.1.jar
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase
import java.util.Properties
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
After, the last statement I'm getting the following error :
scala> var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
<console>:69: error: overloaded method value addSource with alternatives:
[T](function: org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext[T] => Unit)(implicit evidence$10: org.apache.flink.api.common.typeinfo.TypeInformation[T])org.apache.flink.streaming.api.scala.DataStream[T] <and>
[T](function: org.apache.flink.streaming.api.functions.source.SourceFunction[T])(implicit evidence$9: org.apache.flink.api.common.typeinfo.TypeInformation[T])org.apache.flink.streaming.api.scala.DataStream[T]
cannot be applied to (org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer[String])
var stream = senv.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)).print()
I have created the topic named "topic" and I'm able to produce and read messages from it, through another client correctly. I'm using java version 1.8.0_201 and following the instructions from https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html .
Any help on what could be going wrong?
Some dependencies need other dependencies, implicitly. We usually use some dependency managers like maven or sbt and when we add some dependencies into the project, the dependency manager will provide its implicit dependencies in the background.
On the other hand, when you use shells, where there is no dependency manager, you are responsible for providing your code dependencies. Using Flink Kafka connector explicitly needs the Flink Connector Kafka jar, but you should notice that Flink Connector Kafka needs some dependencies, too. You can find it's dependencies at the bottom of the page, which is in the section Compile Dependencies. So starting with this preface, I added the following jar files to the directory FLINK_HOME/lib (Flink classpath):
flink-connector-kafka-0.11_2.11-1.4.2.jar
flink-connector-kafka-0.10_2.11-1.4.2.jar
flink-connector-kafka-0.9_2.11-1.4.2.jar
flink-connector-kafka-base_2.11-1.4.2.jar
flink-core-1.4.2.jar
kafka_2.11-2.1.1.jar
kafka-clients-2.1.0.jar
and I could successfully consume Kafka messages using the following code in the Flink shell:
scala> import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
scala> import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
scala> import java.util.Properties
import java.util.Properties
scala> val properties = new Properties()
properties: java.util.Properties = {}
scala> properties.setProperty("bootstrap.servers", "localhost:9092")
res0: Object = null
scala> properties.setProperty("group.id", "test")
res1: Object = null
scala> val stream = senv.addSource(new FlinkKafkaConsumer011[String]("topic", new SimpleStringSchema(), properties)).print()
warning: there was one deprecation warning; re-run with -deprecation for details
stream: org.apache.flink.streaming.api.datastream.DataStreamSink[String] = org.apache.flink.streaming.api.datastream.DataStreamSink#71de1091
scala> senv.execute("Kafka Consumer Test")
Submitting job with JobID: 23e3bb3466d914a2747ae5fed293a076. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink#localhost:40093/user/jobmanager#1760995711] with leader session id 00000000-0000-0000-0000-000000000000.
03/11/2019 21:42:39 Job execution switched to status RUNNING.
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to SCHEDULED
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to SCHEDULED
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to DEPLOYING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to DEPLOYING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to RUNNING
03/11/2019 21:42:39 Source: Custom Source -> Sink: Unnamed(1/1) switched to RUNNING
hello
hello
In addition, another way to add some jar files to the Flink classpath is to pass the jars as arguments for Flink shell start command:
bin/start-scala-shell.sh local "--addclasspath <path/to/jar.jar>"
Test environment:
Flink 1.4.2
Kafka 2.1.0
Java 1.8 201
Scala 2.11
Most probably you should import Flink's Scala implicits before adding a source:
import org.apache.flink.streaming.api.scala._
I am using:
flink 1.1.2
Kafka 2.10-0.10.0.1
flink-connector-kafka-0.9.2.10-1.0.0
I am using the following very simple/basic app
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:33334");
properties.setProperty("partition.assignment.strategy", "org.apache.kafka.clients.consumer.RangeAssignor");
properties.setProperty("group.id", "test");
String topic = "mytopic";
FlinkKafkaConsumer09<String> fkc =
new FlinkKafkaConsumer09<String>(topic, new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(fkc);
env.execute()
After compiling it using maven and when I try to run using the following command:
bin/flink run -c com.mycompany.app.App fkaf/target/fkaf-1.0-SNAPSHOT.jar
I see the following runtime error:
Submitting job with JobID: f6e290ec7c28f66d527eaa5286c00f4d. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink#127.0.0.1:6123/user/jobmanager#-1679485245]
10/12/2016 15:10:06 Job execution switched to status RUNNING.
10/12/2016 15:10:06 Source: Custom Source(1/1) switched to SCHEDULED
10/12/2016 15:10:06 Source: Custom Source(1/1) switched to DEPLOYING
10/12/2016 15:10:06 Map -> Sink: Unnamed(1/1) switched to SCHEDULED
10/12/2016 15:10:06 Map -> Sink: Unnamed(1/1) switched to DEPLOYING
10/12/2016 15:10:06 Source: Custom Source(1/1) switched to RUNNING
10/12/2016 15:10:06 Map -> Sink: Unnamed(1/1) switched to RUNNING
10/12/2016 15:10:06 Map -> Sink: Unnamed(1/1) switched to CANCELED
10/12/2016 15:10:06 Source: Custom Source(1/1) switched to FAILED
java.lang.NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.assign(Ljava/util/List;)V
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09.open(FlinkKafkaConsumer09.java:282)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:91)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:256)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:722)
Any idea on why the method assign() is not being found? The method is there in the
lib/kafka-clients-0.10.0.1.jar.
ParameterTool parameterTool = ParameterTool.fromArgs(args);
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer09<String>(parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties()));
// print() will write the contents of the stream to the TaskManager's standard out stream
// the rebelance call is causing a repartitioning of the data so that all machines
// see the messages (for example in cases when "num kafka partitions" < "num flink operators"
messageStream.rebalance().map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Kafka and Flink says: " + value;
}
}).print();
env.execute();
A NoSuchMethodError indicates a version mismatch.
I would guess the issue is that you try to connect a Kafka 0.9 consumer to a Kafka 0.10 instance. Flink 1.1.x does not provide a Kafka 0.10 consumer. However, a 0.10 consumer will be included in the upcoming 1.2.0 release.
You could try to build the Kafka 0.10 consumer yourself from the current master branch (1.2-SNAPSHOT) and use that one with Flink 1.1.2. The corresponding Flink APIs should be stable and backwards compatible from 1.2 to 1.1.
I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect.
The code is very simple:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "myGroup")
println("topic: "+args(0))
val stream = env.addSource(new FlinkKafkaConsumer09[String](args(0), new SimpleStringSchema(), properties))
stream.print
env.execute()
I use kafka 0.9.0.0 with the same topics (in consumer[Flink] and producer[Kafka console]), but when I send my jar to the cluster, nothing happens:
Cluster Flink
What it could be happening?
Your stream.print will not print in console on flink .It will write to flink0.9/logs/recentlog. Other-wise you can add your own logger for confirming output.
For this particular case (a Source chained into a Sink) the Webinterface will never report Bytes/Records sent/received. Note that this will change in the somewhat near future.
Please check whether the job-/taskmanager logs do not contain any output.