Apache Flink 1.4 with Apache Kafka 1.0.0 - apache-kafka

I am trying to get Apache Flink Scala project to integrate with Apache Kafka 1.0.0. When i attempt to add the flink-connector-kafka package in my build.sbt file I get an error saying it cannot resolve it.
When i then look at the options available in the maven repository, there is no maven dependency available for Apache Kafka 2.11-1.0.0 for any version above 0.10.2
val flinkVersion = "1.4.1"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided")
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion)
Does anyone know how to integrate these versions correctly so that I can connect Apache Flink 1.4 to Apache Kafka 2.11-1.0.0, as nothing I seem to try works (and i do not wish to downgrade the Kafka version I am connecting to).

This should work. Try:
val flinkVersion = "1.4.2"
libraryDependencies ++= Seq(
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion,
"org.apache.flink" %% "flink-connector-kafka-0.11" % flinkVersion
)

Try
org.apache.flink" % "flink-connector-kafka-0.11_2.11" % "1.4.0
flink-connector-kafka-0.11_2.11 is Flink's latest Kafka connector available.
Sources: https://search.maven.org/#search%7Cga%7C1%7Cflink%20kafka%20connector , https://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.flink%22%20AND%20a%3A%22flink-connector-kafka-0.11_2.11%22
A Kafka 1.0 broker is backwards compatible with 0.11 and 0.10 APIs.

Related

fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

I'm trying to read data from S3 using spark using following dependencies and configurations:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.2.1"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", config.s3AccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", config.s3SecretKey)
spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
I'm getting error as
java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
It was working fine with older version of spark and hadoop. To be exact, i was previously using spark version 2.4.8 and hadoop version 2.8.5
I was looking forward to use the latest EMR version with spark 3.2.0 and hadoop 3.2.1. This issue basically was faced mainly because of hadoop 3.2.1 and hence the only option was to use older version of EMR. Spark 2.4.8 and hadoop 2.10.1 worked for me.

RestHighLevelClient Not Found in Scala

I am trying to insert into ElasticSearch(ES) in a Scala Program.
In build.sbt I have added
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
My code is
val client = new RestHighLevelClient( RestClient.builder(new HttpHost("localhost", 9200, "http")))
While compiling I am getting errors as below
not found: type RestHighLevelClient
not found: value RestClient
Am I missing some import? My goal is to get a stream from Flink and insert into ElasticSearch
Any help is greatly appreciated.
To use Elasticsearch with Flink it's going to be easier if you use Flink's ElasticsearchSink, rather than working with RestHighLevelClient directly. However, a version of that sink for Elasticsearch 7.x is coming in Flink 1.10, which hasn't been released yet (it's coming very soon; RC1 is already out).
Using this connector requires an extra dependency, such as flink-connector-elasticsearch6_2.11 (or flink-connector-elasticsearch7_2.11, coming with Flink 1.10).
See the docs on using Elasticsearch with Flink.
The reason to prefer Flink's sink over using RestHighLevelClient yourself is that the Flink sink makes bulk requests, handles errors and retries, and it's tied in with Flink's checkpointing mechanism, so it's able to guarantee that nothing is lost if something fails.
As for your actual question, maybe you need to add
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "7.5.2"
We don't need to use these dependencies separately to insert data in Elasticsearch by using Flink Streaming.
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
Just use this flink-connector-elasticsearch7 or flink-connector-elasticsearch6
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % "1.10.0"
All dependencies of Elasticsearch come along with the Flink-Elastic connector. So we don't need to include them separately in build.sbt file.
build.sbt file for Flink Elasticsearch
name := "flink-streaming-demo"
scalaVersion := "2.12.11"
val flinkVersion = "1.10.0"
libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % flinkVersion
For more details please go through this working Flink-Elasticsearch code which I have provided here.
Note: Since Elastic 6.x onwards they started full support of the REST elastic client. And till Elastic5.x they were using Transport elastic client.

Flink write to S3 on EMR

I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:345)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:222)
at org.apache.flink.api.java.io.TextOutputFormat.open(TextOutputFormat.java:78)
at org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:61)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
My build.sbt looks like this:
libraryDependencies ++= Seq(
"org.apache.flink" % "flink-core" % "1.3.2",
"org.apache.flink" % "flink-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-streaming-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-shaded-hadoop2" % "1.3.2",
"org.apache.flink" % "flink-clients_2.11" % "1.3.2",
"org.apache.flink" %% "flink-avro" % "1.3.2",
"org.apache.flink" %% "flink-connector-filesystem" % "1.3.2"
)
I also found this post, but it didn't resolve the issue: External checkpoints to S3 on EMR
I just put the output to S3: input.writeAsText("s3://test/flink"). Any suggestions would be appreciated.
Not sure the good combination for flink-shaded-hadoop and EMR version. After several round of tries and failures, I was able to write to S3 by using a new version of flink-shaded-hadoop2 -- "org.apache.flink" % "flink-shaded-hadoop2" % "1.4.0"
Your issue is probably due the fact that some libraries are loaded by EMR/Yarn/Flink before your own classes, what leads to NoSuchMethodError: classes loaded are not the one you provided, but the one provided by EMR. Take care of the classpath in the JobManager/TaskManager logs.
A solution is to put your own jars in the Flink lib directory to that they are loaded before EMR ones.

Saving data from Spark to Cassandra results in java.lang.ClassCastException

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.
I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.

In sbt, how can we specify the version of hadoop on which spark depends?

Well I have a sbt project which uses spark and spark sql, but my cluster uses hadoop 1.0.4 and spark 1.2 with spark-sql 1.2, currently my build.sbt looks like this:
libraryDependencies ++= Seq(
"com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5",
"com.datastax.cassandra" % "cassandra-driver-mapping" % "2.1.5",
"com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.1",
"org.apache.spark" % "spark-core_2.10" % "1.2.1",
"org.apache.spark" % "spark-sql_2.10" % "1.2.1",
)
It turns out that I am running the app with hadoop 2.2.0, but I wish to see hadoop-*-1.0.4 in my dependencies. What would I do please?
You can exclude the dependency from Spark to hadoop, and add an explicit one with the version you need, something along those lines:
libraryDependencies ++= Seq(
"com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5",
"com.datastax.cassandra" % "cassandra-driver-mapping" % "2.1.5",
"com.datastax.spark" % "spark-cassandra-connector" %% "1.2.1",
"org.apache.spark" % "spark-sql_2.10" % "1.2.1" excludeAll(
ExclusionRule("org.apache.hadoop")
),
"org.apache.hadoop" % "hadoop-client" % "2.2.0"
)
You probably do not need the dependency to spark-core since spark-sql should transitively bring it to you.
Also, watch out that spark-cassandra-connector probably also depends on spark, which could again transitively bring back hadoop => you might need to add an exclusion rule there as well.
Last note: an excellent tool for investigating which dependency comes from where is https://github.com/jrudolph/sbt-dependency-graph