Flink write to S3 on EMR - scala

I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:345)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:222)
at org.apache.flink.api.java.io.TextOutputFormat.open(TextOutputFormat.java:78)
at org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:61)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
My build.sbt looks like this:
libraryDependencies ++= Seq(
"org.apache.flink" % "flink-core" % "1.3.2",
"org.apache.flink" % "flink-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-streaming-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-shaded-hadoop2" % "1.3.2",
"org.apache.flink" % "flink-clients_2.11" % "1.3.2",
"org.apache.flink" %% "flink-avro" % "1.3.2",
"org.apache.flink" %% "flink-connector-filesystem" % "1.3.2"
)
I also found this post, but it didn't resolve the issue: External checkpoints to S3 on EMR
I just put the output to S3: input.writeAsText("s3://test/flink"). Any suggestions would be appreciated.

Not sure the good combination for flink-shaded-hadoop and EMR version. After several round of tries and failures, I was able to write to S3 by using a new version of flink-shaded-hadoop2 -- "org.apache.flink" % "flink-shaded-hadoop2" % "1.4.0"

Your issue is probably due the fact that some libraries are loaded by EMR/Yarn/Flink before your own classes, what leads to NoSuchMethodError: classes loaded are not the one you provided, but the one provided by EMR. Take care of the classpath in the JobManager/TaskManager logs.
A solution is to put your own jars in the Flink lib directory to that they are loaded before EMR ones.

Related

Spark Cassandra Join ClassCastException

I am trying to join two Cassandra tables with:
t1.join(t2, Seq("some column"), "left")
I am getting the below error message:
Exception in thread "main" java.lang.ClassCastException: scala.Tuple8 cannot be cast to scala.Tuple7 at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy.apply(CassandraDirectJoinStrategy.scala:27)
I am using cassandra v3.11.13 and Spark 3.3.0. The code dependencies:
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "3.2.11" % Test,
"com.github.mrpowers" %% "spark-fast-tests" % "1.0.0" % Test,
"graphframes" % "graphframes" % "0.8.1-spark3.0-s_2.12" % Provided,
"org.rogach" %% "scallop" % "4.1.0" % Provided,
"org.apache.spark" %% "spark-sql" % "3.1.2" % Provided,
"org.apache.spark" %% "spark-graphx" % "3.1.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0" % Provided)
Your help is greatly appreciated
The Spark Cassandra connector does not support Apache Spark 3.3.0 yet and I suspect that is the reason it's not working though I haven't done any verification myself.
Support for Spark 3.3.0 has been requested in SPARKC-686 but the amount of work required is significant so stay tuned.
The latest supported Spark version is 3.2 using spark-cassandra-connector 3.2. Cheers!
this commit
adds initial support for Spark 3.3.x, although it is awaiting RC's/publish at the time of this comment, so you would need to build and package the jars yourself for the time being to begin making use of them to resolve the above error when using spark 3.3. This could be a good opportunity to provide any feedback on any subsequent RC's, as an active user.
I will update this comment when RC's/stable releases are available, which should resolve the above issue for others hitting this issue. Unfortunately, I don't have enough reputation to add this a comment to thread above.

Testing stateful UDFs in Flink

I am trying to test some stateful UDFs in my Scala Flink application following the docs:
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/datastream/testing/#unit-testing-stateful-or-timely-udfs--custom-operators
Based on that link, I've added the following dependencies to the build.sbt file:
"org.apache.flink" %% "flink-test-utils" % flinkVersion % Test,
"org.apache.flink" %% "flink-runtime" % flinkVersion % Test,
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % Test
But still, I cannot access the required utility classes such as OneInputStreamOperatorTestHarness (the class cannot be found).
FYI the scala version in my project is 2.12.11, whereas Flink is at v1.13.2. Am I doing something wrong? Any reasons why I cannot find those classes? Maybe the documentation is not correct?
I added the tests classifier to the flink-streaming-java dependency in the build.sbt file and now it works:
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % Test classifier "tests"

How should I log from my custom Spark JAR

Scala/JVM noob here that wants to understand more about logging, specifically when using Apache Spark.
I have written a library in Scala that depends upon a bunch of Spark libraries, here are my dependencies:
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark,
"org.apache.spark" %% "spark-sql" % Version.spark,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
(admittedly I copied a lot of this from elsewhere).
I am able to package my code as a JAR using sbt package which I can then call from Spark by placing the JAR into ${SPARK_HOME}/jars. This is working great.
I now want to implement logging from my code so I do this:
import com.typesafe.scalalogging.Logger
/*
* stuff stuff stuff
*/
val logger : Logger = Logger("name")
logger.info("stuff")
however when I try and call my library (which I'm doing from Python, not that I think that's relevant here) I get an error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.company.package.class.function.
E : java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger$
Clearly this is because com.typesafe.scala-logging library is not in my JAR. I know I could solve this by packaging using sbt assembly but I don't want to do that because it will include all the other dependencies and cause my JAR to be enormous.
Is there a way to selectively include libraries (com.typesafe.scala-logging in this case) in my JAR? Alternatively, should I be attempting to log using another method, perhaps using a logger that is included with Spark?
Thanks to pasha701 in the comments I attempted packaging my dependencies by using sbt assembly rather than sbt package.
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark % Provided,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark % Provided,
"org.apache.spark" %% "spark-sql" % Version.spark % Provided,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
Unfortunately, even if specifying the spark dependencies as Provided my JAR went from 324K to 12M hence I opted to use println() instead. Here is my commit message:
log using println
I went with the println option because it keeps the size of the JAR small.
I trialled use of com.typesafe.scalalogging.Logger but my tests failed with error:
java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger
because that isn't provided with Spark. I attempted to use sbt assembly
instead of sbt package but this caused the size of the JAR to go from
324K to 12M, even with spark dependencies set to Provided. A 12M JAR
isn't worth the trade-off just to use scalaLogging, hence using println
instead.
I note that pasha701 suggested using log4j instead as that is provided with Spark so I shall try that next. Any advice on using log4j from Scala when writing a Spark library would be much appreciated.
As you said 'sbt assembly' will include all the dependencies into your jar.
If you want use certain two option:
Download logback-core and logback-classic and add them on --jar spark2-submit command
Specify the above deps in --packages spark2-submit option

RestHighLevelClient Not Found in Scala

I am trying to insert into ElasticSearch(ES) in a Scala Program.
In build.sbt I have added
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
My code is
val client = new RestHighLevelClient( RestClient.builder(new HttpHost("localhost", 9200, "http")))
While compiling I am getting errors as below
not found: type RestHighLevelClient
not found: value RestClient
Am I missing some import? My goal is to get a stream from Flink and insert into ElasticSearch
Any help is greatly appreciated.
To use Elasticsearch with Flink it's going to be easier if you use Flink's ElasticsearchSink, rather than working with RestHighLevelClient directly. However, a version of that sink for Elasticsearch 7.x is coming in Flink 1.10, which hasn't been released yet (it's coming very soon; RC1 is already out).
Using this connector requires an extra dependency, such as flink-connector-elasticsearch6_2.11 (or flink-connector-elasticsearch7_2.11, coming with Flink 1.10).
See the docs on using Elasticsearch with Flink.
The reason to prefer Flink's sink over using RestHighLevelClient yourself is that the Flink sink makes bulk requests, handles errors and retries, and it's tied in with Flink's checkpointing mechanism, so it's able to guarantee that nothing is lost if something fails.
As for your actual question, maybe you need to add
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "7.5.2"
We don't need to use these dependencies separately to insert data in Elasticsearch by using Flink Streaming.
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
Just use this flink-connector-elasticsearch7 or flink-connector-elasticsearch6
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % "1.10.0"
All dependencies of Elasticsearch come along with the Flink-Elastic connector. So we don't need to include them separately in build.sbt file.
build.sbt file for Flink Elasticsearch
name := "flink-streaming-demo"
scalaVersion := "2.12.11"
val flinkVersion = "1.10.0"
libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % flinkVersion
For more details please go through this working Flink-Elasticsearch code which I have provided here.
Note: Since Elastic 6.x onwards they started full support of the REST elastic client. And till Elastic5.x they were using Transport elastic client.

Apache Flink 1.4 with Apache Kafka 1.0.0

I am trying to get Apache Flink Scala project to integrate with Apache Kafka 1.0.0. When i attempt to add the flink-connector-kafka package in my build.sbt file I get an error saying it cannot resolve it.
When i then look at the options available in the maven repository, there is no maven dependency available for Apache Kafka 2.11-1.0.0 for any version above 0.10.2
val flinkVersion = "1.4.1"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided")
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion)
Does anyone know how to integrate these versions correctly so that I can connect Apache Flink 1.4 to Apache Kafka 2.11-1.0.0, as nothing I seem to try works (and i do not wish to downgrade the Kafka version I am connecting to).
This should work. Try:
val flinkVersion = "1.4.2"
libraryDependencies ++= Seq(
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion,
"org.apache.flink" %% "flink-connector-kafka-0.11" % flinkVersion
)
Try
org.apache.flink" % "flink-connector-kafka-0.11_2.11" % "1.4.0
flink-connector-kafka-0.11_2.11 is Flink's latest Kafka connector available.
Sources: https://search.maven.org/#search%7Cga%7C1%7Cflink%20kafka%20connector , https://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.flink%22%20AND%20a%3A%22flink-connector-kafka-0.11_2.11%22
A Kafka 1.0 broker is backwards compatible with 0.11 and 0.10 APIs.