Testing stateful UDFs in Flink - scala

I am trying to test some stateful UDFs in my Scala Flink application following the docs:
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/datastream/testing/#unit-testing-stateful-or-timely-udfs--custom-operators
Based on that link, I've added the following dependencies to the build.sbt file:
"org.apache.flink" %% "flink-test-utils" % flinkVersion % Test,
"org.apache.flink" %% "flink-runtime" % flinkVersion % Test,
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % Test
But still, I cannot access the required utility classes such as OneInputStreamOperatorTestHarness (the class cannot be found).
FYI the scala version in my project is 2.12.11, whereas Flink is at v1.13.2. Am I doing something wrong? Any reasons why I cannot find those classes? Maybe the documentation is not correct?

I added the tests classifier to the flink-streaming-java dependency in the build.sbt file and now it works:
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % Test classifier "tests"

Related

How should I log from my custom Spark JAR

Scala/JVM noob here that wants to understand more about logging, specifically when using Apache Spark.
I have written a library in Scala that depends upon a bunch of Spark libraries, here are my dependencies:
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark,
"org.apache.spark" %% "spark-sql" % Version.spark,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
(admittedly I copied a lot of this from elsewhere).
I am able to package my code as a JAR using sbt package which I can then call from Spark by placing the JAR into ${SPARK_HOME}/jars. This is working great.
I now want to implement logging from my code so I do this:
import com.typesafe.scalalogging.Logger
/*
* stuff stuff stuff
*/
val logger : Logger = Logger("name")
logger.info("stuff")
however when I try and call my library (which I'm doing from Python, not that I think that's relevant here) I get an error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.company.package.class.function.
E : java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger$
Clearly this is because com.typesafe.scala-logging library is not in my JAR. I know I could solve this by packaging using sbt assembly but I don't want to do that because it will include all the other dependencies and cause my JAR to be enormous.
Is there a way to selectively include libraries (com.typesafe.scala-logging in this case) in my JAR? Alternatively, should I be attempting to log using another method, perhaps using a logger that is included with Spark?
Thanks to pasha701 in the comments I attempted packaging my dependencies by using sbt assembly rather than sbt package.
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark % Provided,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark % Provided,
"org.apache.spark" %% "spark-sql" % Version.spark % Provided,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
Unfortunately, even if specifying the spark dependencies as Provided my JAR went from 324K to 12M hence I opted to use println() instead. Here is my commit message:
log using println
I went with the println option because it keeps the size of the JAR small.
I trialled use of com.typesafe.scalalogging.Logger but my tests failed with error:
java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger
because that isn't provided with Spark. I attempted to use sbt assembly
instead of sbt package but this caused the size of the JAR to go from
324K to 12M, even with spark dependencies set to Provided. A 12M JAR
isn't worth the trade-off just to use scalaLogging, hence using println
instead.
I note that pasha701 suggested using log4j instead as that is provided with Spark so I shall try that next. Any advice on using log4j from Scala when writing a Spark library would be much appreciated.
As you said 'sbt assembly' will include all the dependencies into your jar.
If you want use certain two option:
Download logback-core and logback-classic and add them on --jar spark2-submit command
Specify the above deps in --packages spark2-submit option

Flink write to S3 on EMR

I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:345)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:222)
at org.apache.flink.api.java.io.TextOutputFormat.open(TextOutputFormat.java:78)
at org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:61)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
My build.sbt looks like this:
libraryDependencies ++= Seq(
"org.apache.flink" % "flink-core" % "1.3.2",
"org.apache.flink" % "flink-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-streaming-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-shaded-hadoop2" % "1.3.2",
"org.apache.flink" % "flink-clients_2.11" % "1.3.2",
"org.apache.flink" %% "flink-avro" % "1.3.2",
"org.apache.flink" %% "flink-connector-filesystem" % "1.3.2"
)
I also found this post, but it didn't resolve the issue: External checkpoints to S3 on EMR
I just put the output to S3: input.writeAsText("s3://test/flink"). Any suggestions would be appreciated.
Not sure the good combination for flink-shaded-hadoop and EMR version. After several round of tries and failures, I was able to write to S3 by using a new version of flink-shaded-hadoop2 -- "org.apache.flink" % "flink-shaded-hadoop2" % "1.4.0"
Your issue is probably due the fact that some libraries are loaded by EMR/Yarn/Flink before your own classes, what leads to NoSuchMethodError: classes loaded are not the one you provided, but the one provided by EMR. Take care of the classpath in the JobManager/TaskManager logs.
A solution is to put your own jars in the Flink lib directory to that they are loaded before EMR ones.

Play Framework SBT import play.api.libs.streams

I am building an application in Play Framework (2.4.0) / scala and trying to add play.api.libs.streams so I can use the object Streams in my application.
so here is my working build.sbt
libraryDependencies ++= Seq(
specs2 % Test,
cache,
ws,
"com.softwaremill.macwire" %% "macros" % "2.2.2",
"com.softwaremill.macwire" %% "runtime" % "1.0.7",
"org.reactivemongo" %% "play2-reactivemongo" % "0.11.10",
"com.eclipsesource" %% "play-json-schema-validator" % "0.6.5",
"org.scalatest" %% "scalatest" % "2.2.5" % Test,
"org.scalacheck" %% "scalacheck" % "1.12.2" % Test,
"org.scalatestplus" %% "play" % "1.4.0-M4" % Test,
"com.typesafe.akka" %% "akka-stream" % "2.4.4"
)
Now when I try to add the following line :
streams,
or when I just add
libraryDependencies += streams
I get the error :
error: No implicit for Append.Value[Seq[sbt.ModuleID], sbt.TaskKey[sbt.Keys.TaskStreams]] found,
so sbt.TaskKey[sbt.Keys.TaskStreams] cannot be appended to Seq[sbt.ModuleID]
libraryDependencies += streams
And I am unable to launch my project.
I found this question, but tweaking by adding '%' or '%%' did not solve the issue, and I was not sure how to use the solutions as I am just trying to add a play.api.libs dependency and not an external one.
I am kind of stuck here, I don't understand why streams is a sbt.TaskKey[sbt.Keys.TaskStreams] but ws or any other key added in the Sequence is a sbt.ModuleID
This this case the cache, ws, etc lines refer not to packages in play.api.libs, but to build artefacts that the Play sbt-plugin pre-defines as components in the play.sbt.PlayImport object, for example here.
In this context, ws is exactly equivalent to:
"com.typesafe.play" %% "play-ws" % "2.5.4"
The reason you see an error for streams is because there is no such component defined by Play, and therefore SBT assumes you are making reference to a TaskKey.
The play.api.libs.streams.Streams object should be available without anything extra added to your build if you have a PlayScala project on Play 2.5.x and above.

Scala sbt order of imports

Is it possible to change the order of imports in sbt for scala compilation?
I have two unmanaged dependencies in the ./lib/ folder and the rest are managed dependencies in the sbt file:
libraryDependencies ++= Seq(
"org.slf4j" % "slf4j-api" % "1.7.10",
"org.slf4j" % "slf4j-simple" % "1.7.10",
"org.slf4j" % "slf4j-log4j12" % "1.7.10",
"com.typesafe" % "config" % "1.0.2",
"edu.washington.cs.knowitall.taggers" %% "taggers-core" % "0.4",
"com.rockymadden.stringmetric" % "stringmetric-core" % "0.25.3",
"org.apache.solr" % "solr-solrj" % "4.3.1",
"com.twitter" %% "util-collection" % "6.3.6",
"org.scalaj" %% "scalaj-http" % "0.3.10",
"commons-logging" % "commons-logging" % "1.2"
)
In Eclipse I can run my program, because I can change the order of imports in the java build path (I put unmanaged dependencies at the end).
However when I want to run it from the terminal:
sbt "run-main de.questionParser.Test"
I get the following error
[error] Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.collect.Lists.reverse(Ljava/util/List;)Ljava/util/List;
So the final question is:
Is it possible to change the order of imports in sbt for scala compilation so that managed dependencies are included before unmanaged dependencies?

What do the % and %% operators do when setting up SBT dependencies?

In Lift Web Framework, dependencies for Simple Build Tool (SBT) are specified in LiftProject.scala. That file includes this code:
override def libraryDependencies = Set(
"net.liftweb" %% "lift-webkit" % liftVersion % "compile->default",
"net.liftweb" %% "lift-mapper" % liftVersion % "compile->default",
"org.mortbay.jetty" % "jetty" % "6.1.22" % "test->default",
"junit" % "junit" % "4.5" % "test->default",
"org.scala-tools.testing" %% "specs" % "1.6.6" % "test->default",
"org.scala-lang" % "scala-compiler" % "2.8.1" % "test->default",
"org.apache.tomcat" % "tomcat-juli" % "7.0.0" % "test->default",
"com.h2database" % "h2" % "1.2.138"
) ++ super.libraryDependencies
What do the % and %% operators do here? If I paste this code into the scala interpreter, it errors out, and neither % nor %% is defined for String or RichString. What's going on here?
The difference between these functions is that %% considers Scala version when SBT resolve dependency, so for example net/liftweb/lift-webkit_2.8.1/2.3/lift-webkit_2.8.1-2.3.jar will be downloaded from repo.
Regarding compile error - these methods should be called when some implicit methods defined in SBT class hierarchy that make actual conversion are in scope.
Best regards,
Vladimir
They control grabbing builds for a specific version of Scala.
% grabs the dependency exactly as you described it.
%% tacks the Scala version into the resource name to fetch a version for the local Scala build. Extra useful.if you crossbuild for several releases of Scala.
Since 2011, the doc got a bit more complete: "Library dependencies ".
The article "Sbt heiroglyphs and multi-projects explained" from Divan Visagie also details those sbt operators:
% and %% get a little tricky: they define the ids and versions of each library in the sequence , but it’s safe to say that:
"org.scala-tools" % "scala-stm_2.11.1" % "0.3"
Is the equivalent of
"org.scala-tools" %% "scala-stm" % "0.3"
So effectively the extra %% means it figures out what Scala version you are on.
The doc adds:
The idea is that many dependencies are compiled for multiple Scala versions, and you’d like to get the one that matches your project to ensure binary compatibility.
The complexity in practice is that often a dependency will work with a slightly different Scala version; but %% is not smart about that.
So if the dependency is available for 2.10.1 but you’re using scalaVersion := "2.10.4", you won’t be able to use %% even though the 2.10.1 dependency likely works.
If %% stops working, just go see which versions the dependency is really built for, and hardcode the one you think will work (assuming there is one).