Spark shell started with assembly jar cannot resolve decline's cats dependency - scala

I want to use decline to parse command line parameters for a spark application. I use sbt assembly to create a fat jar and use it in the spark-submit. Unfortunately I get an error
java.lang.NoSuchMethodError: cats.kernel.Semigroup$.catsKernelMonoidForList()Lcats/kernel/Monoid; when the parameters get parsed, example below. To reproduce the error you can check out my github repo.
This is my code:
package example
import cats.implicits._
import com.monovore.decline._
object Minimal {
case class Minimal(input: String, count: Int)
val configOpts: Opts[Minimal] = (
Opts.option[String]("input", "the input"),
Opts.option[Int]("count", "the count")
).mapN(Minimal.apply)
def parseMinimalConfig(
args: Array[String]
): Either[Help, Minimal] = {
val command = Command(name = "min-example", header = "my-header")(configOpts)
command.parse(args)
}
}
and this is my build.sbt:
name := "example"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq("com.monovore" %% "decline" % "2.3.0")
This is how I reproduce the error locally (spark version is 3.1.2)
~/playground/decline-test » ~/apache/spark-3.1.2-bin-hadoop3.2/bin/spark-shell --jars "target/scala-2.12/example-assembly-0.1.jar"
22/08/31 14:36:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://airi:4040
Spark context available as 'sc' (master = local[*], app id = local-1661949407775).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_345)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import example.Minimal._
import example.Minimal._
scala> parseMinimalConfig(Array("x", "x"))
java.lang.NoSuchMethodError: cats.kernel.Semigroup$.catsKernelMonoidForList()Lcats/kernel/Monoid;
at com.monovore.decline.Help$.optionList(Help.scala:74)
at com.monovore.decline.Help$.detail(Help.scala:105)
at com.monovore.decline.Help$.fromCommand(Help.scala:50)
at com.monovore.decline.Parser.<init>(Parser.scala:21)
at com.monovore.decline.Command.parse(opts.scala:20)
at example.Minimal$.parseMinimalConfig(Minimal.scala:19)
... 49 elided
scala> :quit
Interestingly adding the assembled jar to the scala classpath does not yield the same error but gives the expected help message. My local scala version is 2.12.16 and the spark scala version is 2.12.10, but I'm unsure whether this can be the cause.
~/playground/decline-test » scala -cp "target/scala-2.12/example-assembly-0.1.jar"
Welcome to Scala 2.12.16-20220611-202836-281c3ee (OpenJDK 64-Bit Server VM, Java 1.8.0_345).
Type in expressions for evaluation. Or try :help.
scala> import example.Minimal._
import example.Minimal._
scala> parseMinimalConfig(Array("x", "x"))
res0: Either[com.monovore.decline.Help,example.Minimal.Minimal] =
Left(Unexpected argument: x
Usage: command --input <string> --count <integer>
our command
Options and flags:
--input <string>
the input
--count <integer>
the count)
scala>
I also tried scala 2.13 with spark 3.2.2 and I got the same error, although need to double check on that.
What could I be missing?

Have you tried using ShadeRules to avoid getting stuck in the dependency hell?
assembly / assemblyShadeRules := Seq(
ShadeRule.rename("org.typelevel.cats.**" -> "repackaged.org.typelevel.cats.#1").inAll,
ShadeRule.rename("cats.**" -> "repackaged.cats.#1").inAll,
)

Related

Why does running the ZIO App with mill not work?

I setup the simple ZIO App from zio.dev.
val myAppLogic =
for {
_ <- putStrLn("Hello! What is your name?")
name <- getStrLn
_ <- putStrLn(s"Hello, ${name}, welcome to ZIO!")
} yield ()
When running in/with Intellij it works as expected.
However running it with mill it doesn't.
nbszmbp012:zio-scala-camunda-bot mpa$ mill server.run
[27/37] server.compile
[info] Compiling 1 Scala source to /Users/mpa/dev/Github/pme123/zio-scala-camunda-bot/out/server/compile/dest/classes ...
[info] Done compiling.
[37/37] server.run
Hello! What is your name?
Peter
name <- getStrLn is not executed.
Here is the build.sc
import mill._, scalalib._
object server extends ScalaModule {
def scalaVersion = "2.12.8"
def ivyDeps = Agg(
ivy"dev.zio::zio:1.0.0-RC10-1",
ivy"com.bot4s::telegram-core:4.3.0-RC1"
)
}
Did I miss something?
Mill runs, by default, in client-server mode. One of the consequences is, that build tasks can't consume the input stream.
Your given example needs to read from the process standard input. So, you have to explicitly tell mill to run in interactive mode with --interactive (or short -i).
$ mill -i server.run
[27/37] server.compile
[info] Compiling 1 Scala source to /tmp/zio-q/out/server/compile/dest/classes ...
[info] Done compiling.
[37/37] server.run
Hello! What is your name?
Peter
Hello, Peter, welcome to ZIO!
When invoked with the additional -i (before the task name), the ZIO app correctly reads from STDIN and prints the greeting.

SBT system property not being set

I'm trying to do sbt flywayMigrate -Denvi=foo but the system property envi is not being set. Pointers for debugging is greatly appreciated as I haven't been successful in identifying the cause of this issue for hours now. No question in SO or anywhere else have had this issue so far.
In build.sbt, this will be used as a variable.
lazy val envi = sys.props.getOrElse("envi", "default")
Using sys.env.get("ENVI") instead is currently not an option due to shared/team repo considerations.
sbt console -Denvi=foo
scala> sys.props.get("envi")
res0: Option[String] = None
scala> sys.props.getOrElse("envi", "default")
res1: Option[String] = default
scala, sbt installed using brew
You have to put the environment variable before the command:
sbt -Denvi=foo console
otherwise it will be passed as an argument to the main class instead of to the JVM.
Alternatively you can set the environment in the JAVA_OPTS variable before starting sbt:
export JAVA_OPTS="-Denvi=foo"
sbt console
scala> sys.props.getOrElse("envi", "default")
res0: String = foo

Failed to load com.databricks.spark.csv while running with spark-submit

I am trying to run my code with spark-submit with the below command.
spark-submit --class "SampleApp" --master local[2] target/scala-2.11/sample-project_2.11-1.0.jar
And my sbt file is having below dependencies:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.2.0"
My code :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.SQLContext
object SampleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sample App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
import sqlContext.implicits._
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/root/input/Account.csv", "header" -> "true"))
val column_names = df.columns
val row_count = df.count
val column_count = column_names.length
var pKeys = ArrayBuffer[String]()
for ( i <- column_names){
if (row_count == df.groupBy(i).count.count){
pKeys += df.groupBy(i).count.columns(0)
}
}
pKeys.foreach(print)
}
}
The error:
16/03/11 04:47:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1253)
My Spark Version is 1.4.1 and Scala is 2.11.7
(I am following this link: http://www.nodalpoint.com/development-and-deployment-of-spark-applications-with-scala-eclipse-and-sbt-part-1-installation-configuration/)
I have tried below versions of spark csv
spark-csv_2.10 1.2.0
1.4.0
1.3.1
1.3.0
1.2.0
1.1.0
1.0.3
1.0.2
1.0.1
1.0.0
etc.
Please help!
Since you are running the job in local mode, add external jar path using --jar option
spark-submit --class "SampleApp" --master local[2] --jar file:[path-of-spark-csv_2.11.jar],file:[path-of-other-dependency-jar] target/scala-2.11/sample-project_2.11-1.0.jar
e.g.
spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌​mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar
Another thing you can do is create a fat jar. In SBT you can try this proper-way-to-make-a-spark-fat-jar-using-sbt and in Maven refer create-a-fat-jar-file-maven-assembly-plugin
Note: Mark scope of Spark's (i.e. spark-core, spark-streaming, spark-sql etc) jar as provided otherwise fat jar will become too fat to deploy.
Better solution is to use --packages option like below.
spark-submit --class "SampleApp" --master local[2] --packages com.databricks:spark-csv_2.10:1.5.0 target/scala-2.11/sample-project_2.11-1.0.jar
Make sure that --packages option precedes the application jar
you have added spark-csv library to your sbt config - it means that you can compile your code with it,
but it still doesn't mean that it's present in runtime(spark sql and spark core are there by default)
so try to use --jars option of spark-submit to add spark-csv jar to runtime classpath or you can build fat-jar(not sure how you doing it with sbt)
You are using the Spark 1.3 syntax of loading the CSV file into a dataframe.
If you check the repository here, you should use the following syntax on Spark 1.4 and higher:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
I was looking for an option where in I could skip the --packages option and provide it directly in the assembly jar. The reason I faced this exception was
sqlContext.read.format("csv") which meant it should know the data format of csv. Instead, to specify where the format csv is present use sqlContext.read.format("com.databricks.spark.csv") so it knows where to look for it and does not throw an exception.

Spark shell command lines

I am new to Spark and trying to figure out how can I use the Spark shell.
Looked into Spark's site documentation and it doesn't show how to create directories or how to see all my files in spark shell. If anyone could help me I would appreciate it.
In this context you can assume that Spark shell is just a normal Scala REPL so the same rules apply. You can get a list of the available commands using :help.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:edit <id>|<line> edit history
:help [command] print this summary or command-specific help
:history [num] show the history (optional num is commands to show)
:h? <string> search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v] show the implicits in scope
:javap <path|class> disassemble a file or class name
:line <id>|<line> place line(s) at the end of history
:load <path> interpret lines in a file
:paste [-raw] [path] enter paste mode or paste a file
:power enable power user mode
:quit exit the interpreter
:replay [options] reset the repl and replay all previous commands
:require <path> add a jar to the classpath
:reset [options] reset the repl to its initial state, forgetting all session entries
:save <path> save replayable session to a file
:sh <command line> run a shell command (result is implicitly => List[String])
:settings <options> update compiler options, if possible; see reset
:silent disable/enable automatic printing of results
:type [-v] <expr> display the type of an expression without evaluating it
:kind [-v] <expr> display the kind of expression's type
:warnings show the suppressed warnings from the most recent line which had any
As you can see above you can invoke shell commands using :sh. For example:
scala> :sh mkdir foobar
res0: scala.tools.nsc.interpreter.ProcessResult = `mkdir foobar` (0 lines, exit 0)
scala> :sh touch foobar/foo
res1: scala.tools.nsc.interpreter.ProcessResult = `touch foobar/foo` (0 lines, exit 0)
scala> :sh touch foobar/bar
res2: scala.tools.nsc.interpreter.ProcessResult = `touch foobar/bar` (0 lines, exit 0)
scala> :sh ls foobar
res3: scala.tools.nsc.interpreter.ProcessResult = `ls foobar` (2 lines, exit 0)
scala> res3.line foreach println
line lines
scala> res3.lines foreach println
bar
foo
:q or :quit command is used to exit from your scala REPL.

Forking and ordering tests in Sbt 0.13.x

Here is how it was configured for Sbt 0.12.x:
parallelExecution in test := false
testGrouping in Test <<= definedTests in Test map { tests =>
tests.map { test =>
import Tests._
import scala.collection.JavaConversions._
new Group(
name = test.name,
tests = Seq(test),
runPolicy = SubProcess(javaOptions = Seq(
"-server", "-Xms4096m", "-Xms4096m", "-XX:NewSize=3584m",
"-Xss256k", "-XX:+UseG1GC", "-XX:+TieredCompilation",
"-XX:+UseNUMA", "-XX:+UseCondCardMark",
"-XX:-UseBiasedLocking", "-XX:+AlwaysPreTouch") ++
System.getProperties.toMap.map {
case (k, v) => "-D" + k + "=" + v
}))
}.sortWith(_.name < _.name)
}
During migration to Sbt 0.13.x I get the following error:
[error] Could not accept connection from test agent: class java.net.SocketException: socket closed
java.net.SocketException: socket closed
at java.net.DualStackPlainSocketImpl.accept0(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:48)
at java.lang.Thread.run(Thread.java:745)
Migration changes are just updates in sbt & plugin versions.
Are there any other approaches to forking and ordering of tests in Sbt 0.13.x to overcome that exception?
Works fine on Linux and Mac OS.
Got error on Windows because of limit of classpath length that prevents launching of test agent instance with following error in System.err:
Error: Could not find or load main class sbt.ForkMain
I also got this error when moving to Scala repo to sbt version sbt.version = 1.3.8 (previously 1.2.8 was ok). Strangely worked fine on my mac, but failed on teamcity linux build agents.
Fix for me was to set
fork := false,
in build.sbt.
Not sure why repo had it previously set to fork := true (guess it was cut/paste from somewhere else as no strong reason for this in this repo), but this change resolved the issue. Locally on my mac also runs a few seconds faster now.
See here for background
https://www.scala-sbt.org/1.0/docs/Forking.html