Scala - Spark-corenlp - java.lang.NoClassDefFoundError - scala

I want to run spark-coreNLP example, but I get an java.lang.NoClassDefFoundError error when running spark-submit.
Here is the scala code, from the github example, which I put into an object, and defined a SparkContext and SQLContext
main.scala.Sentiment.scala
package main.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import com.databricks.spark.corenlp.functions._
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
object Sentiment {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sentiment")
val sc = new SparkContext(conf)
val sqlContext = SQLContextSingleton.getInstance(sc)
import sqlContext.implicits._
val input = Seq((1, "<xml>Stanford University is located in California. It is a great university.</xml>")).toDF("id", "text")
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
output.show(truncate = false)
}
}
And my build.sbt (modified from here)
version := "1.0"
scalaVersion := "2.10.6"
scalaSource in Compile := baseDirectory.value / "src"
initialize := {
val _ = initialize.value
val required = VersionNumber("1.8")
val current = VersionNumber(sys.props("java.specification.version"))
assert(VersionNumber.Strict.isCompatible(current, required), s"Java $required required.")
}
sparkVersion := "1.5.2"
// change the value below to change the directory where your zip artifact will be created
spDistDirectory := target.value
sparkComponents += "mllib"
// add any sparkPackageDependencies using sparkPackageDependencies.
// e.g. sparkPackageDependencies += "databricks/spark-avro:0.1"
spName := "databricks/spark-corenlp"
licenses := Seq("GPL-3.0" -> url("http://opensource.org/licenses/GPL-3.0"))
resolvers += Resolver.mavenLocal
libraryDependencies ++= Seq(
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models",
"com.google.protobuf" % "protobuf-java" % "2.6.1"
)
I run sbt package without issue, then run Spark with
spark-submit --class "main.scala.Sentiment" --master local[4] target/scala-2.10/sentimentanalizer_2.10-1.0.jar
The program fails after throwing an exception:
Exception in thread "main" java.lang.NoClassDefFoundError: edu/stanford/nlp/simple/Sentence
at main.scala.com.databricks.spark.corenlp.functions$$anonfun$cleanxml$1.apply(functions.scala:55)
at main.scala.com.databricks.spark.corenlp.functions$$anonfun$cleanxml$1.apply(functions.scala:54)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
Things I tried:
I work with Eclipse for Scala, and I made sure to add all the jars from stanford-corenlp as suggested here
./stanford-corenlp/ejml-0.23.jar
./stanford-corenlp/javax.json-api-1.0-sources.jar
./stanford-corenlp/javax.json.jar
./stanford-corenlp/joda-time-2.9-sources.jar
./stanford-corenlp/joda-time.jar
./stanford-corenlp/jollyday-0.4.7-sources.jar
./stanford-corenlp/jollyday.jar
./stanford-corenlp/protobuf.jar
./stanford-corenlp/slf4j-api.jar
./stanford-corenlp/slf4j-simple.jar
./stanford-corenlp/stanford-corenlp-3.6.0-javadoc.jar
./stanford-corenlp/stanford-corenlp-3.6.0-models.jar
./stanford-corenlp/stanford-corenlp-3.6.0-sources.jar
./stanford-corenlp/stanford-corenlp-3.6.0.jar
./stanford-corenlp/xom-1.2.10-src.jar
./stanford-corenlp/xom.jar
I suspect that I need to add something to my command line when submitting the job to Spark, any thoughts?

I was on the right track that my command line was missing something.
spark-submit needs to have all the stanford-corenlp added:
spark-submit
--jars $(echo stanford-corenlp/*.jar | tr ' ' ',')
--class "main.scala.Sentiment"
--master local[4] target/scala-2.10/sentimentanalizer_2.10-1.0.jar

Related

Exception in thread "main" java.lang.NoSuchMethodError in ubuntu only

I have the code below:
import java.io.File
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object RDFBenchVerticalPartionedTables {
def main(args: Array[String]): Unit = {
println("Start of programm .... ")
val conf = new SparkConf().setMaster("local").setAppName("SQLSPARK")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println("Conf and SC declared... ")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("SparkConversionSingleTable")
.getOrCreate()
println("SparkSession declared... ")
println("Before Agrs..... ")
val filePathCSV=args(0)
val filePathAVRO=args(1)
val filePathORC=args(2)
val filePathParquet=args(3)
println("After Agrs..... ")
val csvFiles = new File(filePathCSV).list()
println("After List of Files Agrs..... " + csvFiles.length )
println("Before the foreach ... ")
csvFiles.foreach{verticalTableName=>
println("inside the foreach ... ")
val verticalTableName2=verticalTableName.dropRight(4)
val RDFVerticalTableDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(filePathCSV+"/"+verticalTableName).toDF()
RDFVerticalTableDF.write.format("com.databricks.spark.avro").save(filePathAVRO+"/"+verticalTableName2+".avro")
RDFVerticalTableDF.write.parquet(filePathParquet+"/"+verticalTableName2+".parquet")
RDFVerticalTableDF.write.orc(filePathORC+"/"+verticalTableName2+".orc")
println("Vertical Table: '" +verticalTableName2+"' Has been Successfully Converted to AVRO, PARQUET and ORC !")
}
}
}
this class transforms list of csv files in adirectory that is given in a arguments (0) and save different formats (avro,orc and parquet) in three directories given also as args(1) args(2) and args(3).
I tried to submit this job using the spark-submit on windows it works, but while running the same job in ubuntu it fails with this error:
ubuntu#ragab:~$ spark-submit --class RDFBenchVerticalPartionedTables --master local[*] /home/ubuntu/testjar/rdfschemaconversion_2.11-0.1.jar "/data/RDFBench4/VerticalPartionnedTables/VerticalPartitionedTables100" "/data/RDFBench3/ConvertedData/SP2Bench100/AVRO/VerticalTables" "/data/RDFBench3/ConvertedData/SP2Bench100/ORC/VerticalTables" "/data/RDFBench3/ConvertedData/SP2Bench100/Parquet"
19/05/04 18:10:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Start of programm ....
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Conf and SC declared...
SparkSession declared...
Before Agrs.....
After Agrs.....
After List of Files Agrs..... 25
Before the foreach ...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at RDFBenchVerticalPartionedTables$.main(RDFBenchVerticalPartionedTables.scala:45)
at RDFBenchVerticalPartionedTables.main(RDFBenchVerticalPartionedTables.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
this is my sbt file:
name := "RDFSchemaConversion"
version := "0.1"
scalaVersion := "2.11.12"
mainClass in (Compile, run) := Some("RDFBenchVerticalPartionedTables")
mainClass in (Compile, packageBin) := Some("RDFBenchVerticalPartionedTables")
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.typesafe" % "config" % "1.3.1"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Your Spark distribution on Ubuntu seems to have been compiled with Scala 2.12. It is incompatible with your jar file which is compiled with Scala 2.11.

Java Class not Found Exception while doing Spark-submit Scala using sbt

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

SBT package scala script

I am trying to use spark submit with a scala script, but first I need to create my package.
Here is my sbt file:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
When I try sbt package, I am getting these errors:
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:3: object functions is not a member of package org.apache.spark.sql
import org.apache.spark.sql.functions._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:4: object types is not a member of package org.apache.spark.sql
import org.apache.spark.sql.types._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:25: not found: value sc
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:30: not found: value sqlContext
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:36: not found: value udf
val sqlfunc = udf(coder)
^
5 errors found
(compile:compileIncremental) Compilation failed
Is anyone faced these errors?
Thanks for helping.
Regards
Majid
You are trying to use class org.apache.spark.sql.functions and package org.apache.spark.sql.types. According to functions class documentation it's available starting from version 1.3.0. And types package is available since version 1.3.1.
Solution: update SBT file to:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.1"
Other errors: "not found: value sc", "not found: value sqlContext", "not found: value udf" are caused by some missing varibales in your XML_Script_SBT.scala file. Can't solve without looking into source code.
Thanks Sergey, your correction corrects 3 errors. Below is my script:
object SimpleApp {
def main(args: Array[String]) {
val today = Calendar.getInstance.getTime
val curTimeFormat = new SimpleDateFormat("yyyyMMdd-HHmmss")
val time = curTimeFormat.format(today)
val destination = "/3.Data/3.Code_Check_Processed/2.XML/" + time + ".extensive.csv"
val source = "/3.Data/2.Code_Check_Raw/2.XML/Extensive/"
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listLocatedStatus(new Path(source))
val uri = iter.next.getPath.toUri
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
val df2 = df.selectExpr("explode(check) as e").select("e.#VALUE","e.additionalInformation1","e.columnNumber","e.context","e.errorType","e.filePath","e.lineNumber","e.message","e.severity")
val coder: (Long => String) = (arg: Long) => {if (arg > -1) time else "nada"}
val sqlfunc = udf(coder)
val df3 = df2.withColumn("TimeStamp", sqlfunc(col("columnNumber")))
df3.write.format("com.databricks.spark.csv").option("header", "false").save(destination)
hdfs.delete(new Path(uri.toString()), true)
sys.exit(0)
}
}

Scala - spark-corenlp - java.lang.ClassNotFoundException

I want to run spark-coreNLP example, but I get an java.lang.ClassNotFoundException error when running spark-submit.
Here is the scala code, from the github example, which I put into an object, and defined a SparkContext.
analyzer.Sentiment.scala:
package analyzer
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._
object Sentiment {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sentiment")
val sc = new SparkContext(conf)
val input = Seq(
(1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
output.show(truncate = false)
}
}
I am using the build.sbt provided by spark-coreNLP - I only modified the scalaVersion and sparkVerison to my own.
version := "1.0"
scalaVersion := "2.11.8"
initialize := {
val _ = initialize.value
val required = VersionNumber("1.8")
val current = VersionNumber(sys.props("java.specification.version"))
assert(VersionNumber.Strict.isCompatible(current, required), s"Java $required required.")
}
sparkVersion := "1.5.2"
// change the value below to change the directory where your zip artifact will be created
spDistDirectory := target.value
sparkComponents += "mllib"
spName := "databricks/spark-corenlp"
licenses := Seq("GPL-3.0" -> url("http://opensource.org/licenses/GPL-3.0"))
resolvers += Resolver.mavenLocal
libraryDependencies ++= Seq(
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models",
"com.google.protobuf" % "protobuf-java" % "2.6.1"
)
Then, I created my jar by running without issues.
sbt package
Finally, I submit my job to Spark:
spark-submit --class "analyzer.Sentiment" --master local[4] target/scala-2.11/sentimentanalizer_2.11-0.1-SNAPSHOT.jar
But I get the following error:
java.lang.ClassNotFoundException: analyzer.Sentiment
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:641)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My file Sentiment.scala is correclty located in a package named "analyzer".
$ find .
./src
./src/analyzer
./src/analyzer/Sentiment.scala
./src/com
./src/com/databricks
./src/com/databricks/spark
./src/com/databricks/spark/corenlp
./src/com/databricks/spark/corenlp/CoreNLP.scala
./src/com/databricks/spark/corenlp/functions.scala
./src/com/databricks/spark/corenlp/StanfordCoreNLPWrapper.scala
When I ran the SimpleApp example from the Spark Quick Start , I noticed that MySimpleProject/bin/ contained a SimpleApp.class. MySentimentProject/bin is empty. So I have tried to clean my project (I am using Eclipse for Scala).
I think it is because I need to generate Sentiment.class, but I don't know how to do it - It was done automatically with SimpleApp.scala, and when it ry to run/build with Eclipse Scala, it crashes.
Maybe You should try to add
scalaSource in Compile := baseDirectory.value / "src"
to your build.sbt, cause sbt document reads that "the directory that contains the main Scala sources is by default src/main/scala".
Or just make your source code in this structure
$ find .
./src
./src/main
./src/main/scala
./src/main/scala/analyzer
./src/main/scala/analyzer/Sentiment.scala
./src/main/scala/com
./src/main/scala/com/databricks
./src/main/scala/com/databricks/spark
./src/main/scala/com/databricks/spark/corenlp
./src/main/scala/com/databricks/spark/corenlp/CoreNLP.scala
./src/main/scala/com/databricks/spark/corenlp/functions.scala
./src/main/scala/com/databricks/spark/corenlp/StanfordCoreNLPWrapper.scala