I have created a simple project in Graphx, as soon as I am trying to run this test project, I am getting AbstractMethodError, error is comming inside this method edgeListFile, looks like something related to logger which I am not abe to see, Please help.
Here is my .scala file
object graphtest extends App {
import org.apache.spark.graphx.{GraphLoader, VertexId}
val spark = SparkSession.builder.master("local").appName("learning spark").getOrCreate
val sc = spark.sparkContext
val graph1 = GraphLoader.edgeListFile(spark.sparkContext, "E:\\code\\Cit-HepTh.txt")
val res: (VertexId, Int) = graph1.inDegrees.reduce((a, b) => if (a._2 > b._2) a else b)
graph1.edges.collect().take(10).foreach(println)
}
Here is my build.sbt file
name := "myproject"
version := "0.1"
scalaVersion := "2.11.8"
mainClass in (Compile, packageBin) := Some("myproject.Processor")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.1",
"org.apache.spark" %% "spark-sql" % "2.3.1",
"org.scalatest" %% "scalatest" % "3.2.0-SNAP10" % Test,
"com.typesafe" % "config" % "1.3.1",
"org.apache.spark" %% "spark-mllib" % "2.0.1"
)
and finally the complete failed stacktrace
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at org.apache.spark.graphx.GraphLoader$.initializeLogIfNecessary(GraphLoader.scala:28)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.graphx.GraphLoader$.log(GraphLoader.scala:28)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.graphx.GraphLoader$.logInfo(GraphLoader.scala:28)
at org.apache.spark.graphx.GraphLoader$.edgeListFile(GraphLoader.scala:96)
at aaa.graphtest$.delayedEndpoint$zettasense$graphtest$1(Test.scala:15)
at aaa.graphtest$delayedInit$body.apply(Test.scala:6)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at aaa.graphtest$.main(Test.scala:6)
at aaa.graphtest.main(Test.scala)
It was a library mismatch, I updated spark-core, spark-sql, spark-mlib to the latest version and it worked smoothly, here is how my build.sbt looks now
name := "myproject"
version := "0.1"
scalaVersion := "2.11.8"
mainClass in(Compile, packageBin) := Some("myproject.Processor")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.scalatest" %% "scalatest" % "3.2.0-SNAP10" % Test,
"com.typesafe" % "config" % "1.3.1",
"org.apache.spark" %% "spark-mllib" % "2.4.0"
)
Related
The spark docmentation suggests using spark-hadoop-cloud to read / write from S3 in https://spark.apache.org/docs/latest/cloud-integration.html .
There is no apache spark published artifact for spark-hadoop-cloud. Then when trying to use the Cloudera published module the following exception occurs
Exception in thread "main" java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
This seems like a classpath conflict. Then it seems like it's not possible to use spark-hadoop-cloud to read with the vanilla apache spark 3.1.2 jars
version := "0.0.1"
scalaVersion := "2.12.12"
lazy val app = (project in file("app")).settings(
assemblyPackageScala / assembleArtifact := false,
assembly / assemblyJarName := "uber.jar",
assembly / mainClass := Some("com.example.Main"),
// more settings here ...
)
resolvers += "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % "3.1.1.3.1.7270.0-253"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1.7.2.7.0-184"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.1.7.2.7.0-184"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
import org.apache.spark.sql.SparkSession
object SparkApp {
def main(args: Array[String]){
val spark = SparkSession.builder().master("local")
//.config("spark.jars.repositories", "https://repository.cloudera.com/artifactory/cloudera-repos/")
//.config("spark.jars.packages", "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
.appName("spark session").getOrCreate
val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
jsonDF.show()
csvDF.show()
}
}
To read and write to S3 from Spark you only need these 2 dependencies:
"org.apache.hadoop" % "hadoop-aws" % hadoopVersion,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion
Make sure the haddopVersion is the same used by your worker nodes and make sure your workers node also have these dependencies available. The rest of your code looks correct.
I am trying to build jar using sbt package.
build.sbt:
name := "Simple Project"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.3.2"
val connectorVersion = "2.3.0"
val cassandraVersion = "3.11"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"org.scalaj" %% "scalaj-http" % "2.4.2",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion
)
The sbt package runs successfully but does not add spark-cassandra-connector and scalaj-http to the final jar created.
Do I need to add anything?
If you want the jar to contain all your dependencies, you have to use the sbt assemlbly plugin:
https://github.com/sbt/sbt-assembly
I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running commands:
me#Mycomputer:~/spark-2.1.0$ bin/spark-submit --class ETLApp /home/me/src/etl/target/scala-2.10/etl-assembly-0.1.0.jar
Output:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
build.sbt:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8")
mainClass := Some("ETLApp")
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2";
libraryDependencies += "org.apache.curator" % "curator-recipes" % "2.6.0"
libraryDependencies += "org.apache.curator" % "curator-test" % "2.6.0"
libraryDependencies += "args4j" % "args4j" % "2.32"
java -version
java version "1.8.0_101"
scala -version
2.10.5
spark version
2.1.0
Any hints welcomed. Thanks
in that case, your jar must bring all dependend classes along when being submitted to spark.
in maven this would be possible with the assembly plugin and the jar-with-dependencies descriptor. with sbt a quick google found this: https://github.com/sbt/sbt-assembly
you can change your build.sbt as follows:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
scalacOptions ++= Seq("-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-sql" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming-kafka" % "1.5.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2",
"org.apache.curator" % "curator-recipes" % "2.6.0",
"org.apache.curator" % "curator-test" % "2.6.0",
"args4j" % "args4j" % "2.32")
mainClass in assembly := Some("your.package.name.ETLApp")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
add the sbt-assembly plugin to your plugins.sbt file under the project directory in your Project's Root directory. Running sbt assembly in the Terminal(Linux) or CMD(Windows) in the root directory of your project would download all the dependencies for you and create an U
Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.3",
"org.twitter4j" % "twitter4j-stream" % "3.0.3"
)
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
I simply took this libraryDependencies online so I am not sure which versions, etc. to use.
Can someone please explain to me how I should fix this .sbt files. I spent a couple hours trying to figure it out but none of the suggesstion worked. I installed scala through homebrew and I am on version 2.11.8
All of my errors were about:
Modules were resolved with conflicting cross-version suffixes.
The problem is that you are mixing Scala 2.11 and 2.10 artifacts. You have:
scalaVersion := "2.11.8"
And then:
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
Where the 2.10 artifact is being required. You are also mixing Spark versions instead of using a consistent version:
// spark 1.6.1
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
// spark 1.4.1
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
// spark 0.9.0-incubating
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
Here is a build.sbt that fixes both problems:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion
)
You also don't need to manually add twitter4j dependencies since they are added transitively by spark-streaming-twitter.
It works for me:
name := "spark_local"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.5",
"org.twitter4j" % "twitter4j-stream" % "3.0.5",
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"org.apache.spark" %% "spark-mllib" % "2.0.0",
"org.apache.spark" %% "spark-streaming" % "2.0.0"
)
I'm using this template to develop a microservice:
http://www.typesafe.com/activator/template/activator-service-container-tutorial
My sbt file is like this:
import sbt._
import Keys._
name := "activator-service-container-tutorial"
version := "1.0.1"
scalaVersion := "2.11.6"
crossScalaVersions := Seq("2.10.5", "2.11.6")
resolvers += "Scalaz Bintray Repo" at "https://dl.bintray.com/scalaz/releases"
libraryDependencies ++= {
val containerVersion = "1.0.1"
val configVersion = "1.2.1"
val akkaVersion = "2.3.9"
val liftVersion = "2.6.2"
val sprayVersion = "1.3.3"
Seq(
"com.github.vonnagy" %% "service-container" % containerVersion,
"com.github.vonnagy" %% "service-container-metrics-reporting" % containerVersion,
"com.typesafe" % "config" % configVersion,
"com.typesafe.akka" %% "akka-actor" % akkaVersion exclude ("org.scala-lang" , "scala-library"),
"com.typesafe.akka" %% "akka-slf4j" % akkaVersion exclude ("org.slf4j", "slf4j-api") exclude ("org.scala-lang" , "scala-library"),
"ch.qos.logback" % "logback-classic" % "1.1.3",
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion,
"net.liftweb" %% "lift-json" % liftVersion,
"com.typesafe.akka" %% "akka-testkit" % akkaVersion % "test",
"io.spray" %% "spray-testkit" % sprayVersion % "test",
"junit" % "junit" % "4.12" % "test",
"org.scalaz.stream" %% "scalaz-stream" % "0.7a" % "test",
"org.specs2" %% "specs2-core" % "3.5" % "test",
"org.specs2" %% "specs2-mock" % "3.5" % "test",
"com.twitter" %% "finagle-http" % "6.25.0",
"com.twitter" %% "bijection-util" % "0.7.2"
)
}
scalacOptions ++= Seq(
"-unchecked",
"-deprecation",
"-Xlint",
"-Ywarn-dead-code",
"-language:_",
"-target:jvm-1.7",
"-encoding", "UTF-8"
)
crossPaths := false
parallelExecution in Test := false
assemblyJarName in assembly := "santo.jar"
mainClass in assembly := Some("Service")
The project compiles fine!
But when I run assembly, the terminal show me this:
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] /path/.ivy2/cache/io.dropwizard.metrics/metrics-core/bundles/metrics-core-3.1.1.jar:com/codahale/metrics/ConsoleReporter$1.class
[error] /path/.ivy2/cache/com.codahale.metrics/metrics-core/bundles/metrics-core-3.0.1.jar:com/codahale/metrics/ConsoleReporter$1.class
What options do I have to fix it?
Thanks
The issue as it seems transitive dependency of the dependency is resulting with two different versions of metrics-core. The best thing to do would be to used the right library dependency so that you end up with a single version of this library. Please use https://github.com/jrudolph/sbt-dependency-graph , if it is difficult to figure out dependencies.
If it is not possible to get to a single version then you would most likely to go down exclude route . I assume, this only work, if there is compatibility between the all required versions.