Exception while trying to run Spark app with JSON parsing

Exception while trying to run Spark app with JSON parsing - scala

I have a simple Spark application with Scala and SBT. First I tried to do the following:
run sbt clean package
run spark-submit --class Main ./target/scala-2.11/sparktest_2.11-1.0.jar
but it fails with the following exception:
Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/module/scala/DefaultScalaModule$
Then I tried the assembly plugin for SBT, but I got the following exception instead:
java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.introspect.POJOPropertyBuilder.addField(Lcom/fasterxml/jackson/databind/introspect/AnnotatedField;Lcom/fasterxml/jackson/databind/PropertyName;ZZZ)V
As I can see, everything looks related to the Jackson lib and to the Scala support. Maybe it's some issue related to versions of the libraries?
My build.sbt looks like this:
name := "SparkTest"
version := "1.0"
scalaVersion := "2.11.4"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8", "-feature")
libraryDependencies ++= {
Seq(
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided",
"com.fasterxml.jackson.core" % "jackson-core" % "2.4.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.1",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.4.1"
)
}
And my application code is simply this:
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.spark.{SparkConf, SparkContext}
trait JsonUtil {
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
}
case class Person(name: String)
object Main extends JsonUtil {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Test App")
val sc = new SparkContext(conf)
val inputFile = "/home/user/data/person.json"
val input = sc.textFile(inputFile)
val persons = input.flatMap { line ⇒ {
try {
println(s" [DEBUG] trying to parse '$line'")
Some(mapper.readValue(line, classOf[Person]))
} catch {
case e : Exception ⇒
println(s" [EXCEPTION] ${e.getMessage}")
None
}
}}
println("PERSON LIST:")
for (p ← persons) {
println(s" $p")
}
println("END")
}
}
EDIT: the problem seems to be related to the Spark application. If I run simple application just for testing JSON unmarshalling everything goes OK. But if I try to do the same from the Spark application, then the problem appears as described above. Any ideas?

Related

Scala Flink get java.lang.NoClassDefFoundError: scala/Product$class after using case class for customized DeserializationSchema

It work fine when using generic class.
But get java.lang.NoClassDefFoundError: scala/Product$class error after change class to case class.
Not sure is sbt packaging problem or code problem.
When I'm using:
sbt
scala: 2.11.12
java: 8
sbt assembly to package
package example
import java.util.Properties
import java.nio.charset.StandardCharsets
import org.apache.flink.api.scala._
import org.apache.flink.streaming.util.serialization.{DeserializationSchema, SerializationSchema}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaProducer}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.api.common.typeinfo.TypeInformation
import Config._
case class Record(
id: String,
startTime: Long
) {}
class RecordDeSerializer extends DeserializationSchema[Record] with SerializationSchema[Record] {
override def serialize(record: Record): Array[Byte] = {
return "123".getBytes(StandardCharsets.UTF_8)
}
override def deserialize(b: Array[Byte]): Record = {
Record("1", 123)
}
override def isEndOfStream(record: Record): Boolean = false
override def getProducedType: TypeInformation[Record] = {
createTypeInformation[Record]
}
}
object RecordConsumer {
def main(args: Array[String]): Unit = {
val config : Properties = {
var p = new Properties()
p.setProperty("zookeeper.connect", Config.KafkaZookeeperServers)
p.setProperty("bootstrap.servers", Config.KafkaBootstrapServers)
p.setProperty("group.id", Config.KafkaGroupID)
p
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(1000)
var consumer = new FlinkKafkaConsumer[Record](
Config.KafkaTopic,
new RecordDeSerializer(),
config
)
consumer.setStartFromEarliest()
val stream = env.addSource(consumer).print
env.execute("record consumer")
}
}
Error
2020-08-05 04:07:33,963 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 1670 of job 4de8831901fa72790d0a9a973cc17dde.
java.lang.NoClassDefFoundError: scala/Product$class
...
build.SBT
First idea is that maybe version is not right.
But every thing work fine if use normal class
Here is build.sbt
ThisBuild / resolvers ++= Seq(
"Apache Development Snapshot Repository" at "https://repository.apache.org/content/repositories/snapshots/",
Resolver.mavenLocal
)
name := "deedee"
version := "0.1-SNAPSHOT"
organization := "dexterlab"
ThisBuild / scalaVersion := "2.11.8"
val flinkVersion = "1.8.2"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-java" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion,
)
val thirdPartyDependencies = Seq(
"com.github.nscala-time" %% "nscala-time" % "2.24.0",
"com.typesafe.play" %% "play-json" % "2.6.14",
)
lazy val root = (project in file(".")).
settings(
libraryDependencies ++= flinkDependencies,
libraryDependencies ++= thirdPartyDependencies,
libraryDependencies += "org.scala-lang" % "scala-compiler" % scalaVersion.value,
)
assembly / mainClass := Some("dexterlab.TelecoDataConsumer")
// make run command include the provided dependencies
Compile / run := Defaults.runTask(Compile / fullClasspath,
Compile / run / mainClass,
Compile / run / runner
).evaluated
// stays inside the sbt console when we press "ctrl-c" while a Flink programme executes with "run" or "runMain"
Compile / run / fork := true
Global / cancelable := true
// exclude Scala library from assembly
assembly / assemblyOption := (assembly / assemblyOption).value.copy(includeScala = false)
autoCompilerPlugins := true

Finally success after I add this line in build.sbt
assembly / assemblyOption := (assemblu / assemblyOption).value.copy(includeScala = true)
To include scala library when running sbt assembly

PlaySpec not found in IntelliJ

Below is a Scala test of websocket:
import java.util.function.Consumer
import play.shaded.ahc.org.asynchttpclient.AsyncHttpClient
import play.api.inject.guice.GuiceApplicationBuilder
import play.api.test.{Helpers, TestServer, WsTestClient}
import scala.compat.java8.FutureConverters
import scala.concurrent.Await
import scala.concurrent.duration._
import org.scalatestplus.play._
class SocketTest extends PlaySpec with ScalaFutures {
"HomeController" should {
"reject a websocket flow if the origin is set incorrectly" in WsTestClient.withClient { client =>
// Pick a non standard port that will fail the (somewhat contrived) origin check...
lazy val port: Int = 31337
val app = new GuiceApplicationBuilder().build()
Helpers.running(TestServer(port, app)) {
val myPublicAddress = s"localhost:$port"
val serverURL = s"ws://$myPublicAddress/ws"
val asyncHttpClient: AsyncHttpClient = client.underlying[AsyncHttpClient]
val webSocketClient = new WebSocketClient(asyncHttpClient)
try {
val origin = "ws://example.com/ws"
val consumer: Consumer[String] = new Consumer[String] {
override def accept(message: String): Unit = println(message)
}
val listener = new WebSocketClient.LoggingListener(consumer)
val completionStage = webSocketClient.call(serverURL, origin, listener)
val f = FutureConverters.toScala(completionStage)
Await.result(f, atMost = 1000.millis)
listener.getThrowable mustBe a[IllegalStateException]
} catch {
case e: IllegalStateException =>
e mustBe an[IllegalStateException]
case e: java.util.concurrent.ExecutionException =>
val foo = e.getCause
foo mustBe an[IllegalStateException]
}
}
}
}
}
But compile is failing on line import org.scalatestplus.play._ with error :
Cannot resolve symbol scalatestplus
From https://www.playframework.com/documentation/2.8.x/ScalaTestingWithScalaTest I have added scalatest and play to build:
build.sbt:
name := "testproject"
version := "1.0"
lazy val `testproject` = (project in file(".")).enablePlugins(PlayScala)
resolvers += "scalaz-bintray" at "https://dl.bintray.com/scalaz/releases"
resolvers += "Akka Snapshot Repository" at "https://repo.akka.io/snapshots/"
scalaVersion := "2.12.2"
libraryDependencies ++= Seq( jdbc , ehcache , ws , guice , specs2 % Test)
// https://mvnrepository.com/artifact/com.typesafe.scala-logging/scala-logging
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.9.2"
libraryDependencies ++= Seq(
"org.scalatestplus.play" %% "scalatestplus-play" % "3.0.0" % "test"
)
unmanagedResourceDirectories in Test <+= baseDirectory ( _ /"target/web/public/test" )
I've tried rebuilding the project and module in IntelliJ "build" option and "Build Option" when I right click on build.sbt but the import is not found.

sbt dist from Intellij "sbt shell" then File -> "Invalidate caches" with restart of IntelliJ seems to fix the issue
:Invalidate caches screenshot

Apache flink (1.9.1) runtime exception when using case classes in scala (2.12.8)

I am using case class in Scala (2.12.8) Apache Flink (1.9.1) application. I get the following exception when I run the code below Caused by: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V.
NOTE: I have used the default constructor as per the suggestion ( java.lang.NoSuchMethodException for init method in Scala case class) but that does not work in my case
Here is the complete code
package com.zignallabs
import org.apache.flink.api.scala._
/**
// Implements the program that reads from a Element list, Transforms it into tuple and outputs to TaskManager
*/
case class AddCount ( firstP: String, count: Int) {
def this () = this ("default", 1) // No help when added default constructor as per https://stackoverflow.com/questions/51129809/java-lang-nosuchmethodexception-for-init-method-in-scala-case-class
}
object WordCount {
def main(args: Array[String]): Unit = {
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// get input data
val input =env.fromElements(" one", "two", "three", "four", "five", "end of test")
// ***** Line 31 throws the exception
// Caused by: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
// at com.zignallabs.AddCount.<init>(WordCount.scala:7)
// at com.zignallabs.WordCount$.$anonfun$main$1(WordCount.scala:31)
// at org.apache.flink.api.scala.DataSet$$anon$1.map(DataSet.scala:490)
// at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:79)
// at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
// at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:196)
// at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
// at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
// at java.lang.Thread.run(Thread.java:748)
val transform = input.map{w => AddCount(w, 1)} // <- Throwing exception
// execute and print result
println(transform)
transform.print()
transform.printOnTaskManager(" Word")
env.execute()
}
}
Run time exception is :
at com.zignallabs.AddCount.<init>(WordCount.scala:7)
at com.zignallabs.WordCount$.$anonfun$main$1(WordCount.scala:31)
at org.apache.flink.api.scala.DataSet$$anon$1.map(DataSet.scala:490)
at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:79)
at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:196)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
I am building and running flink locally using local flink cluster with flink version 1.9.1.
Here is the build.sbt file:
name := "flink191KafkaScala"
version := "0.1-SNAPSHOT"
organization := "com.zignallabs"
scalaVersion := "2.12.8"
val flinkVersion = "1.9.1"
//javacOptions ++= Seq("-source", "1.7", "-target", "1.7")
val http4sVersion = "0.16.6"
resolvers ++= Seq(
"Local Ivy" at "file:///"+Path.userHome+"/.ivy2/local",
"Local Ivy Cache" at "file:///"+Path.userHome+"/.ivy2/cache",
"Local Maven Repository" at "file:///"+Path.userHome+"/.m2/repository",
"Artifactory Cache" at "https://zignal.artifactoryonline.com/zignal/zignal-repos"
)
val excludeCommonsLogging = ExclusionRule(organization = "commons-logging")
libraryDependencies ++= Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-clients" % "1.9.1",
// Upgrade to flink-connector-kafka_2.11
"org.apache.flink" %% "flink-connector-kafka-0.11" % "1.9.1",
//"org.scalaj" %% "scalaj-http" % "2.4.2",
"com.squareup.okhttp3" % "okhttp" % "4.2.2"
)
publishTo := Some("Artifactory Realm" at "https://zignal.artifactoryonline.com/zignal/zignal")
credentials += Credentials("Artifactory Realm", "zignal.artifactoryonline.com", "buildserver", "buildserver")
//mainClass in Compile := Some("com.zignallabs.StoryCounterTopology")
mainClass in Compile := Some("com.zignallabs.WordCount")
scalacOptions ++= Seq(
"-feature",
"-unchecked",
"-deprecation",
"-language:implicitConversions",
"-Yresolve-term-conflict:package",
"-language:postfixOps",
"-target:jvm-1.8")
lazy val root = project.in(file(".")).configs(IntegrationTest)

If you're using default args for the constructors of a case class, it's much more idiomatic Scala to define them like this:
case class AddCount ( firstP: String = "default", count: Int = 1)
This is syntactic sugar that basically gives you the following for free:
case class AddCount ( firstP: String, count: Int) {
def this () = this ("default", 1)
def this (firstP:String) = this (firstP, 1)
def this (count:Int) = this ("default", count)
}

I am able to now run this application using Scala 2.12. The issue was in the environment. I needed to ensure conflicts binaries are not there especially the ones for scala 2.11 and scala 2.12

File not found exception while loading a properties file on a Scala SBT project

I am trying to learn a Scala-Spark JDBC program on IntelliJ IDEA. In order to do that, I have created a Scala SBT Project and the project structure looks like:
Before writing the JDBC connection parameters in the class, I first tried loading a properties file which contain all my connection properties and trying to display if they are loading properly as below:
connection.properties content:
devUserName=username
devPassword=password
gpDriverClass=org.postgresql.Driver
gpDevUrl=jdbc:url
Code:
package com.yearpartition.obj
import java.io.FileInputStream
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.SparkConf
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val properties = new Properties()
properties.load(new FileInputStream("connection.properties"))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName=properties.getProperty("devUserName")
val devPassword=properties.getProperty("devPassword")
val gpDriverClass=properties.getProperty("gpDriverClass")
println("connectionUrl: " + connectionUrl)
Class.forName(gpDriverClass).newInstance()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().config(conf).master("local[2]").getOrCreate()
println("connectionUrl: " + connectionUrl)
}
}
Content of build.sbt:
name := "YearPartition"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkCoreVer = "2.2.0"
val sparkSqlVer = "2.2.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkCoreVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkSqlVer % "provided" withSources(),
"org.json4s" %% "json4s-jackson" % "3.2.11" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.3"
)
}
Since I am not writing or saving data into any file and trying to display the values of properties file, I executed the code using following:
SPARK_MAJOR_VERSION=2 spark-submit --class com.yearpartition.obj.PartitionRetrieval yearpartition_2.11-0.1.jar
But I am getting file not found exception as below:
Caused by: java.io.FileNotFoundException: connection.properties (No such file or directory)
I tried to fix it in vain. Could anyone let me know what is the mistake I am doing here and how can I correct it ?

You must write to full path of your connection.properties file (file:///full_path/connection.properties) and in this option when you submit a job in cluster if you want to read file the local disk you must save connection.properties file on the all server in the cluster to same path. But in other option, you can read the files from HDFS. Here is a little example for reading files on HDFS:
#throws[IOException]
def readFileFromHdfs(file: String): org.apache.hadoop.fs.FSDataInputStream = {
val conf = new org.apache.hadoop.conf.Configuration
conf.set("fs.default.name", "HDFS_HOST")
val fileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val path = new org.apache.hadoop.fs.Path(file)
if (!fileSystem.exists(path)) {
println("File (" + path + ") does not exists.")
null
} else {
val in = fileSystem.open(path)
in
}
}

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterUtils$

I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount

I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Exception while trying to run Spark app with JSON parsing - scala

Related

Scala Flink get java.lang.NoClassDefFoundError: scala/Product$class after using case class for customized DeserializationSchema

PlaySpec not found in IntelliJ

Apache flink (1.9.1) runtime exception when using case classes in scala (2.12.8)

File not found exception while loading a properties file on a Scala SBT project

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterUtils$

Categories

Resources