i have a R&D project that read data from oracle then write to spark standalone cluster by using scala & intellij
this is my build.sbt with library Dependencies
name := "DB_Oracle_V07"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.0"
my project main class without any spark syntax
package com.xxxx.spark
import java.io.FileWriter
import java.sql.{Connection, DriverManager}
import java.text.SimpleDateFormat
import java.util.Calendar
import scala.collection.mutable.ArrayBuffer
object query01 {
var dateStart = ""
case class DF_LOT_INFO(LOT_NUMBER: String, MACHINE: String, FACILITY: String, LOT_TYPE: String, REC_DATE: String, FILE_NAME: String)
def main(args:Array[String]): Unit = {
val cal = Calendar.getInstance
cal.add(Calendar.DATE, 1)
val date = cal.getTime
val format1 = new SimpleDateFormat("yyyyMMdd_HHmmss")
dateStart = format1.format(date)
print_log("start")
val url = "jdbc:oracle:thin:#TMDT1PEN.XXXX.XXXX.COM:1521:TMDT1"
//val driver = "oracle.jdbc.OracleDriver"
val driver = "oracle.jdbc.driver.OracleDriver"
val username = "TMDB_XXXX"
val password = "XXXXXXXX"
val connection:Connection = null
val result = ArrayBuffer[String]()
try{
print_log("Class.forName start")
val app_dir = System.getProperty("user.dir")
print_log("current dir: " + app_dir)
val java_class_path = System.getProperty("java.class.path")
print_log("java_class_path: " + java_class_path)
Class.forName(driver)
var testing = Class.forName(driver)
print_log("Class.forName end")
//DriverManager.registerDriver(new oracle.jdbc.driver.OracleDriver)
val connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement
val rs = statement.executeQuery("select * from tester")
var i = 1
while(rs.next){
val item = rs.getString("tester_name")
println("data:" + item)
print_log("data:" + item)
result.append(item)
i = i + 1
}
values('aaaaa','bbbbb',sysdate)")
}
catch{
//case unknown => println("Got this unknown exception: " + unknown)
case unknown => print_error_log("Unknown exception: " + unknown)
}
finally{
}
print_log("end")
}
def print_log(msg:String): Unit = {
val fw = new FileWriter(dateStart + "_log.txt", true)
try {
fw.write("\n" + msg)
}
finally fw.close()
}
def print_error_log(msg:String): Unit = {
val fw = new FileWriter(dateStart + "_error_log.txt", true)
try {
fw.write("\n" + msg)
}
finally fw.close()
}
}
i build artifact as usual & added ojdbc6.jar in my proect .jar as my oracle library
ojdbc6.jar
but i fail to execute my project jar file and getting below error
Error: Could not find or load main class com.xxxx.spark.query01
blindly i removed all spark extracted jar files in my project .jar
or create a new project without any library Dependencies in build.sbt,
then i able to execute my project .jar without error.
With above blind test i sure that the error is caused by spark library,
Any expert can advise me how to solve this error? Thanks for all :)
Related
I suppose some dependencies are not defined in build.sbt file.
I've added library dependencies in build.sbt file, but still I'm getting this error mentioned from title of this question. Try to search for solution on the google but couldn't find it
My spark scala source code (filterEventId100.scala) :
package com.projects.setTopBoxDataAnalysis
import java.lang.System._
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.SparkSession
object filterEventId100 extends App {
if (args.length < 2) {
println("Usage: JavaWordCount <Input-File> <Output-file>")
exit(1)
}
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
val result = data.flatMap{line: String => line.split("\n")}
.map{serverData =>
val serverDataArray = serverData.replace("^", "::")split("::")
val evenId = serverDataArray(2)
if (evenId.equals("100")) {
val serverId = serverDataArray(0)
val timestempTo = serverDataArray(3)
val timestempFrom = serverDataArray(6)
val server = new Servers(serverId, timestempFrom, timestempTo)
val res = (serverId, server.dateDiff(server.timestampFrom, server.timestampTo))
res
}
}.reduceByKey{
case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
}
result.saveAsTextFile(args(1))
spark.stop
}
class Servers(val serverId: String, val timestampFrom: String, val timestampTo: String) {
val DATE_FORMAT = "yyyy-MM-dd hh:mm:ss.SSS"
private def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat(DATE_FORMAT)
dateFormat.parse(s)
}
private def convertDateStringToLong(dateAsString: String): Long = {
convertStringToDate(dateAsString).getTime
}
def dateDiff(tFrom: String, tTo: String): Long = {
val dDiff = convertDateStringToLong(tTo) - tFrom.toLong
dDiff
}
}
My build.sbt file:
name := "SetTopProject"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" %% "hadoop-common" % "3.2.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-hive_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-yarn_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy")
)
I was expecting everything will be fine because
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
is defined well (without any compiler's errors) and I use spark value to define data value:
val data = spark.read.textFile(args(0)).rdd
which calls saveAsTextFile and reducedByKey functions:
val result = data.flatMap{line: String => line.split("\n")}...
}.reducedByKey {case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
result.saveAsTextFile(args(1))
What I should to to remove compiler errors for saveAsTextFile and reduceByKey functions calls?
Replace
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
to
val conf = new SparkConf().setAppName("FilterEvent100")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
val data = sc.textfile(args(0))
i have a problem about running server/client using ScalaPB on spark.
its totally work fine while I running my code using "sbt run". i want running this code using spark coz next ill import my spark model to predict some label. but while I submit my jar to spark, they give me error like this.
Exception in thread "main" io.grpc.ManagedChannelProvider$ProviderNotFoundException:
No functional server found. Try adding a dependency on the grpc-netty artifact
this is my build.sbt
scalaVersion := "2.11.7"
PB.targets in Compile := Seq(
scalapb.gen() -> (sourceManaged in Compile).value
)
val scalapbVersion =
scalapb.compiler.Version.scalapbVersion
val grpcJavaVersion =
scalapb.compiler.Version.grpcJavaVersion
libraryDependencies ++= Seq(
// protobuf
"com.thesamet.scalapb" %% "scalapb-runtime" % scalapbVersion % "protobuf",
//for grpc
"io.grpc" % "grpc-netty" % grpcJavaVersion ,
"com.thesamet.scalapb" %% "scalapb-runtime-grpc" % scalapbVersion
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
using shade still not work
assemblyShadeRules in assembly := Seq(ShadeRule.rename("com.google.**" -> "shadegoogle.#1").inAll)
and this my main
import java.util.logging.Logger
import io.grpc.{Server, ServerBuilder}
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.sql.SparkSession
import testproto.test.{Email, EmailLabel, RouteGuideGrpc}
import scala.concurrent.{ExecutionContext, Future}
object HelloWorldServer {
private val logger = Logger.getLogger(classOf[HelloWorldServer].getName)
def main(args: Array[String]): Unit = {
val server = new HelloWorldServer(ExecutionContext.global)
server.start()
server.blockUntilShutdown()
}
private val port = 50051
}
class HelloWorldServer(executionContext: ExecutionContext) {
self =>
private[this] var server: Server = null
private def start(): Unit = {
server = ServerBuilder.forPort(HelloWorldServer.port).addService(RouteGuideGrpc.bindService(new RouteGuideImpl, executionContext)).build.start
HelloWorldServer.logger.info("Server started, listening on " + HelloWorldServer.port)
sys.addShutdownHook {
System.err.println("*** shutting down gRPC server since JVM is shutting down")
self.stop()
System.err.println("*** server shut down")
}
}
private def stop(): Unit = {
if (server != null) {
server.shutdown()
}
}
private def blockUntilShutdown(): Unit = {
if (server != null) {
server.awaitTermination()
}
}
private class RouteGuideImpl extends RouteGuideGrpc.RouteGuide {
override def getLabel(request: Email): Future[EmailLabel] = {
val replay = EmailLabel(emailId = request.emailId, label = "aaaaa")
Future.successful(replay)
}
}
}
thanks
It looks like grpc-netty is not found when an uber jar is made. Instead of using ServerBuilder, change your code to use io.grpc.netty.NettyServerBuilder.
I am trying to create a project with Slick in Intellij IDEA with PostgreSQL driver. But I have managed to find only this tutorial
I followed it, but I have got an error:
Exception in thread "main" java.lang.ExceptionInInitializerError
at Main.main(Main.scala)
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'url'
Here is my code for the main class:
import scala.slick.driver.PostgresDriver.simple._
object Main {
case class Song(
id: Int,
name: String,
singer: String)
class SongsTable(tag: Tag) extends Table[Song](tag, "songs") {
def id = column[Int]("id")
def name = column[String]("name")
def singer = column[String]("singer")
def * = (id, name, singer) <> (Song.tupled, Song.unapply)
}
lazy val songsTable = TableQuery[SongsTable]
val db = Database.forConfig("scalaxdb")
def main(args: Array[String]): Unit = {
val connectionUrl = "jdbc:postgresql://localhost/songs?user=postgres&password=postgresp"
Database.forURL(connectionUrl, driver = "org.postgresql.Driver") withSession {
implicit session =>
val songs = TableQuery[SongsTable]
songs.list foreach { row =>
println("song with id " + row.id + " has name " + row.name + " and a singer is " + row.singer)
}
}
}
}
These is application.conf file:
scalaxdb = {
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost/dbname?user=user&password=password"
}
}
And this is build.sbt:
libraryDependencies ++= Seq(
"org.postgresql" % "postgresql" % "9.3-1100-jdbc4",
"com.typesafe.slick" %% "slick" % "2.1.0",
"org.slf4j" % "slf4j-nop" % "1.6.4"
)
I can not figure out what I am doing wrong. I would be very grateful for any advice on fixing this.
Don't you mind to use a little more up-to-date version of Slick?
import slick.jdbc.PostgresProfile.api._
import scala.concurrent.Await
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
object Main {
case class Song(
id: Int,
name: String,
singer: String)
class SongsTable(tag: Tag) extends Table[Song](tag, "songs") {
def id = column[Int]("id")
def name = column[String]("name")
def singer = column[String]("singer")
def * = (id, name, singer) <> (Song.tupled, Song.unapply)
}
val db = Database.forConfig("scalaxdb")
val songs = TableQuery[SongsTable]
def main(args: Array[String]): Unit = {
Await.result({
db.run(songs.result).map(_.foreach(row =>
println("song with id " + row.id + " has name " + row.name + " and a singer is " + row.singer)))
}, 1 minute)
}
}
build.sbt
scalaVersion := "2.12.4"
libraryDependencies += "com.typesafe.slick" %% "slick" % "3.2.1"
libraryDependencies += "org.slf4j" % "slf4j-nop" % "1.7.25"
libraryDependencies += "com.typesafe.slick" %% "slick-hikaricp" % "3.2.1"
libraryDependencies += "org.postgresql" % "postgresql" % "42.1.4"
I am trying to run a Scala example with SBT to read data from MongoDB. I am getting this error whenever I try to access the data read from Mongo into the RDD.
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1431)
at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:494)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
I have imported the Dataframe explicitly, even though it is not used in my code. Can anyone help with this issue?
My code:
package stream
import org.apache.spark._
import org.apache.spark.SparkContext._
import com.mongodb.spark._
import com.mongodb.spark.config._
import com.mongodb.spark.rdd.MongoRDD
import org.bson.Document
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame
object SpaceWalk {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SpaceWalk")
.setMaster("local[*]")
.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/nasa.eva")
.set("spark.mongodb.output.uri", "mongodb://127.0.0.1/nasa.astronautTotals")
val sc = new SparkContext(sparkConf)
val rdd = sc.loadFromMongoDB()
def breakoutCrew ( document: Document ): List[(String,Int)] = {
println("INPUT"+document.get( "Duration").asInstanceOf[String])
var minutes = 0;
val timeString = document.get( "Duration").asInstanceOf[String]
if( timeString != null && !timeString.isEmpty ) {
val time = document.get( "Duration").asInstanceOf[String].split( ":" )
minutes = time(0).toInt * 60 + time(1).toInt
}
import scala.util.matching.Regex
val pattern = new Regex("(\\w+\\s\\w+)")
val names = pattern findAllIn document.get( "Crew" ).asInstanceOf[String]
var tuples : List[(String,Int)] = List()
for ( name <- names ) { tuples = tuples :+ (( name, minutes ) ) }
return tuples
}
val logs = rdd.flatMap( breakoutCrew ).reduceByKey( (m1: Int, m2: Int) => ( m1 + m2 ) )
//logs.foreach(println)
def mapToDocument( tuple: (String, Int ) ): Document = {
val doc = new Document();
doc.put( "name", tuple._1 )
doc.put( "minutes", tuple._2 )
return doc
}
val writeConf = WriteConfig(sc)
val writeConfig = WriteConfig(Map("collection" -> "astronautTotals", "writeConcern.w" -> "majority", "db" -> "nasa"), Some(writeConf))
logs.map( mapToDocument ).saveToMongoDB( writeConfig )
import org.apache.spark.sql.SQLContext
import com.mongodb.spark.sql._
import org.apache.spark.sql.DataFrame
// load the first dataframe "EVAs"
val sqlContext = new SQLContext(sc);
import sqlContext.implicits._
val evadf = sqlContext.read.mongo()
evadf.printSchema()
evadf.registerTempTable("evas")
// load the 2nd dataframe "astronautTotals"
val astronautDF = sqlContext.read.option("collection", "astronautTotals").mongo[astronautTotal]()
astronautDF.printSchema()
astronautDF.registerTempTable("astronautTotals")
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes FROM astronautTotals" ).show()
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes, evas.Vehicle, evas.Duration FROM " +
"astronautTotals JOIN evas ON astronautTotals.name LIKE evas.Crew" ).show()
}
}
case class astronautTotal ( name: String, minutes: Integer )
This is my sbt file -
name := "Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
//libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" % "1.2.1"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "0.1"
addCommandAlias("c1", "run-main stream.SaveTweets")
addCommandAlias("c2", "run-main stream.SpaceWalk")
outputStrategy := Some(StdoutOutput)
//outputStrategy := Some(LoggedOutput(log: Logger))
fork in run := true
This error message is because you are using an incompatible library that only supports Spark 1.x. You should use mongo-spark-connector 2.0.0+ instead. See: https://docs.mongodb.com/spark-connector/v2.0/
Anyone else help me about how can i analyze twitter data based on 'keys' whatever i write.I found this code but this is give me an error.
import java.io.File
import com.google.gson.Gson
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Collect at least the specified number of tweets into json text files.
*/
object Collect {
private var numTweetsCollected = 0L
private var partNum = 0
private var gson = new Gson()
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 3) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>")
System.exit(1)
}
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
val outputDir = new File(outputDirectory.toString)
if (outputDir.exists()) {
System.err.println("ERROR - %s already exists: delete or specify another directory".format(
outputDirectory))
System.exit(1)
}
outputDir.mkdirs()
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(intervalSecs))
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})
ssc.start()
ssc.awaitTermination()
}
}
Error is
object gson is not a member of package com.google
If you know any link about it or fix this problem can you share with me,because i want to analyze twitter datas with spark.
Thanks.:)
Like Peter pointed out, you are missing the gson dependency. So you'll need to add the following dependency to your build.sbt :
libraryDependencies += "com.google.code.gson" % "gson" % "2.4"
You can also do the following to define all the dependencies in one sequence :
libraryDependencies ++= Seq(
"com.google.code.gson" % "gson" % "2.4",
"org.apache.spark" %% "spark-core" % "1.2.0",
"org.apache.spark" %% "spark-streaming" % "1.2.0",
"org.apache.spark" %% "spark-streaming-twitter" % "1.2.0"
)
Bonus: In case of other missing dependencies, you can try to search your dependency on the http://mvnrepository.com/ and if you need to find the associated jar/dependency for a given class, you can also use the findjar website