Suppose in MongoDB there are multiple DBs(DB1, DB2, ... DBa, DBb, ....) and each of them has some collections(Col1A, Col1B, ... Col2A, Col2B, ...)
I want to find a way to manage multiple inputs and outputs in MongoDB. I want to have a self-contained Scala Application written in the Scala language. Here is pseudocode that shows my idea:
readconfig_DB1.Col1A=Read setting pointing to DB=DB1 and collection=Col1A
readconfig_DB2.Col2B=Read setting pointing to DB=DB2 and collection=Col2B
val rdd_DB1.Col1A = MongoSpark.load(sc_DB1.Col1A)
val rdd_DB2.Col2B = MongoSpark.load(sc_DB2.Col2B)
DF_Transofmration1 = Do some transformations on DF1a and DF2b
DF_Transofmration2 = Do some transformations on DF1b and DF2a
writeConfig_DBa.Col1A=Write setting pointing to DB=DB1 and collection=Col1A
writeConfig_DBb.Col2B=Write setting pointing to DB=DB2 and collection=Col2B
MongoSpark.save(DF_Transofmration1 , writeConfig_DBa.Col1A)
MongoSpark.save(DF_Transofmration2 , writeConfig_DBa.Col2B)
Edit1:
I tried to run the solution.
The structure of folders:
$find .
.
./src
./src/main
./src/main/scala
./src/main/scala/application.conf
./src/main/scala/SimpleApp.scala
./build.sbt
Content of build.sbt
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.mongodb.spark" %% "mongo-spark-connector" % "2.4.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"org.apache.spark" %% "spark-sql" % "2.4.1"
)
Content of application.conf:
config{
database {
"spark_mongodb_input_uri": "mongodb://127.0.0.1/test.myCollection",
"spark_mongodb_user":"",
"spark_mongodb_pass":"",
"spark_mongodb_input_database": "test",
"spark_mongodb_input_collection": "myCollection",
"spark_mongodb_input_readPreference_name": "",
"spark_mongodb_output_database": "test",
"spark_mongodb_output_collection": "myCollection"
}
newreaderone {
"database": "test",
"collection": "myCollection",
"readPreference.name": ""
}
newwriterone {
"uri":"mongodb://127.0.0.1/test.myCollection"
"database": "test",
"collection": "myCollection",
"replaceDocument": "false",//If set to True, updates an existing document
"readPreference.name": "",
"maxBatchSize": "128"
}
}
content of SimpleApp.scala:
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.SparkSession
object FirstApp {
def main(args: Array[String]) {
import com.typesafe.{Config,ConfigFactory}
val appConfig: Config = ConfigFactory.load("application.conf")
import scala.collection.JavaConverters._
val initial_conf:Config = appconf.getConfig("config.database")
val confMap: Map[String,String] = initial_conf.entrySet()
.iterator.asScala
.map(e => e.getKey.replaceAll("_",".") -> e.getValue.unwrapped.toString).toMap
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame,SparkSession}
val sparkConfig: SparkConf=new SparkConf()
sparkConfig.setAll(confMap)
val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport.getOrCreate
import com.mongodb.spark._
val data: DataFrame = MongoSpark.load(spark)
import com.mongodb.spark.config._
val nreader = appConfig.getConfig("config.newreaderone")
val readMap: Map[String,Any] = nreader.entrySet()
.iterator.asScala
.map(e => e.getKey -> e.getValue.unwrapped)
.toMap
val customReader = ReadConfig(readMap)
val newDF: DataFrame = spark.read.mongo(customReader)
resultDF.write.mode("append").mongo()
}
}
Error after compilation:
sbt package
[info] Updated file /Path/3/project/build.properties: set sbt.version to 1.3.10
[info] Loading global plugins from /home/sadegh/.sbt/1.0/plugins
[info] Loading project definition from /Path/3/project
[info] Loading settings for project root-3 from build.sbt ...
[info] Set current project to root-3 (in build file:/Path/3/)
[warn] There may be incompatibilities among your library dependencies; run 'evicted' to see detailed eviction warnings.
[info] Compiling 1 Scala source to /Path/3/target/scala-2.11/classes ...
[error] /Path/3/src/main/scala/SimpleApp.scala:8:13: object typesafe is not a member of package com
[error] import com.typesafe.{Config,ConfigFactory}
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:9:17: not found: type Config
[error] val appConfig: Config = ConfigFactory.load("application.conf")
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:9:26: not found: value ConfigFactory
[error] val appConfig: Config = ConfigFactory.load("application.conf")
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:11:19: not found: type Config
[error] val initial_conf:Config = appconf.getConfig("config.database")
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:11:28: not found: value appconf
[error] val initial_conf:Config = appconf.getConfig("config.database")
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:19:56: not found: value sparkConf
[error] val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport.getOrCreate
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:28:21: overloaded method value apply with alternatives:
[error] (options: scala.collection.Map[String,String])com.mongodb.spark.config.ReadConfig.Self <and>
[error] (sparkConf: org.apache.spark.SparkConf)com.mongodb.spark.config.ReadConfig.Self <and>
[error] (sqlContext: org.apache.spark.sql.SQLContext)com.mongodb.spark.config.ReadConfig.Self <and>
[error] (sparkSession: org.apache.spark.sql.SparkSession)com.mongodb.spark.config.ReadConfig.Self <and>
[error] (sparkContext: org.apache.spark.SparkContext)com.mongodb.spark.config.ReadConfig.Self
[error] cannot be applied to (scala.collection.immutable.Map[String,Any])
[error] val customReader = ReadConfig(readMap)
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:29:36: value mongo is not a member of org.apache.spark.sql.DataFrameReader
[error] val newDF: DataFrame = spark.read.mongo(customReader)
[error] ^
[error] /Path/3/src/main/scala/SimpleApp.scala:30:2: not found: value resultDF
[error] resultDF.write.mode("append").mongo()
[error] ^
[error] 9 errors found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 12 s, completed Jun 14, 2020 6:55:43 PM
You can pass the configurations as inputs to your application in HOCON.
You can try the following HOCON configuration snippet for mulitple read and write configuration.
config{
database {
"spark_mongodb_input_uri": "mongodb://connection/string/here",
"spark_mongodb_user":"your_user_name",
"spark_mongodb_pass":"your_password",
"spark_mongodb_input_database": "Some_Db_Name",
"spark_mongodb_input_collection": "Some_Col_Name",
"spark_mongodb_input_readPreference_name": "primaryPreferred",
"spark_mongodb_output_database": "Some_Output_Db_Name",
"spark_mongodb_output_collection": "Some_Output_Col_Name"
}
newreaderone {
"database": "sf",
"collection": "matchrecord",
"readPreference.name": "secondaryPreferred"
}
newwriterone {
"uri":"mongodb://uri of same or new mongo cluster/instance"
"database": "db_name",
"collection": "col_name",
"replaceDocument": "false",//If set to True, updates an existing document
"readPreference.name": "secondaryPreferred",
"maxBatchSize": "128"
}
}
the above configuration has been tested and can be read easily with Typesafe Config library.
Update:
Enter the above configuration in a file call application.conf.
Use the following line of code to read the file
Step 1:
Reading the configuration file
import com.typesafe.{Config,ConfigFactory}
val appConfig: Config = ConfigFactory.load("/path/to/application.conf")
Step 2:
To initialize spark to read and write to MongoDB we use the config under the database section as follows:
import scala.collection.JavaConverters._
val initial_conf:Config = appconf.getConfig("config.database")
val confMap: Map[String,String] = initial_conf.entrySet()
.iterator.asScala
.map(e => e.getKey.replaceAll("_",".") -> e.getValue.unwrapped.toString).toMap
Step 3: Create SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame,SparkSession}
val sparkConfig: SparkConf=new SparkConf()
sparkConfig.setAll(confMap)
val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport.getOrCreate
Step 4:
Read DataFrame from MongoDB
import com.mongodb.spark._
val data: DataFrame = MongoSpark.load(spark)
The above step reads the collection specified in the database section of the configuration
Step 5:
Reading a new collection:
import com.mongodb.spark.config._
val nreader = appConfig.getConfig("config.newreaderone")
val readMap: Map[String,Any] = nreader.entrySet()
.iterator.asScala
.map(e => e.getKey -> e.getValue.unwrapped)
.toMap
val customReader = ReadConfig(readMap)
val newDF: DataFrame = spark.read.mongo(customReader)
Step 6:
Writing to MongoDB Collection
resultDF.write.mode("append").mongo()
Above code writes to the collection speified under the database section of the configuration
ii) Writing to collection other than the one specified in SparkConf
import com.mongodb.spark.config._
val nwriter = appConfig.getConfig("config.newwriterone")
val writerMap: Map[String,Any] = nreader.entrySet()
.iterator.asScala
.map(e => e.getKey -> e.getValue.unwrapped).toMap
val writeConf = WriteConfig(writerMap)
MongoSpark.save(resultDF, writeConf)
Update:
The whole code looks as folows, look at the two ways I save the dataframe into MongoDb in the end
import com.typesafe.{Config,ConfigFactory}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame,SparkSession}
import scala.collection.JavaConverters._
import com.mongodb.spark._
import com.mongodb.spark.config._
object ReadWriteMongo{
def main(args: Array[String]): Unit = {
val appConfig: Config = ConfigFactory.load("application.conf")
val initial_conf:Config = appconf.getConfig("config.database")
val confMap: Map[String,String] = initial_conf.entrySet()
.iterator.asScala
.map(e => e.getKey.replaceAll("_",".") -> e.getValue.unwrapped.toString).toMap
val sparkConfig: SparkConf=new SparkConf()
sparkConfig.setAll(confMap)
val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport.getOrCreate
val data: DataFrame = MongoSpark.load(spark)
val nreader = appConfig.getConfig("config.newreaderone")
val readMap: Map[String,Any] = nreader.entrySet()
.iterator.asScala
.map(e => e.getKey -> e.getValue.unwrapped)
.toMap
val customReader = ReadConfig(readMap)
/*
Read Data from MongoDB
*/
val newDF: DataFrame = spark.read.mongo(customReader)
/*
* After you performing processing on the newDF above and save
* the result in a new Dataframe called "resultDF".
* You can save the DF as follows
*/
resultDF.write.mode("append").mongo()
/*
*Alternatively You can save a Dataframe also by passing a WriteConfig as follows
*/
val nwriter = appConfig.getConfig("config.newwriterone")
val writerMap: Map[String,Any] = nreader.entrySet()
.iterator.asScala
.map(e => e.getKey -> e.getValue.unwrapped).toMap
val writeConf = WriteConfig(writerMap)
MongoSpark.save(resultDF, writeConf)
}
}
The folder structure should be as follows:
src/
src/main/scala/com/test/ReadWriteMongo.scala
src/main/resources/application.conf
Update: build.sbt
val sparkVersion = "2.4.1"
scalaVersion := "2.11.12"
scalacOptions ++= Seq(
"-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps"
)
libraryDependencies ++= Seq(
"com.typesafe" % "config" % "1.4.0",
"org.mongodb.spark" %% "mongo-spark-connector" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided
)
mainClass in assembly := Some("com.test.SimpleApp.scala")
assembly / test := {}
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
Related
I am new in scala/Spark development. I have created a simple streaming application from Kafka topic using sbt and scala. I have the following code
build.sbt
name := "kafka-streaming"
version := "1.0"
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case PathList(pl # _*) if pl.contains("log4j.properties") => MergeStrategy.concat
case PathList("META-INF", "io.netty.versions.properties") => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
scalaVersion := "2.11.8"
resolvers += "jitpack" at "https://jitpack.io"
// still want to be able to run in sbt
// https://github.com/sbt/sbt-assembly#-provided-configuration
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
fork in run := true
javaOptions in run ++= Seq(
"-Dlog4j.debug=true",
"-Dlog4j.configuration=log4j.properties")
libraryDependencies ++= Seq(
"com.groupon.sparklint" %% "sparklint-spark162" % "1.0.4" excludeAll (
ExclusionRule(organization = "org.apache.spark")
),
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.3"
)
WeatherDataStream.scala
package com.supergloo
import kafka.serializer.StringDecoder
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
/**
* Stream from Kafka
*/
object WeatherDataStream {
val localLogger = Logger.getLogger("WeatherDataStream")
def main(args: Array[String]) {
// update
// val checkpointDir = "./tmp"
val sparkConf = new SparkConf().setAppName("Raw Weather")
sparkConf.setIfMissing("spark.master", "local[5]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val kafkaTopicRaw = "spark-topic"
val kafkaBroker = "127.0.01:9092"
val topics: Set[String] = kafkaTopicRaw.split(",").map(_.trim).toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBroker)
localLogger.info(s"connecting to brokers: $kafkaBroker")
localLogger.info(s"kafkaParams: $kafkaParams")
localLogger.info(s"topics: $topics")
val rawWeatherStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
localLogger.info(s"Manaaaaaaaaaf --->>>: $rawWeatherStream")
//Kick off
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
}
I have created jar file using command
sbt package
and run the application using command
./spark-submit --master spark://myserver:7077 --class
com.supergloo.WeatherDataStream
/home/Manaf/kafka-streaming_2.11-1.0.jar
But i got error like this
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
at com.supergloo.WeatherDataStream$.main(WeatherDataStream.scala:37)
at com.supergloo.WeatherDataStream.main(WeatherDataStream.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Based on my stack overflow analysis, i got idea about create jar using assembly command
sbt assembly
But I got an error like below when executing the assembly command
[error] 153 errors were encountered during merge
[trace] Stack trace suppressed: run last *:assembly for the full output.
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.apache.arrow\arrow-vector\jars\arrow-vector-0.10.0.jar:git.properties
[error] C:\Users\amanaf\.ivy2\cache\org.apache.arrow\arrow-format\jars\arrow-format-0.10.0.jar:git.properties
[error] C:\Users\amanaf\.ivy2\cache\org.apache.arrow\arrow-memory\jars\arrow-memory-0.10.0.jar:git.properties
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Inject.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Inject.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Named.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Named.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Provider.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Provider.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Qualifier.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Qualifier.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Scope.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Scope.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\javax.inject\javax.inject\jars\javax.inject-1.jar:javax/inject/Singleton.class
[error] C:\Users\amanaf\.ivy2\cache\org.glassfish.hk2.external\javax.inject\jars\javax.inject-2.4.0-b34.jar:javax/inject/Singleton.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4BlockInputStream.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4BlockInputStream.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4BlockOutputStream.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4BlockOutputStream.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4Compressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4Compressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4Constants.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4Constants.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4Factory.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4Factory.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4FastDecompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4FastDecompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4HCJNICompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4HCJNICompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4HCJavaSafeCompressor$HashTable.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4HCJavaSafeCompressor$HashTable.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4HCJavaSafeCompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4HCJavaSafeCompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4HCJavaUnsafeCompressor$HashTable.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4HCJavaUnsafeCompressor$HashTable.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4HCJavaUnsafeCompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4HCJavaUnsafeCompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4JNI.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4JNI.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4JNICompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4JNICompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4JNIFastDecompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4JNIFastDecompressor.class
[error] deduplicate: different file contents found in the following:
[error] C:\Users\amanaf\.ivy2\cache\org.lz4\lz4-java\jars\lz4-java-1.4.0.jar:net/jpountz/lz4/LZ4JNISafeDecompressor.class
[error] C:\Users\amanaf\.ivy2\cache\net.jpountz.lz4\lz4\jars\lz4-1.2.0.jar:net/jpountz/lz4/LZ4JNISafeDecompressor.class
This issue is related to library versions. I have just updated my build.sbt like this
name := "kafka-streaming"
version := "1.0"
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case PathList(pl # _*) if pl.contains("log4j.properties") => MergeStrategy.concat
case PathList("META-INF", "io.netty.versions.properties") => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
scalaVersion := "2.11.8"
resolvers += "jitpack" at "https://jitpack.io"
// still want to be able to run in sbt
// https://github.com/sbt/sbt-assembly#-provided-configuration
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
fork in run := true
javaOptions in run ++= Seq(
"-Dlog4j.debug=true",
"-Dlog4j.configuration=log4j.properties")
libraryDependencies ++= Seq(
"com.groupon.sparklint" %% "sparklint-spark162" % "1.0.4" excludeAll (
ExclusionRule(organization = "org.apache.spark")
),
"org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.2" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.2" % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.2",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0"
)
Now the issue resolved.
This is the basic code to ingest messages from Kafka into Spark Streams to do a word frequency count. The code is customized for local machine.
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.LocationStrategies._
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010._`
object WordFreqCount {
def main( args:Array[String] ){
println("Start")
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("KafkaReceiver")
.set("spark.driver.bindAddress","127.0.0.1")
println("conf created")
val spark = SparkSession.builder().config(conf).getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(10))
println("ssc created")
val topics = "wctopic"
val brokers = "127.0.0.1:9092"
val groupId = "wcgroup"
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
//val kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "wcgroup",Map("wctopic" -> 1))
messages.print() //prints the stream of data received
ssc.start()
ssc.awaitTermination()
println("End")`
}
}
I want to deploy and submit a spark program using sbt but its throwing error.
Code:
package in.goai.spark
import org.apache.spark.{SparkContext, SparkConf}
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("First Spark")
val sc = new SparkContext(conf)
val fileName = args(0)
val lines = sc.textFile(fileName).cache
val c = lines.count
println(s"There are $c lines in $fileName")
}
}
build.sbt
name := "First Spark"
version := "1.0"
organization := "in.goai"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
resolvers += Resolver.mavenLocal
Under first/project directory
build.properties
bt.version=0.13.9
When I am trying to run sbt package its throwing error given below.
[root#hadoop first]# sbt package
[info] Loading project definition from /home/training/workspace_spark/first/project
[info] Set current project to First Spark (in build file:/home/training/workspace_spark/first/)
[info] Compiling 1 Scala source to /home/training/workspace_spark/first/target/scala-2.11/classes...
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:3: object apache is not a member of package org
[error] import org.apache.spark.{SparkContext, SparkConf}
[error] ^
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:9: not found: type SparkConf
[error] val conf = new SparkConf().setAppName("First Spark")
[error] ^
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:11: not found: type SparkContext
[error] val sc = new SparkContext(conf)
[error] ^
[error] three errors found
[error] (compile:compile) Compilation failed
[error] Total time: 4 s, completed May 10, 2018 4:05:10 PM
I have tried with extends to App too but no change.
Please remove resolvers += Resolver.mavenLocal from build.sbt. Since spark-core is available on Maven, we don't need to use local resolvers.
After that, you can try sbt clean package.
I tried to do ETL on DStream using Kafka Consumer and SparkStreaming, but got the following error. Could you help me fix this? Thanks.
KafkaCardCount.scala:56:28: value reduceByKey is not a member of org.apache.spark.streaming.dstream.DStream[Any]
[error] val wordCounts = etl.reduceByKey(_ + _)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 7 s, completed Jan 14, 2018 2:52:23 PM
I have this sample code. I found many articles suggest adding import import org.apache.spark.streaming.StreamingContext._ but it seems that it does not work for me.
package example
import org.apache.spark.streaming.StreamingContext._
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Durations, StreamingContext}
val ssc = new StreamingContext(sparkConf, Durations.seconds(5))
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val etl = stream.map(r => {
val split = r.value.split("\t")
val id = split(1)
val numStr = split(4)
if (numStr.matches("\\d+")) {
val num = numStr.toInt
val tpl = (id, num)
tpl
} else {
()
}
})
// Create the counts per game
val wordCounts = etl.reduceByKey(_ + _)
wordCounts.print()
I have this build.sbt.
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "example",
scalaVersion := "2.11.8",
version := "0.1.0-SNAPSHOT"
)),
name := "KafkaCardCount",
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10-assembly" % "2.1.0"
)
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Your problem is here:
else {
()
}
The common supertype for (String, Int) and Unit is Any.
What you need to do is signal that the processing failed with a type that similar to your success (if) clause. For example:
else ("-1", -1)
.filter { case (id, res) => id != "-1" && res != -1 }
.reduceByKey(_ + _)
I am trying to use spark submit with a scala script, but first I need to create my package.
Here is my sbt file:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
When I try sbt package, I am getting these errors:
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:3: object functions is not a member of package org.apache.spark.sql
import org.apache.spark.sql.functions._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:4: object types is not a member of package org.apache.spark.sql
import org.apache.spark.sql.types._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:25: not found: value sc
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:30: not found: value sqlContext
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:36: not found: value udf
val sqlfunc = udf(coder)
^
5 errors found
(compile:compileIncremental) Compilation failed
Is anyone faced these errors?
Thanks for helping.
Regards
Majid
You are trying to use class org.apache.spark.sql.functions and package org.apache.spark.sql.types. According to functions class documentation it's available starting from version 1.3.0. And types package is available since version 1.3.1.
Solution: update SBT file to:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.1"
Other errors: "not found: value sc", "not found: value sqlContext", "not found: value udf" are caused by some missing varibales in your XML_Script_SBT.scala file. Can't solve without looking into source code.
Thanks Sergey, your correction corrects 3 errors. Below is my script:
object SimpleApp {
def main(args: Array[String]) {
val today = Calendar.getInstance.getTime
val curTimeFormat = new SimpleDateFormat("yyyyMMdd-HHmmss")
val time = curTimeFormat.format(today)
val destination = "/3.Data/3.Code_Check_Processed/2.XML/" + time + ".extensive.csv"
val source = "/3.Data/2.Code_Check_Raw/2.XML/Extensive/"
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listLocatedStatus(new Path(source))
val uri = iter.next.getPath.toUri
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
val df2 = df.selectExpr("explode(check) as e").select("e.#VALUE","e.additionalInformation1","e.columnNumber","e.context","e.errorType","e.filePath","e.lineNumber","e.message","e.severity")
val coder: (Long => String) = (arg: Long) => {if (arg > -1) time else "nada"}
val sqlfunc = udf(coder)
val df3 = df2.withColumn("TimeStamp", sqlfunc(col("columnNumber")))
df3.write.format("com.databricks.spark.csv").option("header", "false").save(destination)
hdfs.delete(new Path(uri.toString()), true)
sys.exit(0)
}
}
I cannot access the SparkConf in the package. But I have already import the import org.apache.spark.SparkConf. My code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkStreaming {
def main(arg: Array[String]) = {
val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext( conf, Seconds(1) )
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs_new = words.map( w => (w, 1) )
val wordsCount = pairs_new.reduceByKey(_ + _)
wordsCount.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to the terminate
}
}
The sbt dependencies are:
name := "Spark Streaming"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.5.2",
"org.apache.spark" %% "spark-streaming" % "1.5.2"
)
But the error shows that SparkConf cannot be accessed.
[error] /home/cliu/Documents/github/Spark-Streaming/src/main/scala/Spark-Streaming.scala:31: object SparkConf in package spark cannot be accessed in package org.apache.spark
[error] val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
[error] ^
It compiles if you add parenthesis after SparkConf:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
The point is that SparkConf is a class and not a function, so you could use class name also for scope purposes. So when you add parenthesis after the class name, you are making sure you are calling the class constructor and not the scoping functionality. Here is an example from Scala shell illustrating the difference:
scala> class C1 { var age = 0; def setAge(a:Int) = {age = a}}
defined class C1
scala> new C1
res18: C1 = $iwC$$iwC$C1#2d33c200
scala> new C1()
res19: C1 = $iwC$$iwC$C1#30822879
scala> new C1.setAge(30) // this doesn't work
<console>:23: error: not found: value C1
new C1.setAge(30)
^
scala> new C1().setAge(30) // this works
scala>
In this case you cannot omit parentheses so it should be:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")