Can't write to mongodb after mapping in Spark Scala - scala

I have problem when write data to mongo after read and map data.
This is script I use to run the program.
I am using Spark 1.4.0, Scala 2.11.7 and mongo 2.6.10
#!/usr/bin/env bash
SPARK_PATH="/Users/username/spark-1.4.0-bin-hadoop2.6/bin/spark-submit"
CLASS_NAME="com.knx.conversion.ScalaWordCount"
CLUSTER='local[2]'
JARS="/Users/username/spark-1.4.0-bin-hadoop2.6/lib/mongo-hadoop-core-1.4.0.jar,/Users/username/spark-1.4.0-bin-hadoop2.6/lib/mongo-java-driver-3.0.3.jar"
JAR="/Users/username/AggragateConversionFunnel/target/scala-2.11/aggragateconversionfunnel_2.11-1.0.jar"
PROJECT_PATH="/Users/username/AggragateConversionFunnel"
cd ${PROJECT_PATH} && sbt package
${SPARK_PATH} --class ${CLASS_NAME} --master ${CLUSTER} --jars ${JARS} $JAR
and here is the main program here. Just copy from [here][1] and change the input output collection.
package com.knx.conversion
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import org.bson.BasicBSONObject
object ScalaWordCount {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Scala Word Count")
val config = new Configuration()
config.set("mongo.input.uri", "mongodb://127.0.0.1:27017/first-week.interactions")
config.set("mongo.output.uri", "mongodb://127.0.0.1:27017/visit_06_2015.output")
val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
// Input contains tuples of (ObjectId, BSONObject)
// Output contains tuples of (null, BSONObject) - ObjectId will be generated by Mongo driver if null
val countsRDD = mongoRDD.flatMap(arg => {
val str = arg._2.get("referer").toString
str.split("h")
})
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
countsRDD.foreach(println)
val saveRDD = countsRDD.map((tuple) => {
val bson = new BasicBSONObject()
bson.put("word", tuple._1)
bson.put("count", tuple._2.toString)
(null, bson)
})
// Only MongoOutputFormat and config are relevant
saveRDD.saveAsNewAPIHadoopFile("file:///bogus", classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config)
}
}
When run I got error
5/07/24 15:53:03 INFO DAGScheduler: Job 0 finished: foreach at ScalaWordCount.scala:39, took 1.111442 s
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.knx.conversion.ScalaWordCount$.main(ScalaWordCount.scala:48)
at com.knx.conversion.ScalaWordCount.main(ScalaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/24 15:53:03 INFO SparkContext: Invoking stop() from shutdown hook
Just don't know why and how it happened.
[1]: https://github.com/plaa/mongo-spark/blob/master/src/main/scala/ScalaWordCount.scala

This issued is about Scala version that I am currently using is not matching with Spark Scala version.
I am using Scala 2.11.7 to compile and package the jar but Spark 1.4.1 is using Scala 2.10.4.
The answer I found out here.
Then this issue solve by switching version of Scala to 2.10.4.

Related

spark - scala script for wordcount not working Only one SparkContext should be running

I am trying to run a scala code in spark. The wordcount code was copied from internet.
I have the hdfs running.
I keep having the same """Only one SparkContext should be running""" even if I stop de sc
I have also mounted the hdfs and pointed there the file value, but I have the same error.
Don´t know what else could do.
scala> sc.stop()
scala> :load /Users/dvegamar/spark_ej/wordcount.scala
Loading /Users/dvegamar/spark_ej/wordcount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
defined object WordCount
scala> WordCount.main()
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:85)
WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:72)
<init>(<console>:59)
<init>(<console>:63)
<init>(<console>:65)
<init>(<console>:67)
<init>(<console>:69)
<init>(<console>:71)
<init>(<console>:73)
<init>(<console>:75)
<init>(<console>:77)
<init>(<console>:79)
<init>(<console>:81)
<init>(<console>:83)
<init>(<console>:85)
<init>(<console>:87)
<init>(<console>:89)
<init>(<console>:91)
<init>(<console>:93)
<init>(<console>:95)
at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
at WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:83)
... 63 elided
The scala code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main() {
//Configuration for a Spark application.
val conf = new SparkConf()
conf.setAppName("SparkWordCount").setMaster("local")
conf.set("spark.driver.allowMultipleContexts", "true")
//Create Spark Context
val sc = new SparkContext(conf)
//Create MappedRDD by reading from HDFS file from path command line parameter
val rdd = sc.textFile("file:///Users/dvegamar/spark_ej/texto_ejemplo.txt")
//WordCount
rdd.flatMap(_.split(" ")).
map((_, 1)).
reduceByKey(_ + _).
map(x => (x._2, x._1)).
sortByKey(false).
map(x => (x._2, x._1)).
saveAsTextFile("SparkWordCountResult")
//stop context,
sc.stop
}
}

Spark scala not able to push data in Hive table

I am try to push data in existing hive table, i have already created orc table in hive not able to push data in hive. this code is work if i copy paste on spark console but not able to run by spark-submit.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object TestCode {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("first example").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
for (i <- 0 to 100 - 1) {
// sample value but it replace with business logic. and try to push into table.for loop consider as business logic.
var fstring = "fstring" + i
var cmd = "cmd" + i
var idpath = "idpath" + i
import sqlContext.implicits._
val sDF = Seq((fstring, cmd, idpath)).toDF("t_als_s_path", "t_als_s_cmd", "t_als_s_pd")
sDF.write.insertInto("l_sequence");
//sDF.write.format("orc").saveAsTable("l_sequence");
println("write data ==> " + i)
}
}
Giving the error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: l_sequence;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:449)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:455)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:443)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:259)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:239)
at com.hq.bds.Helloword$$anonfun$main$1.apply$mcVI$sp(Helloword.scala:16)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.hq.bds.Helloword$.main(Helloword.scala:10)
at com.hq.bds.Helloword.main(Helloword.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
You need to link hive-site.xml with spark conf or copy hive-site.xml into spark conf directory. Spark is not
able to find your hive metastore (derby database which is by default), so for that we have to link hive-conf to spark conf direcrtory.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your hive-site.xml file to Spark’s configuration directory ($SPARK_HOME/conf). If you
don’t have an existing Hive installation, Spark SQL will still run.
Sudo to root user and then copy hive-site to spark conf directory.
sudo -u root
cp /etc/hive/conf/hive-site.xml /etc/spark/conf

Not able to read conf file in spark scala

I would like to read a conf file in to my spark application. The conf file is located in Hadoop edge node directory.
omega.conf
username = "surrender"
location = "USA"
My Spark Code :
package com.test.spark
import org.apache.spark.{SparkConf, SparkContext}
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
object DemoMain {
def main(args: Array[String]): Unit = {
println("Lets Get Started ")
val conf = new SparkConf().setAppName("SIMPLE")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val conf_loc = "/home/cloudera/localinputfiles/omega.conf"
loadConfigFile(conf_loc)
}
def loadConfigFile(loc:String):Unit ={
val config = ConfigFactory.parseFile(new File(loc))
val username = config.getString("username")
println(username)
}
}
I am running this spark application using spark-submit
spark-submit --class com.test.spark.DemoMain --master local /home/cloudera/dev/jars/spark_examples.jar
Spark job is initiated ,but it throws me the below error .It says that No configuration setting found for key 'username'
17/03/29 12:57:37 INFO SparkContext: Created broadcast 0 from textFile at DemoMain.scala:25
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'username'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
at com.typesafe.config.impl.SimpleConfig.getString (SimpleConfig.java:197)
at com.test.spark.DemoMain$.loadConfigFile(DemoMain.scala:53)
at com.test.spark.DemoMain$.main(DemoMain.scala:27)
at com.test.spark.DemoMain.main(DemoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me on fixing this issue
I just tried its working fine i test this with below code
val config=ConfigFactory.parseFile(new File("/home/sandy/my.conf"))
println("::::::::::::::::::::"+config.getString("username"))
and conf file is
username = "surrender"
location = "USA"
Please check location of your file by printing it.

ClassNotFoundException for Kafka-Spark cluster

I'm trying to run some .jar file made in scala-Spark 2.0.2. in my Spark-Kafka cluster. The code's here:
import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer,ProducerConfig}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object sparkKafka{
def main(args: Array[String]): Unit = {
if(args.length < 4){
System.err.println("Usage: sparkKafka <zkQuorum><group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf()
.setAppName("sparkKafka")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("E:/temp/")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordsCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordsCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
I built .jar file named: kafka-spark.jar and scp-it to my node in spark2 folder so it could read it.
Afterwards I went to start the script with:
bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
The error I'm getting is like it said in the head of topic, or ClassNotFoundException: sparkKafka
[root#hdp2 spark2]# bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
java.lang.ClassNotFoundException: sparkKafka
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Where am I making mistake? Also I tried with full path to my jar file, but eihter I get that .jar not found or this error above. Also I tried without -v but I think it doesn't make any difference.
It would be great if someone knows where is a problem. Thank you!
Did you try calling it like this
bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
If that does not work you might want to extract content of kafka-spark.jar and check if it actually contains sparkKafka class.
Hava you add package in your sparkKafka.scala ?

Spark: java.lang.IllegalArgumentException: Invalid hostname in URI s3:///<bucket-name>

I have written a sample Spark program in Scala to count the number of lines of a text file present in Amazon S3. Below is my sample program.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.util.{Map => JMap}
import org.apache.hadoop.conf.Configuration
object CountLines {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("CountLines").setMaster("local"))
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey","XYZ");
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","XYX");
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val path ="s3:///my-bucket/test/test.txt";
println("num lines: " + countLines(sc, path));
}
def countLines(sc: SparkContext, path: String): Long = {
sc.textFile(path).count();
}
}
Unfortunately I am getting IllegalArgumentException which has something to do with credentials. Below is the stack trace.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid hostname in URI s3:/my-bucket/test/test.txt
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
I have given valid credentials. I package this as a JAR file and run on the cluster using spark-submit command. I am not sure if this is the right way to set the access key and secret key in spark. I have tried different approaches but nothing seems to work. Throwing some light on this issue would be highly appreciated.
Thanks,
J Joseph
You have an extra slash. You have to change s3:///my-bucket/test/test.txt to s3://my-bucket/test/test.txt.