Can't write to mongodb after mapping in Spark Scala

Can't write to mongodb after mapping in Spark Scala - scala

I have problem when write data to mongo after read and map data.
This is script I use to run the program.
I am using Spark 1.4.0, Scala 2.11.7 and mongo 2.6.10
#!/usr/bin/env bash
SPARK_PATH="/Users/username/spark-1.4.0-bin-hadoop2.6/bin/spark-submit"
CLASS_NAME="com.knx.conversion.ScalaWordCount"
CLUSTER='local[2]'
JARS="/Users/username/spark-1.4.0-bin-hadoop2.6/lib/mongo-hadoop-core-1.4.0.jar,/Users/username/spark-1.4.0-bin-hadoop2.6/lib/mongo-java-driver-3.0.3.jar"
JAR="/Users/username/AggragateConversionFunnel/target/scala-2.11/aggragateconversionfunnel_2.11-1.0.jar"
PROJECT_PATH="/Users/username/AggragateConversionFunnel"
cd ${PROJECT_PATH} && sbt package
${SPARK_PATH} --class ${CLASS_NAME} --master ${CLUSTER} --jars ${JARS} $JAR
and here is the main program here. Just copy from [here][1] and change the input output collection.
package com.knx.conversion
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import org.bson.BasicBSONObject
object ScalaWordCount {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Scala Word Count")
val config = new Configuration()
config.set("mongo.input.uri", "mongodb://127.0.0.1:27017/first-week.interactions")
config.set("mongo.output.uri", "mongodb://127.0.0.1:27017/visit_06_2015.output")
val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
// Input contains tuples of (ObjectId, BSONObject)
// Output contains tuples of (null, BSONObject) - ObjectId will be generated by Mongo driver if null
val countsRDD = mongoRDD.flatMap(arg => {
val str = arg._2.get("referer").toString
str.split("h")
})
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
countsRDD.foreach(println)
val saveRDD = countsRDD.map((tuple) => {
val bson = new BasicBSONObject()
bson.put("word", tuple._1)
bson.put("count", tuple._2.toString)
(null, bson)
})
// Only MongoOutputFormat and config are relevant
saveRDD.saveAsNewAPIHadoopFile("file:///bogus", classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config)
}
}
When run I got error
5/07/24 15:53:03 INFO DAGScheduler: Job 0 finished: foreach at ScalaWordCount.scala:39, took 1.111442 s
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.knx.conversion.ScalaWordCount$.main(ScalaWordCount.scala:48)
at com.knx.conversion.ScalaWordCount.main(ScalaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/24 15:53:03 INFO SparkContext: Invoking stop() from shutdown hook
Just don't know why and how it happened.
[1]: https://github.com/plaa/mongo-spark/blob/master/src/main/scala/ScalaWordCount.scala

This issued is about Scala version that I am currently using is not matching with Spark Scala version.
I am using Scala 2.11.7 to compile and package the jar but Spark 1.4.1 is using Scala 2.10.4.
The answer I found out here.
Then this issue solve by switching version of Scala to 2.10.4.

Related

spark - scala script for wordcount not working Only one SparkContext should be running

I am trying to run a scala code in spark. The wordcount code was copied from internet.
I have the hdfs running.
I keep having the same """Only one SparkContext should be running""" even if I stop de sc
I have also mounted the hdfs and pointed there the file value, but I have the same error.
Don´t know what else could do.
scala> sc.stop()
scala> :load /Users/dvegamar/spark_ej/wordcount.scala
Loading /Users/dvegamar/spark_ej/wordcount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
defined object WordCount
scala> WordCount.main()
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:85)
WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:72)
<init>(<console>:59)
<init>(<console>:63)
<init>(<console>:65)
<init>(<console>:67)
<init>(<console>:69)
<init>(<console>:71)
<init>(<console>:73)
<init>(<console>:75)
<init>(<console>:77)
<init>(<console>:79)
<init>(<console>:81)
<init>(<console>:83)
<init>(<console>:85)
<init>(<console>:87)
<init>(<console>:89)
<init>(<console>:91)
<init>(<console>:93)
<init>(<console>:95)
at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
at WordCount$.main(/Users/dvegamar/spark_ej/wordcount.scala:83)
... 63 elided
The scala code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main() {
//Configuration for a Spark application.
val conf = new SparkConf()
conf.setAppName("SparkWordCount").setMaster("local")
conf.set("spark.driver.allowMultipleContexts", "true")
//Create Spark Context
val sc = new SparkContext(conf)
//Create MappedRDD by reading from HDFS file from path command line parameter
val rdd = sc.textFile("file:///Users/dvegamar/spark_ej/texto_ejemplo.txt")
//WordCount
rdd.flatMap(_.split(" ")).
map((_, 1)).
reduceByKey(_ + _).
map(x => (x._2, x._1)).
sortByKey(false).
map(x => (x._2, x._1)).
saveAsTextFile("SparkWordCountResult")
//stop context,
sc.stop
}
}

Spark scala not able to push data in Hive table

I am try to push data in existing hive table, i have already created orc table in hive not able to push data in hive. this code is work if i copy paste on spark console but not able to run by spark-submit.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object TestCode {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("first example").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
for (i <- 0 to 100 - 1) {
// sample value but it replace with business logic. and try to push into table.for loop consider as business logic.
var fstring = "fstring" + i
var cmd = "cmd" + i
var idpath = "idpath" + i
import sqlContext.implicits._
val sDF = Seq((fstring, cmd, idpath)).toDF("t_als_s_path", "t_als_s_cmd", "t_als_s_pd")
sDF.write.insertInto("l_sequence");
//sDF.write.format("orc").saveAsTable("l_sequence");
println("write data ==> " + i)
}
}
Giving the error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: l_sequence;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:449)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:455)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:443)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:259)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:239)
at com.hq.bds.Helloword$$anonfun$main$1.apply$mcVI$sp(Helloword.scala:16)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.hq.bds.Helloword$.main(Helloword.scala:10)
at com.hq.bds.Helloword.main(Helloword.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

You need to link hive-site.xml with spark conf or copy hive-site.xml into spark conf directory. Spark is not
able to find your hive metastore (derby database which is by default), so for that we have to link hive-conf to spark conf direcrtory.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your hive-site.xml file to Spark’s configuration directory ($SPARK_HOME/conf). If you
don’t have an existing Hive installation, Spark SQL will still run.
Sudo to root user and then copy hive-site to spark conf directory.
sudo -u root
cp /etc/hive/conf/hive-site.xml /etc/spark/conf

Not able to read conf file in spark scala

I would like to read a conf file in to my spark application. The conf file is located in Hadoop edge node directory.
omega.conf
username = "surrender"
location = "USA"
My Spark Code :
package com.test.spark
import org.apache.spark.{SparkConf, SparkContext}
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
object DemoMain {
def main(args: Array[String]): Unit = {
println("Lets Get Started ")
val conf = new SparkConf().setAppName("SIMPLE")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val conf_loc = "/home/cloudera/localinputfiles/omega.conf"
loadConfigFile(conf_loc)
}
def loadConfigFile(loc:String):Unit ={
val config = ConfigFactory.parseFile(new File(loc))
val username = config.getString("username")
println(username)
}
}
I am running this spark application using spark-submit
spark-submit --class com.test.spark.DemoMain --master local /home/cloudera/dev/jars/spark_examples.jar
Spark job is initiated ,but it throws me the below error .It says that No configuration setting found for key 'username'
17/03/29 12:57:37 INFO SparkContext: Created broadcast 0 from textFile at DemoMain.scala:25
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'username'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
at com.typesafe.config.impl.SimpleConfig.getString (SimpleConfig.java:197)
at com.test.spark.DemoMain$.loadConfigFile(DemoMain.scala:53)
at com.test.spark.DemoMain$.main(DemoMain.scala:27)
at com.test.spark.DemoMain.main(DemoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me on fixing this issue

I just tried its working fine i test this with below code
val config=ConfigFactory.parseFile(new File("/home/sandy/my.conf"))
println("::::::::::::::::::::"+config.getString("username"))
and conf file is
username = "surrender"
location = "USA"
Please check location of your file by printing it.

ClassNotFoundException for Kafka-Spark cluster

I'm trying to run some .jar file made in scala-Spark 2.0.2. in my Spark-Kafka cluster. The code's here:
import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer,ProducerConfig}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object sparkKafka{
def main(args: Array[String]): Unit = {
if(args.length < 4){
System.err.println("Usage: sparkKafka <zkQuorum><group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf()
.setAppName("sparkKafka")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("E:/temp/")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordsCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordsCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
I built .jar file named: kafka-spark.jar and scp-it to my node in spark2 folder so it could read it.
Afterwards I went to start the script with:
bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
The error I'm getting is like it said in the head of topic, or ClassNotFoundException: sparkKafka
[root#hdp2 spark2]# bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
java.lang.ClassNotFoundException: sparkKafka
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Where am I making mistake? Also I tried with full path to my jar file, but eihter I get that .jar not found or this error above. Also I tried without -v but I think it doesn't make any difference.
It would be great if someone knows where is a problem. Thank you!

Did you try calling it like this
bin/spark-submit --class "sparkKafka" --master local[4] kafka-spark.jar hdp2.local:2181 group1 Topic-example 1 -verbose
If that does not work you might want to extract content of kafka-spark.jar and check if it actually contains sparkKafka class.

Hava you add package in your sparkKafka.scala ?

Spark: java.lang.IllegalArgumentException: Invalid hostname in URI s3:///<bucket-name>

I have written a sample Spark program in Scala to count the number of lines of a text file present in Amazon S3. Below is my sample program.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.util.{Map => JMap}
import org.apache.hadoop.conf.Configuration
object CountLines {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("CountLines").setMaster("local"))
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey","XYZ");
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","XYX");
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val path ="s3:///my-bucket/test/test.txt";
println("num lines: " + countLines(sc, path));
}
def countLines(sc: SparkContext, path: String): Long = {
sc.textFile(path).count();
}
}
Unfortunately I am getting IllegalArgumentException which has something to do with credentials. Below is the stack trace.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid hostname in URI s3:/my-bucket/test/test.txt
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
I have given valid credentials. I package this as a JAR file and run on the cluster using spark-submit command. I am not sure if this is the right way to set the access key and secret key in spark. I have tried different approaches but nothing seems to work. Throwing some light on this issue would be highly appreciated.
Thanks,
J Joseph

You have an extra slash. You have to change s3:///my-bucket/test/test.txt to s3://my-bucket/test/test.txt.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can't write to mongodb after mapping in Spark Scala - scala

This issued is about Scala version that I am currently using is not matching with Spark Scala version. I am using Scala 2.11.7 to compile and package the jar but Spark 1.4.1 is using Scala 2.10.4. The answer I found out here. Then this issue solve by switching version of Scala to 2.10.4.

Related

spark - scala script for wordcount not working Only one SparkContext should be running

Spark scala not able to push data in Hive table

Not able to read conf file in spark scala

ClassNotFoundException for Kafka-Spark cluster

Spark: java.lang.IllegalArgumentException: Invalid hostname in URI s3:///<bucket-name>

Categories

Resources