Executing Spark scala program after compilation - scala

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?

Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar

Related

Issues with Spark and Salesforce Connection

I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.

Json argument in Spark submit

My spark-submit command :
spark-submit --deploy-mode cluster --class spark_package.import_jar s3://test-system/test.jar "{\"localparameter\" : {\"mail\": \"\", \"clusterid\": \"test\", \"clientCd\": \"1000\", \"processid\": \"1234\"} }"
Here i want to pass the clientCd as parameter to my Scala code.
My scala code :
package Spark_package
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object SampleFile {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[*]").appName("SampleFile").getOrCreate()
val sc = spark.sparkContext
val conf = new SparkConf().setAppName("SampleFile")
val sqlContext = spark.sqlContext
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test-system/data/*.gz")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data where client_cd = $clientCd")
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite").save("s3a://dev-system/bkup/")
spark.stop()
}
}
Here My question is how to pass clientCd as parameter to my code.
val res = spark.sql("select count(*) from data where client_cd = $clientCd")
Kindly help me on this.
Append all program arguments in the end of spark-submit, they will be available in args at main.
eg. spark-submit --class xxx --deploy-mode xxx.jar arg1 arg2
then you can parse your arg1 by a json unmarshaller.

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

How to make Spark slaves use HDFS input files 'local' to them in a Hadoop+Spark cluster?

I have a cluster of 9 computers with Apache Hadoop 2.7.2 and Spark 2.0.0 installed on them. Each computer runs an HDFS datanode and Spark slave. One of these computers also runs an HDFS namenode and Spark master.
I've uploaded a few TBs of gz-archives in HDFS with Replication=2. It turned out that some of the archives are corrupt. I'd want to find them. It looks like 'gunzip -t ' can help. So I'm trying to find a way to run a Spark application on the cluster so that each Spark executor tests archives 'local' (i.e. having one of the replicas located on the same computer where this executor runs) to it as long as it is possible. The following script runs but sometimes Spark executors process 'remote' files in HDFS:
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchivesTester' in spark.pom
// and placing this jar file into Spark's home directory):
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
// means testing for corruption the gz-archives in the directory hdfs://LV-WS10.lviv:9000/commoncrawl
// using a Spark cluster with the Spark master URL spark://LV-WS10.lviv:7077 and 9 Spark slaves
package com.qbeats.cortex
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.FileSplit
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkContext, SparkConf, AccumulatorParam}
import sys.process._
object CommoncrawlArchivesTester extends App {
object LogAccumulator extends AccumulatorParam[String] {
def zero(initialValue: String): String = ""
def addInPlace(log1: String, log2: String) = if (log1.isEmpty) log2 else log1 + "\n" + log2
}
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchivesTester"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val log = sc.accumulator(LogAccumulator.zero(""))(LogAccumulator)
val text = sc.hadoopFile(args(1), classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
val fileAndLine = hadoopRdd.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val fileName = inputSplit.asInstanceOf[FileSplit].getPath.toString
class FilePath extends Iterable[String] {
def iterator = List(fileName).iterator
}
val result = (sys.env("HADOOP_PREFIX") + "/bin/hadoop fs -cat " + fileName) #| "gunzip -t" !
println("Processed %s.".format(fileName))
if (result != 0) {
log.add(fileName)
println("Corrupt: %s.".format(fileName))
}
(new FilePath).iterator
}
val result = fileAndLine.collect()
println("Corrupted files:")
println(log.value)
}
}
}
What would you suggest?
ADDED LATER:
I tried another script which gets files from HDFS via textFile(). I looks like a Spark executor doesn't prefer among input files the files which are 'local' to it. Doesn't it contradict to "Spark brings code to data, not data to code"?
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchiveLinesCounter' in spark.pom)
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
package com.qbeats.cortex
import org.apache.spark.{SparkContext, SparkConf}
object CommoncrawlArchiveLinesCounter extends App {
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchiveLinesCounter"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val helper = new Helper
val nLines = sc.
textFile(args(1) + "/*").
mapPartitionsWithIndex( (index, it) => {
println("Processing partition %s".format(index))
it
}).
count
println(nLines)
}
}
}
SAIF C, could you explain in more detail please?
I've solved the problem by switching from Spark’s standalone mode to YARN.
Related topic: How does Apache Spark know about HDFS data nodes?

Run Scala Program with Spark on Hadoop

i have create a scala program that search a word in a text file.
I create the file scala with eclipse and after i compile and create a jar with sbt and sbt assembly.After that i run the .jar with Spark in local and it run correctly.
Now i want try to run this program using Spark on hadoop, i have 1 master and 2 work machine.
I have to change the code ? and what command i do with the shell of the master?
i have create a bucket and i have put the text file in hadoop
this is my code:
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object wordcount {
def main(args: Array[String]) {
// set spark context
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
val sc = new SparkContext(conf)
val distFile = sc.textFile("bible.txt")
print("Enter word to look for in the HOLY BILE: ")
val word = Console.readLine
var count = 0;
var finalCount=0;
println("You entered " + word)
val input = sc.textFile("bible.txt")
val splitedLines = input.flatMap(line => line.split(" "))
.filter(x => x.equals(word))
System.out.println("The word " + word + " appear " + splitedLines.count())
}
}
Thanks all
Just change the following line,
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
to
val conf = new SparkConf().setAppName("wordcount")
This will allow you not to modify the code whenever you want to switch from local mode to cluster mode. The master option can be passed via the spark-submit command as follows,
spark-submit --class wordcount --master <master-url> --jars wordcount.jar
and if you want to run your program locally, use the following command,
spark-submit --class wordcount --master local[*] --jars wordcount.jar
here is the list of master option that you can set while running the application.