I have to run a simple wordcount on a cluster Hdinsight in Azure. I have created a cluster with hadoop and spark and i have already the jar file with the code, the problem that i don't know how to set-up the cluster and the right line of code to launch spark on Azure,I want to try different combination of nodes(workers , 2-4-8) to see how the program scale.
Every time i launch the app with spark-submit in mode yarn-client, it work but always with 2 executor and 1 core taking for 1gb input text file around 3 minute,also if i set more executor and more core he take the settings but he don't use that,so i think that the problem it's with the RDD, it don't split the input file in the right mode because it create only 2 task that start in 2 worknode and the other nodes remain inactive.
The jar file it's created with sbt package.
Command to launch Spark:
spark-submit --class "SimpleApp" --master yarn-client --num-executors 2 simpleapp_2.10-1.0.jar
WordCount Code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.io._
import org.apache.hadoop.fs
import org.apache.spark.rdd.RDD
object SimpleApp {
def main(args: Array[String]){
//settingsparkcontext
val conf = new SparkConf().setAppName("SimpleApp")
val sc = new SparkContext(conf)
//settingthewordtosearch
val word = "word"
//settingtime
val now = System.nanoTime
//settingtheinputfile
val input = sc.textFile("wasb://xxx#storage.blob.core.windows.net/dizionario1gb.txt")
//wordlookup
val splittedLines = input.map(line=>line.split(""))
val find = System.nanoTime
val tot = splittedLines.map(x => x.equals(word)).count()
val w=(System.nanoTime-find)/1000000
val rw=(System.nanoTime-now)/1000000
//reportingtheresultofexecutioninatxtfile
val writer = new FileWriter("D:\\Users\\user\\Desktop\\File\\output.txt",true)
try {
writer.write("Word found "+tot+" time total "+rw+" mstimesearch "+w+" time read "+(rw-w)+"\n")
}
finally writer.close()
//terminatingthesparkserver
sc.stop()
}}
Level of Parallelism
"Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc) You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster."
Source:
https://spark.apache.org/docs/1.3.1/tuning.html
Related
I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala
I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'
I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.
I have the following worksheet in IntelliJ:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
val conf = new SparkConf().
setAppName("Scala Wooksheet").
setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("/Users/someuser/some.json")
df.show
This code works in the REPL, but seems to run only the first time (with some other errors). Each subsequent time, the error is:
16/04/13 11:04:57 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
How can I find the context already in use?
Note: I hear others say to use conf.set("spark.driver.allowMultipleContexts","true") but this seems to be a solution of increasing memory usage (like uncollected garbage).
Is there a better way?
I was having the same problem trying to get code executed with Spark in Scala Worksheet in IntelliJ IDEA (CE 2016.3.4).
The solution for the duplicate Spark context creation was to uncheck 'Run worksheet in the compiler process' checkbox in Settings -> Languages and Frameworks -> Scala -> Worksheet. I have also tested the other Worksheet settings and they had no effect on the problem of duplicate Spark context creation.
I also did not put sc.stop() in the Worksheet.
But I had to set master and appName parameters in the conf for it to work.
Here is the Worksheet version of the code from SimpleApp.scala from Spark Quick Start
import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("Simple Application")
val sc = new SparkContext(conf)
val logFile = "/opt/spark-latest/README.md"
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
I have used the same simple.sbt from the guide for importing the dependencies to IntelliJ IDEA.
Here is a screenshot of the functioning Scala Worksheet with Spark:
UPDATE for IntelliJ CE 2017.1 (Worksheet in REPL mode)
In 2017.1 Intellij introduced REPL mode for Worksheet. I have tested the same code with 'Use REPL' option checked. For this mode to run you need to leave the 'Run worksheet in the compiler process' checkbox in Worksheet Settings I have described above checked (it is by default).
The code runs fine in Worksheet REPL mode.
Here is the Screenshot:
As detectivebag stated in this git post you can fix this problem by switching your worksheets to run in only 'eclipse compatibility mode':
1) open preferences
2) under Languages and Frameworks select scala
3) under the worksheet tab uncheck everything except 'Use "eclipse compatibility" mode'
I build a spark application to count the number of word in a file. I run the application on the cloudera quickstart VM, all is fine when i use the cloudera user directory but when i want to write or read in an other user directory i have a permission denied from hadoop. I would like to know how to change the hadoop user in spark.
package user1.item1
import user1.{Article}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
import scala.util.{Try, Success, Failure}
object WordCount {
def main(args: Array[String]) {
Context.User = 'espacechange'
val filename = "hdfs://quickstart.cloudera:8020/user/user1/test/wiki_test/wikipedia.txt"
val conf = new SparkConf().setAppName("word count")
val sc = new SparkContext(conf)
val wikipedia = sc.textFile(filename).map(Article.parseWikipediaArticle)
val counts = wikipedia.flatMap(line => line.text.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/user1/test/word_count")
}
}
It depends on your cluster's authentication. By default, you can set following environment variable:
$ export HADOOP_USER_NAME=hdfs
Try the above before submitting spark job.
You need to launch the spark-submit script using a different OS user.
For example, use the following command to run the spark application as (and get the permissions of) the HDFS user:
sudo -u hdfs spark-submit ....