I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4
Related
I'm using spark/scala to generate about 10 mln people in a CSV file on HDFS by randomly mixing 2 csv files with first and lastname + adding a date of birth random between 1920 and now + creation data + counter.
I'm running into a bit of a problem using a for loop everything is working correctly but in that case the loop part only runs on the driver and this seems to be limited to working fine on 1 mln, generating 10 mln takes about 10 minutes longer. So I decided to create a range with 10 mln items so I could use map and utilize the cluster. I got the following code:
package ebicus
import org.apache.spark._
import org.joda.time.{DateTime,Interval,LocalDateTime}
import org.joda.time.format.DateTimeFormat
import java.util.Random
object main_generator_spark {
val conf = new SparkConf()
.setAppName("Generator")
val sc = new SparkContext(conf)
val rs = Random
val file = sc.textFile("hdfs://host:8020/user/firstname")
val fnames = file.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val fnames_ar = fnames.collect()
val fnames_size = fnames_ar.length
val firstnames = sc.broadcast(fnames_ar)
val file2 = sc.textFile("hdfs://host:8020/user/lastname")
val lnames = file2.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val lnames_ar = lnames.collect()
val lnames_size = lnames_ar.length
val lastnames = sc.broadcast(lnames_ar)
val range_val = sc.range(0, 100000000, 1, 20)
val rddpersons = range_val.map(x =>
(x.toString,
new DateTime().toString("y-M-d::H:m:s"),
**Error at line 77--> fnames_ar(rs.nextInt(fnames_size)),**
lnames_ar(rs.nextInt(lnames_size)),
makeGebDate
)
)
def makeGebDate():String ={
lazy val start = new DateTime(1920,1,1,0,0,0)
lazy val end = new DateTime().minusYears(18)
lazy val hours = (new Interval(start, end).toDurationMillis()./(1000*60*60)).toInt
start.plusHours(rs.nextInt(hours)).toString("y-MM-dd")
}
def main(args: Array[String]): Unit = {
rddpersons.saveAsTextFile("hdfs://hdfs://host:8020/user/output")
}
The code works fine when I use the spark-shell but when I try to run the script with a spark-submit (I'm using maven to build):
spark-submit --class ebicus.main_generator_spark --num-executors 16 --executor-cores 4 --executor-memory 2G --driver-cores 2 --driver-memory 10g /u01/stage/mvn_test-0.0.2.jar
I get the following error:
16/06/16 11:17:29 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO DAGScheduler: Parents of final stage: List()
16/06/16 11:17:29 INFO DAGScheduler: Missing parents: List()
16/06/16 11:17:29 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93), which has no missing parents
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(140536) called with curMem=1326969, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 137.2 KB, free 5.2 GB)
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(48992) called with curMem=1467505, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 47.8 KB, free 5.2 GB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.29.7.4:51642 (size: 47.8 KB, free: 5.2 GB)
16/06/16 11:17:29 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/06/16 11:17:29 INFO DAGScheduler: Submitting 20 missing tasks from ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO YarnScheduler: Adding task set 2.0 with 20 tasks
16/06/16 11:17:29 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, cloudera-001.fusion.ebicus.com, partition 0,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, cloudera-003.fusion.ebicus.com, partition 1,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, cloudera-001.fusion.ebicus.com, partition 2,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, cloudera-003.fusion.ebicus.com, partition 3,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 12, cloudera-001.fusion.ebicus.com, partition 4,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 13, cloudera-003.fusion.ebicus.com, partition 5,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com, partition 6,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 7.0 in stage 2.0 (TID 15, cloudera-003.fusion.ebicus.com, partition 7,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-003.fusion.ebicus.com:52334 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-001.fusion.ebicus.com:53110 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 8.0 in stage 2.0 (TID 16, cloudera-001.fusion.ebicus.com, partition 8,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 9.0 in stage 2.0 (TID 17, cloudera-001.fusion.ebicus.com, partition 9,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com): java.lang.NoClassDefFoundError: Could not initialize class ebicus.main_generator_spark$
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:77)
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:74)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I'm I making some kind of fundamental thinking error? I would be happy if someone can point me in a direction?
Edit: I'm using cloudera 5.6.0, spark 1.5.0, scala 2.10.6, yarn 2.10, joda-time 2.9.4
Edit2: Added conf & sc
I have written a spark job (spark 1.3 cloudera 5.4) which loops through an avro file, and for each record issues a hivecontext query
val path = "/user/foo/2016/03/07/ALL"
val fw2 = new FileWriter("/home.nfs/Foo/spark-query-result.txt", false)
val conf = new SparkConf().setAppName("App")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val sqlSc = new SQLContext(sc)
import sqlSc.implicits._
val df = sqlSc.load(path, "com.databricks.spark.avro").cache()
val hc = new HiveContext(sc)
df.filter("fieldA = 'X'").select($"fieldA", $"fieldB", $"fieldC").rdd.toLocalIterator.filter(x => x(1) != null).foreach{x =>
val query = s"select from hive_table where fieldA = ${x(0)} and fieldB='${x(1)}' and fieldC=${x(2)}"
val df1 = hc.sql(query)
df1.rdd.toLocalIterator.foreach { r =>
println(s"For ${x(0)} Found ${r(0)}\n")
fw1.write(s"For ${x(0)} Found ${r(0)}\n")
}
}
The job runs for 2 hours, but then aborts with the error
16/03/08 12:35:53 WARN TaskSetManager: Lost task 17.0 in stage 34315.0 (TID 82258, foo-cloudera04.foo.com): java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:794)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:833)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:897)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
....
16/03/08 12:35:53 INFO TaskSetManager: Starting task 0.0 in stage 34315.0 (TID 82260, foo-cloudera09.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 INFO TaskSetManager: Finished task 67.0 in stage 34314.0 (TID 82256) in 1298 ms on foo-cloudera09.foo.com (42/75)
16/03/08 12:35:53 INFO BlockManagerInfo: Added broadcast_12501_piece0 in memory on foo-cloudera09.foo.com:43893 (size: 6.5 KB, free: 522.8 MB)
16/03/08 12:35:53 INFO BlockManagerInfo: Added broadcast_12499_piece0 in memory on foo-cloudera09.foo.com:43893 (size: 44.2 KB, free: 522.7 MB)
16/03/08 12:35:53 INFO TaskSetManager: Starting task 17.1 in stage 34315.0 (TID 82261, foo-cloudera04.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 WARN TaskSetManager: Lost task 19.0 in stage 34315.0 (TID 82259, foo-cloudera04.foo.com): java.io.FileNotFoundException: /data/1/yarn/nm/usercache/Foo.Bar/appcache/application_1456200816465_188203/blockmgr-79a08609-56ae-490e-afc9-0f0143441a76/27/temp_shuffle_feb9ae13-6cb0-4a19-a60f-8c433f30e0e0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
at scala.Array$.fill(Array.scala:267)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
16/03/08 12:35:53 INFO TaskSetManager: Starting task 19.1 in stage 34315.0 (TID 82262, foo-cloudera04.foo.com, NODE_LOCAL, 1420 bytes)
16/03/08 12:35:53 WARN TaskSetManager: Lost task 17.1 in stage 34315.0 (TID 82261, foo-cloudera04.foo.com): java.io.FileNotFoundException: /data/1/yarn/nm/usercache/Foo.Bar/appcache/application_1456200816465_188203/blockmgr-79a08609-56ae-490e-afc9-0f0143441a76/13/temp_shuffle_2f89df35-9e35-4558-a0f2-1f7353d3f9b0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
I have 2 question want to know:
This is my code:
object Hi {
def main (args: Array[String]) {
println("Sucess")
val conf = new SparkConf().setAppName("HI").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile("src/main/scala/source.txt")
val rows = textFile.map { line =>
val fields = line.split("::")
(fields(0), fields(1).toInt)
}
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
println("Sucess2")
}
}
Here is some of resault :
15/04/26 16:49:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/04/26 16:49:57 INFO SparkUI: Started SparkUI at http://192.168.1.105:4040
15/04/26 16:49:57 INFO Executor: Starting executor ID <driver> on host localhost
15/04/26 16:49:57 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#192.168.1.105:64952/user/HeartbeatReceiver
15/04/26 16:49:57 INFO NettyBlockTransferService: Server created on 64954
15/04/26 16:49:57 INFO BlockManagerMaster: Trying to register BlockManager
15/04/26 16:49:57 INFO BlockManagerMasterActor: Registering block manager localhost:64954 with 983.1 MB RAM, BlockManagerId(<driver>, localhost, 64954)
.....
15/04/26 16:49:59 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/04/26 16:49:59 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at map at Hi.scala:25)
15/04/26 16:49:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/04/26 16:49:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1331 bytes)
15/04/26 16:49:59 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/04/26 16:49:59 INFO HadoopRDD: Input split: file:/Users/Winsome/IdeaProjects/untitled/src/main/scala/source.txt:0+23
15/04/26 16:49:59 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1787 bytes result sent to driver
15/04/26 16:49:59 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (1/1)
15/04/26 16:49:59 INFO DAGScheduler: Stage 1 (collect at Hi.scala:25) finished in 0.013 s
15/04/26 16:49:59 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/04/26 16:49:59 INFO DAGScheduler: Job 1 finished: collect at Hi.scala:25, took 0.027784 s
1~1::2~2::3~3
10::20::30
Sucess2
My first question is : When I check http://localhost:8080/
There is no worker. and I can't open http://192.168.1.105:4040 too
Is is because I use spark standalone?
How to fixed this??
(My environment is MAC,IDE is Intellij)
My 2nd question is:
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
I thiink these code could be more easily to get x and y (something like this stuff :rows[range],rows[ratenum]),But I'm not familiar with scala .
Could you give me some advice?
I'm not sure about you first question, but reading your log I see that the worker node lasted 13 ms, so this may be the reason why you haven't see it. Run a longer job and you may see the workers.
About the second question, yes, there is a simpler way to write it that is:
val x = rows.map{(tuple) => tuple._1}.collect.mkString("::")
because your RDD is made of Tuple scala objects, which are made of two fields you can access with _1 and _2 respectively.
I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
sc.textFile("hdfs:///user/D_Authors.txt")
.map(_.split("\\|"))
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
D_Authors1.registerAsTable("D_Authors1")
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)
myData.map(_.split("\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
unexpectedArrayForm.mkString("\t"))
})
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.