I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
Related
I'm very new to both spark and mongo please comment if my question might need additional details to be clear.
My mongo DB contains over 500M record, I'm trying to connect to it using pyspark-mongo connector and load only 100 record then show them. It works, but it is very slow. The logs shows me that the cluster goes over 4800 task to be able to perform such a simple load and show. a snippet from terminal output:
22/08/23 07:00:47 INFO TaskSetManager: Finished task 4395.0 in stage 0.0 (TID 9623) in 6860 ms on xx.xx.xx.xx (executor 16) (2528/4842)
22/08/23 07:00:47 INFO TaskSetManager: Starting task 1649.2 in stage 0.0 (TID 9697) (xx.xx.xx.xx, executor 16, partition 1649, PROCESS_LOCAL, 4952 bytes) taskResourceAssignments Map()
22/08/23 07:00:47 INFO TaskSetManager: Finished task 4416.0 in stage 0.0 (TID 9644) in 6249 ms on xx.xx.xx.xx (executor 16) (2529/4842)
22/08/23 07:00:47 INFO TaskSetManager: Starting task 92.4 in stage 0.0 (TID 9698) (xx.xx.xx.xx, executor 16, partition 92, PROCESS_LOCAL, 4952 bytes) taskResourceAssignments Map()
22/08/23 07:00:47 INFO TaskSetManager: Starting task 128.2 in stage 0.0 (TID 9699) (xx.xx.xx.xx, executor 16, partition 128, PROCESS_LOCAL, 4952 bytes) taskResourceAssignments Map()
22/08/23 07:00:47 INFO TaskSetManager: Finished task 4417.0 in stage 0.0 (TID 9645) in 6129 ms on xx.xx.xx.xx (executor 16) (2530/4842)
.
.
My code:
my_spark = SparkSession.builder \
.config("spark.executor.memory", exMemory) \
.config("spark.driver.memory", driverMemory) \
.config("spark.executor.cpu", cores) \
.config("spark.executor.cores", cores) \
.config("spark.driver.host", host) \
.config("spark.driver.port", port) \
.appName(appName) \
.getOrCreate()
df = my_spark.read.format("mongodb") \
.option('spark.mongodb.connection.uri', 'mongodb://xx.xx.xx.xx:27017') \
.option("spark.mongodb.input.partitioner","MongoShardedPartitioner") \
.option('spark.mongodb.database', db) \
.option('spark.mongodb.collection', coll).load().limit(100)
df.show()
I suspect the reason could be that pyspark is loading the whole data and AFTER that it limits the result into 100 record. Is this correct? why is this happening and how to fix it?
I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4
i'm trying to compute personalized pagerank on 200M edges graph with spark.
I was able to compute it for a single node but i can't do it for multiple nodes.
This is the code i wrote so far:
val ops : Broadcast[GraphOps[Int, Int]] = sc.broadcast(new GraphOps(graph))
vertices.map(vertex => (vertex._1, ops.value.personalizedPageRank(vertex._1, 0.00001, 0.2)))
.mapValues(_.vertices.filter(_._2 > 0))
.mapValues(_.sortBy(_._2, false))
.mapValues(_.mapValues(d => "%.12f".format(d)))
.mapValues(_.take(1000))
.mapValues(_.mkString("\t"))
.saveAsTextFile("hdfs://localhost:9000/user/spark/out/vertices-ppr")
Where vertices is a VertexRDD[Int] and is a subset of the graph vertices.
If it is small (like 1,2 or 10 elements) the code works nicelly but if it is bigger (100 elements) the code just freeze on job 2 after the first was completed. The last lines of the console are:
INFO Got job 13 (reduce at VertexRDDImpl.scala:88) with 22 output partitions
INFO Final stage: ResultStage 63 (reduce at VertexRDDImpl.scala:88)
INFO Parents of final stage: List(ShuffleMapStage 1, ShuffleMapStage 3, ShuffleMapStage 62)
INFO Missing parents: List(ShuffleMapStage 3, ShuffleMapStage 62)
INFO Removed broadcast_4_piece0 on localhost:33231 in memory (size: 2.7 KB, free: 22.7 GB)
Here is a screenshot of spark console :
I'm using spark/scala to generate about 10 mln people in a CSV file on HDFS by randomly mixing 2 csv files with first and lastname + adding a date of birth random between 1920 and now + creation data + counter.
I'm running into a bit of a problem using a for loop everything is working correctly but in that case the loop part only runs on the driver and this seems to be limited to working fine on 1 mln, generating 10 mln takes about 10 minutes longer. So I decided to create a range with 10 mln items so I could use map and utilize the cluster. I got the following code:
package ebicus
import org.apache.spark._
import org.joda.time.{DateTime,Interval,LocalDateTime}
import org.joda.time.format.DateTimeFormat
import java.util.Random
object main_generator_spark {
val conf = new SparkConf()
.setAppName("Generator")
val sc = new SparkContext(conf)
val rs = Random
val file = sc.textFile("hdfs://host:8020/user/firstname")
val fnames = file.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val fnames_ar = fnames.collect()
val fnames_size = fnames_ar.length
val firstnames = sc.broadcast(fnames_ar)
val file2 = sc.textFile("hdfs://host:8020/user/lastname")
val lnames = file2.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
val lnames_ar = lnames.collect()
val lnames_size = lnames_ar.length
val lastnames = sc.broadcast(lnames_ar)
val range_val = sc.range(0, 100000000, 1, 20)
val rddpersons = range_val.map(x =>
(x.toString,
new DateTime().toString("y-M-d::H:m:s"),
**Error at line 77--> fnames_ar(rs.nextInt(fnames_size)),**
lnames_ar(rs.nextInt(lnames_size)),
makeGebDate
)
)
def makeGebDate():String ={
lazy val start = new DateTime(1920,1,1,0,0,0)
lazy val end = new DateTime().minusYears(18)
lazy val hours = (new Interval(start, end).toDurationMillis()./(1000*60*60)).toInt
start.plusHours(rs.nextInt(hours)).toString("y-MM-dd")
}
def main(args: Array[String]): Unit = {
rddpersons.saveAsTextFile("hdfs://hdfs://host:8020/user/output")
}
The code works fine when I use the spark-shell but when I try to run the script with a spark-submit (I'm using maven to build):
spark-submit --class ebicus.main_generator_spark --num-executors 16 --executor-cores 4 --executor-memory 2G --driver-cores 2 --driver-memory 10g /u01/stage/mvn_test-0.0.2.jar
I get the following error:
16/06/16 11:17:29 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO DAGScheduler: Parents of final stage: List()
16/06/16 11:17:29 INFO DAGScheduler: Missing parents: List()
16/06/16 11:17:29 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93), which has no missing parents
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(140536) called with curMem=1326969, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 137.2 KB, free 5.2 GB)
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(48992) called with curMem=1467505, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 47.8 KB, free 5.2 GB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.29.7.4:51642 (size: 47.8 KB, free: 5.2 GB)
16/06/16 11:17:29 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/06/16 11:17:29 INFO DAGScheduler: Submitting 20 missing tasks from ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO YarnScheduler: Adding task set 2.0 with 20 tasks
16/06/16 11:17:29 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, cloudera-001.fusion.ebicus.com, partition 0,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, cloudera-003.fusion.ebicus.com, partition 1,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, cloudera-001.fusion.ebicus.com, partition 2,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, cloudera-003.fusion.ebicus.com, partition 3,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 12, cloudera-001.fusion.ebicus.com, partition 4,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 13, cloudera-003.fusion.ebicus.com, partition 5,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com, partition 6,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 7.0 in stage 2.0 (TID 15, cloudera-003.fusion.ebicus.com, partition 7,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-003.fusion.ebicus.com:52334 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-001.fusion.ebicus.com:53110 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 8.0 in stage 2.0 (TID 16, cloudera-001.fusion.ebicus.com, partition 8,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 9.0 in stage 2.0 (TID 17, cloudera-001.fusion.ebicus.com, partition 9,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com): java.lang.NoClassDefFoundError: Could not initialize class ebicus.main_generator_spark$
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:77)
at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:74)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I'm I making some kind of fundamental thinking error? I would be happy if someone can point me in a direction?
Edit: I'm using cloudera 5.6.0, spark 1.5.0, scala 2.10.6, yarn 2.10, joda-time 2.9.4
Edit2: Added conf & sc
I have 2 question want to know:
This is my code:
object Hi {
def main (args: Array[String]) {
println("Sucess")
val conf = new SparkConf().setAppName("HI").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile("src/main/scala/source.txt")
val rows = textFile.map { line =>
val fields = line.split("::")
(fields(0), fields(1).toInt)
}
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
println("Sucess2")
}
}
Here is some of resault :
15/04/26 16:49:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/04/26 16:49:57 INFO SparkUI: Started SparkUI at http://192.168.1.105:4040
15/04/26 16:49:57 INFO Executor: Starting executor ID <driver> on host localhost
15/04/26 16:49:57 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#192.168.1.105:64952/user/HeartbeatReceiver
15/04/26 16:49:57 INFO NettyBlockTransferService: Server created on 64954
15/04/26 16:49:57 INFO BlockManagerMaster: Trying to register BlockManager
15/04/26 16:49:57 INFO BlockManagerMasterActor: Registering block manager localhost:64954 with 983.1 MB RAM, BlockManagerId(<driver>, localhost, 64954)
.....
15/04/26 16:49:59 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/04/26 16:49:59 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at map at Hi.scala:25)
15/04/26 16:49:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/04/26 16:49:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1331 bytes)
15/04/26 16:49:59 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/04/26 16:49:59 INFO HadoopRDD: Input split: file:/Users/Winsome/IdeaProjects/untitled/src/main/scala/source.txt:0+23
15/04/26 16:49:59 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1787 bytes result sent to driver
15/04/26 16:49:59 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (1/1)
15/04/26 16:49:59 INFO DAGScheduler: Stage 1 (collect at Hi.scala:25) finished in 0.013 s
15/04/26 16:49:59 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/04/26 16:49:59 INFO DAGScheduler: Job 1 finished: collect at Hi.scala:25, took 0.027784 s
1~1::2~2::3~3
10::20::30
Sucess2
My first question is : When I check http://localhost:8080/
There is no worker. and I can't open http://192.168.1.105:4040 too
Is is because I use spark standalone?
How to fixed this??
(My environment is MAC,IDE is Intellij)
My 2nd question is:
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
I thiink these code could be more easily to get x and y (something like this stuff :rows[range],rows[ratenum]),But I'm not familiar with scala .
Could you give me some advice?
I'm not sure about you first question, but reading your log I see that the worker node lasted 13 ms, so this may be the reason why you haven't see it. Run a longer job and you may see the workers.
About the second question, yes, there is a simpler way to write it that is:
val x = rows.map{(tuple) => tuple._1}.collect.mkString("::")
because your RDD is made of Tuple scala objects, which are made of two fields you can access with _1 and _2 respectively.