I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
sc.textFile("hdfs:///user/D_Authors.txt")
.map(_.split("\\|"))
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
D_Authors1.registerAsTable("D_Authors1")
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)
myData.map(_.split("\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
unexpectedArrayForm.mkString("\t"))
})
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.
Related
Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = sep_words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs:///user/test/task1.txt")
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
-------------------------------------------
Time: 1666588300000 ms
-------------------------------------------
(heaped,1)
(safe,1)
(became,1)
(For,1)
(ll,1)
(it,1)
(Let,1)
(Open,1)
(others,1)
(pack,1)
...
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap("[a-zA-Z]+".r.findAllIn(_))
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
And
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.
I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4
Iam running a spark simple program on a cluster:
val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println()
println()
println()
and i get the following error
15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 6]
15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)
on executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 7]
15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at
SimpleApp.scala:55) failed in 7.636 s
15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at
SimpleApp.scala:55, took 7.810387 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The file is in the correct place. IF I REPLACE README.MD WITH REAMDME.TXT IT WILL WORK FINE. Can somenone help with this?
Thanks
If you are running a multinode cluster, make sure all nodes have the file in the same given path, with respect to their own filesystem. Or, you know, just use HDFS.
In multinode case "/home/hduser/README.md" path is distributed to worker nodes as well. README.md probably exists only on master node. Now when workers try to access this file, they won't look into master's fs instead each will try to find it on their own fs. If you have the same file on the same path in every node. The code is very likely to work. To do this, copy the file to every node's fs using the same path.
As you've already noticed, the solution above is cumbersome. Hadoop FS, HDFS, solves this issue and many more. You should look into it.
It is simply because a file with an extension .md contains plain text with formatting information. When you save this file with .txt extension, the formatting informations are removed or not considered. sc.textFile() works with plain texts.
I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
I have 2 question want to know:
This is my code:
object Hi {
def main (args: Array[String]) {
println("Sucess")
val conf = new SparkConf().setAppName("HI").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile("src/main/scala/source.txt")
val rows = textFile.map { line =>
val fields = line.split("::")
(fields(0), fields(1).toInt)
}
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
println("Sucess2")
}
}
Here is some of resault :
15/04/26 16:49:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/04/26 16:49:57 INFO SparkUI: Started SparkUI at http://192.168.1.105:4040
15/04/26 16:49:57 INFO Executor: Starting executor ID <driver> on host localhost
15/04/26 16:49:57 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#192.168.1.105:64952/user/HeartbeatReceiver
15/04/26 16:49:57 INFO NettyBlockTransferService: Server created on 64954
15/04/26 16:49:57 INFO BlockManagerMaster: Trying to register BlockManager
15/04/26 16:49:57 INFO BlockManagerMasterActor: Registering block manager localhost:64954 with 983.1 MB RAM, BlockManagerId(<driver>, localhost, 64954)
.....
15/04/26 16:49:59 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/04/26 16:49:59 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at map at Hi.scala:25)
15/04/26 16:49:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/04/26 16:49:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1331 bytes)
15/04/26 16:49:59 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/04/26 16:49:59 INFO HadoopRDD: Input split: file:/Users/Winsome/IdeaProjects/untitled/src/main/scala/source.txt:0+23
15/04/26 16:49:59 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1787 bytes result sent to driver
15/04/26 16:49:59 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (1/1)
15/04/26 16:49:59 INFO DAGScheduler: Stage 1 (collect at Hi.scala:25) finished in 0.013 s
15/04/26 16:49:59 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/04/26 16:49:59 INFO DAGScheduler: Job 1 finished: collect at Hi.scala:25, took 0.027784 s
1~1::2~2::3~3
10::20::30
Sucess2
My first question is : When I check http://localhost:8080/
There is no worker. and I can't open http://192.168.1.105:4040 too
Is is because I use spark standalone?
How to fixed this??
(My environment is MAC,IDE is Intellij)
My 2nd question is:
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
I thiink these code could be more easily to get x and y (something like this stuff :rows[range],rows[ratenum]),But I'm not familiar with scala .
Could you give me some advice?
I'm not sure about you first question, but reading your log I see that the worker node lasted 13 ms, so this may be the reason why you haven't see it. Run a longer job and you may see the workers.
About the second question, yes, there is a simpler way to write it that is:
val x = rows.map{(tuple) => tuple._1}.collect.mkString("::")
because your RDD is made of Tuple scala objects, which are made of two fields you can access with _1 and _2 respectively.