Below Spark code does not appear to perform any operation on file example.txt
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\\example.txt")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
I'm attempting to print first 10 elements of file using dataFile.print()
Some of generated output :
15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:
15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------
15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms:
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:
15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------
15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms:
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:
example.txt is of format :
gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82
As the print documentation states :
/**
* Print the first ten elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
Does this mean 0 RDD have been generated for this stream ? Using Apache Spark if want to see contents of RDD would use collect function of RDD. Is these similar method for Streams ? So in short how to print to console contents of Stream ?
Update :
Updated code based on #0x0FFF comment. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html does not appear to give an example reading from local file system. Is this not as common as using Spark core, where there are explicit examples for reading data from file?
Here is updated code :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
But output is same. When I add new files to c:\\data dir (which have same format as existing data files) they are not processed. I assume dataFile.print should print first 10 lines to console ?
Update 2 :
Perhaps this is related to fact that I'm running this code in Windows environment?
You misunderstood the use of textFileStream. Here is its description from Spark documentation:
Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat).
So first, you should pass it the directory, and second, this directory should be available from the node running the receiver, so it is better to use HDFS for this purpose. Then when you put a new file into this directory, it would be processed by the function print() and first 10 lines would be printed for it
Update:
My code:
[alex#sparkdemo tmp]$ pyspark --master local[2]
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
SparkContext available as sc.
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 30)
>>> dataFile = ssc.textFileStream('file:///tmp')
>>> dataFile.pprint()
>>> ssc.start()
>>> ssc.awaitTermination()
-------------------------------------------
Time: 2015-03-12 06:40:30
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:00
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:30
-------------------------------------------
1 2 3
4 5 6
7 8 9
-------------------------------------------
Time: 2015-03-12 06:42:00
-------------------------------------------
Here is a custom receiver I wrote that listens for data at a specified dir :
package receivers
import java.io.File
import org.apache.spark.{ SparkConf, Logging }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(dir: String)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("File Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
private def receive() {
for (f <- recursiveListFiles(new File(dir))) {
val source = scala.io.Source.fromFile(f)
val lines = source.getLines
store(lines)
source.close()
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
}
One thing I think to be aware of is that the the files need to be processed in a time that is <= configured batchDuration. In example below it's set to 10 seconds but if time to process files by receiver exceeds 10 seconds then some data files will not be processed. I'm open to correction on this point.
Here is how the custom receiver is implemented :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(10))
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("C:\\data\\"))
customReceiverStream.print
customReceiverStream.foreachRDD(m => {
println("size is " + m.collect.size)
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
More info at :
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html &
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
I probably found your issue, you should have this in your log :
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
The problem is that you need to have at least 2 cores to run a spark streaming app.
So the solution should be to simply replace:
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
By :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[*]")
Or at least more than one.
Related
Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = sep_words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs:///user/test/task1.txt")
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
-------------------------------------------
Time: 1666588300000 ms
-------------------------------------------
(heaped,1)
(safe,1)
(became,1)
(For,1)
(ll,1)
(it,1)
(Let,1)
(Open,1)
(others,1)
(pack,1)
...
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap("[a-zA-Z]+".r.findAllIn(_))
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
And
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.
I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4
Iam running a spark simple program on a cluster:
val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println()
println()
println()
and i get the following error
15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 6]
15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)
on executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 7]
15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at
SimpleApp.scala:55) failed in 7.636 s
15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at
SimpleApp.scala:55, took 7.810387 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The file is in the correct place. IF I REPLACE README.MD WITH REAMDME.TXT IT WILL WORK FINE. Can somenone help with this?
Thanks
If you are running a multinode cluster, make sure all nodes have the file in the same given path, with respect to their own filesystem. Or, you know, just use HDFS.
In multinode case "/home/hduser/README.md" path is distributed to worker nodes as well. README.md probably exists only on master node. Now when workers try to access this file, they won't look into master's fs instead each will try to find it on their own fs. If you have the same file on the same path in every node. The code is very likely to work. To do this, copy the file to every node's fs using the same path.
As you've already noticed, the solution above is cumbersome. Hadoop FS, HDFS, solves this issue and many more. You should look into it.
It is simply because a file with an extension .md contains plain text with formatting information. When you save this file with .txt extension, the formatting informations are removed or not considered. sc.textFile() works with plain texts.
I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
I have 2 question want to know:
This is my code:
object Hi {
def main (args: Array[String]) {
println("Sucess")
val conf = new SparkConf().setAppName("HI").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile("src/main/scala/source.txt")
val rows = textFile.map { line =>
val fields = line.split("::")
(fields(0), fields(1).toInt)
}
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
println("Sucess2")
}
}
Here is some of resault :
15/04/26 16:49:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/04/26 16:49:57 INFO SparkUI: Started SparkUI at http://192.168.1.105:4040
15/04/26 16:49:57 INFO Executor: Starting executor ID <driver> on host localhost
15/04/26 16:49:57 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#192.168.1.105:64952/user/HeartbeatReceiver
15/04/26 16:49:57 INFO NettyBlockTransferService: Server created on 64954
15/04/26 16:49:57 INFO BlockManagerMaster: Trying to register BlockManager
15/04/26 16:49:57 INFO BlockManagerMasterActor: Registering block manager localhost:64954 with 983.1 MB RAM, BlockManagerId(<driver>, localhost, 64954)
.....
15/04/26 16:49:59 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/04/26 16:49:59 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at map at Hi.scala:25)
15/04/26 16:49:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/04/26 16:49:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1331 bytes)
15/04/26 16:49:59 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/04/26 16:49:59 INFO HadoopRDD: Input split: file:/Users/Winsome/IdeaProjects/untitled/src/main/scala/source.txt:0+23
15/04/26 16:49:59 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1787 bytes result sent to driver
15/04/26 16:49:59 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (1/1)
15/04/26 16:49:59 INFO DAGScheduler: Stage 1 (collect at Hi.scala:25) finished in 0.013 s
15/04/26 16:49:59 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/04/26 16:49:59 INFO DAGScheduler: Job 1 finished: collect at Hi.scala:25, took 0.027784 s
1~1::2~2::3~3
10::20::30
Sucess2
My first question is : When I check http://localhost:8080/
There is no worker. and I can't open http://192.168.1.105:4040 too
Is is because I use spark standalone?
How to fixed this??
(My environment is MAC,IDE is Intellij)
My 2nd question is:
val x = rows.map{case (range , ratednum) => range}.collect.mkString("::")
val y = rows.map{case (range , ratednum) => ratednum}.collect.mkString("::")
println(x)
println(y)
I thiink these code could be more easily to get x and y (something like this stuff :rows[range],rows[ratenum]),But I'm not familiar with scala .
Could you give me some advice?
I'm not sure about you first question, but reading your log I see that the worker node lasted 13 ms, so this may be the reason why you haven't see it. Run a longer job and you may see the workers.
About the second question, yes, there is a simpler way to write it that is:
val x = rows.map{(tuple) => tuple._1}.collect.mkString("::")
because your RDD is made of Tuple scala objects, which are made of two fields you can access with _1 and _2 respectively.