EDIT: Note: An executor will normally emit the message [GC (Allocation Failure) ] . It runs it because it is trying to allocate memory to the Executor, but the executor is full, so it will GC in attempt to make space when loading something new to the Executor. If your Executor does this in a loop, it may mean what you are trying to load into that Executor is too big.
I am running Spark 2.2, Scala 2.11 on AWS EMR 5.8.0
I'm trying to run a count operation on a Dataset that refuses to finish. What's frustrating, is that it is only hanging on one particular file. I run this job on a different file from S3, no problem - it finishes completely. The original CSV file is #18GB itself, and we run a transformation on it to turn the original CSV into a struct column, giving it one extra column.
My environment's core slaves are 8 instance where each is:
r3.2xlarge
16 vCore, 61 GiB memory, 160 SSD GB storage
My Spark session settings are:
implicit val spark = SparkSession
.builder()
.appName("MyApp")
.master("yarn")
.config("spark.speculation","false")
.config("hive.metastore.uris", s"thrift://$hadoopIP:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.dynamicAllocation.enabled", false)
.config("spark.executor.cores", 5)
.config("spark.executors.memory", "18G")
.config("spark.yarn.executor.memoryOverhead", "2G")
.config("spark.driver.memory", "18G")
.config("spark.executor.instances", 23)
.config("spark.default.parallelism", 230)
.config("spark.sql.shuffle.partitions", 230)
.enableHiveSupport()
.getOrCreate()
The data comes from a CSV file:
val ds = spark.read
.option("header", "true")
.option("delimiter", ",")
.schema(/* 2 cols: [ValidatedNel, and a stuct schema */)
.csv(sourceFromS3)
.as(MyCaseClass)
val mappedDs:Dataset[ValidatedNel, MyCaseClass] = ds.map(...)
mappedDs.repartition(230)
val count = mappedDs.count() // never finishes
As expected, it spins up 230 tasks, and 229 finish, except one somewhere in the middle. See below - the first task just hangs forever, the middle one finishes no problem (though is odd - the size records/ratio is very different) - and the other 229 tasks look exactly the same as the finished one.
Index| ID |Attempt |Status|Locality Level|Executor ID / Host| Launch Time | Duration |GC Time|Input Size / Records|Write Time | Shuffle Write Size / Records| Errors
110 117 0 RUNNING RACK_LOCAL 11 / ip-XXX-XX-X-XX.uswest-2.compute.internal 2019/10/01 20:34:01 1.1 h 43 min 66.2 MB / 2289538 0.0 B / 0
0 7 0 SUCCESS PROCESS_LOCAL 9 / ip-XXX-XX-X-XXX.us-west-2.compute.internal 2019/10/01 20:32:10 1.0 s 16 ms 81.2 MB /293 5 ms 59.0 B / 1 <-- this task is odd, but finishes
1 8 0 SUCCESS RACK_LOCAL 9 / ip-XXX-XX-X-XXX.us-west-2.compute.internal 2019/10/01 20:32:10 2.1 min 16 ms 81.2 MB /2894845 9 s 59.0 B / 1 <- the other tasks are all similar to this one
Checking the stdout of the hanging tasks, I repeatedly see the following the never ends:
2019-10-01T21:51:16.055+0000: [GC (Allocation Failure) 2019-10-01T21:51:16.055+0000: [ParNew: 10904K->0K(613440K), 0.0129982 secs]2019-10-01T21:51:16.068+0000: [CMS2019-10-01T21:51:16.099+0000: [CMS-concurrent-mark: 0.031/0.044 secs] [Times: user=0.17 sys=0.00, real=0.04 secs]
(concurrent mode failure): 4112635K->2940648K(4900940K), 0.4986233 secs] 4123539K->2940648K(5514380K), [Metaspace: 60372K->60372K(1103872K)], 0.5121869 secs] [Times: user=0.64 sys=0.00, real=0.51 secs]
Another note is that before I call the count, I'm calling repartition(230) just priot to calling count on the Dataset[T] to insure equal distribution of data
What is going on here?
It probably has something to do with data skew and/or data parsing issues. Note that the problem partition has radically more records than the one, processed successfully:
Input Size / Records
66.2 MB / 2289538
81.2 MB /293
I'd check that all the partition files have roughly the same size and number of records. Perhaps line and/or column delimiters are off in either the problem or "good" partition files (293 lines seems to be too low for ~80 Mb file).
Related
Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = sep_words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs:///user/test/task1.txt")
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
-------------------------------------------
Time: 1666588300000 ms
-------------------------------------------
(heaped,1)
(safe,1)
(became,1)
(For,1)
(ll,1)
(it,1)
(Let,1)
(Open,1)
(others,1)
(pack,1)
...
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap("[a-zA-Z]+".r.findAllIn(_))
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
And
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.
I have a scala code that runs on top of spark 2.4.0 to compute the BFS of a graph which is stored in a table as below:
src
dst
isVertex
1
1
1
2
2
1
...
...
...
1
2
0
2
4
0
...
...
...
At some point in the algorithm, I need to update the visited flag of current vertex neighbors. I am doing this by the following code. When I execute the code, it works fine but as time goes on, it becomes slower and slower. It seems that the last nested loop is the problem:
//var vertices = schema(StructType(Seq(StructField("id",IntegerType),StructField("visited", IntegerType),StructField("qOrder",IntegerType))
val neighbours = edges.filter($"src" === start).join(vertices,$"id" === $"dst").filter($"visited" === lit(0))
.select($"dst".as("id")).withColumn("visited", lit(1)).withColumn("qOrder", lit(priorityCounter)).cache()
-----------------------------------------------------------------------
vertices.collect.foreach{x=>
if(!neighbours.filter(col("id")===x(0)).head(1).isEmpty){
vertices = vertices.filter($"id" =!= x(0)).union(neighbours.filter(col("id")===x(0))).cache()
}
}
-----------------------------------------------------------------------
When it becomes slow, it starts giving the following errors and warnings:
2021-06-08 19:55:08,998 [driver-heartbeater] WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded
Does anyone have any idea about the problem?
I have set the spark parameters as follows:
spark.scheduler.listenerbus.eventqueue.capacity 100000000
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 2m
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 10000
spark.max.fetch.failures.per.stage 10
spark.rpc.io.serverThreads 64
spark.rpc.askTimeout 600s
spark.driver.memory 32g
spark.executor.memory 32g
We are using Apache Spark 2.1.1 to generate some daily reports. These reports are generated from some daily data, which we persist before running a report on each unit separately and unioning them all together. Here is a simplified version of what we are doing:
def unitReport(d: Date, df: DataFrame, u: String): DataFrame = ... // Builds a report based on unit `u`
val date: Date = ... // Date to run the report
val dailyData: DataFrame = someDailyData.persist() // Daily data
val units: Seq[String] = Seq("Unit_A", "Unit_B", "Unit_C")
val report: DataFrame =
units.map(unitReport(date, dailyData, _)) // Report for each unit.
.reduce((a, b) => a.union(b)) // Join all the units together.
After this, we write the report to HDFS as a csv, concatenate the parts together, and email the report out.
We've started to have problems with the largest of these reports which runs on about fifty units. We keep upping the max result size (now at 10G) as well as drive memory and keep hitting it. The confusing things here are that a) we aren't ever pulling results back to the driver and b) the final outputted report only takes up 145k and 1298 lines in CSV form, why are we passing 8G of maxResultSize? We feel like there is something we don't understand about how Spark manages memory, what exactly is included in resultSize, and what gets sent back to the driver, but have had a hard time finding any explanation or documentation. Here is a snippet of the final stages of the report, right before it starts running out of memory to give you an idea of the complexity of the report:
[Stage 2297:===========================================> (4822 + 412) / 5316]
[Stage 2297:===========================================> (4848 + 394) / 5316]
[Stage 2297:============================================> (4877 + 370) / 5316]
[Stage 2297:============================================> (4909 + 343) / 5316]
[Stage 2297:============================================> (4944 + 311) / 5316]
[Stage 2297:============================================> (4964 + 293) / 5316]
[Stage 2297:============================================> (4980 + 278) / 5316]
[Stage 2297:=============================================> (4996 + 266) / 5316]
[Stage 2297:=============================================> (5018 + 246) / 5316]
We have found what we think to be a similar memory effect with the following code:
import org.apache.spark.mllib.random.RandomRDDs._
val df = normalRDD(sc, 1000000000L, 1000000).toDF()
df.filter($"value" > 0.9).count()
While this code only returns a simple count, when we eventually hit this out of memory error on the driver:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:174)
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
When we monitor the logs on the driver, we find that it is doing full garbage collections CONSTANTLY with the overall memory creeping up:
2.095: [GC [PSYoungGen: 64512K->8399K(74752K)] 64512K->8407K(244224K), 0.0289150 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
3.989: [GC [PSYoungGen: 72911K->10235K(139264K)] 72919K->10709K(308736K), 0.0257280 secs] [Times: user=0.04 sys=0.02, real=0.02 secs]
5.936: [GC [PSYoungGen: 139259K->10231K(139264K)] 139733K->67362K(308736K), 0.0741340 secs] [Times: user=0.40 sys=0.12, real=0.07 secs]
10.842: [GC [PSYoungGen: 139255K->10231K(268288K)] 196386K->86311K(437760K), 0.0678030 secs] [Times: user=0.28 sys=0.07, real=0.07 secs]
19.282: [GC [PSYoungGen: 268279K->10236K(268288K)] 344359K->122829K(437760K), 0.0642890 secs] [Times: user=0.32 sys=0.10, real=0.06 secs]
22.981: [GC [PSYoungGen: 268284K->30989K(289792K)] 380877K->143582K(459264K), 0.0811960 secs] [Times: user=0.20 sys=0.07, real=0.08 secs]
Does anyone have any ideas what is going on? Any explanation or pointers to documentation would be greatly appreciated.
It's hard to know for sure, but I'm guessing this has to do with the total number of partitions in the DataFrame that is the result of the reduction, and that number is potentially larger the more units you have, because the number of partitions in a.union(b) is the sum of a and b's partition count.
While data isn't stored on / sent to Driver, Driver does manage objects representing all the partitions and the tasks assigned to each one of those; If your DataFrame ends up with millions of partitions, Driver will create (and then collect using GC) millions of objects.
So, try changing the union operation to include a coalesce operation to limit total number of partitions:
val MaxParts = dailyData.rdd.partitions.length * 2 // or anything, but something reasonable
val report: DataFrame =
units.map(unitReport(date, dailyData, _))
.reduce((a, b) => a.union(b).coalesce(MaxParts))
I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu #2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
Below Spark code does not appear to perform any operation on file example.txt
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\\example.txt")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
I'm attempting to print first 10 elements of file using dataFile.print()
Some of generated output :
15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:
15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------
15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms:
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:
15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------
15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms:
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:
example.txt is of format :
gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82
As the print documentation states :
/**
* Print the first ten elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
Does this mean 0 RDD have been generated for this stream ? Using Apache Spark if want to see contents of RDD would use collect function of RDD. Is these similar method for Streams ? So in short how to print to console contents of Stream ?
Update :
Updated code based on #0x0FFF comment. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html does not appear to give an example reading from local file system. Is this not as common as using Spark core, where there are explicit examples for reading data from file?
Here is updated code :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
But output is same. When I add new files to c:\\data dir (which have same format as existing data files) they are not processed. I assume dataFile.print should print first 10 lines to console ?
Update 2 :
Perhaps this is related to fact that I'm running this code in Windows environment?
You misunderstood the use of textFileStream. Here is its description from Spark documentation:
Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat).
So first, you should pass it the directory, and second, this directory should be available from the node running the receiver, so it is better to use HDFS for this purpose. Then when you put a new file into this directory, it would be processed by the function print() and first 10 lines would be printed for it
Update:
My code:
[alex#sparkdemo tmp]$ pyspark --master local[2]
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
SparkContext available as sc.
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 30)
>>> dataFile = ssc.textFileStream('file:///tmp')
>>> dataFile.pprint()
>>> ssc.start()
>>> ssc.awaitTermination()
-------------------------------------------
Time: 2015-03-12 06:40:30
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:00
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:30
-------------------------------------------
1 2 3
4 5 6
7 8 9
-------------------------------------------
Time: 2015-03-12 06:42:00
-------------------------------------------
Here is a custom receiver I wrote that listens for data at a specified dir :
package receivers
import java.io.File
import org.apache.spark.{ SparkConf, Logging }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(dir: String)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("File Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
private def receive() {
for (f <- recursiveListFiles(new File(dir))) {
val source = scala.io.Source.fromFile(f)
val lines = source.getLines
store(lines)
source.close()
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
}
One thing I think to be aware of is that the the files need to be processed in a time that is <= configured batchDuration. In example below it's set to 10 seconds but if time to process files by receiver exceeds 10 seconds then some data files will not be processed. I'm open to correction on this point.
Here is how the custom receiver is implemented :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(10))
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("C:\\data\\"))
customReceiverStream.print
customReceiverStream.foreachRDD(m => {
println("size is " + m.collect.size)
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
More info at :
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html &
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
I probably found your issue, you should have this in your log :
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
The problem is that you need to have at least 2 cores to run a spark streaming app.
So the solution should be to simply replace:
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
By :
val conf = new org.apache.spark.SparkConf()
.setMaster("local[*]")
Or at least more than one.