TextFileStreaming in spark scala - scala

I have many text file in local directory. Spark Program to read all the files and store it into database. For the moment, trying to read the files using text file stream not working.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
/**
* Main Program
*/
object SparkMain extends App {
// Create a SparkContext to initialize Spark
val sparkConf: SparkConf =
new SparkConf()
.setMaster("local")
.setAppName("TestProgram")
// Create a spark streaming context with windows period 2 sec
val ssc: StreamingContext =
new StreamingContext(sparkConf, Seconds(2))
// Create text file stream
val sourceDir: String = "D:\\tmpDir"
val stream: DStream[String] = ssc.textFileStream(sourceDir)
case class TextLine(line: String)
val lineRdd: DStream[TextLine] = stream.map(TextLine)
lineRdd.foreachRDD(rdd => {
rdd.foreach(println)
})
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
Input:
//1.txt
Hello World
Nothing print when stream the streaming. What is wrong in it?

TextFileStreaming does not read the file that is already present in the directory. Start the program and create a new file or move the file from any other directory. The following program is simple word count for text file streaming
val sourceDir: String = "path to streaming directory"
val stream: DStream[String] = streamingContext.textFileStream(sourceDir)
case class TextLine(line: String)
val lineRdd: DStream[TextLine] = stream.map(TextLine)
lineRdd.foreachRDD(rdd => {
val words = rdd.flatMap(rdd => rdd.line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
println("=====================")
wordCounts.foreach(println)
println("=====================" + rdd.count())
})
The ouput should be something like this
+++++++++++++++++++++++
=====================0
+++++++++++++++++++++++
(are,1)
(you,1)
(how,1)
(hello,1)
(doing,1)
=====================5
+++++++++++++++++++++++
=====================0
I hope this helps!

Related

Streaming from HDFS folder

I am trying to implement a scala + spark solution to streaming a word count information from new values from a HDFS folder, like this:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
object HdfsWordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("HdfsWordCount")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
I tried with spark-shell running HdfsWordCount.main(Array('hdfs:///user/cloudera/sparkStreaming/'))
and it just give a | and leaving me to type. Am I doing something wrong?

Getting error while saving PairRdd in Spark Stream [duplicate]

This question already has an answer here:
Custom partiotioning of JavaDStreamPairRDD
(1 answer)
Closed 4 years ago.
I am trying to save my Pair Rdd in spark streaming but getting error while saving at last step .
Here is my sample code
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
println("IN Streaming ")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration;
//hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val ssc = new org.apache.spark.streaming.StreamingContext(sc, Seconds(60))
val input = ssc.textFileStream(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
//print(pairedRDD)
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
I am getting at last step while saving .I am new to spark streaming so must be missing something here .
Getting error like
value partitionBy is not a member of
org.apache.spark.streaming.dstream.DStream[(String, String)]
Please help
pairedRDD is of type DStream[(String, String)] not RDD[(String,String)]. The method partitionBy is not available on DStreams.
Maybe look into foreachRDD which should be available on DStreams.
EDIT: A bit more context explanation textFileStream will set up a directory watch on the specified path and whenever there are new files will stream the content. so that's where the stream aspect comes from. Is that what you want? or do you just want to read the content of the directory "as is" once? Then there's readTextFiles which will return a non-stream container.

How to create a stop condition on Spark streaming?

I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is, a way in which the program uploading files to the HDFS can signal the spark streaming program, that it is done uploading all the files.
For a simple example, take the program from Here. The code is shown below. Assuming another program is uploading those files, how can the end condition be progammatically signalled by that program (Not requiring us to press CTRL+C) to the spark streaming program?
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage StreamingWordCount <input-directory> <output-directory>")
System.exit(0)
}
val inputDir=args(0)
val output=args(1)
val conf = new SparkConf().setAppName("Spark Streaming Example")
val streamingContext = new StreamingContext(conf, Seconds(10))
val lines = streamingContext.textFileStream(inputDir)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1))
wc.foreachRDD(rdd => {
val counts = rdd.reduceByKey((x, y) => x + y)
counts.saveAsTextFile(output)
val collectedCounts = counts.collect
collectedCounts.foreach(c => println(c))
}
)
println("StreamingWordCount: streamingContext start")
streamingContext.start()
println("StreamingWordCount: await termination")
streamingContext.awaitTermination()
println("StreamingWordCount: done!")
}
}
OK, I got it. Basically you create another thread from where you call ssc.stop(), to signal the stream processing to stop. For example, like this.
val ssc = new StreamingContext(sparkConf, Seconds(1))
//////////////////////////////////////////////////////////////////////
val thread = new Thread
{
override def run
{
....
// On reaching the end condition
ssc.stop()
}
}
thread.start
//////////////////////////////////////////////////////////////////////
val lines = ssc.textFileStream("inputDir")
.....

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help

Spark scala input/output directories

I am new to Spark/Scala Programming.I am able to do the set up using the maven and able to run the sample word count program.
I am having 2 questions over here for both running in spark environment/ in Windows local:
1.How the scala program is identifying the input.
2.How to write the output into text file.
Here is my code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
def main(args: Array[String]) = {
//Start the Spark context
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local")
val sc = new SparkContext(conf)
//Read some example file to a test RDD
val textFile = sc.textFile("file:/home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
counts.collect()
counts.saveAsTextFile("file:/home/root1/Avinash/output")
}
}
When I place the file in file:/home/root1/Avinash/data.txt and try to run it didnt work.Only when i place the data.txt in /home/root1/softs/spark-1.6.1/bin or inside the project folder in workspace it is trying to take the input.
Similarly, when I am trying to write into output using counts.saveAsTextFile("file:/home/root1/Avinash/output"), it is not writing and instead it is throwing the error as
Exception in thread "main" java.io.IOException: No FileSystem for scheme: D.
Please help me in resolving this!!.
you suppose to use /// on file. this is an example
val textFile = sc.textFile("file:///home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _).cache()
counts.foreach(println)
//counts.collect()
counts.saveAsTextFile("file:///home/root1/Avinash/output")
use cache to avoid compute every time you are doing action on RDD if the file is big