Spark 1.1: saving RDD in HDFS with saveAsTextFile - eclipse

I get the following error
Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)
when launching the following command
spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1
The two last arguments (linkage and linkage/out1) are HDFS directories, the first contains several CSV files, the second doesn't exist, I assume that it will be automatically created.
The following code has been tested successfully with REPL (Spark 1.1, Scala 2.10.4), except of course the saveAsTextFile() part. I've followed the step-by-step method explained in O'Reilly's "Advanced Analytics with Spark" book.
Since it worked on REPL, I wanted to transpose this into a JAR file using Eclipse Juno, with the following code.
package spark00
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object DataAnalysis1 {
case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
def isHeader(line:String) = line.contains("id_1")
def toDouble(s:String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line:String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
val sc = new SparkContext(conf)
// Load our input data.
val rawblocks = sc.textFile(args(0))
// CLEAN-UP
// a. calling !isHeader(): suppress header
val noheader = rawblocks.filter(!isHeader(_))
// b. calling parse(): setting feature types and renaming headers
val parsed = noheader.map(line => parse(line))
// EXPORT CLEAN FILE
parsed.coalesce(1,true).saveAsTextFile(args(1))
}
}
As you can see args(0) should be "linkage" directory, and args(1) is actually the output HDFS directory linkage/out1 based on my spark-submit command above.
I've also tried the last line without coalesce(1,true)
Here's the official RDD type for parsed
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34
Thank you in advance for your support
Nov 20th: I'm adding this simple Wordcount code that works well when running the spark-submit command the same way as for the code above. Thus, my question will be: why the saveAsTextFile() worked for this one and not the for other code ?
object SpWordCount {
def main(args: Array[String]) {
// Create a Scala Spark Context.
val conf = new SparkConf().setMaster("local").setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(args(0))
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(args(1))
}
}

Related

Getting error while saving PairRdd in Spark Stream [duplicate]

This question already has an answer here:
Custom partiotioning of JavaDStreamPairRDD
(1 answer)
Closed 4 years ago.
I am trying to save my Pair Rdd in spark streaming but getting error while saving at last step .
Here is my sample code
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
println("IN Streaming ")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration;
//hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val ssc = new org.apache.spark.streaming.StreamingContext(sc, Seconds(60))
val input = ssc.textFileStream(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
//print(pairedRDD)
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
I am getting at last step while saving .I am new to spark streaming so must be missing something here .
Getting error like
value partitionBy is not a member of
org.apache.spark.streaming.dstream.DStream[(String, String)]
Please help
pairedRDD is of type DStream[(String, String)] not RDD[(String,String)]. The method partitionBy is not available on DStreams.
Maybe look into foreachRDD which should be available on DStreams.
EDIT: A bit more context explanation textFileStream will set up a directory watch on the specified path and whenever there are new files will stream the content. so that's where the stream aspect comes from. Is that what you want? or do you just want to read the content of the directory "as is" once? Then there's readTextFiles which will return a non-stream container.

I am getting error in eclipse while running Spark WordCount in scala [duplicate]

This question already has answers here:
Scala project won't compile in Eclipse; "Could not find the main class."
(12 answers)
Closed 5 years ago.
My Scala program:
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
// Create a Scala Spark Context.
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(inputFile)
// Split up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile)
}
}
Error I am getting :
Error: Could not find or load main class WordCount
You have not set property of master.
val conf = new SparkConf().setAppName("wordCount").setMaster(local[*])
I think is probably something with the classpath, not correctly set up.
If you're using Intellij, mark that directory as source root, could do the trick.

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help

Spark scala input/output directories

I am new to Spark/Scala Programming.I am able to do the set up using the maven and able to run the sample word count program.
I am having 2 questions over here for both running in spark environment/ in Windows local:
1.How the scala program is identifying the input.
2.How to write the output into text file.
Here is my code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
def main(args: Array[String]) = {
//Start the Spark context
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local")
val sc = new SparkContext(conf)
//Read some example file to a test RDD
val textFile = sc.textFile("file:/home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
counts.collect()
counts.saveAsTextFile("file:/home/root1/Avinash/output")
}
}
When I place the file in file:/home/root1/Avinash/data.txt and try to run it didnt work.Only when i place the data.txt in /home/root1/softs/spark-1.6.1/bin or inside the project folder in workspace it is trying to take the input.
Similarly, when I am trying to write into output using counts.saveAsTextFile("file:/home/root1/Avinash/output"), it is not writing and instead it is throwing the error as
Exception in thread "main" java.io.IOException: No FileSystem for scheme: D.
Please help me in resolving this!!.
you suppose to use /// on file. this is an example
val textFile = sc.textFile("file:///home/root1/Avinash/data.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _).cache()
counts.foreach(println)
//counts.collect()
counts.saveAsTextFile("file:///home/root1/Avinash/output")
use cache to avoid compute every time you are doing action on RDD if the file is big

Why saveAsTextFile do nothing?

I'm trying to implement simple WordCount in Scala + Spark. Here is my code
object FirstObject {
def main(args: Array[String]) {
val input = "/Data/input"
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://192.168.1.162:7077")
val sparkContext = new SparkContext(conf)
val text = sparkContext.textFile(input).cache()
val wordCounts = text.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((a,b) => a+b)
.sortByKey()
wordCounts.saveAsTextFile("/Data/output")
}
This job is working for 54s, and finally do nothing. Is is not writing output to /Data/output
Also if I replace saveAsTextFile with forEach(println) it is produce desired output.
You should check your user rights for /data/output folder.
This folder should have writing rights for your specific user.