Error reading s3 bucket it spark - scala

Im getting an exception when trying to read files from s3 with spark. Error and code is given below. The folder consists of a number of files called part-00000 part-00001 etc output from hadoop. They have a range of file sizes from 0kb to several gb
16/04/07 15:38:58 INFO NativeS3FileSystem: Opening key
'titlematching214/1.0/bypublicdemand/part-00000' for reading at
position '0' 16/04/07 15:38:58 ERROR Executor: Exception in task 0.0
in stage 0.0 (TID 0) org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 GET failed for
'/titlematching214%2F1.0%2Fbypublicdemand%2Fpart-00000' XML Error
Message: InvalidRangeThe
requested range is not
satisfiablebytes=0-01AED523DF401F17ECBYUH1h3WkC7/g8/EFE/YyHbzxoNTpRBiX6QMy2RXHur17lYTZXd7XxOWivmqIpu0F7Xx5zdWns=
object ReadMatches extends App{
override def main(args: Array[String]): Unit = {
val config = new SparkConf().setAppName("RunAll").setMaster("local")
val sc = new SparkContext(config)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem")
hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myRealKeyId")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "realKey")
val sqlConext = new SQLContext(sc)
val datset = sc.textFile("s3n://altvissparkoutput/titlematching214/1.0/*/*")
val ebayRaw = sqlConext.read.json(datset)
val data = ebayRaw.first();
}
}

May be you can read your dataset straight from s3.
val datset = "s3n://altvissparkoutput/titlematching214/1.0/*/*"
val ebayRaw = sqlConext.read.json(datset)

Related

ERROR SparkContext: Error initializing SparkContext.ERROR Utils: Uncaught exception in thread main

I thought it can work,but i failed actually
import math._
import org.apache.spark.sql.SparkSession
object Position {
def main(args: Array[String]): Unit = {
// create Spark DataFrame with Spark configuration
val spark= SparkSession.builder().getOrCreate()
// Read csv with DataFrame
val file1 = spark.read.csv("file:///home/aaron/Downloads/taxi_gps.txt")
val file2 = spark.read.csv("file:///home/aaron/Downloads/district.txt")
//change the name
val new_file1= file1.withColumnRenamed("_c4","lat")
.withColumnRenamed("_c5","lon")
val new_file2= file2.withColumnRenamed("_c0","dis")
.withColumnRenamed("_1","lat")
.withColumnRenamed("_2","lon")
.withColumnRenamed("_c3","r")
//geo code
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double): Double ={
val R = 6372.8 //radius in km
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
//count
new_file2.foreach(row => {
val district = row.getAs[Float]("dis")
val lon = row.getAs[Float]("lon")
val lat = row.getAs[Float]("lat")
val distance = row.getAs[Float]("r")
var temp = 0
new_file1.foreach(taxi => {
val taxiLon = taxi.getAs[Float]("lon")
val taxiLat = taxi.getAs[Float]("lat")
if(haversine(lat,lon,taxiLat,taxiLon) <= distance) {
temp+=1
}
})
println(s"district:$district temp=$temp")
})
}
}
Here's results
20/06/07 23:04:11 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
......
20/06/07 23:04:11 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
......
20/06/07 23:04:11 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I am not sure that since this seems to be Spark, using a DF inside a DF is the only mistake to this program.
I am not familiar with scala and spark,it is quite a tough question for me. I hope you guys can help me,thx!
Your exception says org.apache.spark.SparkException: A master URL must be set in your configuration set master url in master function.
I hope you are running code in some IDE. If yes, Please replace this val spark= SparkSession.builder().getOrCreate() with val spark= SparkSession.builder().master("local[*]").getOrCreate() in your code.
or if you are executing this code using spark-submit try add this --master yarn.

Out of memory issue while using Multipart upload API of AWS s3

I am trying to use aws multipart upload using aws SDK and spark and file size is around 14GB but getting out of memory error. Its giving error at this line - val bytes: Array[Byte] = IOUtils.toByteArray(is)
I have tried to bump up driver memory and executor memory to 100 G and tried few other spark optimizations.
Below is the code I am trying with :-
val tm = TransferManagerBuilder.standard.withS3Client(s3Client).build
val fs = FileSystem.get(new Configuration())
val filePath = new Path(hdfsFilePath)
val is:InputStream = fs.open(filePath)
val om = new ObjectMetadata()
val bytes: Array[Byte] = IOUtils.toByteArray(is)
om.setContentLength(bytes.length)
val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(bytes)
val request = new PutObjectRequest(bucketName, keyName, byteArrayInputStream, om).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey)).withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
val upload = tm.upload(request)
And this is the Exception I am getting :-
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.amazonaws.util.IOUtils.toByteArray(IOUtils.java:45)
PutObjectRequest accepts File:
public PutObjectRequest(String bucketName, String key, File file)
Something like the following should work (I haven't checked though):
val result = TransferManagerBuilder.standard.withS3Client(s3Client)
.build
.upload(
new PutObjectRequest(
bucketName,
keyName,
new File(new Path(hdfsFilePath))
)
.withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey))
.withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
)

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help

Spark 1.1: saving RDD in HDFS with saveAsTextFile

I get the following error
Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)
when launching the following command
spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1
The two last arguments (linkage and linkage/out1) are HDFS directories, the first contains several CSV files, the second doesn't exist, I assume that it will be automatically created.
The following code has been tested successfully with REPL (Spark 1.1, Scala 2.10.4), except of course the saveAsTextFile() part. I've followed the step-by-step method explained in O'Reilly's "Advanced Analytics with Spark" book.
Since it worked on REPL, I wanted to transpose this into a JAR file using Eclipse Juno, with the following code.
package spark00
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object DataAnalysis1 {
case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
def isHeader(line:String) = line.contains("id_1")
def toDouble(s:String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line:String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
val sc = new SparkContext(conf)
// Load our input data.
val rawblocks = sc.textFile(args(0))
// CLEAN-UP
// a. calling !isHeader(): suppress header
val noheader = rawblocks.filter(!isHeader(_))
// b. calling parse(): setting feature types and renaming headers
val parsed = noheader.map(line => parse(line))
// EXPORT CLEAN FILE
parsed.coalesce(1,true).saveAsTextFile(args(1))
}
}
As you can see args(0) should be "linkage" directory, and args(1) is actually the output HDFS directory linkage/out1 based on my spark-submit command above.
I've also tried the last line without coalesce(1,true)
Here's the official RDD type for parsed
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34
Thank you in advance for your support
Nov 20th: I'm adding this simple Wordcount code that works well when running the spark-submit command the same way as for the code above. Thus, my question will be: why the saveAsTextFile() worked for this one and not the for other code ?
object SpWordCount {
def main(args: Array[String]) {
// Create a Scala Spark Context.
val conf = new SparkConf().setMaster("local").setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(args(0))
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(args(1))
}
}