Out of memory issue while using Multipart upload API of AWS s3 - scala

I am trying to use aws multipart upload using aws SDK and spark and file size is around 14GB but getting out of memory error. Its giving error at this line - val bytes: Array[Byte] = IOUtils.toByteArray(is)
I have tried to bump up driver memory and executor memory to 100 G and tried few other spark optimizations.
Below is the code I am trying with :-
val tm = TransferManagerBuilder.standard.withS3Client(s3Client).build
val fs = FileSystem.get(new Configuration())
val filePath = new Path(hdfsFilePath)
val is:InputStream = fs.open(filePath)
val om = new ObjectMetadata()
val bytes: Array[Byte] = IOUtils.toByteArray(is)
om.setContentLength(bytes.length)
val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(bytes)
val request = new PutObjectRequest(bucketName, keyName, byteArrayInputStream, om).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey)).withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
val upload = tm.upload(request)
And this is the Exception I am getting :-
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.amazonaws.util.IOUtils.toByteArray(IOUtils.java:45)

PutObjectRequest accepts File:
public PutObjectRequest(String bucketName, String key, File file)
Something like the following should work (I haven't checked though):
val result = TransferManagerBuilder.standard.withS3Client(s3Client)
.build
.upload(
new PutObjectRequest(
bucketName,
keyName,
new File(new Path(hdfsFilePath))
)
.withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey))
.withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
)

Related

java.lang.IllegalArgumentException when HDFS file creating

I have HDFS and some text, I want to create file with text. I tried to use HDFS api and FSDataOutputStream, but got an exception. Could you help me please resolve it.
The exception is:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/user1, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:513)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:499)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:594)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:448)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:776)
at com.example.FileBuilder$.buildFile(FileBuilder.scala:23)
The code is
val fs = FileSystem.get(new Configuration())
val path = new Path(s"hdfs:////user/" + "fileName.sql")
val fsDataOutputStream = fs.create(path)
val outputStreamWriter = new OutputStreamWriter(fsDataOutputStream, "UTF-8")
val bufferedWriter = new BufferedWriter(outputStreamWriter)
bufferedWriter.write(data)
bufferedWriter.close()
outputStreamWriter.close()
fsDataOutputStream.close()
I think there is some problem with the file path. Can you test by replace below portion of code in yours.
Configuration configuration = new Configuration();
FileSystem fs = FileSystem.get(new URI(<url:port>), configuration);
Path filePath = new Path("/user/fileName.sql");
val fsDataOutputStream = fs.create(path)

Writing file using FileSystem to S3 (Scala)

I'm using scala , and trying to write file with string content,
to S3.
I've tried to do that with FileSystem ,
but I getting an error of:
"Wrong FS: s3a"
val content = "blabla"
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val localPath= new Path("/tmp/fileTest.txt")
val os = fs.create(localPath)
os.write(content.getBytes)
fs.copyFromLocalFile(localPath,s3Path)
and i'm getting an error:
java.lang.IllegalArgumentException: Wrong FS: s3a://...txt, expected: file:///
What is wrong?
Thanks!!
you need to ask for the specific filesystem for that scheme, then you can create a text file directly on the remote system.
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val fs = s3Path.getFilesystem(spark.sparkContext.hadoopConfiguration)
val os = fs.create(s3Path, true)
os.write("hi".getBytes)
os.close
There's no need to write locally and upload; the s3a connector will buffer and upload as needed

Read data from HDFS

I'm using the FSDataInputStream library to access the data from HDFS
The following is the snippet which I'm using
val fs = FileSystem.get(new java.net.URI(#HDFS_URI),new Configuration())
val stream = fs.open(new Path(#PATH))
val reader = new BufferedReader(new InputStreamReader(stream))
val offset:String = reader.readLine() #Reads the string "5432" stored in the file
Expected output is "5432".
But the actual output is "^#5^#4^#3^#2"
Not able to trim "^#" since they are not considered as characters.Please help with appropriate solution.

Error reading s3 bucket it spark

Im getting an exception when trying to read files from s3 with spark. Error and code is given below. The folder consists of a number of files called part-00000 part-00001 etc output from hadoop. They have a range of file sizes from 0kb to several gb
16/04/07 15:38:58 INFO NativeS3FileSystem: Opening key
'titlematching214/1.0/bypublicdemand/part-00000' for reading at
position '0' 16/04/07 15:38:58 ERROR Executor: Exception in task 0.0
in stage 0.0 (TID 0) org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 GET failed for
'/titlematching214%2F1.0%2Fbypublicdemand%2Fpart-00000' XML Error
Message: InvalidRangeThe
requested range is not
satisfiablebytes=0-01AED523DF401F17ECBYUH1h3WkC7/g8/EFE/YyHbzxoNTpRBiX6QMy2RXHur17lYTZXd7XxOWivmqIpu0F7Xx5zdWns=
object ReadMatches extends App{
override def main(args: Array[String]): Unit = {
val config = new SparkConf().setAppName("RunAll").setMaster("local")
val sc = new SparkContext(config)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem")
hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myRealKeyId")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "realKey")
val sqlConext = new SQLContext(sc)
val datset = sc.textFile("s3n://altvissparkoutput/titlematching214/1.0/*/*")
val ebayRaw = sqlConext.read.json(datset)
val data = ebayRaw.first();
}
}
May be you can read your dataset straight from s3.
val datset = "s3n://altvissparkoutput/titlematching214/1.0/*/*"
val ebayRaw = sqlConext.read.json(datset)

Reading an Avro file from scala

I'm trying to read an avro file using scala.
I've extracted the file's schema using avro-tools and saved it to a file, I then try to read it using the following code:
val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
val schema_obj = new Schema.Parser
val schema2 = schema_obj.parse(zibi)
val READER2 = new GenericDatumReader[GenericRecord](schema2)
val myFile = Files.readAllBytes(Paths.get("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro"))
val datum = READER2.read(null, DecoderFactory.defaultFactory.createBinaryDecoder(myFile,null))
But I keep hitting IOExceptions as such:
java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
at org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:444)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:159)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:219)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
When I'm reading the file through avro-tools it reads just fine.
What am I doing wrong?
Try using a DataFileReader instead of using a BinaryDecoder.
While Encoder/Decoders are used for writing and reading raw avros, I suspect that they are choking on the header info found in avro datafiles.
import org.apache.avro.generic.{ GenericDatumReader, GenericRecord }
import org.apache.avro.file.DataFileReader
val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
val schema_obj = new Schema.Parser
val schema2 = schema_obj.parse(zibi)
val READER2 = new GenericDatumReader[GenericRecord](schema2)
val myFile = new File("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro")
val dataFileReader = new DataFileReader[GenericRecord](myFile, READER2)
val datum = dataFileReader.next()