Getting a 0KB File in S3 when i write the zipInputStream - scala

I have the following piece of code which is producing 0KB Files in S3 whats wrong with the following piece of code
def extractFilesFromZipStream(zipInputStream: ZipArchiveInputStream,
tgtPath: String, storageType:String): scala.collection.mutable.Map[String, (String, Long)] = {
var filePathMap = scala.collection.mutable.Map[String, (String, Long)]()
Try {
storageWrapper.mkDirs(tgtPath)
Stream.continually(zipInputStream.getNextZipEntry).takeWhile(_ != null).foreach {file =>
val storagePathFilePath = s"$tgtPath/${file.getName}"
storageWrapper.write(zipInputStream, storagePathFilePath)
LOGGER.info(storagePathFilePath)
val lineCount = Source.fromInputStream(storageWrapper.read(storagePathFilePath)).getLines().count(s => s!= null)
There is nothing wrong with the Storage wrapper, it takes an input stream and path and has been working well so far. Can anyone suggest what's wrong with my implementation of using zipArchive Stream.

Related

Download pdf file from s3 using akka-stream-alpakka

I am trying to download pdf file from S3 using the akka-stream-alpakka connector. I have the s3 path and try to download the pdf using a wrapper method over the alpakka s3Client.
def getSource(s3Path: String): Source[ByteString, NotUsed] = {
val (source, _) = s3Client.download(s3Bucket, s3Path)
source
}
From my main code, I call the above method and try to convert it to a file
val file = new File("certificate.pdf")
val res: Future[IOResult] = getSource(data.s3PdfPath)
.runWith(FileIO.toFile(file))
However, instead of it getting converted to a file, I am stuck with a type of IOResult. Can someone please guide as to where I am going wrong regarding this ?
def download(bucket: String, bucketKey: String, filePath: String) = {
val (s3Source: Source[ByteString, _], _) = s3Client.download(bucket, bucketKey)
val result = s3Source.toMat(FileIO.toPath(Paths.get(filePath)))(Keep.right)
.run()
result
}
download(s3Bucket, key, newSigFilepath).onComplete {
}
Inspect the IOResult, and if successful you can use your file:
res.foreach {
case IOResult(bytes, Success(_)) =>
println(s"$bytes bytes written to $file")
... // do whatever you want with your file
case _ =>
println("some error occurred.")
}

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR.
The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz"
If I use deeply nested wildcards (e.g. "s3a://SomeBucket/SomeFolder/////*.gz"), the performance is terrible and takes about 40 minutes to read a few tens of thousand small gzipped json files.
It works, but losing 40 minutes to test some code is really bad.
I have two other approaches that my research has told me are much more performant.
Using the hadoop.fs library (2.8.5) I try to read each file path I provide it.
private def getEventDataHadoop(
eventsFilePaths: RDD[String]
)(implicit sqlContext: SQLContext): Try[RDD[String]] =
Try(
{
val conf = sqlContext.sparkContext.hadoopConfiguration
eventsFilePaths.map(eventsFilePath => {
val p = new Path(eventsFilePath)
val fs = p.getFileSystem(conf)
val eventData: FSDataInputStream = fs.open(p)
IOUtils.toString(eventData)
})
}
)
These file paths are generated by the below code:
private[disneystreaming] def generateInputBucketPaths(
s3Protocol: String,
bucketName: String,
service: String,
region: String,
yearsMonths: Map[String, Set[String]]
): Try[Set[String]] =
Try(
{
val days = 1 to 31
val hours = 0 to 23
val dateFormatter: Int => String = buildDateFormat("00")
yearsMonths.flatMap { yearMonth: (String, Set[String]) =>
for {
month: String <- yearMonth._2
day: Int <- days
hour: Int <- hours
} yield
s"$s3Protocol$bucketName/$service/$region/${dateFormatter(yearMonth._1.toInt)}/${dateFormatter(month.toInt)}/" +
s"${dateFormatter(day)}/${dateFormatter(hour)}/*.gz"
}.toSet
}
)
The hadoop.fs code fails because the Path class is not serializable. I can't think of how I can get around that.
So this led me to another approach using AmazonS3Client, where I just ask the client to give me all the file paths in a folder (or prefix), then parse the files to a string, which will likely fail due to them being compressed:
private def getEventDataS3(bucketName: String, prefix: String)(
implicit sqlContext: SQLContext
): Try[RDD[String]] =
Try(
{
import com.amazonaws.services.s3._, model._
import scala.collection.JavaConverters._
val request = new ListObjectsRequest()
request.setBucketName(bucketName)
request.setPrefix(prefix)
request.setMaxKeys(Integer.MAX_VALUE)
val s3 = new AmazonS3Client(new ProfileCredentialsProvider("default"))
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
sqlContext.sparkContext
.parallelize(objs.getObjectSummaries.asScala.map(_.getKey).toList)
.flatMap { key =>
Source
.fromInputStream(s3.getObject(bucketName, key).getObjectContent: InputStream)
.getLines()
}
}
)
This code produce a null exception because the profile cannot be null ("java.lang.IllegalArgumentException: profile file cannot be null").
Remember this code is running on EMR within AWS, so how do I provide the credentials it wants? How are other people running spark jobs on EMR using this client?
Any help with getting any of these approaches working is much appreciated.
Path is serializable in later Hadoop releases, because it is useful to be able to use in Spark RDDs. Until then, convert the path to a URI, marshall that, and create a new path from that URI inside your closure.

Scala: How to get the content of PortableDataStream instance from an RDD

As I want to extract data from binaryFiles I read the files using
val dataRDD = sc.binaryRecord("Path") I get the result as org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]
I want to extract the content of my files which is under the form of PortableDataStream
For that I tried: val data = dataRDD.map(x => x._2.open()).collect()
but I get the following error:
java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream
If you have an idea how can I solve my issue, please HELP!
Many Thanks in advance.
Actually, the PortableDataStream is Serializable. That's what it is meant for. Yet, open() returns a simple DataInputStream (HdfsDataInputStream in your case because your file is on HDFS) which is not Serializable, hence the error you get.
In fact, when you open the PortableDataStream, you just need to read the data right away. In scala, you can use scala.io.Source.fromInputStream:
val data : RDD[Array[String]] = sc
.binaryFiles("path/.../")
.map{ case (fileName, pds) => {
scala.io.Source.fromInputStream(pds.open())
.getLines().toArray
}}
This code assumes that the data is textual. If it is not, you can adapt it to read any kind of binary data. Here is an example to create a sequence of bytes, that you could process the way you want.
val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
.map{ case (file, pds) => {
val dis = pds.open()
val bytes = Array.ofDim[Byte](1024)
val all = scala.collection.mutable.ArrayBuffer[Byte]()
while( dis.read(bytes) != -1) {
all ++= bytes
}
all.toSeq
}}
See the javadoc of DataInputStream for more possibilities. For instance, it possesses readLong, readDouble (and so on) methods.
val bf = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available();
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26
scala> bytes.take(1)(0).size
res15: Int = 5879609 // this happened to be the size of my first binary file

Play Framework 2.6 Alpakka S3 File Upload

I use Play Framework 2.6 (Scala) and Alpakka AWS S3 Connector to upload files asynchronously to S3 bucket. My code looks like this:
def richUpload(extension: String, checkFunction: (String, Option[String]) => Boolean, cannedAcl: CannedAcl, bucket: String) = userAction(parse.multipartFormData(handleFilePartAsFile)).async { implicit request =>
val s3Filename = request.user.get.id + "/" + java.util.UUID.randomUUID.toString + "." + extension
val fileOption = request.body.file("file").map {
case FilePart(key, filename, contentType, file) =>
Logger.info(s"key = ${key}, filename = ${filename}, contentType = ${contentType}, file = $file")
if(checkFunction(filename, contentType)) {
s3Service.uploadSink(s3Filename, cannedAcl, bucket).runWith(FileIO.fromPath(file.toPath))
} else {
throw new Exception("Upload failed")
}
}
fileOption match {
case Some(opt) => opt.map(o => Ok(s3Filename))
case _ => Future.successful(BadRequest("ERROR"))
}
}
It works, but it returns filename before it uploads to S3. But I want to return value after it uploads to S3. Is there any solution?
Also, is it possible to stream file upload directly to S3, to show progress correctly and to not use temporary disk file?
You need to flip around your source and sink to obtain the materialized value you are interested in.
You have:
a source that reads from your local files, and materializes to a Future[IOResult] upon completion of reading the file.
a sink that writes to S3 and materializes to Future[MultipartUploadResult] upon completion of writing to S3.
You are interested in the latter, but in your code you are using the former. This is because the runWith function always keeps the materialized value of stage passed as parameter.
The types in the sample snippet below should clarify this:
val fileSource: Source[ByteString, Future[IOResult]] = ???
val s3Sink : Sink [ByteString, Future[MultipartUploadResult]] = ???
val m1: Future[IOResult] = s3Sink.runWith(fileSource)
val m2: Future[MultipartUploadResult] = fileSource.runWith(s3Sink)
After you have obtained a Future[MultipartUploadResult] you can map on it the same way and access the location field to get a file's URI, e.g.:
val location: URI = fileSource.runWith(s3Sink).map(_.location)

Scala: Source.fromInputStream failed for bigger input

I am reading data from AWS S3. The following code works fine if the input file is small. It failed when the input file is big. Is there any parameter I can modify to increase the buffer size or anything so it can handle bigger input file as well? Thanks!
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
If you look at the source code you can see that internally it calls Source.createBufferedSource. You can use that to create your own version with a bigger buffer size.
These are the lines of code from scala:
def createBufferedSource(
inputStream: InputStream,
bufferSize: Int = DefaultBufSize,
reset: () => Source = null,
close: () => Unit = null
)(implicit codec: Codec): BufferedSource = {
// workaround for default arguments being unable to refer to other parameters
val resetFn = if (reset == null) () => createBufferedSource(inputStream, bufferSize, reset, close)(codec) else reset
new BufferedSource(inputStream, bufferSize)(codec) withReset resetFn withClose close
}
def fromInputStream(is: InputStream, enc: String): BufferedSource =
fromInputStream(is)(Codec(enc))
def fromInputStream(is: InputStream)(implicit codec: Codec): BufferedSource =
createBufferedSource(is, reset = () => fromInputStream(is)(codec), close = () => is.close())(codec)
Edit: Now that I have thought about your issue a bit more, you can increase the buffer size in this way, but I'm not sure that this will actually fix your issue