reading zip file from s3 bucket using scala spark

reading zip file from s3 bucket using scala spark - scala

i am trying to fetch and read text files in a zip file uploaded on aws s3 bucket
code i tried
var ZipFileList = spark.sparkContext.binaryFiles(/path/);
var unit = ZipFileList.flatMap {
case (zipFilePath, zipContent) =>
{
val zipInputStream = new ZipInputStream(zipContent.open())
val zipEntry = zipInputStream.getNextEntry()
println(zipEntry.getName)
}
}
but it gives an error unit required traversableOnce
val files = spark.sparkContext.wholeTextFiles(/path/))
files.flatMap({case (name, content) =>
unzip(content) //gives error "type mismatch; found : Unit required: scala.collection.GenTraversableOnce[?]"
})
is there any other way to read file contents inside a zip file ...
zip file contains .json files and i want to achieve is to read and parse all those files

you aren't actually returning the data in the unzip() command, are you? I think that's part of the problem

Related

Write/Read/Delete binary data in Spark Databricks (scala)

I'm quite new to Spark on Databricks (Scala) and I would like to know how I can write a variable's content which is of type Array[Byte] to a temporary file data.bin in a mount storage mtn/somewhere/tmp/ (Azure Data Lake) or to file:/tmp/. Then I would like to know how to read it as an InputStream and later delete it when I'm done with it.
All methods I've read so far does not work or does not apply to binary data.
Thank you.

Turns out this code works fine :
import java.io._
import org.apache.commons.io.FileUtils
// Create or collect the data
val bytes: Array[Byte] = <some_data>
try {
// Write data to temp file
// Note : Here I use GRIB2 file as I manipulate forecast data,
// but you can use .bin or .png/.jpg (if it's an image data)
// extensions or no extension at all. It doesn't matter.
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
// Read the temp file
val input = new FileInputStream(path)
////////// Do something with it //////////
// Remove the temp file
if (!file.delete()) {
println("Cannot delete temporary file !")
}
} catch {
case _: Throwable => println("An I/O error occured")

Scala: Listing files that match a regular expression within a directory

I'm trying to list files within a directory that match a regular expression, e.g. ".csv$" this is very similar to Scala & DataBricks: Getting a list of Files
I've been running in circles for hours trying to figure out how Scala can list a directory of files and filter by regex.
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val name : String = ".csv"
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
println(files)
gives the error
/home/con/Scripts/scala/find_files.scala:13: error: value path is not a member of java.io.File
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
I'm trying to figure out the regular Scala equivalent of dbutils.fs.ls which eludes me.
How can list files in a regular directory in Scala?

The error is reporting that path is not defined in java.io.File which it isn't.
If you want to match by name, why don't you get file names? Also, your regex is a bit off if you want to match based on file extension.
Fixing these two problems:
val name : String = ".+\\.csv"
val files = getListOfFiles("/path/to/files/location")
.map(f => f.getName)
.filter(_.matches(name))
will output .csv files in the /path/to/files/location folder.

Reading and printing zipfile name and filename inside the zipfiles using scala

I have a directory with many *.zip files and each zipfile can have many *.xml files inside it.
Test1.Zip [2018-09-07 02-57-43Z-OJPRxx.xml , 2018-09-07 03-57-43Z-OJPRxx.xml ]
Test2.Zip [2018-09-17 02-57-43Z-OJPRYY.xml , 2018-09-17 03-57-43Z-OJPRYY.xml ]
Using the scala code I am able to print the name of the files inside the zip file , but cant print the name of the zip file itself - while using sc.binaryfiles and zipinputstreams.
I have altered the code that I researched and it appends the filename to each line of the xmlfile using the "getName" method. But I cant find any function in scala to print the parent zip filename after its builds the stream and starts to read zip files one by one.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zis, "UTF-8").getLines.mkString(s"~${filename1}\n") + s"~${filename1}"
}
}
}
}
val df = sc.readFile("/landing/data/hs/tsnt/froff/bdu/acbdustore/INPUT/zipfiles", 1)
df.saveAsTextFile("/landing/data/hs/tsnt/froff/bdu/acbdustore/INPUT/outputfile")
example output:
<?xml version="1.0" encoding="utf-8"?>~2018-09-06 01-57-43Z-OJPRQL.xml
<pnr xmlns="http://gdsx.com/PnrDataPush.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">~2018-09-06 01-57-43Z-
OJPRQL.xml
<PNRid>999999999999</PNRid>~2018-09-06 01-57-43Z-OJPRQL.xml
<recordLocator>OJPRQL</recordLocator>~2018-09-06 01-57-43Z-OJPRQL.xml
<GDS>6</GDS>~2018-09-06 01-57-43Z-OJPRQL.xml
<platformID>NA</platformID>~2018-09-06 01-57-43Z-OJPRQL.xml
I expect my parent zip file name be captured and printed or written to an output file, but cant get a function/method to do the same. Would someone please help me out here. Thanks!

It should be in 'name' or
content.getPath

Play Framework 2.6 Alpakka S3 File Upload

I use Play Framework 2.6 (Scala) and Alpakka AWS S3 Connector to upload files asynchronously to S3 bucket. My code looks like this:
def richUpload(extension: String, checkFunction: (String, Option[String]) => Boolean, cannedAcl: CannedAcl, bucket: String) = userAction(parse.multipartFormData(handleFilePartAsFile)).async { implicit request =>
val s3Filename = request.user.get.id + "/" + java.util.UUID.randomUUID.toString + "." + extension
val fileOption = request.body.file("file").map {
case FilePart(key, filename, contentType, file) =>
Logger.info(s"key = ${key}, filename = ${filename}, contentType = ${contentType}, file = $file")
if(checkFunction(filename, contentType)) {
s3Service.uploadSink(s3Filename, cannedAcl, bucket).runWith(FileIO.fromPath(file.toPath))
} else {
throw new Exception("Upload failed")
}
}
fileOption match {
case Some(opt) => opt.map(o => Ok(s3Filename))
case _ => Future.successful(BadRequest("ERROR"))
}
}
It works, but it returns filename before it uploads to S3. But I want to return value after it uploads to S3. Is there any solution?
Also, is it possible to stream file upload directly to S3, to show progress correctly and to not use temporary disk file?

You need to flip around your source and sink to obtain the materialized value you are interested in.
You have:
a source that reads from your local files, and materializes to a Future[IOResult] upon completion of reading the file.
a sink that writes to S3 and materializes to Future[MultipartUploadResult] upon completion of writing to S3.
You are interested in the latter, but in your code you are using the former. This is because the runWith function always keeps the materialized value of stage passed as parameter.
The types in the sample snippet below should clarify this:
val fileSource: Source[ByteString, Future[IOResult]] = ???
val s3Sink : Sink [ByteString, Future[MultipartUploadResult]] = ???
val m1: Future[IOResult] = s3Sink.runWith(fileSource)
val m2: Future[MultipartUploadResult] = fileSource.runWith(s3Sink)
After you have obtained a Future[MultipartUploadResult] you can map on it the same way and access the location field to get a file's URI, e.g.:
val location: URI = fileSource.runWith(s3Sink).map(_.location)

Play! Upload file and save to AWS S3

I am using Play 2.3 and want to store the uploaded files to S3, so I use Play-S3 module.
However, I got stuck because I need to create a BucketFile to upload to S3 with this module, and a BucketFile is created using an Array[Byte] in memory of the file. The Play! body parser gives me a temporary on disc file. How can I put this file into BucketFile?
Here is my controller Action:
def upload = Action.async(parse.multipartFormData) { implicit request =>
request.body.file("file").map{ file =>
implicit val credential = AwsCredentials.fromConfiguration
val bucket = S3("bucketName")
val result = bucket + BucketFile(file.filename, file.contentType.get, file.ref.file.toString.getBytes)
result.map{ unit =>
Ok("File uploaded")
}
}.getOrElse {
Future.successful {
Redirect(routes.Application.index).flashing(
"error" -> "Missing file"
)
}
}
}
This code does not work because file.ref.file.toString() does not really return the string representation of a file.

Import the following:
import java.nio.file.{Paths, Files}
To create the Array[Byte] do:
val byteArray = Files.readAllBytes(Paths.get(file.ref.file.getPath))
Then upload with:
BucketFile(file.filename, file.contentType.get, byteArray, None, None)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

reading zip file from s3 bucket using scala spark - scala

you aren't actually returning the data in the unzip() command, are you? I think that's part of the problem

Related

Write/Read/Delete binary data in Spark Databricks (scala)

Scala: Listing files that match a regular expression within a directory

Reading and printing zipfile name and filename inside the zipfiles using scala

Play Framework 2.6 Alpakka S3 File Upload

Play! Upload file and save to AWS S3

Categories

Resources