Write/Read/Delete binary data in Spark Databricks (scala) - scala

I'm quite new to Spark on Databricks (Scala) and I would like to know how I can write a variable's content which is of type Array[Byte] to a temporary file data.bin in a mount storage mtn/somewhere/tmp/ (Azure Data Lake) or to file:/tmp/. Then I would like to know how to read it as an InputStream and later delete it when I'm done with it.
All methods I've read so far does not work or does not apply to binary data.
Thank you.

Turns out this code works fine :
import java.io._
import org.apache.commons.io.FileUtils
// Create or collect the data
val bytes: Array[Byte] = <some_data>
try {
// Write data to temp file
// Note : Here I use GRIB2 file as I manipulate forecast data,
// but you can use .bin or .png/.jpg (if it's an image data)
// extensions or no extension at all. It doesn't matter.
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
// Read the temp file
val input = new FileInputStream(path)
////////// Do something with it //////////
// Remove the temp file
if (!file.delete()) {
println("Cannot delete temporary file !")
}
} catch {
case _: Throwable => println("An I/O error occured")

Related

Issue with try-finally in Scala

I have following scala code:
val file = new FileReader("myfile.txt")
try {
// do operations on file
} finally {
file.close() // close the file
}
How do I handle FileNotFoundException thrown when I read the file? If I put that line inside try block, I am not able to access the file variable inside finally.
For scala 2.13:
you can just use Using to acquire some resource and release it automatically without error handling if it's an AutoClosable:
import java.io.FileReader
import scala.util.Using
val newStyle: Try[String] = Using(new FileReader("myfile.txt")) {
reader: FileReader =>
// do something with reader
"something"
}
newStyle
// will be
// Failure(java.io.FileNotFoundException: myfile.txt (No such file or directory))
// if file is not found or Success with some value it will not fall
scala 2.12:
You can wrap your reader creation by scala.util.Try and if it will fall on creation you will get Failure with FileNotFoundException inside.
import java.io.FileReader
import scala.util.Try
val oldStyle: Try[String] = Try{
val file = new FileReader("myfile.txt")
try {
// do operations on file
"something"
} finally {
file.close() // close the file
}
}
oldStyle
// will be
// Failure(java.io.FileNotFoundException: myfile.txt (No such file or directory))
// or Success with your result of file reading inside
I recommend not to use try ... catch blocks in scala code. It's not type safety for some cases and can lead to non-obvious results but for release some resource in old scala versions there is the only way to do it - using try-finally.

Save Dataset elements to files with specified file path

I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.
You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.

Writing to a local file on hdfs in Spark from a non-Spark datastructure

I have the following code:
def writeCSV(indexing: ListBuffer[Array[Int]], outputpath: String): Unit = {
new PrintWriter(outputpath + "out.csv") {
write("col1,col2,col3\n")
for (entry <- indexing) {
for (num <- entry) {
write(num + "");
if (num != entry(2)) write(",");
}
write("\n")
}
close
}
Which does not work, because my Spark complains that the output path cannot be found. How would I be able to print this out from a regular datastructure (ListBuffer[Array[Int]]) to just a regular file in my Spark program? Do I need to map the ListBuffer to some Spark datastructure?
I understand this is not what you want to do normally, but this is more for use for debugging and will not be used in production code.
I am new to Spark and I am using Spark 1.6.0.
If you want to write a file on HDFS, you would pass InputStream via FileSystem in org.apache.hadoop.fs package to PrintWriter constructor.
Example code
import org.apache.hadoop.fs.{FileSystem,Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
new PrintWriter(fs.create(new Path(""))){
write(...)
}

Play Framework 2.6 Alpakka S3 File Upload

I use Play Framework 2.6 (Scala) and Alpakka AWS S3 Connector to upload files asynchronously to S3 bucket. My code looks like this:
def richUpload(extension: String, checkFunction: (String, Option[String]) => Boolean, cannedAcl: CannedAcl, bucket: String) = userAction(parse.multipartFormData(handleFilePartAsFile)).async { implicit request =>
val s3Filename = request.user.get.id + "/" + java.util.UUID.randomUUID.toString + "." + extension
val fileOption = request.body.file("file").map {
case FilePart(key, filename, contentType, file) =>
Logger.info(s"key = ${key}, filename = ${filename}, contentType = ${contentType}, file = $file")
if(checkFunction(filename, contentType)) {
s3Service.uploadSink(s3Filename, cannedAcl, bucket).runWith(FileIO.fromPath(file.toPath))
} else {
throw new Exception("Upload failed")
}
}
fileOption match {
case Some(opt) => opt.map(o => Ok(s3Filename))
case _ => Future.successful(BadRequest("ERROR"))
}
}
It works, but it returns filename before it uploads to S3. But I want to return value after it uploads to S3. Is there any solution?
Also, is it possible to stream file upload directly to S3, to show progress correctly and to not use temporary disk file?
You need to flip around your source and sink to obtain the materialized value you are interested in.
You have:
a source that reads from your local files, and materializes to a Future[IOResult] upon completion of reading the file.
a sink that writes to S3 and materializes to Future[MultipartUploadResult] upon completion of writing to S3.
You are interested in the latter, but in your code you are using the former. This is because the runWith function always keeps the materialized value of stage passed as parameter.
The types in the sample snippet below should clarify this:
val fileSource: Source[ByteString, Future[IOResult]] = ???
val s3Sink : Sink [ByteString, Future[MultipartUploadResult]] = ???
val m1: Future[IOResult] = s3Sink.runWith(fileSource)
val m2: Future[MultipartUploadResult] = fileSource.runWith(s3Sink)
After you have obtained a Future[MultipartUploadResult] you can map on it the same way and access the location field to get a file's URI, e.g.:
val location: URI = fileSource.runWith(s3Sink).map(_.location)

reading zip file from s3 bucket using scala spark

i am trying to fetch and read text files in a zip file uploaded on aws s3 bucket
code i tried
var ZipFileList = spark.sparkContext.binaryFiles(/path/);
var unit = ZipFileList.flatMap {
case (zipFilePath, zipContent) =>
{
val zipInputStream = new ZipInputStream(zipContent.open())
val zipEntry = zipInputStream.getNextEntry()
println(zipEntry.getName)
}
}
but it gives an error unit required traversableOnce
val files = spark.sparkContext.wholeTextFiles(/path/))
files.flatMap({case (name, content) =>
unzip(content) //gives error "type mismatch; found : Unit required: scala.collection.GenTraversableOnce[?]"
})
is there any other way to read file contents inside a zip file ...
zip file contains .json files and i want to achieve is to read and parse all those files
you aren't actually returning the data in the unzip() command, are you? I think that's part of the problem