How to read a zip containing multiple files in Apache Spark

How to read a zip containing multiple files in Apache Spark - scala

I am having a Zipped file containing multiple text files.
I want to read each of the file and build a List of RDD containining the content of each files.
val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")
will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.
I am fine with Scala or Python.
Possible solution in Python with using Spark -
archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
urls = file_path.split("/")
urlId = urls[-1].split('_')[0]

Apache Spark default compression support
I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135
Read zip containing multiple files
I have followed the advice given by #Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile {
case null => zis.close(); false
case _ => true
}
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
}
} else {
sc.textFile(path, minPartitions)
}
}
}
simply use it by importing the implicit class and call the readFile method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)

If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.

Here's a working version of #Atais solution (which needs enhancement by closing the streams) :
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { _ ⇒
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
Then all you have to do is the following to read a zip file :
sc.readFile(path)

This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.
JavaPairRDD<String, PortableDataStream> zipData =
sc.binaryFiles("hdfs://temp.zip");
JavaRDD<Record> newRDDRecord = zipData.flatMap(
new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
List<Record> records = new ArrayList<Record>();
ZipInputStream zin = new ZipInputStream(content._2.open());
ZipEntry zipEntry;
while ((zipEntry = zin.getNextEntry()) != null) {
count++;
if (!zipEntry.isDirectory()) {
Record sd;
String line;
InputStreamReader streamReader = new InputStreamReader(zin);
BufferedReader bufferedReader = new BufferedReader(streamReader);
line = bufferedReader.readLine();
String[] records= new CSVParser().parseLineMulti(line);
sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
getDefaultValue(records[1]),
getDefaultValue(records[22]));
records.add(sd);
}
}
return records.iterator();
}
});

Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
full code is here
https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/extraDFExamples/SparkReadZipFiles.scala

Related

How to stream zipped file (on the fly) via Play Framework 2.5 in scala?

I want to stream some files and zip them on the fly, so users can download multiple files into a single zipped file without writing anything to the local disk. However, my current implementation holds everything in the memory, and will no work for large files. Is there any way to fix it?
I was looking at this implementation: https://gist.github.com/kirked/03c7f111de0e9a1f74377bf95d3f0f60, but couldn't figure out how to use it.
import java.io.{BufferedOutputStream, ByteArrayInputStream, ByteArrayOutputStream}
import java.util.zip.{ZipEntry, ZipOutputStream}
import akka.stream.scaladsl.{StreamConverters}
import org.apache.commons.io.FileUtils
import play.api.mvc.{Action, Controller}
class HomeController extends Controller {
def single() = Action {
Ok.sendFile(
content = new java.io.File("C:\\Users\\a.csv"),
fileName = _ => "a.csv"
)
}
def zip() = Action {
Ok.chunked(StreamConverters.fromInputStream(fileByteData)).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip"
)
}
def fileByteData(): ByteArrayInputStream = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val baos = new ByteArrayOutputStream()
val zos = new ZipOutputStream(new BufferedOutputStream(baos))
try {
fileList.map(file => {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
zos.write(FileUtils.readFileToByteArray(file))
zos.closeEntry()
})
} finally {
zos.close()
}
new ByteArrayInputStream(baos.toByteArray)
}
}

Instead of using a ByteArrayOutputStream to buffer the contents in an array then putting them into a ByteArrayInputStream you could use Java's piping mechanism.
Here's a sketch solution:
def zip() = Action {
// Create Source that listens to an OutputStream
// and pass it to `fileByteData` method.
val zipSource: Source[ByteString, Unit] =
StreamConverters
.asOutputStream()
.mapMaterializedValue(fileByteData)
Ok.chunked(zipSource).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip")
}
// Send the file data, given an OutputStream to write to.
def fileByteData(os: OutputStream): Unit = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val zos = new ZipOutputStream(os)
val buffer: Array[Byte] = new Array[Byte](2048)
try {
for (file <- fileList) {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
val fis = new Files.newInputStream(file.toPath)
try {
#tailrec
def zipFile(): Unit = {
val bytesRead = fis.read(buffer)
if (bytesRead == -1) () else {
zos.write(buffer, 0, bytesRead)
zipFile()
}
}
zipFile()
} finally fis.close()
zos.closeEntry()
}
} finally {
zos.close()
}
}
This is just an outline of an approach. You'll also want to make sure:
- the threading is OK - the fileByteData will hopefully run on a different thread to the sending thread
- the error handling is OK - e.g. all streams are closed properly if there's an error on either the server (e.g. file not found) or client side (early disconnect)

How to use Akka HTTP to generate contents via an output stream

I'm quite newbie to Akka Streams and Akka HTTP.
I'd like to generate a simple HTTP server that can generate a zip file from the contents of a folder and send it to the client.
The org.zeroturnaround.zip.ZipUtil makes the task of creating a zip file very easy, but it needs an outputStream.
Here is my solution (written in Scala language):
val os = new ByteArrayOutputStream()
ZipUtil.pack(myFolder, os)
HttpResponse(entity = HttpEntity(
MediaTypes.`application/zip`,
os.toByteArray))
This solution works, but keeps all the contents to memory, so it isn't scalable.
I think the key for solving this is to use this:
val source = StreamConverters.asOutputStream()
but don't know how to use it. :-(
Any help please?

Try this
val byteSource: Source[ByteString, Unit] = StreamConverters.asOutputStream()
.mapMaterializedValue(os => ZipUtil.pack(myFolder, os))
HttpResponse(entity = HttpEntity(
MediaTypes.`application/zip`,
byteSource))
You only get access to the OutputStream once the source gets materialized,
which might not happen immediately. In theory the source could also materialized multiple times, so you should be able to deal with this.

I had same problem. In order to make it backpressure-compatible I had to write artificial InputStream which is later converted to Source via StreamConverters.fromInputStream(() => input) which in turn you return from your Akka-Http DSL complete directive.
Here is what I wrote.
import java.io.{File, IOException, InputStream}
import java.nio.charset.StandardCharsets
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.commons.compress.archivers.sevenz.{SevenZArchiveEntry, SevenZFile}
import scala.annotation.tailrec
import scala.util.{Failure, Success, Try}
class DownloadStatsZipReader(path: String, password: String) extends InputStream {
private val (archive, targetDate) = {
val inputFile = new SevenZFile(new File(path), password.getBytes(StandardCharsets.UTF_16LE.displayName()))
#tailrec
def findValidEntry(): Option[(LocalDate, SevenZArchiveEntry)] =
Option(inputFile.getNextEntry) match {
case Some(entry) =>
if (!entry.isDirectory) {
val parts = entry.getName.toLowerCase.split("\\.(?=[^\\.]+$)")
if (parts(1) == "tab" && entry.getSize > 0)
Try(LocalDate.parse(parts(0), DateTimeFormatter.ISO_LOCAL_DATE)) match {
case Success(localDate) =>
Some(localDate -> entry)
case Failure(_) =>
findValidEntry()
}
else
findValidEntry()
} else
findValidEntry()
case None => None
}
val (date, _) = findValidEntry().getOrElse {
throw new RuntimeException(s"$path has no files named as `YYYY-MM-DD.tab`")
}
inputFile -> date
}
private val buffer = new Array[Byte](1024)
private var offsetBuffer: Int = 0
private var sizeBuffer: Int = 0
def getTargetDate: LocalDate = targetDate
override def read(): Int =
sizeBuffer match {
case -1 =>
-1
case 0 =>
loadNextChunk()
read()
case _ =>
if (offsetBuffer < sizeBuffer) {
val result = buffer(offsetBuffer)
offsetBuffer += 1
result
} else {
sizeBuffer = 0
read()
}
}
#throws[IOException]
override def close(): Unit = {
archive.close()
}
private def loadNextChunk(): Unit = try {
val bytesRead = archive.read(buffer)
if (bytesRead >= 0) {
offsetBuffer = 0
sizeBuffer = bytesRead
} else {
offsetBuffer = -1
sizeBuffer = -1
}
} catch {
case ex: Throwable =>
ex.printStackTrace()
throw ex
}
}
If you find bugs in my code please let me know.

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?

Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")

What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.

Creating serializable objects from Scala source code at runtime

To embed Scala as a "scripting language", I need to be able to compile text fragments to simple objects, such as Function0[Unit] that can be serialised to and deserialised from disk and which can be loaded into the current runtime and executed.
How would I go about this?
Say for example, my text fragment is (purely hypothetical):
Document.current.elements.headOption.foreach(_.open())
This might be wrapped into the following complete text:
package myapp.userscripts
import myapp.DSL._
object UserFunction1234 extends Function0[Unit] {
def apply(): Unit = {
Document.current.elements.headOption.foreach(_.open())
}
}
What comes next? Should I use IMain to compile this code? I don't want to use the normal interpreter mode, because the compilation should be "context-free" and not accumulate requests.
What I need to get hold off from the compilation is I guess the binary class file? In that case, serialisation is straight forward (byte array). How would I then load that class into the runtime and invoke the apply method?
What happens if the code compiles to multiple auxiliary classes? The example above contains a closure _.open(). How do I make sure I "package" all those auxiliary things into one object to serialize and class-load?
Note: Given that Scala 2.11 is imminent and the compiler API probably changed, I am happy to receive hints as how to approach this problem on Scala 2.11

Here is one idea: use a regular Scala compiler instance. Unfortunately it seems to require the use of hard disk files both for input and output. So we use temporary files for that. The output will be zipped up in a JAR which will be stored as a byte array (that would go into the hypothetical serialization process). We need a special class loader to retrieve the class again from the extracted JAR.
The following assumes Scala 2.10.3 with the scala-compiler library on the class path:
import scala.tools.nsc
import java.io._
import scala.annotation.tailrec
Wrapping user provided code in a function class with a synthetic name that will be incremented for each new fragment:
val packageName = "myapp"
var userCount = 0
def mkFunName(): String = {
val c = userCount
userCount += 1
s"Fun$c"
}
def wrapSource(source: String): (String, String) = {
val fun = mkFunName()
val code = s"""package $packageName
|
|class $fun extends Function0[Unit] {
| def apply(): Unit = {
| $source
| }
|}
|""".stripMargin
(fun, code)
}
A function to compile a source fragment and return the byte array of the resulting jar:
/** Compiles a source code consisting of a body which is wrapped in a `Function0`
* apply method, and returns the function's class name (without package) and the
* raw jar file produced in the compilation.
*/
def compile(source: String): (String, Array[Byte]) = {
val set = new nsc.Settings
val d = File.createTempFile("temp", ".out")
d.delete(); d.mkdir()
set.d.value = d.getPath
set.usejavacp.value = true
val compiler = new nsc.Global(set)
val f = File.createTempFile("temp", ".scala")
val out = new BufferedOutputStream(new FileOutputStream(f))
val (fun, code) = wrapSource(source)
out.write(code.getBytes("UTF-8"))
out.flush(); out.close()
val run = new compiler.Run()
run.compile(List(f.getPath))
f.delete()
val bytes = packJar(d)
deleteDir(d)
(fun, bytes)
}
def deleteDir(base: File): Unit = {
base.listFiles().foreach { f =>
if (f.isFile) f.delete()
else deleteDir(f)
}
base.delete()
}
Note: Doesn't handle compiler errors yet!
The packJar method uses the compiler output directory and produces an in-memory jar file from it:
// cf. http://stackoverflow.com/questions/1281229
def packJar(base: File): Array[Byte] = {
import java.util.jar._
val mf = new Manifest
mf.getMainAttributes.put(Attributes.Name.MANIFEST_VERSION, "1.0")
val bs = new java.io.ByteArrayOutputStream
val out = new JarOutputStream(bs, mf)
def add(prefix: String, f: File): Unit = {
val name0 = prefix + f.getName
val name = if (f.isDirectory) name0 + "/" else name0
val entry = new JarEntry(name)
entry.setTime(f.lastModified())
out.putNextEntry(entry)
if (f.isFile) {
val in = new BufferedInputStream(new FileInputStream(f))
try {
val buf = new Array[Byte](1024)
#tailrec def loop(): Unit = {
val count = in.read(buf)
if (count >= 0) {
out.write(buf, 0, count)
loop()
}
}
loop()
} finally {
in.close()
}
}
out.closeEntry()
if (f.isDirectory) f.listFiles.foreach(add(name, _))
}
base.listFiles().foreach(add("", _))
out.close()
bs.toByteArray
}
A utility function that takes the byte array found in deserialization and creates a map from class names to class byte code:
def unpackJar(bytes: Array[Byte]): Map[String, Array[Byte]] = {
import java.util.jar._
import scala.annotation.tailrec
val in = new JarInputStream(new ByteArrayInputStream(bytes))
val b = Map.newBuilder[String, Array[Byte]]
#tailrec def loop(): Unit = {
val entry = in.getNextJarEntry
if (entry != null) {
if (!entry.isDirectory) {
val name = entry.getName
// cf. http://stackoverflow.com/questions/8909743
val bs = new ByteArrayOutputStream
var i = 0
while (i >= 0) {
i = in.read()
if (i >= 0) bs.write(i)
}
val bytes = bs.toByteArray
b += mkClassName(name) -> bytes
}
loop()
}
}
loop()
in.close()
b.result()
}
def mkClassName(path: String): String = {
require(path.endsWith(".class"))
path.substring(0, path.length - 6).replace("/", ".")
}
A suitable class loader:
class MemoryClassLoader(map: Map[String, Array[Byte]]) extends ClassLoader {
override protected def findClass(name: String): Class[_] =
map.get(name).map { bytes =>
println(s"defineClass($name, ...)")
defineClass(name, bytes, 0, bytes.length)
} .getOrElse(super.findClass(name)) // throws exception
}
And a test case which contains additional classes (closures):
val exampleSource =
"""val xs = List("hello", "world")
|println(xs.map(_.capitalize).mkString(" "))
|""".stripMargin
def test(fun: String, cl: ClassLoader): Unit = {
val clName = s"$packageName.$fun"
println(s"Resolving class '$clName'...")
val clazz = Class.forName(clName, true, cl)
println("Instantiating...")
val x = clazz.newInstance().asInstanceOf[() => Unit]
println("Invoking 'apply':")
x()
}
locally {
println("Compiling...")
val (fun, bytes) = compile(exampleSource)
val map = unpackJar(bytes)
println("Classes found:")
map.keys.foreach(k => println(s" '$k'"))
val cl = new MemoryClassLoader(map)
test(fun, cl) // should call `defineClass`
test(fun, cl) // should find cached class
}

How do I archive multiple files into a .zip file using scala?

Could anyone post a simple snippet that does this?
Files are text files, so compression would be nice rather than just archive the files.
I have the filenames stored in an iterable.

There's not currently any way to do this kind of thing from the standard Scala library, but it's pretty easy to use java.util.zip:
def zip(out: String, files: Iterable[String]) = {
import java.io.{ BufferedInputStream, FileInputStream, FileOutputStream }
import java.util.zip.{ ZipEntry, ZipOutputStream }
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name))
var b = in.read()
while (b > -1) {
zip.write(b)
b = in.read()
}
in.close()
zip.closeEntry()
}
zip.close()
}
I'm focusing on simplicity instead of efficiency here (no error checking and reading and writing one byte at a time isn't ideal), but it works, and can very easily be improved.

I recently had to work with zip files too and found this very nice utility: https://github.com/zeroturnaround/zt-zip
Here's an example of zipping all files inside a directory:
import org.zeroturnaround.zip.ZipUtil
ZipUtil.pack(new File("/tmp/demo"), new File("/tmp/demo.zip"))
Very convenient.

This is a little bit more scala style in case you like functional:
def compress(zipFilepath: String, files: List[File]) {
def readByte(bufferedReader: BufferedReader): Stream[Int] = {
bufferedReader.read() #:: readByte(bufferedReader)
}
val zip = new ZipOutputStream(new FileOutputStream(zipFilepath))
try {
for (file <- files) {
//add zip entry to output stream
zip.putNextEntry(new ZipEntry(file.getName))
val in = Source.fromFile(file.getCanonicalPath).bufferedReader()
try {
readByte(in).takeWhile(_ > -1).toList.foreach(zip.write(_))
}
finally {
in.close()
}
zip.closeEntry()
}
}
finally {
zip.close()
}
}
and don't forget the imports:
import java.io.{BufferedReader, FileOutputStream, File}
import java.util.zip.{ZipEntry, ZipOutputStream}
import io.Source

The Travis answer is correct but I have tweaked a little to get a faster version of his code:
val Buffer = 2 * 1024
def zip(out: String, files: Iterable[String], retainPathInfo: Boolean = true) = {
var data = new Array[Byte](Buffer)
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
if (!retainPathInfo)
zip.putNextEntry(new ZipEntry(name.splitAt(name.lastIndexOf(File.separatorChar) + 1)._2))
else
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name), Buffer)
var b = in.read(data, 0, Buffer)
while (b != -1) {
zip.write(data, 0, b)
b = in.read(data, 0, Buffer)
}
in.close()
zip.closeEntry()
}
zip.close()
}

A bit modified (shorter) version using NIO2:
private def zip(out: Path, files: Iterable[Path]) = {
val zip = new ZipOutputStream(Files.newOutputStream(out))
files.foreach { file =>
zip.putNextEntry(new ZipEntry(file.toString))
Files.copy(file, zip)
zip.closeEntry()
}
zip.close()
}

As suggested by Gabriele Petronella, in addition, you need to add below Maven dependency in pom.xml, as well as below imports:
import org.zeroturnaround.zip.ZipUtil
import java.io.File
<dependency>
<groupId>org.zeroturnaround</groupId>
<artifactId>zt-zip</artifactId>
<version>1.13</version>
<type>jar</type>
</dependency>*

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read a zip containing multiple files in Apache Spark - scala

If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.

Related

How to stream zipped file (on the fly) via Play Framework 2.5 in scala?

How to use Akka HTTP to generate contents via an output stream

What happens when you do java data manipulations in Spark outside of an RDD

Creating serializable objects from Scala source code at runtime

How do I archive multiple files into a .zip file using scala?

Categories

Resources