Could anyone post a simple snippet that does this?
Files are text files, so compression would be nice rather than just archive the files.
I have the filenames stored in an iterable.
There's not currently any way to do this kind of thing from the standard Scala library, but it's pretty easy to use java.util.zip:
def zip(out: String, files: Iterable[String]) = {
import java.io.{ BufferedInputStream, FileInputStream, FileOutputStream }
import java.util.zip.{ ZipEntry, ZipOutputStream }
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name))
var b = in.read()
while (b > -1) {
zip.write(b)
b = in.read()
}
in.close()
zip.closeEntry()
}
zip.close()
}
I'm focusing on simplicity instead of efficiency here (no error checking and reading and writing one byte at a time isn't ideal), but it works, and can very easily be improved.
I recently had to work with zip files too and found this very nice utility: https://github.com/zeroturnaround/zt-zip
Here's an example of zipping all files inside a directory:
import org.zeroturnaround.zip.ZipUtil
ZipUtil.pack(new File("/tmp/demo"), new File("/tmp/demo.zip"))
Very convenient.
This is a little bit more scala style in case you like functional:
def compress(zipFilepath: String, files: List[File]) {
def readByte(bufferedReader: BufferedReader): Stream[Int] = {
bufferedReader.read() #:: readByte(bufferedReader)
}
val zip = new ZipOutputStream(new FileOutputStream(zipFilepath))
try {
for (file <- files) {
//add zip entry to output stream
zip.putNextEntry(new ZipEntry(file.getName))
val in = Source.fromFile(file.getCanonicalPath).bufferedReader()
try {
readByte(in).takeWhile(_ > -1).toList.foreach(zip.write(_))
}
finally {
in.close()
}
zip.closeEntry()
}
}
finally {
zip.close()
}
}
and don't forget the imports:
import java.io.{BufferedReader, FileOutputStream, File}
import java.util.zip.{ZipEntry, ZipOutputStream}
import io.Source
The Travis answer is correct but I have tweaked a little to get a faster version of his code:
val Buffer = 2 * 1024
def zip(out: String, files: Iterable[String], retainPathInfo: Boolean = true) = {
var data = new Array[Byte](Buffer)
val zip = new ZipOutputStream(new FileOutputStream(out))
files.foreach { name =>
if (!retainPathInfo)
zip.putNextEntry(new ZipEntry(name.splitAt(name.lastIndexOf(File.separatorChar) + 1)._2))
else
zip.putNextEntry(new ZipEntry(name))
val in = new BufferedInputStream(new FileInputStream(name), Buffer)
var b = in.read(data, 0, Buffer)
while (b != -1) {
zip.write(data, 0, b)
b = in.read(data, 0, Buffer)
}
in.close()
zip.closeEntry()
}
zip.close()
}
A bit modified (shorter) version using NIO2:
private def zip(out: Path, files: Iterable[Path]) = {
val zip = new ZipOutputStream(Files.newOutputStream(out))
files.foreach { file =>
zip.putNextEntry(new ZipEntry(file.toString))
Files.copy(file, zip)
zip.closeEntry()
}
zip.close()
}
As suggested by Gabriele Petronella, in addition, you need to add below Maven dependency in pom.xml, as well as below imports:
import org.zeroturnaround.zip.ZipUtil
import java.io.File
<dependency>
<groupId>org.zeroturnaround</groupId>
<artifactId>zt-zip</artifactId>
<version>1.13</version>
<type>jar</type>
</dependency>*
Related
So I have a list of fileLocations. I have already downloaded the files to these fileLocations so technically, they have content already. These are just .txt files.
What my problem is, when I have multiple files that were zipped due to this code, when I unzip it in my file manager (Finder in mac), the unzipped contains a folder containing the text files. I don't want to see any folder when I unzip it. How can I fix this? Here's my code btw.
def zipFiles(fileLocations: List[String], zipOutputFilename: String): Unit = {
val a =
for {
fos <- managed(new FileOutputStream(zipOutputFilename))
zos <- managed(new ZipOutputStream(fos))
} yield {
for {
fileLoc <- fileLocations
} {
val file = new File(fileLoc)
zos.putNextEntry(new ZipEntry(fileLoc))
val in = new BufferedInputStream(new FileInputStream(file))
var b = in.read()
while (b > -1) {
zos.write(b)
b = in.read()
}
in.close()
zos.closeEntry()
}
zos.close()
}
a.map(identity).tried
()
}
Use new ZipEntry(file.getName) instead of new ZipEntry(fileLoc).
I have a .Z file stored in Azure ADLS Gen2. I want to decompress the file in the ADLS, I tried decompressing using ADF and C# but found that .Z is not supported. Also I tried using Apache Common Compress Lib for decompression, but unable to read the file in InputStream.
Can anyone have any idea, how we can decompress the file using Apache lib in Scala.
.Z files are .gzip files so you could try this approach
import java.io.{BufferedReader, File, FileInputStream, InputStreamReader}
import java.util.zip.GZIPInputStream
object UnzipFiles {
def decompressGzipOrZFiles(file: File, encode: String): BufferedReader = {
val fis = new FileInputStream(file)
val gzis = new GZIPInputStream(fis)
val isr = new InputStreamReader(gzis, encode)
new BufferedReader(isr)
}
def main(args: Array[String]): Unit = {
val path = new File("/home/cloudera/files/my_file.Z")
// print to the console
decompressGzipOrZFiles(path,"UTF-8").lines().toArray.foreach(println)
}
}
or you could follow this too
def uncompressGzip(myFileDotZorGzip: String): Unit = {
import java.io.FileInputStream
import java.util.zip.GZIPInputStream
try {
val gzipInputStream = new GZIPInputStream(new FileInputStream(myFileDotZorGzip))
try {
val tam = 128
val buffer = new Array[Byte](tam)
do {
gzipInputStream.read(buffer)
gzipInputStream.skip(tam)
//do something with data
print(buffer.foreach(b => print(b.toChar)))
} while(gzipInputStream.read() != -1)
} finally {
if (gzipInputStream != null) gzipInputStream.close()
}
}
}
I hope this helps.
I want to stream some files and zip them on the fly, so users can download multiple files into a single zipped file without writing anything to the local disk. However, my current implementation holds everything in the memory, and will no work for large files. Is there any way to fix it?
I was looking at this implementation: https://gist.github.com/kirked/03c7f111de0e9a1f74377bf95d3f0f60, but couldn't figure out how to use it.
import java.io.{BufferedOutputStream, ByteArrayInputStream, ByteArrayOutputStream}
import java.util.zip.{ZipEntry, ZipOutputStream}
import akka.stream.scaladsl.{StreamConverters}
import org.apache.commons.io.FileUtils
import play.api.mvc.{Action, Controller}
class HomeController extends Controller {
def single() = Action {
Ok.sendFile(
content = new java.io.File("C:\\Users\\a.csv"),
fileName = _ => "a.csv"
)
}
def zip() = Action {
Ok.chunked(StreamConverters.fromInputStream(fileByteData)).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip"
)
}
def fileByteData(): ByteArrayInputStream = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val baos = new ByteArrayOutputStream()
val zos = new ZipOutputStream(new BufferedOutputStream(baos))
try {
fileList.map(file => {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
zos.write(FileUtils.readFileToByteArray(file))
zos.closeEntry()
})
} finally {
zos.close()
}
new ByteArrayInputStream(baos.toByteArray)
}
}
Instead of using a ByteArrayOutputStream to buffer the contents in an array then putting them into a ByteArrayInputStream you could use Java's piping mechanism.
Here's a sketch solution:
def zip() = Action {
// Create Source that listens to an OutputStream
// and pass it to `fileByteData` method.
val zipSource: Source[ByteString, Unit] =
StreamConverters
.asOutputStream()
.mapMaterializedValue(fileByteData)
Ok.chunked(zipSource).withHeaders(
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip")
}
// Send the file data, given an OutputStream to write to.
def fileByteData(os: OutputStream): Unit = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
)
val zos = new ZipOutputStream(os)
val buffer: Array[Byte] = new Array[Byte](2048)
try {
for (file <- fileList) {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
val fis = new Files.newInputStream(file.toPath)
try {
#tailrec
def zipFile(): Unit = {
val bytesRead = fis.read(buffer)
if (bytesRead == -1) () else {
zos.write(buffer, 0, bytesRead)
zipFile()
}
}
zipFile()
} finally fis.close()
zos.closeEntry()
}
} finally {
zos.close()
}
}
This is just an outline of an approach. You'll also want to make sure:
- the threading is OK - the fileByteData will hopefully run on a different thread to the sending thread
- the error handling is OK - e.g. all streams are closed properly if there's an error on either the server (e.g. file not found) or client side (early disconnect)
I am having a Zipped file containing multiple text files.
I want to read each of the file and build a List of RDD containining the content of each files.
val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")
will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.
I am fine with Scala or Python.
Possible solution in Python with using Spark -
archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
urls = file_path.split("/")
urlId = urls[-1].split('_')[0]
Apache Spark default compression support
I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135
Read zip containing multiple files
I have followed the advice given by #Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile {
case null => zis.close(); false
case _ => true
}
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
}
} else {
sc.textFile(path, minPartitions)
}
}
}
simply use it by importing the implicit class and call the readFile method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.
Here's a working version of #Atais solution (which needs enhancement by closing the streams) :
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { _ ⇒
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
Then all you have to do is the following to read a zip file :
sc.readFile(path)
This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.
JavaPairRDD<String, PortableDataStream> zipData =
sc.binaryFiles("hdfs://temp.zip");
JavaRDD<Record> newRDDRecord = zipData.flatMap(
new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
List<Record> records = new ArrayList<Record>();
ZipInputStream zin = new ZipInputStream(content._2.open());
ZipEntry zipEntry;
while ((zipEntry = zin.getNextEntry()) != null) {
count++;
if (!zipEntry.isDirectory()) {
Record sd;
String line;
InputStreamReader streamReader = new InputStreamReader(zin);
BufferedReader bufferedReader = new BufferedReader(streamReader);
line = bufferedReader.readLine();
String[] records= new CSVParser().parseLineMulti(line);
sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
getDefaultValue(records[1]),
getDefaultValue(records[22]));
records.add(sd);
}
}
return records.iterator();
}
});
Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
full code is here
https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/extraDFExamples/SparkReadZipFiles.scala
I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!
This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}