How to write/create zip files on HDFS using Spark/Scala? - scala

I need to write a Spark/Scala function in Apache Zeppelin that simply puts some files that are already present in an HDFS folder into a zip or gzip archive (or some common archive format that is easy to extract in Windows) in the same folder. How would I do this please? Would it be a Java call? I see there's something called ZipOutputStream, is that the right approach? Any tips appreciated.
Thanks

Spark does not support reading/writing from zip directly, so using the ZipOutputStream is basically the only approach.
Here's the code I used to compress my existing data via spark. It recursively lists directory for files and then proceeds to compress them. This code does not preserve directory structure, but keeps file names.
Input directory:
unzipped/
├── part-00001
├── part-00002
└── part-00003
0 directories, 3 files
Output directory:
zipped/
├── part-00001.zip
├── part-00002.zip
└── part-00003.zip
0 directories, 3 files
ZipPacker.scala:
package com.haodemon.spark.compression
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import java.io.FileOutputStream
import java.util.zip.{ZipEntry, ZipOutputStream}
object ZipPacker extends Serializable {
private def getSparkContext: SparkContext = {
val conf: SparkConf = new SparkConf()
.setAppName("local")
.setMaster("local[*]")
SparkSession.builder().config(conf).getOrCreate().sparkContext
}
// recursively list files in a filesystem
private def listFiles(fs: FileSystem, path: Path): List[Path] = {
fs.listStatus(path).flatMap(p =>
if (p.isDirectory) listFiles(fs, p.getPath)
else List(p.getPath)
).toList
}
// zip compress file one by one in parallel
private def zip(inputPath: Path, outputDirectory: Path): Unit = {
val outputPath = {
val name = inputPath.getName + ".zip"
outputDirectory + "/" + name
}
println(s"Zipping to $outputPath")
val zipStream = {
val out = new FileOutputStream(outputPath)
val zip = new ZipOutputStream(out)
val entry = new ZipEntry(inputPath.getName)
zip.putNextEntry(entry)
// max compression
zip.setLevel(9)
zip
}
val conf = new Configuration
val uncompressedStream = inputPath.getFileSystem(conf).open(inputPath)
val close = true
IOUtils.copyBytes(uncompressedStream, zipStream, conf, close)
}
def main(args: Array[String]): Unit = {
val input = new Path(args(0))
println(s"Using input path $input")
val sc = getSparkContext
val uncompressedFiles = {
val conf = sc.hadoopConfiguration
val fs = input.getFileSystem(conf)
listFiles(fs, input)
}
val rdd = sc.parallelize(uncompressedFiles)
val output = new Path(args(1))
println(s"Using output path $output")
rdd.foreach(unzipped => zip(unzipped, output))
}
}

Related

How to delete all files from hdfs directory with scala

For a project I am currently working on with Scala and Spark, I have to make a code that checks if the hdfs directory I am working on is empty, and if it is not, I have to remove every files from the directory.
Before I deploy my code into Azur, I am testing it with a local directory from my computer.
I am starting with: making a method to delete every files from this directory. This is what I have for now :
object DirectoryCleaner {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir\\file1.csv")
def deleFilesDir(): Unit = {
if(fs.exists(srcPath) && fs.isFile(srcPath))
fs.delete(srcPath, true)
}
}
With this code, I am able to delete a single file (file1.csv). I would like to be able to define my path this way val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir") (without specifying any filename), and just delete every files from the test_dir directory. Any idea on how I could do that ?
Thank's for helping
Use fs.listFiles to get all the files in a directory and then loop through them while deleting them. Also, set the recursive flag to false, so you don't recurse into directories.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def deleteAllFiles(directoryPath: String, fs: FileSystem): Unit = {
val path = new Path(directoryPath)
// get all files in directory
val files = fs.listFiles(path, false)
// print and delete all files
while (files.hasNext) {
val file = files.next()
fs.delete(file.getPath, false)
}
}
// Example for local, non HDFS path
val directoryPath = "file:///Users/m_vemuri/project"
val fs = FileSystem.get(new Configuration())
deleteAllFiles(directoryPath, fs)

Spark-Scala read multiple files and move to other directory

I have multiple Csv files in hdfs and some of them not in good format, I would like to read the directory of csv files and then if successfully move the files to other directory. How can I achieve this using spark-scala
You need something like that :
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkContext
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val srcPath = "dbfs:/src/"
val dest = "dbfs:/dest/"
val ls = fs.listStatus(new Path(srcPath))
ls.foreach{ p => {
if(true) spark.read.csv(p.getPath.toString).write.csv(dest + p.getName)
else println("File ${p.getName} got wrong format")
}}

Rename and Move S3 files based on their folders name in spark scala

I have spark output in a s3 folders and I want to move all s3 files from that output folder to another location ,but while moving I want to rename the files .
For example I have files in S3 folders like below
Now I want to rename all files and put into another directory,but the name of the files would be like below
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.1.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.2.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.3.2017-10-18-0439.Full.txt
Here Fundamental.FinancialStatementis constant in all the files 2017-10-18-0439 current date time .
This is what I have tried so far but not able to get folder name and loop through all files
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/output")
val dest = new Path("s3://trfsmallfffile/Segments/Finaloutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
//val file = fs.globStatus(new Path("src/DataPartition=Japan/part*.gz"))(0).getPath.getName
//println(file)
val status = fs.listStatus(src)
status.foreach(filename => {
val a = filename.getPath.getName.toString()
println("file name"+a)
//println(filename)
})
This gives me below output
file nameDataPartition=Japan
file nameDataPartition=SelfSourcedPrivate
file nameDataPartition=SelfSourcedPublic
file name_SUCCESS
This gives me folders details not files inside the folder.
Reference is taken from here Stack Overflow Refrence
You are getting directory because you have sub dir level in s3 .
/*/* to go in subdir .
Try this
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/Output/*/*")
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
val file = fs.globStatus(new Path("s3://trfsmallfffile/Segments/Output/*/*"))
for (urlStatus <- file) {
//println("S3 FILE PATH IS ===:" + urlStatus.getPath)
val partitioName=urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
val finalPrefix="Fundamental.FinancialLineItem.Segments."
val finalFileName=finalPrefix+partitioName+".txt"
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput"+"/"+finalFileName+ " ")
fs.rename(urlStatus.getPath, dest)
}
This has worked for me in past
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val path = "s3://<bucket>/<directory>"
val fs = FileSystem.get(new java.net.URI(path), spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(path))
The list status provides all the files in the s3 directory

How can one list all csv files in an HDFS location within the Spark Scala shell?

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using
RddName.coalesce(1).saveAsTextFile(pathName)
to save the result to HDFS.
This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD.
Let us use the following anonymous example as the HDFS source locations:
/data/email/click/date=2015-01-01/sent_20150101.csv
/data/email/click/date=2015-01-02/sent_20150102.csv
/data/email/click/date=2015-01-03/sent_20150103.csv
I know how to list the file paths using Hadoop FS Shell:
HDFS DFS -ls /data/email/click/*/*.csv
I know how to create one RDD for all the data:
val sentRdd = sc.textFile( "/data/email/click/*/*.csv" )
I haven't tested it thoroughly but something like this seems to work:
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
import java.net.URI
val path: String = ???
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listFiles(new Path(path), false)
def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
if (iter.hasNext) {
val uri = iter.next.getPath.toUri
go(iter, uri :: acc)
} else {
acc
}
}
go(iter, List.empty[java.net.URI])
}
listFiles(iter).filter(_.toString.endsWith(".csv"))
This is what ultimately worked for me:
import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI
val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)
// source data in HDFS
val sourcePath = new Path("/<source_location>/<filename_pattern>")
hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
val filePathName = fileStatus.getPath().toString()
val fileName = fileStatus.getPath().getName()
// < DO STUFF HERE>
} // end foreach loop
sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).

Use Spark to list all files in a Hadoop HDFS directory?

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/ to list all the files in a dir with the Apache Spark Scala API?
From the given first example, the spark context seems to only access files individually through something like:
val file = spark.textFile("hdfs://target_load_file.txt")
In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.
You can use a wildcard:
val errorCount = sc.textFile("hdfs://some-directory/*")
.flatMap(_.split(" ")).filter(_ == "error").count
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack
val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)
dirs.push("/user/username/")
while(!dirs.isEmpty){
val status = fs.listStatus(new Path(dirs.pop()))
status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else
files+= x.getPath.toString)
}
files.foreach(println)
For a local installation, (the hdfs default path fs.defaultFS can be found by reading /etc/hadoop/core.xml):
For instance,
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://localhost:9000")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val fileStatus = hdfs.listStatus(new Path("hdfs://localhost:9000/foldername/"))
val fileList = fileStatus.map(x => x.getPath.toString)
fileList.foreach(println)