Deleting Unnecessary files from a directory using spark scala - scala

I want to delete the automatically generated .crc files from a particular directory. Here is my code:
val existingSparkSession = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
fs.delete(new Path(s"./src/path/*.crc"), true)
But this doesn't delete any .crc files as expected. Is there a way to delete these files using scala and spark?

Because of the wildcard in the Path, fs.delete does not work propertly. One possible solution can be trying hadoop globStatus like the following:
import org.apache.hadoop.fs.FileStatus
val allStatus = fs.globStatus(new Path("/src/path/*.crc"))
for (currentStatus <- allStatus ) {
fs.delete(currentStatus.getPath, true);
}

Related

How to delete all files from hdfs directory with scala

For a project I am currently working on with Scala and Spark, I have to make a code that checks if the hdfs directory I am working on is empty, and if it is not, I have to remove every files from the directory.
Before I deploy my code into Azur, I am testing it with a local directory from my computer.
I am starting with: making a method to delete every files from this directory. This is what I have for now :
object DirectoryCleaner {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir\\file1.csv")
def deleFilesDir(): Unit = {
if(fs.exists(srcPath) && fs.isFile(srcPath))
fs.delete(srcPath, true)
}
}
With this code, I am able to delete a single file (file1.csv). I would like to be able to define my path this way val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir") (without specifying any filename), and just delete every files from the test_dir directory. Any idea on how I could do that ?
Thank's for helping
Use fs.listFiles to get all the files in a directory and then loop through them while deleting them. Also, set the recursive flag to false, so you don't recurse into directories.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def deleteAllFiles(directoryPath: String, fs: FileSystem): Unit = {
val path = new Path(directoryPath)
// get all files in directory
val files = fs.listFiles(path, false)
// print and delete all files
while (files.hasNext) {
val file = files.next()
fs.delete(file.getPath, false)
}
}
// Example for local, non HDFS path
val directoryPath = "file:///Users/m_vemuri/project"
val fs = FileSystem.get(new Configuration())
deleteAllFiles(directoryPath, fs)

What is the best way to get the paths of all the files located in a GCS bucket from Spark in Scala?

So let's say I have a GCS bucket ID, something like - gs://uhg802p0on/test_data. How can I fetch all the paths of files located in this bucket from Spark in Scala?
Using Hadoop FS API listFiles method you can do something like this:
import org.apache.hadoop.fs._
val conf = sc.hadoopConfiguration
val gcsBucket = new Path("gs://uhg802p0on/test_data")
val filesIter = gcsBucket.getFileSystem(conf).listFiles(gcsBucket, true)
var files = Seq[Path]()
while (filesIter.hasNext) {
files = files :+ filesIter.next().getPath
}
listFiles with option recursive=true lists all the files recursively under the gcs folder.
If you want only paths without recursivity then you can use globStatus method.

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark.
In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/
I tried it with:
val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)
val path = new Path("hdfs://sandbox.hortonworks.com/demo/")
val files = fs.listFiles(path, false)
But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.
I also tried with:
FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)
But this also does not help.
Do you have any other idea?
PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.
We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))
In Spark 2.0+,
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
Hope this is helpful.
in Ajay Ahujas answer isDir is deprecated..
use isDirectory... pls see complete example and output below.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object ListHDFSDirectories extends App{
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val hdfspath = "." // your path here
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
}
Result :
file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src
I was looking for the same, however instead of HDFS, for S3.
I solved creating the FileSystem with my S3 path as below:
def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
val hadoopConf = sparkContext.hadoopConfiguration
val uri = new URI(path)
FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
_.getPath.toString
}
}
I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.
java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))
for (urlStatus <- listStatus) {
println("urlStatus get Path:" + urlStatus.getPath())
}
val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)
This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory
Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations
On Azure Portal, go to Storage Account, you will find following details:
Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated
Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:
hadoop fs -ls /users/accountsdata
Above command will list all the files. In Scala you can use
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
object HDFSProgram extends App {
val uri = new URI("hdfs://HOSTNAME:PORT")
val fs = FileSystem.get(uri,new Configuration())
val filePath = new Path("/user/hive/")
val status = fs.listStatus(filePath)
status.map(sts => sts.getPath).foreach(println)
}
This is sample code to get list of hdfs files or folder present under /user/hive/
Because you're using Scala, you may also be interested in the following:
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):
df
.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
or coalesce:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
data frame before saving:
All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.
If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
val newData = << create your dataframe >>
val outputfile = "/user/feeds/project/outputs/subject"
var filename = "myinsights"
var outputFileName = outputfile + "/temp_" + filename
var mergedFileName = outputfile + "/merged_" + filename
var mergeFindGlob = outputFileName
newData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.mode("overwrite")
.save(outputFileName)
merge(mergeFindGlob, mergedFileName )
newData.unpersist()
Can't remember where I learned this trick, but it might work for you.
I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?
If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:
val fileprefix= "/mnt/aws/path/file-prefix"
dataset
.coalesce(1)
.write
//.mode("overwrite") // I usually don't use this, but you may want to.
.option("header", "true")
.option("delimiter","\t")
.csv(fileprefix+".tmp")
val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
.filter(file=>file.name.endsWith(".csv"))(0).path
dbutils.fs.cp(partition_path,fileprefix+".tab")
dbutils.fs.rm(fileprefix+".tmp",recurse=true)
If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.
This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.
The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.
spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()
df.coalesce(1).write.csv(filepath,header=True)
will create folder in given filepath with one part-0001-...-c000.csv file
use
cat filepath/part-0001-...-c000.csv > filename_you_want.csv
to have a user friendly filename
This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.
More context on accepted answer
The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:
val df = Seq("one", "two", "three").toDF("num")
df
.repartition(1)
.write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
Here's what's outputted:
Documents/
tmp/
mydata.csv/
_SUCCESS
part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
N.B. mydata.csv is a folder in the accepted answer - it's not a file!
How to output a single file with a specific name
We can use spark-daria to write out a single mydata.csv file.
import com.github.mrpowers.spark.daria.sql.DariaWriters
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = sys.env("HOME") + "/Documents/better/staging",
filename = sys.env("HOME") + "/Documents/better/mydata.csv"
)
This'll output the file as follows:
Documents/
better/
mydata.csv
S3 paths
You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = "s3a://bucket/data/src",
filename = "s3a://bucket/data/dest/my_cool_file.csv"
)
See here for more info.
Avoiding copyMerge
copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.
Source code
Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.
PySpark implementation
It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.
from pathlib import Path
home = str(Path.home())
data = [
("jellyfish", "JALYF"),
("li", "L"),
("luisa", "LAS"),
(None, None)
]
df = spark.createDataFrame(data, ["word", "expected"])
df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
Limitations
The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.
I'm using this in Python to get a single file:
df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)
A solution that works for S3 modified from Minkymorgan.
Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.
/**
* Merges multiple partitions of spark text file output into single file.
* #param srcPath source directory of partitioned files
* #param dstPath output path of individual path
* #param deleteSource whether or not to delete source directory after merging
* #param spark sparkSession
*/
def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit = {
import org.apache.hadoop.fs.FileUtil
import java.net.URI
val config = spark.sparkContext.hadoopConfiguration
val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
FileUtil.copyMerge(
fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
)
}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
import org.apache.spark.sql.functions._
I solved using below approach (hdfs rename file name):-
Step 1:- (Crate Data Frame and write to HDFS)
df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
Step 2:- (Create Hadoop Config)
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
Step3 :- (Get path in hdfs folder path)
val pathFiles = new Path("/hdfsfolder/blah/")
Step4:- (Get spark file names from hdfs folder)
val fileNames = hdfs.listFiles(pathFiles, false)
println(fileNames)
setp5:- (create scala mutable list to save all the file names and add it to the list)
var fileNamesList = scala.collection.mutable.MutableList[String]()
while (fileNames.hasNext) {
fileNamesList += fileNames.next().getPath.getName
}
println(fileNamesList)
Step 6:- (filter _SUCESS file order from file names scala list)
// get files name which are not _SUCCESS
val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)
val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
hdfs.rename(partFileSourcePath , desiredCsvTargetPath)
spark.sql("select * from df").coalesce(1).write.option("mode","append").option("header","true").csv("/your/hdfs/path/")
spark.sql("select * from df") --> this is dataframe
coalesce(1) or repartition(1) --> this will make your output file to 1 part file only
write --> writing data
option("mode","append") --> appending data to existing directory
option("header","true") --> enabling header
csv("<hdfs dir>") --> write as CSV file & its output location in HDFS
repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)
you can use rdd.coalesce(1, true).saveAsTextFile(path)
it will store data as singile file in path/part-00000
Here is a helper function with which you can get a single result-file without the part-0000 and without a subdirectory on S3 and AWS EMR:
def renameSinglePartToParentFolder(directoryUrl: String): Unit = {
import sys.process._
val lsResult = s"aws s3 ls ${directoryUrl}/" !!
val partFilename = lsResult.split("\n").map(_.split(" ").last).filter(_.contains("part-0000")).last
s"aws s3 rm ${directoryUrl}/_SUCCESS" !
s"aws s3 mv ${directoryUrl}/${partFilename} ${directoryUrl}" !
}
val targetPath = "s3://my-bucket/my-folder/my-file.csv"
df.coalesce(1).write.csv(targetPath)
renameSinglePartToParentFolder(targetPath)
Write to a single part-0000... file.
Use AWS CLI to list all files and rename the single file accordingly.
by using Listbuffer we can save data into single file:
import java.io.FileWriter
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
val text = spark.read.textFile("filepath")
var data = ListBuffer[String]()
for(line:String <- text.collect()){
data += line
}
val writer = new FileWriter("filepath")
data.foreach(line => writer.write(line.toString+"\n"))
writer.close()
def export_csv(
fileName: String,
filePath: String
) = {
val filePathDestTemp = filePath + ".dir/"
val merstageout_df = spark.sql(merstageout)
merstageout_df
.coalesce(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(filePathDestTemp)
val listFiles = dbutils.fs.ls(filePathDestTemp)
for(subFiles <- listFiles){
val subFiles_name: String = subFiles.name
if (subFiles_name.slice(subFiles_name.length() - 4,subFiles_name.length()) == ".csv") {
dbutils.fs.cp (filePathDestTemp + subFiles_name, filePath + fileName+ ".csv")
dbutils.fs.rm(filePathDestTemp, recurse=true)
}}}
There is one more way to use Java
import java.io._
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)
{
val p = new java.io.PrintWriter(f);
try { op(p) }
finally { p.close() }
}
printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}

Use Spark to list all files in a Hadoop HDFS directory?

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/ to list all the files in a dir with the Apache Spark Scala API?
From the given first example, the spark context seems to only access files individually through something like:
val file = spark.textFile("hdfs://target_load_file.txt")
In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.
You can use a wildcard:
val errorCount = sc.textFile("hdfs://some-directory/*")
.flatMap(_.split(" ")).filter(_ == "error").count
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack
val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)
dirs.push("/user/username/")
while(!dirs.isEmpty){
val status = fs.listStatus(new Path(dirs.pop()))
status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else
files+= x.getPath.toString)
}
files.foreach(println)
For a local installation, (the hdfs default path fs.defaultFS can be found by reading /etc/hadoop/core.xml):
For instance,
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://localhost:9000")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val fileStatus = hdfs.listStatus(new Path("hdfs://localhost:9000/foldername/"))
val fileList = fileStatus.map(x => x.getPath.toString)
fileList.foreach(println)