How to list and delete files faster in Databricks - using pyspark - pyspark

I want to implement efficient file listing and deletion on Databricks using pyspark. The following link has an implementation in Scala, is there an equivalent pyspark version?
https://kb.databricks.com/en_US/data/list-delete-files-faster

You can use dbutils, the DataBricks file utility APIs.
To delete a file or a directory:
dbutils.fs.rm("dbfs:/filepath")
To delete all files from a dir, and optionally delete the dir, I use a custom written util function:
def empty_dir(dir_path, remove_dir=False):
listFiles = dbutils.fs.ls(dir_path)
for _file in listFiles:
if _file.isFile():
dbutils.fs.rm(_file.path)
if remove_dir:
dbutils.fs.rm(dir_path)

Related

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

Copy file from Hdfs to Hdfs scala

Is there a known way using Hadoop api / spark scala to copy files from one directory to another on Hdfs ?
I have tried using copyFromLocalFile but was not helpful
Try Hadoop's FileUtil.copy() command, as described here: https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/fs/FileUtil.html#copy(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20boolean,%20org.apache.hadoop.conf.Configuration)
val conf = new org.apache.hadoop.conf.Configuration()
val srcPath = new org.apache.hadoop.fs.Path("hdfs://my/src/path")
val dstPath = new org.apache.hadoop.fs.Path("hdfs://my/dst/path")
org.apache.hadoop.fs.FileUtil.copy(
srcPath.getFileSystem(conf),
srcPath,
dstPath.getFileSystem(conf),
dstPath,
true,
conf
)
As I've understand your question, the answer is as easy as abc. Actually, there is no difference between your OS filesystem and some other distributed versions in the fundamental concepts like copying files in them. That is true that each would have its own rules in commands. For instance, when you want to copy a file from one directory to another you can do something like:
hdfs dfs -cp /dir_1/file_1.txt /dir_2/file_1_new_name.txt
The first part of the example command is just to let the command to be routed to the true destination not the OS's own file system.
for further reading you can use: copying data in hdfs

How to delete multiple hdfs directories starting with some word in Apache Spark

I have persisted object files in spark streaming using dstream.saveAsObjectFiles("/temObj") method it shows multiple files in hdfs.
temObj-1506338844000
temObj-1506338848000
temObj-1506338852000
temObj-1506338856000
temObj-1506338860000
I want to delete all temObj files after reading all. What is the bet way to do it in spark. I tried
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
hdfs.delete(new org.apache.hadoop.fs.Path(Path), true)
But it only can delete ane folder at a time
Unfortunately, delete doesn't support globs.
You can use globStatus and iterate over the files/directories one by one and delete them.
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val deletePaths = hdfs.globStatus(new Path("/tempObj-*") ).map(_.getPath)
deletePaths.foreach{ path => hdfs.delete(path, true) }
Alternatively, you can use sys.process to execute shell commands
import scala.sys.process._
"hdfs dfs -rm -r /tempObj*" !

Cleaning up BigQueryInputFormat temp files

I am using the BigQueryInputFormat in a spark job, to load data directly from Bigquery into an RDD. The documentation for this states that you should cleanup temporary files using the command:
BigQueryInputFormat.cleanupJob(job)
However from a Spark job, how can I do that, when "job" is a hadoop job?
Thank,
Luke
Figured it out, you can set a custom temp path that is unique to your spark job, and delete that path at the end of the job:
hadoopConf.set(BigQueryConfiguration.TEMP_GCS_PATH_KEY, "gs://mybucket/hadoop/tmp/1234")
...
FileSystem.get(new Configuration()).delete(new Path(hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY)), true)

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java