How to delete multiple hdfs directories starting with some word in Apache Spark - scala

I have persisted object files in spark streaming using dstream.saveAsObjectFiles("/temObj") method it shows multiple files in hdfs.
temObj-1506338844000
temObj-1506338848000
temObj-1506338852000
temObj-1506338856000
temObj-1506338860000
I want to delete all temObj files after reading all. What is the bet way to do it in spark. I tried
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
hdfs.delete(new org.apache.hadoop.fs.Path(Path), true)
But it only can delete ane folder at a time

Unfortunately, delete doesn't support globs.
You can use globStatus and iterate over the files/directories one by one and delete them.
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val deletePaths = hdfs.globStatus(new Path("/tempObj-*") ).map(_.getPath)
deletePaths.foreach{ path => hdfs.delete(path, true) }
Alternatively, you can use sys.process to execute shell commands
import scala.sys.process._
"hdfs dfs -rm -r /tempObj*" !

Related

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

Spark write operation HDFS using temporal path

I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')

Copy file from Hdfs to Hdfs scala

Is there a known way using Hadoop api / spark scala to copy files from one directory to another on Hdfs ?
I have tried using copyFromLocalFile but was not helpful
Try Hadoop's FileUtil.copy() command, as described here: https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/fs/FileUtil.html#copy(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20boolean,%20org.apache.hadoop.conf.Configuration)
val conf = new org.apache.hadoop.conf.Configuration()
val srcPath = new org.apache.hadoop.fs.Path("hdfs://my/src/path")
val dstPath = new org.apache.hadoop.fs.Path("hdfs://my/dst/path")
org.apache.hadoop.fs.FileUtil.copy(
srcPath.getFileSystem(conf),
srcPath,
dstPath.getFileSystem(conf),
dstPath,
true,
conf
)
As I've understand your question, the answer is as easy as abc. Actually, there is no difference between your OS filesystem and some other distributed versions in the fundamental concepts like copying files in them. That is true that each would have its own rules in commands. For instance, when you want to copy a file from one directory to another you can do something like:
hdfs dfs -cp /dir_1/file_1.txt /dir_2/file_1_new_name.txt
The first part of the example command is just to let the command to be routed to the true destination not the OS's own file system.
for further reading you can use: copying data in hdfs

How to continuously monitor a directory by using Spark Structured Streaming

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌​e","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java