I have an HDFS directory, and I need to get the latest path from that directory. I was able to get the directory by using this
val directoryPath = "/jobs/data/"
val path = new Path(directoryPath)
val fileStatus = hdfs.listStatus(path)
val sortedFiles = fileStatus.sortBy(filestatus => filestatus.getModificationTime)
val latestPath = sortedFiles.last.getPath
The latest path looks like this /jobs/data/datepartition=mm-dd-yy-00. Now I have an avro file inside another directory within this directory
/jobs/data/datepartition=mm-dd-yy-00/modelId=model1/file.avro
How do I access this file using the latest path?
Related
For a project I am currently working on with Scala and Spark, I have to make a code that checks if the hdfs directory I am working on is empty, and if it is not, I have to remove every files from the directory.
Before I deploy my code into Azur, I am testing it with a local directory from my computer.
I am starting with: making a method to delete every files from this directory. This is what I have for now :
object DirectoryCleaner {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir\\file1.csv")
def deleFilesDir(): Unit = {
if(fs.exists(srcPath) && fs.isFile(srcPath))
fs.delete(srcPath, true)
}
}
With this code, I am able to delete a single file (file1.csv). I would like to be able to define my path this way val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir") (without specifying any filename), and just delete every files from the test_dir directory. Any idea on how I could do that ?
Thank's for helping
Use fs.listFiles to get all the files in a directory and then loop through them while deleting them. Also, set the recursive flag to false, so you don't recurse into directories.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def deleteAllFiles(directoryPath: String, fs: FileSystem): Unit = {
val path = new Path(directoryPath)
// get all files in directory
val files = fs.listFiles(path, false)
// print and delete all files
while (files.hasNext) {
val file = files.next()
fs.delete(file.getPath, false)
}
}
// Example for local, non HDFS path
val directoryPath = "file:///Users/m_vemuri/project"
val fs = FileSystem.get(new Configuration())
deleteAllFiles(directoryPath, fs)
I am looking for ways to move a file from local filesystem to s3 with spark apis with the given fileName. The below is creating file with Part000. How to avoid this?
val df = spark.read.textFile("readPath")
df.coalesce(1).write.mode("Overwrite")
.save("path"))
This worked
val dstPath = new org.apache.hadoop.fs.Path(inputFilePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.setVerifyChecksum(false)
org.apache.hadoop.fs.FileUtil.copy(
new File(localfilePath),
dstPath.getFileSystem(conf),
dstPath, true,
conf
)
I want to delete the automatically generated .crc files from a particular directory. Here is my code:
val existingSparkSession = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
fs.delete(new Path(s"./src/path/*.crc"), true)
But this doesn't delete any .crc files as expected. Is there a way to delete these files using scala and spark?
Because of the wildcard in the Path, fs.delete does not work propertly. One possible solution can be trying hadoop globStatus like the following:
import org.apache.hadoop.fs.FileStatus
val allStatus = fs.globStatus(new Path("/src/path/*.crc"))
for (currentStatus <- allStatus ) {
fs.delete(currentStatus.getPath, true);
}
So let's say I have a GCS bucket ID, something like - gs://uhg802p0on/test_data. How can I fetch all the paths of files located in this bucket from Spark in Scala?
Using Hadoop FS API listFiles method you can do something like this:
import org.apache.hadoop.fs._
val conf = sc.hadoopConfiguration
val gcsBucket = new Path("gs://uhg802p0on/test_data")
val filesIter = gcsBucket.getFileSystem(conf).listFiles(gcsBucket, true)
var files = Seq[Path]()
while (filesIter.hasNext) {
files = files :+ filesIter.next().getPath
}
listFiles with option recursive=true lists all the files recursively under the gcs folder.
If you want only paths without recursivity then you can use globStatus method.
I have spark output in a s3 folders and I want to move all s3 files from that output folder to another location ,but while moving I want to rename the files .
For example I have files in S3 folders like below
Now I want to rename all files and put into another directory,but the name of the files would be like below
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.1.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.2.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.3.2017-10-18-0439.Full.txt
Here Fundamental.FinancialStatementis constant in all the files 2017-10-18-0439 current date time .
This is what I have tried so far but not able to get folder name and loop through all files
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/output")
val dest = new Path("s3://trfsmallfffile/Segments/Finaloutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
//val file = fs.globStatus(new Path("src/DataPartition=Japan/part*.gz"))(0).getPath.getName
//println(file)
val status = fs.listStatus(src)
status.foreach(filename => {
val a = filename.getPath.getName.toString()
println("file name"+a)
//println(filename)
})
This gives me below output
file nameDataPartition=Japan
file nameDataPartition=SelfSourcedPrivate
file nameDataPartition=SelfSourcedPublic
file name_SUCCESS
This gives me folders details not files inside the folder.
Reference is taken from here Stack Overflow Refrence
You are getting directory because you have sub dir level in s3 .
/*/* to go in subdir .
Try this
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/Output/*/*")
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
val file = fs.globStatus(new Path("s3://trfsmallfffile/Segments/Output/*/*"))
for (urlStatus <- file) {
//println("S3 FILE PATH IS ===:" + urlStatus.getPath)
val partitioName=urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
val finalPrefix="Fundamental.FinancialLineItem.Segments."
val finalFileName=finalPrefix+partitioName+".txt"
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput"+"/"+finalFileName+ " ")
fs.rename(urlStatus.getPath, dest)
}
This has worked for me in past
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val path = "s3://<bucket>/<directory>"
val fs = FileSystem.get(new java.net.URI(path), spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(path))
The list status provides all the files in the s3 directory