Add another directory to Path in Scala

Add another directory to Path in Scala - scala

I have an HDFS directory, and I need to get the latest path from that directory. I was able to get the directory by using this
val directoryPath = "/jobs/data/"
val path = new Path(directoryPath)
val fileStatus = hdfs.listStatus(path)
val sortedFiles = fileStatus.sortBy(filestatus => filestatus.getModificationTime)
val latestPath = sortedFiles.last.getPath
The latest path looks like this /jobs/data/datepartition=mm-dd-yy-00. Now I have an avro file inside another directory within this directory
/jobs/data/datepartition=mm-dd-yy-00/modelId=model1/file.avro
How do I access this file using the latest path?

Related

How to delete all files from hdfs directory with scala

For a project I am currently working on with Scala and Spark, I have to make a code that checks if the hdfs directory I am working on is empty, and if it is not, I have to remove every files from the directory.
Before I deploy my code into Azur, I am testing it with a local directory from my computer.
I am starting with: making a method to delete every files from this directory. This is what I have for now :
object DirectoryCleaner {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir\\file1.csv")
def deleFilesDir(): Unit = {
if(fs.exists(srcPath) && fs.isFile(srcPath))
fs.delete(srcPath, true)
}
}
With this code, I am able to delete a single file (file1.csv). I would like to be able to define my path this way val srcPath=new Path("C:\\Users\\myuser\\Desktop\\test_dir") (without specifying any filename), and just delete every files from the test_dir directory. Any idea on how I could do that ?
Thank's for helping

Use fs.listFiles to get all the files in a directory and then loop through them while deleting them. Also, set the recursive flag to false, so you don't recurse into directories.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def deleteAllFiles(directoryPath: String, fs: FileSystem): Unit = {
val path = new Path(directoryPath)
// get all files in directory
val files = fs.listFiles(path, false)
// print and delete all files
while (files.hasNext) {
val file = files.next()
fs.delete(file.getPath, false)
}
}
// Example for local, non HDFS path
val directoryPath = "file:///Users/m_vemuri/project"
val fs = FileSystem.get(new Configuration())
deleteAllFiles(directoryPath, fs)

How to copy a file from local to s3 using spark as a single file with the given name?

I am looking for ways to move a file from local filesystem to s3 with spark apis with the given fileName. The below is creating file with Part000. How to avoid this?
val df = spark.read.textFile("readPath")
df.coalesce(1).write.mode("Overwrite")
.save("path"))

This worked
val dstPath = new org.apache.hadoop.fs.Path(inputFilePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.setVerifyChecksum(false)
org.apache.hadoop.fs.FileUtil.copy(
new File(localfilePath),
dstPath.getFileSystem(conf),
dstPath, true,
conf
)

Deleting Unnecessary files from a directory using spark scala

I want to delete the automatically generated .crc files from a particular directory. Here is my code:
val existingSparkSession = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
fs.delete(new Path(s"./src/path/*.crc"), true)
But this doesn't delete any .crc files as expected. Is there a way to delete these files using scala and spark?

Because of the wildcard in the Path, fs.delete does not work propertly. One possible solution can be trying hadoop globStatus like the following:
import org.apache.hadoop.fs.FileStatus
val allStatus = fs.globStatus(new Path("/src/path/*.crc"))
for (currentStatus <- allStatus ) {
fs.delete(currentStatus.getPath, true);
}

What is the best way to get the paths of all the files located in a GCS bucket from Spark in Scala?

So let's say I have a GCS bucket ID, something like - gs://uhg802p0on/test_data. How can I fetch all the paths of files located in this bucket from Spark in Scala?

Using Hadoop FS API listFiles method you can do something like this:
import org.apache.hadoop.fs._
val conf = sc.hadoopConfiguration
val gcsBucket = new Path("gs://uhg802p0on/test_data")
val filesIter = gcsBucket.getFileSystem(conf).listFiles(gcsBucket, true)
var files = Seq[Path]()
while (filesIter.hasNext) {
files = files :+ filesIter.next().getPath
}
listFiles with option recursive=true lists all the files recursively under the gcs folder.
If you want only paths without recursivity then you can use globStatus method.

Rename and Move S3 files based on their folders name in spark scala

I have spark output in a s3 folders and I want to move all s3 files from that output folder to another location ,but while moving I want to rename the files .
For example I have files in S3 folders like below
Now I want to rename all files and put into another directory,but the name of the files would be like below
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.1.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.2.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.3.2017-10-18-0439.Full.txt
Here Fundamental.FinancialStatementis constant in all the files 2017-10-18-0439 current date time .
This is what I have tried so far but not able to get folder name and loop through all files
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/output")
val dest = new Path("s3://trfsmallfffile/Segments/Finaloutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
//val file = fs.globStatus(new Path("src/DataPartition=Japan/part*.gz"))(0).getPath.getName
//println(file)
val status = fs.listStatus(src)
status.foreach(filename => {
val a = filename.getPath.getName.toString()
println("file name"+a)
//println(filename)
})
This gives me below output
file nameDataPartition=Japan
file nameDataPartition=SelfSourcedPrivate
file nameDataPartition=SelfSourcedPublic
file name_SUCCESS
This gives me folders details not files inside the folder.
Reference is taken from here Stack Overflow Refrence

You are getting directory because you have sub dir level in s3 .
/*/* to go in subdir .
Try this
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/Segments/Output/*/*")
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
val file = fs.globStatus(new Path("s3://trfsmallfffile/Segments/Output/*/*"))
for (urlStatus <- file) {
//println("S3 FILE PATH IS ===:" + urlStatus.getPath)
val partitioName=urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
val finalPrefix="Fundamental.FinancialLineItem.Segments."
val finalFileName=finalPrefix+partitioName+".txt"
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput"+"/"+finalFileName+ " ")
fs.rename(urlStatus.getPath, dest)
}

This has worked for me in past
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val path = "s3://<bucket>/<directory>"
val fs = FileSystem.get(new java.net.URI(path), spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(path))
The list status provides all the files in the s3 directory

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Add another directory to Path in Scala - scala

Related

How to delete all files from hdfs directory with scala

How to copy a file from local to s3 using spark as a single file with the given name?

Deleting Unnecessary files from a directory using spark scala

What is the best way to get the paths of all the files located in a GCS bucket from Spark in Scala?

Rename and Move S3 files based on their folders name in spark scala

Categories

Resources