How can I prevent Hadoop's HDFS API from creating parent directories? - scala

I want HDFS commands to fail if a parent directory doesn't exist when making subdirectories. When I use any of FileSystem#mkdirs, I find that an exception isn't risen, instead creating non-existent parent directories:
import java.util.UUID
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = new Configuration()
conf.set("fs.defaultFS", s"hdfs://$host:$port")
val fileSystem = FileSystem.get(conf)
val cwd = fileSystem.getWorkingDirectory
// Guarantee non-existence by appending two UUIDs.
val dirToCreate = new Path(cwd, new Path(UUID.randomUUID.toString, UUID.randomUUID.toString))
fileSystem.mkdirs(dirToCreate)
Without the cumbersome burden of checking for the existence, how can I force HDFS to throw an exception if a parent directory doesn't exist?

The FileSystem API does not support this type of behavior. Instead, FileContext#mkdir should be used; for example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileContext, FileSystem, Path}
import org.apache.hadoop.fs.permission.FsPermission
val files = FileContext.getFileContext()
val cwd = files.getWorkingDirectory
val permissions = new FsPermission("644")
val createParent = false
// Guarantee non-existence by appending two UUIDs.
val dirToCreate = new Path(cwd, new Path(UUID.randomUUID.toString, UUID.randomUUID.toString))
files.mkdir(dirToCreate, permissions, createParent)
The above example will throw:
java.io.FileNotFoundException: Parent directory doesn't exist: /user/erip/f425a2c9-1007-487b-8488-d73d447c6f79

Related

check if the temporary file(semaphore) exists using scala

how can I read(check) a temp file created on my system please.
I need to check if a temp file exists or not using scala,
how can i do this.
Try this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val fileSystem = FileSystem.get(new Configuration())
val path = new Path("/tmp/foo/bar/meow.parquet")
if (fileSystem.exists(path)){
// TODO ...
}
It will work on local, Docker and remote file-systems (S3 & HDFS) fluently
Hope it helps

Saving RDD as textfile gives FileAlreadyExists Exception. How to create new file every time program loads and delete old one using FileUtils

Code:
val badData:RDD[ListBuffer[String]] = rdd.filter(line => line(1).equals("XX") || line(5).equals("XX"))
badData.coalesce(1).saveAsTextFile(propForFile.getString("badDataFilePath"))
First time program runs fine. On running again it throws exception for file AlreadyExists.
I want to resolve this using FileUtils java functionalities and save rdd as a text file.
Before you write the file to a specified path, delete the already existing path.
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.delete(new Path(bad/data/file/path), true)
Then perform your usual write process. Hope this should resolve the problem.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val fs = spark.SparkContext.hadoopCofigurations
if (fs.exists(new Path(path/to/the/files)))
fs.delete(new Path(path/to/the/files), true)
Pass the file name as String to the method, if directory or files present it will delete. Use this piece of code before writing it to the output path.
Why not use DataFrames? Get the RDD[ListBuffer[String] into an RDD[Row] - something like -
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
val badData:RDD[ListBuffer[String]] = rdd.map(line =>
Row(line(0), line(1)... line(n))
.filter(row => filter stuff)
badData.toDF().write.mode(SaveMode.Overwrite)

cannot list files in a hdfs dir using new File.listFiles

There is full permission to the folder I am trying to list but still, couldn't.
scala> new File("hdfs://mapdigidev/apps/hive/warehouse/da_ai.db/t_fact_ai_pi_ww").listFiles
res0: Array[java.io.File] = null
You can use the hadoop libraries to list files in hadoop:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(new URI("hdfs://mapdigidev"), new Configuration())
val files = fs.listFiles(new Path("/apps/hive/warehouse/da_ai.db/t_fact_ai_pi_ww"), false)
But java.io doesn't know about hadoop/hdfs.

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark.
In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/
I tried it with:
val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)
val path = new Path("hdfs://sandbox.hortonworks.com/demo/")
val files = fs.listFiles(path, false)
But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.
I also tried with:
FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)
But this also does not help.
Do you have any other idea?
PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.
We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))
In Spark 2.0+,
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
Hope this is helpful.
in Ajay Ahujas answer isDir is deprecated..
use isDirectory... pls see complete example and output below.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object ListHDFSDirectories extends App{
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val hdfspath = "." // your path here
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
}
Result :
file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src
I was looking for the same, however instead of HDFS, for S3.
I solved creating the FileSystem with my S3 path as below:
def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
val hadoopConf = sparkContext.hadoopConfiguration
val uri = new URI(path)
FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
_.getPath.toString
}
}
I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.
java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))
for (urlStatus <- listStatus) {
println("urlStatus get Path:" + urlStatus.getPath())
}
val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)
This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory
Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations
On Azure Portal, go to Storage Account, you will find following details:
Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated
Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:
hadoop fs -ls /users/accountsdata
Above command will list all the files. In Scala you can use
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
object HDFSProgram extends App {
val uri = new URI("hdfs://HOSTNAME:PORT")
val fs = FileSystem.get(uri,new Configuration())
val filePath = new Path("/user/hive/")
val status = fs.listStatus(filePath)
status.map(sts => sts.getPath).foreach(println)
}
This is sample code to get list of hdfs files or folder present under /user/hive/
Because you're using Scala, you may also be interested in the following:
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)

Use Spark to list all files in a Hadoop HDFS directory?

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/ to list all the files in a dir with the Apache Spark Scala API?
From the given first example, the spark context seems to only access files individually through something like:
val file = spark.textFile("hdfs://target_load_file.txt")
In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.
You can use a wildcard:
val errorCount = sc.textFile("hdfs://some-directory/*")
.flatMap(_.split(" ")).filter(_ == "error").count
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack
val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)
dirs.push("/user/username/")
while(!dirs.isEmpty){
val status = fs.listStatus(new Path(dirs.pop()))
status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else
files+= x.getPath.toString)
}
files.foreach(println)
For a local installation, (the hdfs default path fs.defaultFS can be found by reading /etc/hadoop/core.xml):
For instance,
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://localhost:9000")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val fileStatus = hdfs.listStatus(new Path("hdfs://localhost:9000/foldername/"))
val fileList = fileStatus.map(x => x.getPath.toString)
fileList.foreach(println)