I am working in scala and spark environment where I want to read parquet file. Before I read, I want to check if the file exists or not. I am writing the following code in jupyter notebook but it does not work - meaning it does not show any frame because the function testDirExist returns false
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def testDirExist(path: String): Boolean = {
val p = new Path(path)
hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val pt = "abfss://container#account.dfs.core.windows.net/blah/blah/blah
val exists = testDirExist(pt)
if(exists)
{
val dataframe = spark.read.parquet(pt)
dataframe.show()
}
However, the following code works. It shows data frame
val k = spark.read.parquet("abfss://container#account.dfs.core.windows.net/blah/blah/blah)
k.show()
Can anyone help me how can I check if the file exists or not?
Thanks
You just need to set the default filesystem to your storage account:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
val conf = new Configuration()
conf.set("fs.defaultFS", "abfss://<container_name>#<account_name>.dfs.core.windows.net")
conf.set("fs.azure.account.auth.type.<container_name>.dfs.core.windows.net", "OAuth")
conf.set("fs.azure.account.oauth.provider.type.<container_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
conf.set("fs.azure.account.oauth2.client.id.<container_name>.dfs.core.windows.net", "<client_id>")
conf.set("fs.azure.account.oauth2.client.secret.<container_name>.dfs.core.windows.net", "<secret>")
conf.set("fs.azure.account.oauth2.client.endpoint.<container_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
val fs= FileSystem.get(conf)
val ostream = fs.create(new Path("/abfss_test.out"))
val pwriter = new PrintWriter(ostream)
try {
pwriter.write("Azure Datalake Gen2 test")
pwriter.write("\n")
}
finally {
pwriter.close()
}
// check if the file we've just created exists
println(fs.exists(new Path("/abfss_test.out")))
I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.
i am trying to find the list of file in hdfs directory but the code its expecting file as the input when i try to run the below code.
val TestPath2="hdfs://localhost:8020/user/hdfs/QERESULTS1.csv"
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(sc.hadoopConfiguration)
val hadoopPath = new org.apache.hadoop.fs.Path(TestPath1)
val recursive = true
// val ri = hdfs.listFiles(hadoopPath, recursive)()
//println(hdfs.getChildFileSystems)
//hdfs.get(sc
val ri=hdfs.listFiles(hadoopPath, true)
println(ri)
You should set your default filesystem to hdfs:// first, I seems like your default filesystem is file://
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://some-path")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
...
I want to list all folders within a hdfs directory using Scala/Spark.
In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/
I tried it with:
val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)
val path = new Path("hdfs://sandbox.hortonworks.com/demo/")
val files = fs.listFiles(path, false)
But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.
I also tried with:
FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)
But this also does not help.
Do you have any other idea?
PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.
We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))
In Spark 2.0+,
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
Hope this is helpful.
in Ajay Ahujas answer isDir is deprecated..
use isDirectory... pls see complete example and output below.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object ListHDFSDirectories extends App{
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val hdfspath = "." // your path here
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
}
Result :
file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src
I was looking for the same, however instead of HDFS, for S3.
I solved creating the FileSystem with my S3 path as below:
def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
val hadoopConf = sparkContext.hadoopConfiguration
val uri = new URI(path)
FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
_.getPath.toString
}
}
I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.
java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))
for (urlStatus <- listStatus) {
println("urlStatus get Path:" + urlStatus.getPath())
}
val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)
This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory
Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations
On Azure Portal, go to Storage Account, you will find following details:
Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated
Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:
hadoop fs -ls /users/accountsdata
Above command will list all the files. In Scala you can use
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
object HDFSProgram extends App {
val uri = new URI("hdfs://HOSTNAME:PORT")
val fs = FileSystem.get(uri,new Configuration())
val filePath = new Path("/user/hive/")
val status = fs.listStatus(filePath)
status.map(sts => sts.getPath).foreach(println)
}
This is sample code to get list of hdfs files or folder present under /user/hive/
Because you're using Scala, you may also be interested in the following:
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)
The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using
RddName.coalesce(1).saveAsTextFile(pathName)
to save the result to HDFS.
This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD.
Let us use the following anonymous example as the HDFS source locations:
/data/email/click/date=2015-01-01/sent_20150101.csv
/data/email/click/date=2015-01-02/sent_20150102.csv
/data/email/click/date=2015-01-03/sent_20150103.csv
I know how to list the file paths using Hadoop FS Shell:
HDFS DFS -ls /data/email/click/*/*.csv
I know how to create one RDD for all the data:
val sentRdd = sc.textFile( "/data/email/click/*/*.csv" )
I haven't tested it thoroughly but something like this seems to work:
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
import java.net.URI
val path: String = ???
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listFiles(new Path(path), false)
def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
if (iter.hasNext) {
val uri = iter.next.getPath.toUri
go(iter, uri :: acc)
} else {
acc
}
}
go(iter, List.empty[java.net.URI])
}
listFiles(iter).filter(_.toString.endsWith(".csv"))
This is what ultimately worked for me:
import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI
val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)
// source data in HDFS
val sourcePath = new Path("/<source_location>/<filename_pattern>")
hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
val filePathName = fileStatus.getPath().toString()
val fileName = fileStatus.getPath().getName()
// < DO STUFF HERE>
} // end foreach loop
sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).