What is the correct way to identify if the folder exist on ADLS gen 2 account or not - scala

I am working in scala and spark environment where I want to read parquet file. Before I read, I want to check if the file exists or not. I am writing the following code in jupyter notebook but it does not work - meaning it does not show any frame because the function testDirExist returns false
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def testDirExist(path: String): Boolean = {
val p = new Path(path)
hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val pt = "abfss://container#account.dfs.core.windows.net/blah/blah/blah
val exists = testDirExist(pt)
if(exists)
{
val dataframe = spark.read.parquet(pt)
dataframe.show()
}
However, the following code works. It shows data frame
val k = spark.read.parquet("abfss://container#account.dfs.core.windows.net/blah/blah/blah)
k.show()
Can anyone help me how can I check if the file exists or not?
Thanks

You just need to set the default filesystem to your storage account:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
val conf = new Configuration()
conf.set("fs.defaultFS", "abfss://<container_name>#<account_name>.dfs.core.windows.net")
conf.set("fs.azure.account.auth.type.<container_name>.dfs.core.windows.net", "OAuth")
conf.set("fs.azure.account.oauth.provider.type.<container_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
conf.set("fs.azure.account.oauth2.client.id.<container_name>.dfs.core.windows.net", "<client_id>")
conf.set("fs.azure.account.oauth2.client.secret.<container_name>.dfs.core.windows.net", "<secret>")
conf.set("fs.azure.account.oauth2.client.endpoint.<container_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
val fs= FileSystem.get(conf)
val ostream = fs.create(new Path("/abfss_test.out"))
val pwriter = new PrintWriter(ostream)
try {
pwriter.write("Azure Datalake Gen2 test")
pwriter.write("\n")
}
finally {
pwriter.close()
}
// check if the file we've just created exists
println(fs.exists(new Path("/abfss_test.out")))

Related

Reading a File from HDFS, Scala Spark

I am trying to read a file from HDFS but I am having a problem here. The file couldn't exists so for that reason I have to check if exists. If the file exists I read that file, otherwise I read an empty DF.
So what I am trying is:
val fs: FilySystem = FileSystem.get(new URI(path), new Configuration())
if (fs.exists(new org.apache.hadoop.fs.Path(s"$Path"))) {
val df6 = spark.read.parquet(path)
} else {
val df6 = df1.limit(0)
}
val df6.show()
But I am getting the following error on Jupyter:
Message: <console>:28: error: not found: type FileSystem
What I am doing wrong?
Try something like this (with your adjustment) -
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.net.URI
import scala.io.Source
val hdfs = FileSystem.get(new URI("hdfs://cluster:8020/"), new Configuration())
val path = new Path("/HDFS/FILE/LOCATION")
val stream = hdfs.open(path)
val temp = Source.fromInputStream(stream).getLines()

Smile - Model Persistence - How to write models to HDFS?

I am trying to use Smile in my Scala project which uses Spark and HDFS. For reusability of my models, I need to write them to HDFS.
Right now I am using the write object, checking if the path exists beforehand and creating it if it does not (otherwise it would throw a FileNotFoundException) :
import java.nio.file.Paths
val path: String = "hdfs:/my/hdfs/path"
val outputPath: Path = Paths.get(path)
val outputFile: File = outputPath.toFile
if(!outputFile.exists()) {
outputFile.getParentFile().mkdirs(); // This is a no-op if it exists
outputFile.createNewFile();
}
write(mySmileModel, path)
but this creates locally the path "hdfs:/my/hdfs/path" and writes the model in it, instead of actually writing to HDFS.
Note that using a spark model and its save method works:
mySparkModel.save("hdfs:/my/hdfs/path")
Therefore my question: How to write a Smile model to HDFS?
Similarly, if I manage to write a model to HDFS, I will probably also wonder how to read a model from HDFS.
Thanks!
In the end, I solved my problem by writing my own save method for my wrapper class, which roughly amounts to:
import org.apache.hadoop.fs.{FSDataInputStream, FSDataOutputStream, FileSystem, Path}
import org.apache.hadoop.conf.Configuration
import java.io.{ObjectInputStream, ObjectOutputStream}
val path: String = /my/hdfs/path
val file: Path = new Path(path)
val conf: Configuration = new Configuration()
val hdfs: FileSystem = FileSystem.get(new URI(path), conf)
val outputStream: FSDataOutputStream = hdfs.create(file)
val objectOutputStream: ObjectOutputStream = new ObjectOutputStream(outputStream)
objectOutputStream.writeObject(model)
objectOutputStream.close()
Similarly, for loading the saved model I wrote a method doing roughly the following:
val conf: Configuration = new Configuration()
val path: String = /my/hdfs/path
val hdfs: FileSystem = FileSystem.get(new URI(path), conf)
val inputStream: FSDataInputStream = hdfs.open(new Path(path))
val objectInputStream: ObjectInputStream = new ObjectInputStream(inputStream)
val model: RandomForest = objectInputStream.readObject().asInstanceOf[RandomForest]

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.

Read the data from HDFS using Scala

I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.

How can one list all csv files in an HDFS location within the Spark Scala shell?

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using
RddName.coalesce(1).saveAsTextFile(pathName)
to save the result to HDFS.
This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD.
Let us use the following anonymous example as the HDFS source locations:
/data/email/click/date=2015-01-01/sent_20150101.csv
/data/email/click/date=2015-01-02/sent_20150102.csv
/data/email/click/date=2015-01-03/sent_20150103.csv
I know how to list the file paths using Hadoop FS Shell:
HDFS DFS -ls /data/email/click/*/*.csv
I know how to create one RDD for all the data:
val sentRdd = sc.textFile( "/data/email/click/*/*.csv" )
I haven't tested it thoroughly but something like this seems to work:
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
import java.net.URI
val path: String = ???
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listFiles(new Path(path), false)
def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
if (iter.hasNext) {
val uri = iter.next.getPath.toUri
go(iter, uri :: acc)
} else {
acc
}
}
go(iter, List.empty[java.net.URI])
}
listFiles(iter).filter(_.toString.endsWith(".csv"))
This is what ultimately worked for me:
import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI
val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)
// source data in HDFS
val sourcePath = new Path("/<source_location>/<filename_pattern>")
hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
val filePathName = fileStatus.getPath().toString()
val fileName = fileStatus.getPath().getName()
// < DO STUFF HERE>
} // end foreach loop
sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).