Writing file using FileSystem to S3 (Scala) - scala

I'm using scala , and trying to write file with string content,
to S3.
I've tried to do that with FileSystem ,
but I getting an error of:
"Wrong FS: s3a"
val content = "blabla"
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val localPath= new Path("/tmp/fileTest.txt")
val os = fs.create(localPath)
os.write(content.getBytes)
fs.copyFromLocalFile(localPath,s3Path)
and i'm getting an error:
java.lang.IllegalArgumentException: Wrong FS: s3a://...txt, expected: file:///
What is wrong?
Thanks!!

you need to ask for the specific filesystem for that scheme, then you can create a text file directly on the remote system.
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val fs = s3Path.getFilesystem(spark.sparkContext.hadoopConfiguration)
val os = fs.create(s3Path, true)
os.write("hi".getBytes)
os.close
There's no need to write locally and upload; the s3a connector will buffer and upload as needed

Related

How to copy a file from local to s3 using spark as a single file with the given name?

I am looking for ways to move a file from local filesystem to s3 with spark apis with the given fileName. The below is creating file with Part000. How to avoid this?
val df = spark.read.textFile("readPath")
df.coalesce(1).write.mode("Overwrite")
.save("path"))
This worked
val dstPath = new org.apache.hadoop.fs.Path(inputFilePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.setVerifyChecksum(false)
org.apache.hadoop.fs.FileUtil.copy(
new File(localfilePath),
dstPath.getFileSystem(conf),
dstPath, true,
conf
)

Read data from HDFS

I'm using the FSDataInputStream library to access the data from HDFS
The following is the snippet which I'm using
val fs = FileSystem.get(new java.net.URI(#HDFS_URI),new Configuration())
val stream = fs.open(new Path(#PATH))
val reader = new BufferedReader(new InputStreamReader(stream))
val offset:String = reader.readLine() #Reads the string "5432" stored in the file
Expected output is "5432".
But the actual output is "^#5^#4^#3^#2"
Not able to trim "^#" since they are not considered as characters.Please help with appropriate solution.

Error while moving a file into HDFS from Local using Scala

I have a Scala list: fileNames, that contains the file names present on a local directory.
Ex:
fileNames(2)
res0: String = file:///tmp/audits/xx_user.log
I am trying to move a file from the list: fileNames, from local to HDFS using Scala. In order to do that, I followed the below steps:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.commons.io.IOUtils;
val hadoopconf = new Configuration();
hadoopconf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path("hdfs://mydev/user/devusr/testfolder"))
The code works fine until here. When I try to add the inputStream, I am getting error message as below:
val inStream = fs.open(new Path(fileNames(2)))
java.lang.IllegalArgumentException: Wrong FS: file:/tmp/audits/xx_user.log, expected: hdfs://mergedev
I also tried by specifying the file name directly and result is same:
val inStream = fs.open(new Path("file:///tmp/audits/xx_user.log"))
java.lang.IllegalArgumentException: Wrong FS: file:/tmp/audits/xx_user.log, expected: hdfs://mergedev
But when I try to load the file directly into spark, it is working fine:
val localToSpark = spark.read.text(fileNames(2))
localToSpark: org.apache.spark.sql.DataFrame = [value: string]
localToSpark.collect
res1: Array[org.apache.spark.sql.Row] = Array([[Wed Dec 20 06:18:02 UTC 2017] INFO: ], [*********************************************************************************************************], [ ], [[Wed Dec 20 06:18:02 UTC 2017] INFO: Diagnostic log for xx_user.]
Could anyone tell me what is the mistake I am doing at this point:
val inStream = fs.open(new Path(fileNames(2))) for which I am getting the error.
For small files, copyFromLocalFile() is sufficient:
fs.copyFromLocalFile(new Path(localFileName), new Path(hdfsFileName))
For large files, it's more efficient to use Apache Commons-IO:
IOUtils.copyLarge(
new FileInputStream(new File(localFileName)), fs.create(new Path(hdfsFileName)))
Keep in mind that the local file name should not include the protocol (so no file:/// in there)

Hdfs file list in scala

i am trying to find the list of file in hdfs directory but the code its expecting file as the input when i try to run the below code.
val TestPath2="hdfs://localhost:8020/user/hdfs/QERESULTS1.csv"
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(sc.hadoopConfiguration)
val hadoopPath = new org.apache.hadoop.fs.Path(TestPath1)
val recursive = true
// val ri = hdfs.listFiles(hadoopPath, recursive)()
//println(hdfs.getChildFileSystems)
//hdfs.get(sc
val ri=hdfs.listFiles(hadoopPath, true)
println(ri)
You should set your default filesystem to hdfs:// first, I seems like your default filesystem is file://
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://some-path")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
...

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark.
In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/
I tried it with:
val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)
val path = new Path("hdfs://sandbox.hortonworks.com/demo/")
val files = fs.listFiles(path, false)
But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.
I also tried with:
FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)
But this also does not help.
Do you have any other idea?
PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.
We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))
In Spark 2.0+,
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
Hope this is helpful.
in Ajay Ahujas answer isDir is deprecated..
use isDirectory... pls see complete example and output below.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object ListHDFSDirectories extends App{
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val hdfspath = "." // your path here
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
}
Result :
file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src
I was looking for the same, however instead of HDFS, for S3.
I solved creating the FileSystem with my S3 path as below:
def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
val hadoopConf = sparkContext.hadoopConfiguration
val uri = new URI(path)
FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
_.getPath.toString
}
}
I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.
java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))
for (urlStatus <- listStatus) {
println("urlStatus get Path:" + urlStatus.getPath())
}
val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)
This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory
Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations
On Azure Portal, go to Storage Account, you will find following details:
Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated
Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:
hadoop fs -ls /users/accountsdata
Above command will list all the files. In Scala you can use
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
object HDFSProgram extends App {
val uri = new URI("hdfs://HOSTNAME:PORT")
val fs = FileSystem.get(uri,new Configuration())
val filePath = new Path("/user/hive/")
val status = fs.listStatus(filePath)
status.map(sts => sts.getPath).foreach(println)
}
This is sample code to get list of hdfs files or folder present under /user/hive/
Because you're using Scala, you may also be interested in the following:
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)