Load XML file from HDFS in Scala - scala

I want to load a XML file from HDFS using XML Scala API. I am trying as follows but its not recognizing the path. Could anyone let me know how we can load file from HDFS by using Scala?
import scala.xml.{NodeSeq, XML}
val xml_load = XML.loadFile("hdfs:////user/np.user/raw/xmlfile.xml")

I assume you're using Scala 2.12.x; I also assume those four slashes in hdfs:////user... are typo.
You're using method XML.loadFile(name: String); it internally uses FileInputStream. It's not possible to open an HDFS file with a plain FileInputStream. You need an input stream which supports HDFS. You can find it in org.apache.hadoop:hadoop-hdfs library.
The code then looks like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
// configure properly so the code knows which Hadoop cluster to connect to
// https://hadoop.apache.org/docs/r3.2.0/api/org/apache/hadoop/conf/Configuration.html
val conf = new Configuration()
// obtain input stream instance
val hdfsPath: Path = new Path("hdfs://user/np.user/raw/xmlfile.xml")
val fs: FileSystem = hdfsPath.getFileSystem(conf)
val inputStream: FSDataInputStream = fs.open(hdfsPath)
// load XML
try {
val xml_load = XML.load(inputStream)
} finally {
// close resources; of course, this will silently swallow any exception in close() methods
inputStream.close()
fs.close()
}

Related

How to read files in flink if its raw log file using scala?

I am trying to read raw log file in flink to process it to kinesis .However in flink scala how to read raw files instead of using text file since text file create DataStream[String] which creates issue in formatting.I have tried this but its for readtext:
val inputStream = env.readTextFile(config.input.stream)
inputStream.print()
inputStream.map{str =>
val strByte = str.getBytes()
val thriftSerializer = new LazyBinaryThriftStructSerializer[CollectorPayload] {
override def codec: CollectorPayload.type = CollectorPayload
}
val collectorPayload = thriftSerializer.fromBytes(strByte)
collectorPayload
}.print()
Here it does not convert the data in proper format so wanted to read as binary file is it possible?
Instead of using readTextFile(path) you can use readFile(fileInputFormat, path) and provide a fileInputFormat that meets your needs.
https://ci.apache.org/projects/flink/flink-docs-release-1.2/api/java/org/apache/flink/api/common/io/class-use/FileInputFormat.html

How to read HDFS file from Scala code

I am new to Scala and HDFS:
I am just wondering I am able to read local file from Scala code but how to read from HDFS:
import scala.io.source
object ReadLine {
def main(args:Array[String]) {
if (args.length>0) {
for (line <- Source.fromLine(args(0)).getLine())
println(line)
}
}
in Argument I have passed hdfs://localhost:9000/usr/local/log_data/file1.. But its giving FileNotFoundException error
I am definitely missing something.. can anyone help me out here ?
scala.io.source api cannot read from HDFS. Source is used to read from local file system.
Spark
If you want to read from hdfs then I would recommend to use spark where you would have to use sparkContext.
val lines = sc.textFile(args(0)) //args(0) should be hdfs:///usr/local/log_data/file1
No Spark
If you don't want to use spark then you should go with BufferedReader or StreamReader or hadoop filesystem api. for example
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.

How to get the file name from DStream of Spark StreamingContext?

Event after lots of try and googling, could not get the fileName, if I am use the streaming context. I can use the wholeTextFiles of SparkContext but, then I have to re-implement the streaming context's functionality.
Note: FileName (error events as json file) is the input to the system, so retaining the name in the output is extremely important so that any event can be traced during audit.
Note: FileName is of the format below. SerialNumber part can be extracted from the event json, but time is stored as milliseconds and difficult to get in below format in a reliable way and no way to find the counter.
...
Each file contains just one line as a complex json string. Using the streaming context I am able to create a RDD[String], where each string is a json string from a single file. Can any one have any solution/workaround for associating the strings with the respective file name.
val sc = new SparkContext("local[*]", "test")
val ssc = new StreamingContext(sc, Seconds(4))
val dStream = ssc.textFileStream(pathOfDirToStream)
dStream.foreachRDD { eventsRdd => /* How to get the file name */ }
You could do this using fileStream and creating your own FileInputFormat, similar to TextInputFormat which uses the InputSplit to provide the filename as a Key. Then you can use fileStream to get a DStream with filename and line.
Hi to get file names from DStream I have created a java function which fetch file path using java spark api and than in spark-streaming(which is written in scala) i have called that function.
Here is a java Code sample:
import java.io.Serializable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.NewHadoopPartition;
import org.apache.spark.rdd.UnionPartition;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.Partition;
public class GetFileNameFromStream implements Serializable{
public String getFileName(Partition partition)
{
UnionPartition upp = (UnionPartition)partition;
NewHadoopPartition npp = (NewHadoopPartition) upp.parentPartition();
String filePath=npp.serializableHadoopSplit().value().toString();
return filePath;
}
}
In spark streaming, i have called above java function
Here is a code sample
val obj =new GetFileNameFromStream
dstream.transform(rdd=>{
val lenPartition = rdd.partitions.length
val listPartitions = rdd.partitions
for(part <-listPartitions){
var filePath=obj.getFileName(part)
})

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark.
In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/
I tried it with:
val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)
val path = new Path("hdfs://sandbox.hortonworks.com/demo/")
val files = fs.listFiles(path, false)
But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.
I also tried with:
FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)
But this also does not help.
Do you have any other idea?
PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.
We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))
In Spark 2.0+,
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
Hope this is helpful.
in Ajay Ahujas answer isDir is deprecated..
use isDirectory... pls see complete example and output below.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object ListHDFSDirectories extends App{
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val hdfspath = "." // your path here
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
}
Result :
file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src
I was looking for the same, however instead of HDFS, for S3.
I solved creating the FileSystem with my S3 path as below:
def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
val hadoopConf = sparkContext.hadoopConfiguration
val uri = new URI(path)
FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
_.getPath.toString
}
}
I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.
java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))
for (urlStatus <- listStatus) {
println("urlStatus get Path:" + urlStatus.getPath())
}
val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)
This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory
Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations
On Azure Portal, go to Storage Account, you will find following details:
Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated
Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:
hadoop fs -ls /users/accountsdata
Above command will list all the files. In Scala you can use
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
object HDFSProgram extends App {
val uri = new URI("hdfs://HOSTNAME:PORT")
val fs = FileSystem.get(uri,new Configuration())
val filePath = new Path("/user/hive/")
val status = fs.listStatus(filePath)
status.map(sts => sts.getPath).foreach(println)
}
This is sample code to get list of hdfs files or folder present under /user/hive/
Because you're using Scala, you may also be interested in the following:
import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)