Reading a File from HDFS, Scala Spark - scala

I am trying to read a file from HDFS but I am having a problem here. The file couldn't exists so for that reason I have to check if exists. If the file exists I read that file, otherwise I read an empty DF.
So what I am trying is:
val fs: FilySystem = FileSystem.get(new URI(path), new Configuration())
if (fs.exists(new org.apache.hadoop.fs.Path(s"$Path"))) {
val df6 = spark.read.parquet(path)
} else {
val df6 = df1.limit(0)
}
val df6.show()
But I am getting the following error on Jupyter:
Message: <console>:28: error: not found: type FileSystem
What I am doing wrong?

Try something like this (with your adjustment) -
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.net.URI
import scala.io.Source
val hdfs = FileSystem.get(new URI("hdfs://cluster:8020/"), new Configuration())
val path = new Path("/HDFS/FILE/LOCATION")
val stream = hdfs.open(path)
val temp = Source.fromInputStream(stream).getLines()

Related

What is the correct way to identify if the folder exist on ADLS gen 2 account or not

I am working in scala and spark environment where I want to read parquet file. Before I read, I want to check if the file exists or not. I am writing the following code in jupyter notebook but it does not work - meaning it does not show any frame because the function testDirExist returns false
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def testDirExist(path: String): Boolean = {
val p = new Path(path)
hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val pt = "abfss://container#account.dfs.core.windows.net/blah/blah/blah
val exists = testDirExist(pt)
if(exists)
{
val dataframe = spark.read.parquet(pt)
dataframe.show()
}
However, the following code works. It shows data frame
val k = spark.read.parquet("abfss://container#account.dfs.core.windows.net/blah/blah/blah)
k.show()
Can anyone help me how can I check if the file exists or not?
Thanks
You just need to set the default filesystem to your storage account:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
val conf = new Configuration()
conf.set("fs.defaultFS", "abfss://<container_name>#<account_name>.dfs.core.windows.net")
conf.set("fs.azure.account.auth.type.<container_name>.dfs.core.windows.net", "OAuth")
conf.set("fs.azure.account.oauth.provider.type.<container_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
conf.set("fs.azure.account.oauth2.client.id.<container_name>.dfs.core.windows.net", "<client_id>")
conf.set("fs.azure.account.oauth2.client.secret.<container_name>.dfs.core.windows.net", "<secret>")
conf.set("fs.azure.account.oauth2.client.endpoint.<container_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
val fs= FileSystem.get(conf)
val ostream = fs.create(new Path("/abfss_test.out"))
val pwriter = new PrintWriter(ostream)
try {
pwriter.write("Azure Datalake Gen2 test")
pwriter.write("\n")
}
finally {
pwriter.close()
}
// check if the file we've just created exists
println(fs.exists(new Path("/abfss_test.out")))

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.

HDFS : java.io.FileNotFoundException : File does not exist: name._COPYING

I'm working with Spark Streaming using Scala. I need to read a .csv file dinamically from HDFS directory with this line:
val lines = ssc.textFileStream("/user/root/")
I use the following command line to put the file into HDFS:
hdfs dfs -put ./head40k.csv
It works fine with a relatively small file.
When I try with a larger one, I get this error:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/head800k.csv._COPYING
I can understand why, but I don't know how to fix it. I've tried this solution too:
hdfs dfs -put ./head800k.csv /user
hdfs dfs -mv /usr/head800k.csv /user/root
but my program doesn't read the file.
Any ideas?
Thanks in advance
PROGRAM:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
import scala.sys.process._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import StreamingContext._
object Traccia2014{
def main(args: Array[String]){
if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <test><topicRisultato>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
val Array(brokers,risultato) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.textFileStream("/user/root/")
//val lines= ssc.fileStream[LongWritable, Text, TextInputFormat](directory="/user/root/",
// filter = (path: org.apache.hadoop.fs.Path) => //(!path.getName.endsWith("._COPYING")),newFilesOnly = true)
//********** Definizioni Producer***********
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val slice=30
lines.foreachRDD( rdd => {
if(!rdd.isEmpty){
val min=rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
if(!min.isEmpty){
val ipDst= rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
if(!ipDst.isEmpty){
val ipSrc=rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(1)),1)).reduceByKey(_ + _)
if(!ipSrc.isEmpty){
val Rapporto=ipSrc.leftOuterJoin(ipDst).mapValues{case (x,y) => x.asInstanceOf[Int] / y.getOrElse(1) }
val RapportoFiltrato=Rapporto.filter{case (key, value) => value > 100 }
println("###(ConsumerScala) CalcoloRapporti: ###")
Rapporto.collect().foreach(println)
val str = Rapporto.collect().mkString("\n")
println(s"###(ConsumerScala) Produco Risultato : ${str}")
val message = new ProducerRecord[String, String](risultato, null, str)
producer.send(message)
Thread.sleep(1000)
}else{
println("src vuoto")
}
}else{
println("dst vuoto")
}
}else{
println("min vuoto")
}
}else
{
println("rdd vuoto")
}
})//foreach
ssc.start()
ssc.awaitTermination()
} }
/user/root/head800k.csv._COPYING is a transient file that is created while the copy process is on going. Wait for the copy process to complete and you will have a fail without the _COPYING suffix ie /user/root/head800k.csv.
to filter these transient in your spark-streaming job you can use the fileStream method documented here
as shown below for example
ssc.fileStream[LongWritable, Text, TextInputFormat](
directory="/user/root/",
filter = (path: org.apache.hadoop.fs.Path) => (!path.getName.endsWith("_COPYING")), // add other filters like files starting with dot etc
newFilesOnly = true)
EDIT
since you are moving your file from local filesystem to HDFS, the best solution is to move your file to a temporary staging location in the HDFS and then move them to your target directory. copying or moving within the HDFS filesystem should avoid the transient files

Read the data from HDFS using Scala

I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.

Wiki xml parser - org.apache.spark.SparkException: Task not serializable

I am newbie to both scala and spark, and trying some of the tutorials, this one is from Advanced Analytics with Spark. The following code is supposed to work:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/petr/Downloads/wiki/wiki"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, xml)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
But it gives
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
...
Running Spark v1.3.0 on a local (and I have loaded only about a 21MB of the wiki articles, just to test it).
All of https://stackoverflow.com/search?q=org.apache.spark.SparkException%3A+Task+not+serializable didn't get me any clue...
Thanks.
try
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/terrapin/Downloads/enwiki-20150304-pages-articles1.xml-p000000010p000010000"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
val plainText = rawXmls.flatMap{line =>
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, line)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
The first guess which comes to mind is that: all your code is wrapped in the object where SparkContext is defined. Spark tries to serialize this object to transfer wikiXmlToPlainText function to nodes. Try to create different object with the only one function wikiXmlToPlainText.