How to write to HDFS using Scala - scala

I am learning Scala and i need to write a custom file to HDFS. I have my own HDFS running on a Cloudera image using vmware fusion on my laptop.
This is my actual code:
package org.glassfish.samples
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
val fs= FileSystem.get(conf)
val output = fs.create(new Path("hdfs://quickstart.cloudera:8020/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
}
print("Done!")
}
}
And i am getting this exception:
Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://quickstart.cloudera:8020/tmp, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:414)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.glassfish.samples.App$.main(App.scala:19)
at org.glassfish.samples.App.main(App.scala)
... 6 more
I can access hdfs using the terminal and Hue
[cloudera#quickstart ~]$ hdfs dfs -ls /tmp
Found 3 items
drwxr-xr-x - hdfs supergroup 0 2015-06-09 17:54 /tmp/hadoop-yarn
drwx-wx-wx - hive supergroup 0 2015-08-17 15:24 /tmp/hive
drwxr-xr-x - cloudera supergroup 0 2015-08-17 16:50 /tmp/labdata
this is my pom.xml
I ran the project using the command:
mvn clean package scala:run
What do i am doing wrong? thank you in advance!
EDIT after #jeroenr advice
This is actual code:
package org.glassfish.samples
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}

Have a look at this this example here. I think the problem is that you don't configure the default file system using
conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
and pass the relative path, like so:
fs.create(new Path("/tmp/mySample.txt"))
to write to the file, call 'write' directly on the output stream returned by fs.create, like so:
val os = fs.create(new Path("/tmp/mySample.txt"))
os.write("This is a test".getBytes)

Related

How to write Spark-submit logs file with the Scala code?

I am trying to build a scala based jar file that uses log4j to write logs. Executing the code above with spark-shell works fine (logs printing in the console). But when I try to make it write to a log file (spark-shell or spark-submit), only the line with logging.info is print out. I wish to set the log level to DEBUG. Here is my code :
import org.apache.log4j
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger, PatternLayout, Priority, RollingFileAppender}
import java.time
import java.time.format.DateTimeFormatter
trait SparkContextProvider {
def spark: SparkSession
}
trait Logs extends SparkContextProvider {
lazy val logging: log4j.Logger = Logger.getLogger(getClass.getName)
lazy val applicationId: String = spark.sparkContext.applicationId
val appender = new RollingFileAppender()
appender.setAppend(true)
appender.setMaxFileSize("50MB")
appender.setMaxBackupIndex(10)
appender.setFile("/usr/spark-3.0.2/app-logs/spark-" + applicationId + ".log")
appender.activateOptions()
val layOut = new PatternLayout()
layOut.setConversionPattern("%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n")
appender.setLayout(layOut)
logging.addAppender(appender)
logging.setLevel(Level.DEBUG)
}
object DataExtractionProcess extends Logs {
def Main(): Unit = {
logging.info("hello test world")
}
override def spark: SparkSession = SparkSession.builder
.appName("PredictiveDataOperation")
.getOrCreate()
}
I trigger the job with DataExtractionProcess.Main()
And I tried also to set log level with :
//Logger.getLogger("org.apache.spark").setLevel(Level.DEBUG)
//Logger.getRootLogger().setLevel(Level.DEBUG)
//spark.sparkContext.setLogLevel("all")
But no change in the log file.
Thanks for the help

What is the correct way to identify if the folder exist on ADLS gen 2 account or not

I am working in scala and spark environment where I want to read parquet file. Before I read, I want to check if the file exists or not. I am writing the following code in jupyter notebook but it does not work - meaning it does not show any frame because the function testDirExist returns false
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def testDirExist(path: String): Boolean = {
val p = new Path(path)
hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val pt = "abfss://container#account.dfs.core.windows.net/blah/blah/blah
val exists = testDirExist(pt)
if(exists)
{
val dataframe = spark.read.parquet(pt)
dataframe.show()
}
However, the following code works. It shows data frame
val k = spark.read.parquet("abfss://container#account.dfs.core.windows.net/blah/blah/blah)
k.show()
Can anyone help me how can I check if the file exists or not?
Thanks
You just need to set the default filesystem to your storage account:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
val conf = new Configuration()
conf.set("fs.defaultFS", "abfss://<container_name>#<account_name>.dfs.core.windows.net")
conf.set("fs.azure.account.auth.type.<container_name>.dfs.core.windows.net", "OAuth")
conf.set("fs.azure.account.oauth.provider.type.<container_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
conf.set("fs.azure.account.oauth2.client.id.<container_name>.dfs.core.windows.net", "<client_id>")
conf.set("fs.azure.account.oauth2.client.secret.<container_name>.dfs.core.windows.net", "<secret>")
conf.set("fs.azure.account.oauth2.client.endpoint.<container_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
val fs= FileSystem.get(conf)
val ostream = fs.create(new Path("/abfss_test.out"))
val pwriter = new PrintWriter(ostream)
try {
pwriter.write("Azure Datalake Gen2 test")
pwriter.write("\n")
}
finally {
pwriter.close()
}
// check if the file we've just created exists
println(fs.exists(new Path("/abfss_test.out")))

HDFS : java.io.FileNotFoundException : File does not exist: name._COPYING

I'm working with Spark Streaming using Scala. I need to read a .csv file dinamically from HDFS directory with this line:
val lines = ssc.textFileStream("/user/root/")
I use the following command line to put the file into HDFS:
hdfs dfs -put ./head40k.csv
It works fine with a relatively small file.
When I try with a larger one, I get this error:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/head800k.csv._COPYING
I can understand why, but I don't know how to fix it. I've tried this solution too:
hdfs dfs -put ./head800k.csv /user
hdfs dfs -mv /usr/head800k.csv /user/root
but my program doesn't read the file.
Any ideas?
Thanks in advance
PROGRAM:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
import scala.sys.process._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import StreamingContext._
object Traccia2014{
def main(args: Array[String]){
if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <test><topicRisultato>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
val Array(brokers,risultato) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.textFileStream("/user/root/")
//val lines= ssc.fileStream[LongWritable, Text, TextInputFormat](directory="/user/root/",
// filter = (path: org.apache.hadoop.fs.Path) => //(!path.getName.endsWith("._COPYING")),newFilesOnly = true)
//********** Definizioni Producer***********
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val slice=30
lines.foreachRDD( rdd => {
if(!rdd.isEmpty){
val min=rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
if(!min.isEmpty){
val ipDst= rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
if(!ipDst.isEmpty){
val ipSrc=rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(1)),1)).reduceByKey(_ + _)
if(!ipSrc.isEmpty){
val Rapporto=ipSrc.leftOuterJoin(ipDst).mapValues{case (x,y) => x.asInstanceOf[Int] / y.getOrElse(1) }
val RapportoFiltrato=Rapporto.filter{case (key, value) => value > 100 }
println("###(ConsumerScala) CalcoloRapporti: ###")
Rapporto.collect().foreach(println)
val str = Rapporto.collect().mkString("\n")
println(s"###(ConsumerScala) Produco Risultato : ${str}")
val message = new ProducerRecord[String, String](risultato, null, str)
producer.send(message)
Thread.sleep(1000)
}else{
println("src vuoto")
}
}else{
println("dst vuoto")
}
}else{
println("min vuoto")
}
}else
{
println("rdd vuoto")
}
})//foreach
ssc.start()
ssc.awaitTermination()
} }
/user/root/head800k.csv._COPYING is a transient file that is created while the copy process is on going. Wait for the copy process to complete and you will have a fail without the _COPYING suffix ie /user/root/head800k.csv.
to filter these transient in your spark-streaming job you can use the fileStream method documented here
as shown below for example
ssc.fileStream[LongWritable, Text, TextInputFormat](
directory="/user/root/",
filter = (path: org.apache.hadoop.fs.Path) => (!path.getName.endsWith("_COPYING")), // add other filters like files starting with dot etc
newFilesOnly = true)
EDIT
since you are moving your file from local filesystem to HDFS, the best solution is to move your file to a temporary staging location in the HDFS and then move them to your target directory. copying or moving within the HDFS filesystem should avoid the transient files

Read the data from HDFS using Scala

I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.

How to make Spark slaves use HDFS input files 'local' to them in a Hadoop+Spark cluster?

I have a cluster of 9 computers with Apache Hadoop 2.7.2 and Spark 2.0.0 installed on them. Each computer runs an HDFS datanode and Spark slave. One of these computers also runs an HDFS namenode and Spark master.
I've uploaded a few TBs of gz-archives in HDFS with Replication=2. It turned out that some of the archives are corrupt. I'd want to find them. It looks like 'gunzip -t ' can help. So I'm trying to find a way to run a Spark application on the cluster so that each Spark executor tests archives 'local' (i.e. having one of the replicas located on the same computer where this executor runs) to it as long as it is possible. The following script runs but sometimes Spark executors process 'remote' files in HDFS:
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchivesTester' in spark.pom
// and placing this jar file into Spark's home directory):
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
// means testing for corruption the gz-archives in the directory hdfs://LV-WS10.lviv:9000/commoncrawl
// using a Spark cluster with the Spark master URL spark://LV-WS10.lviv:7077 and 9 Spark slaves
package com.qbeats.cortex
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.FileSplit
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkContext, SparkConf, AccumulatorParam}
import sys.process._
object CommoncrawlArchivesTester extends App {
object LogAccumulator extends AccumulatorParam[String] {
def zero(initialValue: String): String = ""
def addInPlace(log1: String, log2: String) = if (log1.isEmpty) log2 else log1 + "\n" + log2
}
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchivesTester"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val log = sc.accumulator(LogAccumulator.zero(""))(LogAccumulator)
val text = sc.hadoopFile(args(1), classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
val fileAndLine = hadoopRdd.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val fileName = inputSplit.asInstanceOf[FileSplit].getPath.toString
class FilePath extends Iterable[String] {
def iterator = List(fileName).iterator
}
val result = (sys.env("HADOOP_PREFIX") + "/bin/hadoop fs -cat " + fileName) #| "gunzip -t" !
println("Processed %s.".format(fileName))
if (result != 0) {
log.add(fileName)
println("Corrupt: %s.".format(fileName))
}
(new FilePath).iterator
}
val result = fileAndLine.collect()
println("Corrupted files:")
println(log.value)
}
}
}
What would you suggest?
ADDED LATER:
I tried another script which gets files from HDFS via textFile(). I looks like a Spark executor doesn't prefer among input files the files which are 'local' to it. Doesn't it contradict to "Spark brings code to data, not data to code"?
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchiveLinesCounter' in spark.pom)
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
package com.qbeats.cortex
import org.apache.spark.{SparkContext, SparkConf}
object CommoncrawlArchiveLinesCounter extends App {
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchiveLinesCounter"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val helper = new Helper
val nLines = sc.
textFile(args(1) + "/*").
mapPartitionsWithIndex( (index, it) => {
println("Processing partition %s".format(index))
it
}).
count
println(nLines)
}
}
}
SAIF C, could you explain in more detail please?
I've solved the problem by switching from Spark’s standalone mode to YARN.
Related topic: How does Apache Spark know about HDFS data nodes?