Loop and process multiple HDFS files in spark/scala - scala

I have multiple files in my HDFS folder and I want to loop and run my scala transformation logic on it.
I am using below script which is working fine in my development environment using local files but it is failing when I run on my HDFS environment. Any idea where am I doing wrong please?
val files = new File("hdfs://172.X.X.X:8020/landing/").listFiles.map(_.getName).toList
files.foreach { file =>
print(file)
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + file)
event.show(false)
}
Can someone correct it or suggest alternative solution please.

You should use Hadoop IO library to handle hadoop files.
code:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
val fs=FileSystem.get(new URI("hdfs://172.X.X.X:8020/"),spark.sparkContext.hadoopConfiguration)
fs.globStatus(new Path("/landing/*")).toList.foreach{
f=>
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + f.getPath.getName)
event.show(false)
}

Related

How to "force" CRC files to appear when writing csv/parquet on HDFS in Spark

I seem to have the opposite problem from the rest of the Internet - any search on the topic would throw thousands of questions on how to suppress CRC files when writing out using Spark.
When using Spark on a cluster and writing stuff out to the HDFS I can't see any of the .crc files I usually see on the local system. Any ideas how to "force" them to appear?
You can try the below approach and see if .crc file is appearing on the hdfs folders.
val customConf = spark.sparkContext.hadoopConfiguration
val fileSystemObject = org.apache.hadoop.fs.FileSystem.get(customConf)
fileSystemObject.setVerifyChecksum(true)
If you write to text file on HDFS - you need to call setWriteChecksum with "false". And you will have only one your file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = new Configuration()
conf.set("fs.defaultFS", uri)
val hdfs = FileSystem.get(conf)
// this is it!
hdfs.setWriteChecksum(false)
val outputStream = hdfs.create(new Path("full/file/path"))
outputStream.write("string to be written".getBytes)
outputStream.close()
hdfs.close()

Execute code scala from spark in Zeppelin

I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala

Scala - reading to a DataFrame when a path to the file doesn't exist

I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala

Spark scala :FIle already exists exception while Uploading csv file to azure blob

I am reading sas file from azure blob . Converting it to csv and trying to upload csv to azure blob . However for small files in MBs I am able to do the same successfully with the following spark scala code .
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._
val sqlContext = new SQLContext(sc)
val df=sqlContext.sasFile("wasbs://container#storageaccount/input.sas7bdat")
df.write.format("csv").save("wasbs://container#storageaccount/output.csv");
But for large files in GB it gives me Analysis exception wasbs://container#storageaccount/output.csv file already exists exception. I have tried overwrite also . But no luck . Any help would be appriciated
Actually, you could not overwrite an existing file on HDFS normally, even for small files in MBs.
Please try to use the code below to overwrite, please check your spark version because there are some differences to use the methed for different spark version.
df.write.format("csv").mode("overwrite").save("wasbs://container#storageaccount/output.csv");
I don't know the code above using overwrite mode whether you had tried as you said.
So there is another way to do it that first delete the existing files befer do the overwrite operation.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("<hdfs://<namenodehost>/ or wasb[s]://<containername>#<accountname>.blob.core.windows.net/<path> >"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
And there is a spark topic discussed similar issue, please see http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html.

spark unable to save in hadoop (permission denied for user)

I build a spark application to count the number of word in a file. I run the application on the cloudera quickstart VM, all is fine when i use the cloudera user directory but when i want to write or read in an other user directory i have a permission denied from hadoop. I would like to know how to change the hadoop user in spark.
package user1.item1
import user1.{Article}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
import scala.util.{Try, Success, Failure}
object WordCount {
def main(args: Array[String]) {
Context.User = 'espacechange'
val filename = "hdfs://quickstart.cloudera:8020/user/user1/test/wiki_test/wikipedia.txt"
val conf = new SparkConf().setAppName("word count")
val sc = new SparkContext(conf)
val wikipedia = sc.textFile(filename).map(Article.parseWikipediaArticle)
val counts = wikipedia.flatMap(line => line.text.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/user1/test/word_count")
}
}
It depends on your cluster's authentication. By default, you can set following environment variable:
$ export HADOOP_USER_NAME=hdfs
Try the above before submitting spark job.
You need to launch the spark-submit script using a different OS user.
For example, use the following command to run the spark application as (and get the permissions of) the HDFS user:
sudo -u hdfs spark-submit ....