Spark write operation HDFS using temporal path - scala

I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?

If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')

Related

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

Scala - reading to a DataFrame when a path to the file doesn't exist

I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

How to continuously monitor a directory by using Spark Structured Streaming

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌​e","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible

Decrypting PGP file which is on HDFS

We are decrypting PGP file with the help of "com.didisoft.pgp.PGPLib" in scala.
this is working fine with local files but when we run it for HDFS files we are facing issue like "File not found exception for securingkey"
Even while trying the same thing with unix utility for gpg we faced a file not found issue when path of HDFS file is passed.
Below is sample code for local files thats working fine:
val decryptionPassword = "xxxx"
val sec = "C:/Users/path/secring.gpg"
val originalFileName =pgp.decryptFile("C:/Users/path/pgp_sample_file.PGP",sec,
decryptionPassword ,"C:/Users/path/opfile/PGP.txt")
How can we use these utilities for decrypting our files lying on the HDFS?
You can't access hdfs like a normal file system. You need to either download the file to your local system then use the local file, or open a stream or load the file into memory then decrypt that.
To use gpg from the command line
hdfs dfs -cat <hdfs_file_path> | gpg --batch --yes --passphrase <passphrase> -d
I can't answer how to do it with the Java library (it seems to be proprietary), but there is probably a way to accept an inputstream instead of a filename.
To get an InputStream from an hdfs file, you need to use the hadoop fs api
val fs = org.apache.hadoop.fs.FileSystem.get(new org.apache.hadoop.conf.Configuration())
val inputStream = fs.open(new org.apache.hadoop.fs.Path(<filepath>))
Based on the sample code from puhlen, I can suggest you to try this:
val pgp = new com.didisoft.pgp.PGPLib()
val decryptionPassword = "xxxx"
val fs = org.apache.hadoop.fs.FileSystem.get(new org.apache.hadoop.conf.Configuration())
val keysStream = fs.open(new org.apache.hadoop.fs.Path("hdfs://.../secring.gpg"))
val ks = new com.didisoft.pgp.KeyStore()
ks.importKeyRing(keysStream)
val inputData = fs.open(new org.apache.hadoop.fs.Path("hdfs://.../pgp_sample_file.PGP"))
val outputData = fs.create(new org.apache.hadoop.fs.Path("hdfs://.../PGP.txt"))
val originalFileName = pgp.decryptStream(inputData, ks,
decryptionPassword, outputData)
(don't forget to replace the dots with the correct HDFS paths)