Decrypting PGP file which is on HDFS - scala

We are decrypting PGP file with the help of "com.didisoft.pgp.PGPLib" in scala.
this is working fine with local files but when we run it for HDFS files we are facing issue like "File not found exception for securingkey"
Even while trying the same thing with unix utility for gpg we faced a file not found issue when path of HDFS file is passed.
Below is sample code for local files thats working fine:
val decryptionPassword = "xxxx"
val sec = "C:/Users/path/secring.gpg"
val originalFileName =pgp.decryptFile("C:/Users/path/pgp_sample_file.PGP",sec,
decryptionPassword ,"C:/Users/path/opfile/PGP.txt")
How can we use these utilities for decrypting our files lying on the HDFS?

You can't access hdfs like a normal file system. You need to either download the file to your local system then use the local file, or open a stream or load the file into memory then decrypt that.
To use gpg from the command line
hdfs dfs -cat <hdfs_file_path> | gpg --batch --yes --passphrase <passphrase> -d
I can't answer how to do it with the Java library (it seems to be proprietary), but there is probably a way to accept an inputstream instead of a filename.
To get an InputStream from an hdfs file, you need to use the hadoop fs api
val fs = org.apache.hadoop.fs.FileSystem.get(new org.apache.hadoop.conf.Configuration())
val inputStream = fs.open(new org.apache.hadoop.fs.Path(<filepath>))

Based on the sample code from puhlen, I can suggest you to try this:
val pgp = new com.didisoft.pgp.PGPLib()
val decryptionPassword = "xxxx"
val fs = org.apache.hadoop.fs.FileSystem.get(new org.apache.hadoop.conf.Configuration())
val keysStream = fs.open(new org.apache.hadoop.fs.Path("hdfs://.../secring.gpg"))
val ks = new com.didisoft.pgp.KeyStore()
ks.importKeyRing(keysStream)
val inputData = fs.open(new org.apache.hadoop.fs.Path("hdfs://.../pgp_sample_file.PGP"))
val outputData = fs.create(new org.apache.hadoop.fs.Path("hdfs://.../PGP.txt"))
val originalFileName = pgp.decryptStream(inputData, ks,
decryptionPassword, outputData)
(don't forget to replace the dots with the correct HDFS paths)

Related

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

How to "force" CRC files to appear when writing csv/parquet on HDFS in Spark

I seem to have the opposite problem from the rest of the Internet - any search on the topic would throw thousands of questions on how to suppress CRC files when writing out using Spark.
When using Spark on a cluster and writing stuff out to the HDFS I can't see any of the .crc files I usually see on the local system. Any ideas how to "force" them to appear?
You can try the below approach and see if .crc file is appearing on the hdfs folders.
val customConf = spark.sparkContext.hadoopConfiguration
val fileSystemObject = org.apache.hadoop.fs.FileSystem.get(customConf)
fileSystemObject.setVerifyChecksum(true)
If you write to text file on HDFS - you need to call setWriteChecksum with "false". And you will have only one your file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = new Configuration()
conf.set("fs.defaultFS", uri)
val hdfs = FileSystem.get(conf)
// this is it!
hdfs.setWriteChecksum(false)
val outputStream = hdfs.create(new Path("full/file/path"))
outputStream.write("string to be written".getBytes)
outputStream.close()
hdfs.close()

Spark write operation HDFS using temporal path

I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')

Scala - reading to a DataFrame when a path to the file doesn't exist

I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala

Copy file from Hdfs to Hdfs scala

Is there a known way using Hadoop api / spark scala to copy files from one directory to another on Hdfs ?
I have tried using copyFromLocalFile but was not helpful
Try Hadoop's FileUtil.copy() command, as described here: https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/fs/FileUtil.html#copy(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20boolean,%20org.apache.hadoop.conf.Configuration)
val conf = new org.apache.hadoop.conf.Configuration()
val srcPath = new org.apache.hadoop.fs.Path("hdfs://my/src/path")
val dstPath = new org.apache.hadoop.fs.Path("hdfs://my/dst/path")
org.apache.hadoop.fs.FileUtil.copy(
srcPath.getFileSystem(conf),
srcPath,
dstPath.getFileSystem(conf),
dstPath,
true,
conf
)
As I've understand your question, the answer is as easy as abc. Actually, there is no difference between your OS filesystem and some other distributed versions in the fundamental concepts like copying files in them. That is true that each would have its own rules in commands. For instance, when you want to copy a file from one directory to another you can do something like:
hdfs dfs -cp /dir_1/file_1.txt /dir_2/file_1_new_name.txt
The first part of the example command is just to let the command to be routed to the true destination not the OS's own file system.
for further reading you can use: copying data in hdfs