How to "force" CRC files to appear when writing csv/parquet on HDFS in Spark - scala

I seem to have the opposite problem from the rest of the Internet - any search on the topic would throw thousands of questions on how to suppress CRC files when writing out using Spark.
When using Spark on a cluster and writing stuff out to the HDFS I can't see any of the .crc files I usually see on the local system. Any ideas how to "force" them to appear?

You can try the below approach and see if .crc file is appearing on the hdfs folders.
val customConf = spark.sparkContext.hadoopConfiguration
val fileSystemObject = org.apache.hadoop.fs.FileSystem.get(customConf)
fileSystemObject.setVerifyChecksum(true)

If you write to text file on HDFS - you need to call setWriteChecksum with "false". And you will have only one your file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = new Configuration()
conf.set("fs.defaultFS", uri)
val hdfs = FileSystem.get(conf)
// this is it!
hdfs.setWriteChecksum(false)
val outputStream = hdfs.create(new Path("full/file/path"))
outputStream.write("string to be written".getBytes)
outputStream.close()
hdfs.close()

Related

Loop and process multiple HDFS files in spark/scala

I have multiple files in my HDFS folder and I want to loop and run my scala transformation logic on it.
I am using below script which is working fine in my development environment using local files but it is failing when I run on my HDFS environment. Any idea where am I doing wrong please?
val files = new File("hdfs://172.X.X.X:8020/landing/").listFiles.map(_.getName).toList
files.foreach { file =>
print(file)
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + file)
event.show(false)
}
Can someone correct it or suggest alternative solution please.
You should use Hadoop IO library to handle hadoop files.
code:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
val fs=FileSystem.get(new URI("hdfs://172.X.X.X:8020/"),spark.sparkContext.hadoopConfiguration)
fs.globStatus(new Path("/landing/*")).toList.foreach{
f=>
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + f.getPath.getName)
event.show(false)
}

Scala - reading to a DataFrame when a path to the file doesn't exist

I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala

Spark scala :FIle already exists exception while Uploading csv file to azure blob

I am reading sas file from azure blob . Converting it to csv and trying to upload csv to azure blob . However for small files in MBs I am able to do the same successfully with the following spark scala code .
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._
val sqlContext = new SQLContext(sc)
val df=sqlContext.sasFile("wasbs://container#storageaccount/input.sas7bdat")
df.write.format("csv").save("wasbs://container#storageaccount/output.csv");
But for large files in GB it gives me Analysis exception wasbs://container#storageaccount/output.csv file already exists exception. I have tried overwrite also . But no luck . Any help would be appriciated
Actually, you could not overwrite an existing file on HDFS normally, even for small files in MBs.
Please try to use the code below to overwrite, please check your spark version because there are some differences to use the methed for different spark version.
df.write.format("csv").mode("overwrite").save("wasbs://container#storageaccount/output.csv");
I don't know the code above using overwrite mode whether you had tried as you said.
So there is another way to do it that first delete the existing files befer do the overwrite operation.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("<hdfs://<namenodehost>/ or wasb[s]://<containername>#<accountname>.blob.core.windows.net/<path> >"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
And there is a spark topic discussed similar issue, please see http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html.

How to read a file using sparkstreaming and write to a simple file using Scala?

I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders.
This is my code :
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("StreamAFile")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val schemaforfile = new StructType().add("SrNo",IntegerType).add("Name",StringType).add("Age",IntegerType).add("Friends",IntegerType)
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
file.writeStream.format("parquet").start("C:\\Users\\roswal01\\Desktop\\streamed")
spark.stop()
Is there anything missing in my code or anything in the code where I've gone wrong?
I also tried reading this file from a hdfs location but the same code ends up not creating any output folders on my hdfs.
You've mistake here:
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
csv() function should have directory path as an argument. It will scan this directory and read all new files when they will be moved into this directory
For checkpointing, you should add
.option("checkpointLocation", "path/to/HDFS/dir")
For example:
val query = file.writeStream.format("parquet")
.option("checkpointLocation", "path/to/HDFS/dir")
.start("C:\\Users\\roswal01\\Desktop\\streamed")
query.awaitTermination()

PySpark: Empty RDD on reading gzipped BSON files

I have a script to analyse BSON dumps, however it works only with uncompressed files. I get an empty RDD while reading gz bson files.
pyspark_location = 'lib/pymongo_spark.py'
HDFS_HOME = 'hdfs://1.1.1.1/'
INPUT_FILE = 'big_bson.gz'
class BsonEncoder(JSONEncoder):
def default(self, obj):
if isinstance(obj, ObjectId):
return str(obj)
elif isinstance(obj, datetime):
return obj.isoformat()
return JSONEncoder.default(self, obj)
def setup_spark_with_pymongo(app_name='App'):
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sc.addPyFile(pyspark_location)
return sc
def main():
spark_context = setup_spark_with_pymongo('PysparkApp')
filename = HDFS_HOME + INPUT_FILE
import pymongo_spark
pymongo_spark.activate()
rdd = spark_context.BSONFileRDD(filename)
print(rdd.first()) #Raises ValueError("RDD is empty")
I am using mongo-java-driver-3.2.2.jar, mongo-hadoop-spark-1.5.2.jar, pymongo-3.2.2-py2.7-linux-x86_64 and pymongo_spark in along with spark-submit.
The version of Spark deployed is 1.6.1 along with Hadoop 2.6.4.
I am aware that the library does not support splitting compressed BSON files, however it should with a single split.
I have got hundreds of compressed BSON files to analyse and deflating each of them doesn't seem to be a viable option.
Any idea how should I proceed further?
Thanks in advance!
I've just tested in the environment: mongo-hadoop-spark-1.5.2.jar, spark version 1.6.1 for Hadoop 2.6.4, Pymongo 3.2.2. The source file is an output from mongodump compressed, and a small size file for a single split (uncompressed collection size of 105MB). Running through PySpark:
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "/file/example_bson.gz"
rdd = sc.BSONFileRDD(file_path)
rdd.first()
It is able to read the compressed BSON file, and listed the first document. Please make sure you can reach the input file, and the file is in the correct BSON format.