Remove all files with a given extension using scala spark - scala

I have some csv.crc files generated when I try to write a dataframe into a csv file using spark. Therefore I want to delete all files with .csv.crc extension
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
val srcPath=new Path("./src/main/resources/myDirectory/*.csv.crc")
println(fs.exists(srcPath))
println(fs.isFile(srcPath))
if(fs.exists(srcPath) && fs.isFile(srcPath)) {
fs.delete(srcPath,true)
}
both prinln lines give false as the value. therefor its not even going into the if condition. How can I delete all.csv.crc files using scala and spark

You can use below option to avoid crc file while writing.(Note :you're eliminating checksum).
fs.setVerifyChecksum(false).
else you can avoid crc files while reading using below,
config.("dfs.client.read.shortcircuit.skip.checksum", "true").

Related

Creating temporary resource test files in Scala

I am currently writing tests for a function that takes file paths and loads a dataset from them. I am not able to change the function. To test it currently I am creating files for each run of the test function. I am worried that simply making files and then deleting them is a bad practice. Is there a better way to create temporary test files in Scala?
import java.io.{File, PrintWriter}
val testFile = new File("src/main/resources/temp.txt" )
val pw = new PrintWriter(testFile)
val testLines = List("this is a text line", "this is the next text line")
testLines.foreach(pw.write)
pw.close
// test logic here
testFile.delete()
I would generally prefer java.nio over java.io. You can create a temporary file like so:
import java.nio.Files
Files.createTempFile()
You can delete it using Files.delete. To ensure that the file is deleted even in the case of an error, you should put the delete call into a finally block.

Read large file into StringBuilder

I've got text files sitting in HDFS, ranging in size from around 300-800 MB each. They are almost valid json files. I am trying to make them valid json files so I can save them as ORC files.
I am attempting to create a StringBuilder with the needed opening characters, then read the file in line by line stripping off the newlines, append each line the string builder, and then add the needed closing character.
import org.apache.hadoop.fs.{FileSystem,Path, PathFilter, RemoteIterator}
import scala.collection.mutable.StringBuilder
//create stringbuilder
var sb = new scala.collection.mutable.StringBuilder("{\"data\" : ")
//read in the file
val path = new Path("/path/to/crappy/file.json")
val stream = fs.open(path)
//read the file line by line. This will strip off the newlines so we can append it to the string builder
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => sb.append(line)
That works. But as soon as I try to append the closing }:
sb.append("}")
It crashes with out of memory:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala
...
I've tried setting the initial size of the stringbuilder to be larger than the file I'm currently testing with, but that didn't help. I've also tried giving the driver more memory (spark-shell --driver-memory 3g), didn't help either.
Is there a better way to do this?
If that's all you need, you can just do it without Scala via hdfs command-line:
hadoop fs -cat /hdfs/path/prefix /hdfs/path/badjson /hdfs/path/suffix | hadoop fs -put - /hdfs/path/properjson
where file prefix just contains {"data" :, and suffix - a single }.
1) Don't use scala's Stream. It is just a broken abstraction. It's extremely difficult to use infinite/huge stream without blowing-up the heap. Stick either with a plain old Iterator or use more principled approaches from fs2 / zio.
In your case readLines object accumulates all entries even though it expects to hold only one at a time.
2) sb object leaks as well. It accumulates entire file content in memory.
Consider writing the corrected content directly into some OutputStreamWriter.

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.

How to save RDD data into json files, not folders

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);

How to write streaming data to S3?

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?
Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.
You should take a look into mode method for dataframewriter in Spark Documentation:
public DataFrameWriter mode(SaveMode saveMode)
Specifies the behavior when data or table already exists. Options
include: - SaveMode.Overwrite: overwrite the existing data. -
SaveMode.Append: append the data. - SaveMode.Ignore: ignore the
operation (i.e. no-op). - SaveMode.ErrorIfExists: default option,
throw an exception at runtime.
You can try somethling like this with Append savemode.
rdd.toDF.write
.format("json")
.mode(SaveMode.Append)
.saveAsTextFile("s3://iiiii/ttttt.json");
Spark Append:
Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to
be appended to existing data.
Basically you can choose which format you want as an output format by passing "format" keyword to method
public DataFrameWriter format(java.lang.String source)
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
eg as parquet:
df.write().format("parquet").save("yourfile.parquet")
or as json:
df.write().format("json").save("yourfile.json")
Edit: Added details about s3 credentials:
there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala
with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:
SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set
hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)