Creating temporary resource test files in Scala - scala

I am currently writing tests for a function that takes file paths and loads a dataset from them. I am not able to change the function. To test it currently I am creating files for each run of the test function. I am worried that simply making files and then deleting them is a bad practice. Is there a better way to create temporary test files in Scala?
import java.io.{File, PrintWriter}
val testFile = new File("src/main/resources/temp.txt" )
val pw = new PrintWriter(testFile)
val testLines = List("this is a text line", "this is the next text line")
testLines.foreach(pw.write)
pw.close
// test logic here
testFile.delete()

I would generally prefer java.nio over java.io. You can create a temporary file like so:
import java.nio.Files
Files.createTempFile()
You can delete it using Files.delete. To ensure that the file is deleted even in the case of an error, you should put the delete call into a finally block.

Related

Remove all files with a given extension using scala spark

I have some csv.crc files generated when I try to write a dataframe into a csv file using spark. Therefore I want to delete all files with .csv.crc extension
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
val srcPath=new Path("./src/main/resources/myDirectory/*.csv.crc")
println(fs.exists(srcPath))
println(fs.isFile(srcPath))
if(fs.exists(srcPath) && fs.isFile(srcPath)) {
fs.delete(srcPath,true)
}
both prinln lines give false as the value. therefor its not even going into the if condition. How can I delete all.csv.crc files using scala and spark
You can use below option to avoid crc file while writing.(Note :you're eliminating checksum).
fs.setVerifyChecksum(false).
else you can avoid crc files while reading using below,
config.("dfs.client.read.shortcircuit.skip.checksum", "true").

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.

Prepend header record (or a string / a file) to large file in Scala / Java

What is the most efficient (or recommended) way to prepend a string or a file to another large file in Scala, preferably without using external libraries? The large file can be binary.
E.g.
if prepend string is:
header_information|123.45|xyz\n
and large file is:
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...
I would expect to get:
header_information|123.45|xyz
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...
I come up with the following solution:
Turn prepend string/file into InputStream
Turn large file into InputStream
"Combine" InputStreams together using java.io.SequenceInputStream
Use java.nio.file.Files.copy to write to target file
object FileAppender {
def main(args: Array[String]): Unit = {
val stringToPrepend = new ByteArrayInputStream("header_information|123.45|xyz\n".getBytes)
val largeFile = new FileInputStream("big_file.dat")
Files.copy(
new SequenceInputStream(stringToPrepend, largeFile),
Paths.get("output_file.dat"),
StandardCopyOption.REPLACE_EXISTING
)
}
}
Tested on ~30GB file, took ~40 seconds on MacBookPro (3.3GHz/16GB).
This approach can be used (if necessary) to combine multiple partitioned files created by e.g. Spark engine.

Passing In Config In Gatling Tests

Noob to Gatling/Scala here.
This might be a bit of a silly question but I haven't been able to find an example of what I am trying to do.
I want to pass in things such as the baseURL, username and passwords for some of my calls. This would change from env to env, so I want to be able to change these values between the envs but still have the same tests in each.
I know we can feed in values but it appears that more for iterating over datasets and not so much for passing in the config values like I have.
Ideally I would like to house this information in a JSON file and not pass it in on the command line, but maybe thats not doable?
Any guidance on this would be awesome.
I have a similar setup and you can use pure scala here .In this scenario you can create an object called Config for eg
object Configuration { var INPUT_PROFILE_FILE_NAME = ""; }
This class can also read a file , I have the below code in the above object
val file = getClass.getResource("data/config.properties").getFile()
val prop = new Properties()
prop.load(new FileInputStream(file));
INPUT_PROFILE_FILE_NAME = prop.getProperty("inputProfileFileName")
Now you can import this object in Gattling Simulation File
val profileName= Configuration.INPUT_PROFILE_FILE_NAME ;
https://docs.scala-lang.org/tutorials/tour/singleton-objects.html.html

Using the same object from different PyTest testfiles?

im working with pytest right know. My Problem is that I need to use the same object generated in one test_file1.py in another test_file2.py which are in two different directories and invoked separately from another.
Heres the code:
$ testhandler.py
# Starts the first testcases
returnValue = pytest.main(["-x", "--alluredir=%s" % test1_path, "--junitxml=%s" % test1_path+"\\JunitOut_test1.xml", test_file1])
# Starts the second testcases
pytest.main(["--alluredir=%s" % test2_path, "--junitxml=%s" % test2_path+"\\JunitOut_test2.xml", test_file2])
As you can see the first one is critical, therefore I start it with -x to interrupt if there is an error. And --alluredir deletes the target directory before starting the new tests. Thats why I decided to invoke pytest twice in my testhandler.py (moreoften in the future maybe)
Here is are the test_files:
$ test1_directory/test_file1.py
#pytest.fixture(scope='session')
def object():
# Generate reusable object from another file
def test_use_object(object):
# use the object generated above
Note that the object is actually a class with parameters and functions.
$ test2_directory/test_file2.py
def test_use_object_from_file1():
# reuse the object
I tried to generate the object in the testhandler.py file and importing it to both testfiles. The problem was that the object was not excatly the same as in the testhandler.py or test_file1.py.
My question is now if there is a possibility to use excatly that one generated object. Maybe with a global conftest.py or something like that.
Thank you for your time!
By exactly the same you mean a similar object, right? The only way to do this is to marshal it in the first process and unmarshal it in the other process. One way to do it is by using json or pickle as marshaller, and pass the filename to use for the json/pickle file to be able to read the object back.
Here's some sample code, untested:
# conftest.py
def pytest_addoption(parser):
parser.addoption("--marshalfile", help="file name to transfer files between processes")
#pytest.fixture(scope='session')
def object(request):
filename = request.getoption('marshalfile')
if filename is None:
raise pytest.UsageError('--marshalfile required')
# dump object
if not os.path.isfile(filename):
obj = create_expensive_object()
with open(filename, 'wb') as f:
pickle.dump(f, obj)
else:
# load object, hopefully in the other process
with open(filename, 'rb') as f:
obj = pickle.load(f)
return obj