I am using cucumber with scala and using below jars
cucumber-junit-1.2.0.jar
cucumber-core-1.2.0.jar
cucumber-html-0.2.3.jar
cucumber-jvm-deps-1.0.3.jar
cucumber-java-1.2.0.jar
i am using cucumber framework in my big data testing, and using spark to read/write/process and data.
i am using cucumber cli.Main method to run my features
import cucumber.api.cli.Main
glue = args(0)
gluePath = args(1)
tag = args(2)
tagName = args(3)
val fileNames = args(4)
val arrFileNames = fileNames.split(",")
arrFileNames.foreach(x => sqlContext.sparkContext.addFile(x))
plugin = "-p"
pluginNameAndPath = "com.cucumber.listener.ExtentCucumberFormatter:hdfs:///tmp/target/cucumber-reports/report.html"
pluginNameAndPathJson = "json:hdfs:///tmp/target/cucumber-reports/report.json"
Main.main( Array(glue,gluePath,tag,tagName,plugin,pluginNameAndPath,plugin,pluginNameAndPathJson,SparkFiles.get("xxx.feature")
in the above code when i run in cluster mode it ran successfully but cucumber report not generated in given HDFS location.
But when i run in client mode(without hdfs:/// ) it ran successfully and created cucumber report on the local node.
It seems like cucumber does't have hdfs file system so cannot create the file in hdfs
Can anyone please help how to create cucumber report by giving the hdfs path or any other way to achieve this?
Related
I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.
What I'm trying to do
I am attempting to unit test an equivalent of the following function, using pytest:
def read_s3_csv_into_spark_df(s3_uri, spark):
df = spark.read.csv(
s3_uri.replace("s3://", "s3a://")
)
return df
The test is defined as follows:
def test_load_csv(self, test_spark_session, tmpdir):
# here I 'upload' a fake csv file using tmpdir fixture and moto's mock_s3 decorator
# now that the fake csv file is uploaded to s3, I try read into spark df using my function
baseline_df = read_s3_csv_into_spark_df(
s3_uri="s3a://bucket/key/baseline.csv",
spark=test_spark_session
)
In the above test, the test_spark_session fixture used is defined as follows:
#pytest.fixture(scope="session")
def test_spark_session():
test_spark_session = (
SparkSession.builder.master("local[*]").appName("test").getOrCreate()
)
return test_spark_session
The problem
I am running pytest on a SageMaker notebook instance, using python 3.7, pytest 6.2.4, and pyspark 3.1.2. I am able to run other tests by creating the DataFrame using test_spark_session.createDataFrame, and then performing aggregations. So the local spark context is indeed working on the notebook instance with pytest.
However, when I attempt to read the csv file in the test I described above, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o84.csv.
E : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can I, without actually uploading any csv files to S3, test this function?
I have also tried providing the S3 uri using s3:// instead of s3a://, but got a different, related error: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3".
I am trying to spllit my csv files into multiple csv part file using spark scala.When i am manulaay executing code its working fine and i am able to see the part files. But when i m executing same command using jar and trying to submit using spark submit i m getting error like
split: can not open 'file_location/filename' for reading No such file or directory.
can someone please guide me what is the issue here.
code:
val file1 = filelocation/filename
val file2 = file1.replace(".csv","_")
if (fs.exists(new org.apache.hadoop.fs.Path(file1))) {
s"split -l 20000000 -d --additional-suffix=.csv /hadoop$file1 /hadoop$file2"!
}
I am new to Flink (v1.3.2) and trying to read avro record continuously in scala on EMR. I know for file you can use something like following and it will keep running and scanning directory.
val stream = env.readFile(
inputFormat = textInputFormat,
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds
)
Is there a similar way in Flink for avro record? So I have the following code
val textInputFormat = new AvroInputFormat(new Path(path), classOf[User])
textInputFormat.setNestedFileEnumeration(true)
val avroInputStream = env.createInput(textInputFormat)
val output = avroInputStream.map(line => (line.getUserID, 1))
.keyBy(0)
.timeWindow(Time.seconds(10))
.sum(1)
output.print()
I am able to see the output there then Flink switched to FINISHED, but still want to get the code running/waiting for any new files arrive in the future. Is there something like FileProcessingMode.PROCESS_CONTINUOUSLY? Please suggest!
I figure out this by setting up a flink-yarn-session on EMR and make it run PROCESS_CONTINUOUSLY.
env.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
Create a new flink yarn session using flink-yarn-session -n 2 -d
Get application_id using yarn application -list, for example, it is application_0000000000_0002
Attached flink run job with the application_id,flink run -m yarn-cluster -yid application_0000000000_0002 xxx.jar
More detail can be found on EMR documentation now: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html
In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java