Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory - pyspark

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.

The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

Related

How to pass an argument to a spark submit job in airflow

I have to trigger a pyspark module from airflow using a sparksubmit operator. But, the pyspark module need to take the spark session variable as an argument. I have used application_args to pass the parameter to the pyspark module. But, when I ran the dag the spark submit operator is getting failed and the parameter I passed in considered as None type variable.
Need to know how to pass a argument to a pyspark module triggered through spark_submit_operator.
The DAG code is below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PRJT").enableHiveSupport().getOrCreate()
spark_config = {
'conn_id': 'spark_default',
'driver_memory': '1g',
'executor_cores': 1,
'num_executors': 1,
'executor_memory': '1g'
}
dag = DAG(
dag_id="spark_session_prgm",
default_args=default_args,
schedule_interval='#daily',
catchup=False)
spark_submit_task1 = SparkSubmitOperator(
task_id='spark_submit_task1',
application='/home/airflow_home/dags/tmp_spark_1.py',
application_args=['spark'],
**spark_config, dag=dag)
The sample code in tmp_spark_1.py program:
After a bit of debugging, I found the solution to my problem.
argparse is the reason why it was not working. Instead, I used sys with sys.argv[1] and it does the job.

How to mock an S3AFileSystem locally for testing spark.read.csv with pytest?

What I'm trying to do
I am attempting to unit test an equivalent of the following function, using pytest:
def read_s3_csv_into_spark_df(s3_uri, spark):
df = spark.read.csv(
s3_uri.replace("s3://", "s3a://")
)
return df
The test is defined as follows:
def test_load_csv(self, test_spark_session, tmpdir):
# here I 'upload' a fake csv file using tmpdir fixture and moto's mock_s3 decorator
# now that the fake csv file is uploaded to s3, I try read into spark df using my function
baseline_df = read_s3_csv_into_spark_df(
s3_uri="s3a://bucket/key/baseline.csv",
spark=test_spark_session
)
In the above test, the test_spark_session fixture used is defined as follows:
#pytest.fixture(scope="session")
def test_spark_session():
test_spark_session = (
SparkSession.builder.master("local[*]").appName("test").getOrCreate()
)
return test_spark_session
The problem
I am running pytest on a SageMaker notebook instance, using python 3.7, pytest 6.2.4, and pyspark 3.1.2. I am able to run other tests by creating the DataFrame using test_spark_session.createDataFrame, and then performing aggregations. So the local spark context is indeed working on the notebook instance with pytest.
However, when I attempt to read the csv file in the test I described above, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o84.csv.
E : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can I, without actually uploading any csv files to S3, test this function?
I have also tried providing the S3 uri using s3:// instead of s3a://, but got a different, related error: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3".

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment.
my MLProject file looks as following:
name: mercury_cltv_lib
conda_env: conda-env.yml
entry_points:
main:
command: "python3 run-multiple-notebooks.py"
Workflow is as following:
TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.
It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing
File "run-multiple-notebooks.py", line 3, in <module>
from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'
Please find below the relevant code section from run-multiple-notebooks.py
def get_spark_session():
from pyspark.sql import SparkSession
return SparkSession.builder.getOrCreate()
def get_dbutils(self, spark = None):
try:
if spark == None:
spark = spark
from pyspark.dbutils import DBUtils #error line
dbutils = DBUtils(spark) #error line
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
def submitNotebook(notebook):
print("Running notebook %s" % notebook.path)
spark = get_spark_session()
dbutils = get_dbutils(spark)
I tried all the options and tried
https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils
as well. It is not working :(
Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

How to write cucumber json report in HDFS

I am using cucumber with scala and using below jars
cucumber-junit-1.2.0.jar
cucumber-core-1.2.0.jar
cucumber-html-0.2.3.jar
cucumber-jvm-deps-1.0.3.jar
cucumber-java-1.2.0.jar
i am using cucumber framework in my big data testing, and using spark to read/write/process and data.
i am using cucumber cli.Main method to run my features
import cucumber.api.cli.Main
glue = args(0)
gluePath = args(1)
tag = args(2)
tagName = args(3)
val fileNames = args(4)
val arrFileNames = fileNames.split(",")
arrFileNames.foreach(x => sqlContext.sparkContext.addFile(x))
plugin = "-p"
pluginNameAndPath = "com.cucumber.listener.ExtentCucumberFormatter:hdfs:///tmp/target/cucumber-reports/report.html"
pluginNameAndPathJson = "json:hdfs:///tmp/target/cucumber-reports/report.json"
Main.main( Array(glue,gluePath,tag,tagName,plugin,pluginNameAndPath,plugin,pluginNameAndPathJson,SparkFiles.get("xxx.feature")
in the above code when i run in cluster mode it ran successfully but cucumber report not generated in given HDFS location.
But when i run in client mode(without hdfs:/// ) it ran successfully and created cucumber report on the local node.
It seems like cucumber does't have hdfs file system so cannot create the file in hdfs
Can anyone please help how to create cucumber report by giving the hdfs path or any other way to achieve this?

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java