Submit command line arguments to a pyspark job on airflow - pyspark

I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:
config = help.loadJSON("batch/config_file")
MY_PYSPARK_JOB = {
"reference": {"project_id": "my_project_id"},
"placement": {"cluster_name": "my_cluster_name"},
"pyspark_job": {
"main_python_file_uri": "gs://file/loc/my_spark_file.py"]
"properties": config["spark_properties"]
"args": <TO_BE_ADDED>
},
}
I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:
spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.
How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?

I finally managed to solve this conundrum.
If we are making use of ConfigParser, the key has to be specified as below [irrespective of whether the argument is being passed as command or on airflow]:
--arg1
In airflow, the configs are passed as a Sequence[str] (as mentioned by #Betjens below) and each argument is defined as follows:
arg1=val1
Therefore, as per my requirement, command line arguments are defined as depicted below:
"args": ["--arg1=val1",
"--arg2=val2"]
PS: Thank you #Betjens for all your suggestions.

You have to pass a Sequence[str]. If you check DataprocSubmitJobOperator you will see that the params job implements a class google.cloud.dataproc_v1.types.Job.
class DataprocSubmitJobOperator(BaseOperator):
...
:param job: Required. The job resource. If a dict is provided, it must be of the same form as the protobuf message.
:class:`~google.cloud.dataproc_v1.types.Job`
So, on the section about job type pySpark which is google.cloud.dataproc_v1.types.PySparkJob:
args Sequence[str]
Optional. The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.

Related

Karate- Gatling: Not able to run scenarios based on tags

I am trying to run performance test on scenario tagged as perf from the below feature file-
#tag1 #tag2 #tag3
**background:**
user login
#tag4 #perf
**scenario1:**
#tag4
**scenario2:**
Below is my .scala file setup-
class PerfTest extends Simulation {
val protocol = karateProtocol()
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath"))
setUp(
getTags.inject(
atOnceUsers(1)
).protocols(protocol)
)
I have tried passing the tags from command line and as well as passing the tag as argument in exec method in scala setup.
Terminal command-
mvn clean test-compile gatling:test "-Dkarate.env={env}" "-Dkarate.options= --tags #perf"
.scala update:- I have also tried passing the tag as an argument in the karate execute.
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath", "#perf"))
Both scenarios are being executed with either approach. Any pointers how i can force only the test with tag perf to run?
I wanted to share the finding here. I realized it is working fine when i am passing the tag info in .scala file.
My scenario with perf tag was a combination of GET and POST call as i needed some data from GET call to pass in POST call. That's why i was seeing both calls when running performance test.
I did not find any reference in karate gatling documentation for passing tags in terminal execution command. So i am assuming that might not be a valid case.

Running local python code with arguments in Databricks via dbx utility

I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help?
Following this guide, but it is a bit unclear and lacks good examples.
https://dbx.readthedocs.io/en/latest/quickstart.html
Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job
Databricks manuals are very much not clear in this area.
My PySpark script:
import sys
n = len(sys.argv)
print("Total arguments passed:", n)
print("Script name", sys.argv[0])
print("\nArguments passed:", end=" ")
for i in range(1, n):
print(sys.argv[i], end=" ")
dbx deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py"
},
"parameters": [
"test-argument-1",
"test-argument-2"
]
}
]
}
}
dbx execute command:
dbx execute\
--cluster-id=<reducted>\
--job=parameter-test\
--deployment-file=conf/deployment.json\
--no-rebuild\
--no-package
Output:
(parameter-test) user#735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user#735 parameter-test %
Please help :)
It turns out the parameter section format of my deployment.json was not correct. Here is the corrected example:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
I've also posted my original question in Databricks forum: https://community.databricks.com/s/feed/0D58Y00008znXBxSAM?t=1659032862560
Hope it helps someone else.
I also believe the "jobs" parameter is deprecated and you should use "workflows" instead.
Source: https://dbx.readthedocs.io/en/latest/migration/

Passing parameters using AZ CLI for ARM template deployment

I am trying to use az group deployment create to perform an ARM template deployment and I want to pass in parameters where the values are defined in variables. I can do a single parameter with no issues using the syntax below:
--parameters parameter1=$var1
But when I try to add additional parameters using the syntax below, it fails:
--parameters parameter1=$var1, parameter2=$var2
The syntax below fails as well since it will not use the values of the variables:
--parameters '{
"parameter1": { "value": "$var1" },
"parameter2": { "value": "$var2" }
}'
Does anyone know if what I am trying to do is possible and what the correct syntax would be?
I was fighting a combination of a corrupt shell and slightly incorrect syntax. The correct syntax for what I was trying to do is listed below:
--parameters parameter1=$var1 parameter2=$var2
Or, for a cleaning view when several parameters are involved:
--parameters parameter1=$var1 `
parameter2=$var2 `
parameter3=$var3

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.
You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

How to access values of arguments which is passed in a task with DataProcPySparkOperator operator?

I want to pass parameters to specific task in my Airflow Dag and access it in my pyspark code. Below is the task definition :
run_cmd_arg_test_job= DataProcPySparkOperator(
task_id='test',
main='gs://dataprocessing_scripts/testArg.py',
arguments=['2018-05-07'],
job_name='test',
dataproc_cluster='smoke-cluster-{{ ds_nodash }}',
gcp_conn_id='google_cloud_default',
region='global'
)
how can I access value of "arguments" property in main file "gs://dataprocessing_scripts/testArg.py" ?
You have to use sys.argv[1],sys.argv[2]
sys.argv[0] will be filename itself
and sys.argv[1] will be '2018-05-07'
Also don't forgot to import sys