Submit command line arguments to a pyspark job on airflow

Submit command line arguments to a pyspark job on airflow - pyspark

I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:
config = help.loadJSON("batch/config_file")
MY_PYSPARK_JOB = {
"reference": {"project_id": "my_project_id"},
"placement": {"cluster_name": "my_cluster_name"},
"pyspark_job": {
"main_python_file_uri": "gs://file/loc/my_spark_file.py"]
"properties": config["spark_properties"]
"args": <TO_BE_ADDED>
},
}
I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:
spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.
How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?

I finally managed to solve this conundrum.
If we are making use of ConfigParser, the key has to be specified as below [irrespective of whether the argument is being passed as command or on airflow]:
--arg1
In airflow, the configs are passed as a Sequence[str] (as mentioned by #Betjens below) and each argument is defined as follows:
arg1=val1
Therefore, as per my requirement, command line arguments are defined as depicted below:
"args": ["--arg1=val1",
"--arg2=val2"]
PS: Thank you #Betjens for all your suggestions.

You have to pass a Sequence[str]. If you check DataprocSubmitJobOperator you will see that the params job implements a class google.cloud.dataproc_v1.types.Job.
class DataprocSubmitJobOperator(BaseOperator):
...
:param job: Required. The job resource. If a dict is provided, it must be of the same form as the protobuf message.
:class:`~google.cloud.dataproc_v1.types.Job`
So, on the section about job type pySpark which is google.cloud.dataproc_v1.types.PySparkJob:
args Sequence[str]
Optional. The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.

Related

Karate- Gatling: Not able to run scenarios based on tags

I am trying to run performance test on scenario tagged as perf from the below feature file-
#tag1 #tag2 #tag3
**background:**
user login
#tag4 #perf
**scenario1:**
#tag4
**scenario2:**
Below is my .scala file setup-
class PerfTest extends Simulation {
val protocol = karateProtocol()
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath"))
setUp(
getTags.inject(
atOnceUsers(1)
).protocols(protocol)
)
I have tried passing the tags from command line and as well as passing the tag as argument in exec method in scala setup.
Terminal command-
mvn clean test-compile gatling:test "-Dkarate.env={env}" "-Dkarate.options= --tags #perf"
.scala update:- I have also tried passing the tag as an argument in the karate execute.
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath", "#perf"))
Both scenarios are being executed with either approach. Any pointers how i can force only the test with tag perf to run?

I wanted to share the finding here. I realized it is working fine when i am passing the tag info in .scala file.
My scenario with perf tag was a combination of GET and POST call as i needed some data from GET call to pass in POST call. That's why i was seeing both calls when running performance test.
I did not find any reference in karate gatling documentation for passing tags in terminal execution command. So i am assuming that might not be a valid case.

Running local python code with arguments in Databricks via dbx utility

I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help?
Following this guide, but it is a bit unclear and lacks good examples.
https://dbx.readthedocs.io/en/latest/quickstart.html
Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job
Databricks manuals are very much not clear in this area.
My PySpark script:
import sys
n = len(sys.argv)
print("Total arguments passed:", n)
print("Script name", sys.argv[0])
print("\nArguments passed:", end=" ")
for i in range(1, n):
print(sys.argv[i], end=" ")
dbx deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py"
},
"parameters": [
"test-argument-1",
"test-argument-2"
]
}
]
}
}
dbx execute command:
dbx execute\
--cluster-id=<reducted>\
--job=parameter-test\
--deployment-file=conf/deployment.json\
--no-rebuild\
--no-package
Output:
(parameter-test) user#735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user#735 parameter-test %
Please help :)

It turns out the parameter section format of my deployment.json was not correct. Here is the corrected example:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
I've also posted my original question in Databricks forum: https://community.databricks.com/s/feed/0D58Y00008znXBxSAM?t=1659032862560
Hope it helps someone else.

I also believe the "jobs" parameter is deprecated and you should use "workflows" instead.
Source: https://dbx.readthedocs.io/en/latest/migration/

Passing parameters using AZ CLI for ARM template deployment

I am trying to use az group deployment create to perform an ARM template deployment and I want to pass in parameters where the values are defined in variables. I can do a single parameter with no issues using the syntax below:
--parameters parameter1=$var1
But when I try to add additional parameters using the syntax below, it fails:
--parameters parameter1=$var1, parameter2=$var2
The syntax below fails as well since it will not use the values of the variables:
--parameters '{
"parameter1": { "value": "$var1" },
"parameter2": { "value": "$var2" }
}'
Does anyone know if what I am trying to do is possible and what the correct syntax would be?

I was fighting a combination of a corrupt shell and slightly incorrect syntax. The correct syntax for what I was trying to do is listed below:
--parameters parameter1=$var1 parameter2=$var2
Or, for a cleaning view when several parameters are involved:
--parameters parameter1=$var1 `
parameter2=$var2 `
parameter3=$var3

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.

You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

How to access values of arguments which is passed in a task with DataProcPySparkOperator operator?

I want to pass parameters to specific task in my Airflow Dag and access it in my pyspark code. Below is the task definition :
run_cmd_arg_test_job= DataProcPySparkOperator(
task_id='test',
main='gs://dataprocessing_scripts/testArg.py',
arguments=['2018-05-07'],
job_name='test',
dataproc_cluster='smoke-cluster-{{ ds_nodash }}',
gcp_conn_id='google_cloud_default',
region='global'
)
how can I access value of "arguments" property in main file "gs://dataprocessing_scripts/testArg.py" ?

You have to use sys.argv[1],sys.argv[2]
sys.argv[0] will be filename itself
and sys.argv[1] will be '2018-05-07'
Also don't forgot to import sys

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Submit command line arguments to a pyspark job on airflow - pyspark

Related

Karate- Gatling: Not able to run scenarios based on tags

Running local python code with arguments in Databricks via dbx utility

Passing parameters using AZ CLI for ARM template deployment

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

How to access values of arguments which is passed in a task with DataProcPySparkOperator operator?

Categories

Resources