I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help?
Following this guide, but it is a bit unclear and lacks good examples.
https://dbx.readthedocs.io/en/latest/quickstart.html
Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job
Databricks manuals are very much not clear in this area.
My PySpark script:
import sys
n = len(sys.argv)
print("Total arguments passed:", n)
print("Script name", sys.argv[0])
print("\nArguments passed:", end=" ")
for i in range(1, n):
print(sys.argv[i], end=" ")
dbx deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py"
},
"parameters": [
"test-argument-1",
"test-argument-2"
]
}
]
}
}
dbx execute command:
dbx execute\
--cluster-id=<reducted>\
--job=parameter-test\
--deployment-file=conf/deployment.json\
--no-rebuild\
--no-package
Output:
(parameter-test) user#735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user#735 parameter-test %
Please help :)
It turns out the parameter section format of my deployment.json was not correct. Here is the corrected example:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
I've also posted my original question in Databricks forum: https://community.databricks.com/s/feed/0D58Y00008znXBxSAM?t=1659032862560
Hope it helps someone else.
I also believe the "jobs" parameter is deprecated and you should use "workflows" instead.
Source: https://dbx.readthedocs.io/en/latest/migration/
Related
The purpose is to debug only one unit test in the exs file, therefore it is necessary to ignore other unit tests in the same exs file.
My previous solution is comment out the other unit test, but the bad side of this solution is I can't find other unit tests easily through vscode's outline view as follows:
From the mix doc, it is found that mix command has --include and --only option.
I have adjusted launch.json file as follows, update task args as --trace --only :external, and update the exs file, but when runing mix test, it gives the error message.
Remember to keep good posture and stay hydrated!
helloworld
(Debugger) Task failed because an exception was raised:
** (Mix.Error) Could not invoke task "test": 1 error found!
--trace --only :external : Unknown option
(mix 1.13.4) lib/mix.ex:515: Mix.raise/2
(elixir_ls_debugger 0.10.0) lib/debugger/server.ex:1119: ElixirLS.Debugger.Server.launch_task/2
Then I changed launch.json to "--trace --only :external", similar error message as follows:
(Debugger) Task failed because an exception was raised:
** (Mix.Error) Could not invoke task "test": 1 error found!
--trace --only :external : Unknown option
(mix 1.13.4) lib/mix.ex:515: Mix.raise/2
(elixir_ls_debugger 0.10.0) lib/debugger/server.ex:1119: ElixirLS.Debugger.Server.launch_task/2
I use a plugin called Elixir Test. It has a few nice features including what you are asking for.
To run a single test place your cursor within the code of the test, then select "Elixir Test: Run test at cursor" from the command palette.
Another helpful command is: "Elixir Test: Jump". If you are editing a module file, this command will jump to the test file corresponding to the module. It will optionally create the skeleton for the test file if you haven't created it yet.
It is caused by syntax problem. Every paremeter should be one element as follows:
"taskArgs": [
"--trace", "--warnings-as-errors", "--only", "external"
],
I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:
config = help.loadJSON("batch/config_file")
MY_PYSPARK_JOB = {
"reference": {"project_id": "my_project_id"},
"placement": {"cluster_name": "my_cluster_name"},
"pyspark_job": {
"main_python_file_uri": "gs://file/loc/my_spark_file.py"]
"properties": config["spark_properties"]
"args": <TO_BE_ADDED>
},
}
I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:
spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.
How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?
I finally managed to solve this conundrum.
If we are making use of ConfigParser, the key has to be specified as below [irrespective of whether the argument is being passed as command or on airflow]:
--arg1
In airflow, the configs are passed as a Sequence[str] (as mentioned by #Betjens below) and each argument is defined as follows:
arg1=val1
Therefore, as per my requirement, command line arguments are defined as depicted below:
"args": ["--arg1=val1",
"--arg2=val2"]
PS: Thank you #Betjens for all your suggestions.
You have to pass a Sequence[str]. If you check DataprocSubmitJobOperator you will see that the params job implements a class google.cloud.dataproc_v1.types.Job.
class DataprocSubmitJobOperator(BaseOperator):
...
:param job: Required. The job resource. If a dict is provided, it must be of the same form as the protobuf message.
:class:`~google.cloud.dataproc_v1.types.Job`
So, on the section about job type pySpark which is google.cloud.dataproc_v1.types.PySparkJob:
args Sequence[str]
Optional. The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.
I want to convert our integration pytests over to bazel, but it doesn't seem to actually run nor produce the junitxml output I'm expecting:
ARGS = [
"--verbose",
"--json-report",
"--junit-xml=junit.xml",
]
py_test(
name = "test_something",
srcs = ["test_something.py"],
args = ARGS,
tags = [
"yourit",
],
deps = [
requirement("pytest"),
requirement("pytest-json-report"),
":lib",
],
)
There is a similar question here: How do I use pytest with bazel?. But it didn't cover the multitude of issues I ran into.
So there were multiple things to address in order to make bazel's py_test work with pytest.
Of course you need the pytest dependency requirement("pytest") added.
The pytest files require a hook being added (note that the args are passed as well sys.argv[1:], so py_test args can be passed on to the pytest file):
if __name__ == "__main__":
sys.exit(pytest.main(\[__file__\] + sys.argv\[1:\]))
Since our pytests are trying to access the host network, we need to tell bazel to allow network traffic outside the sandbox by adding a requires-network tag.
The host network is local, so we need to make sure the test runs locally by adding the local tag.
Bazel py_test will create junitxml under bazel-testlogs, but it only shows if py_test rule passed or failed. To get more granular junitxml results, we need to tell pytest how to overwrite bazel's py_test results using XML_OUTPUT_FILE environment variable. Note the $$ is needed to prevent variable expansion. See https://docs.bazel.build/versions/main/test-encyclopedia.html and https://github.com/bazelbuild/bazel/blob/master/tools/test/test-setup.sh#L67 for more info.
Attempts to write json report output to bazel-testlogs using TEST_UNDECLARED_OUTPUTS_DIR environment variable didn't work. The report is still written, but inside bazel-out. There might be a way to scoop up those files, but solving that problem was not a priority.
Note that the ARGS is used to keep it DRY (don't repeat yourself). We use the same arguments across multiple py_test rules.
So the BUILD now looks like this:
ARGS = [
"--verbose",
"--json-report",
"--junit-xml=$$XML_OUTPUT_FILE",
]
py_test(
name = "test_something",
srcs = ["test_something.py"],
args = ARGS,
tags = [
"local",
"requires-network",
"yourit",
],
deps = [
requirement("pytest"),
requirement("pytest-json-report"),
":lib",
],
)
I'm using API Job Add to create one Job with one Task in Azure Batch.
This is my test code:
{
"id": "20211029-1540",
"priority": 0,
"poolInfo": {
"poolId": "pool-test"
},
"jobManagerTask": {
"id": "task2",
"commandLine": "cmd /c dir",
"resourceFiles": [
{
"storageContainerUrl": "https://linkToMyStorage/MyProject/StartTask.txt"
}
]
}
}
To execute the API I'm using Postman and to monitor the result I'm using BatchExplorer.
The job and it's task are created correctly, but the 'wd' folder generate automatically is empty.
If I understood fine, I should see the linked file in the storage variable, right?
Maybe, some other parameter is needed in the Json of the body?
Thank you!
Task state of completed does not necessarily indicate success. From your json body, you most likely have an error:
"resourceFiles": [
{
"storageContainerUrl": "https://linkToMyStorage/MyProject/StartTask.txt"
}
You've specified a storageContainerUrl with a file. Also ensure you have provided proper permissions (either via SAS or a user managed identity).
I read some posts regarding to the error I am seeing now when import pyspark, some suggest to install py4j, and I already did, and yet I am still seeing the error.
I am using a conda environment, here is the steps:
1. create a yml file and include the needed packages (including the py4j)
2. create a env based on the yml
3. create a kernel pointing to the env
4. start the kernel in Jupyter
5. running `import pyspark` throws error: ImportError: No module named py4j.protocol
The issue is resolved with adding environment section in kernel.json and explicitely specify the variables of the following:
"env": {
"HADOOP_CONF_DIR": "/etc/spark2/conf/yarn-conf",
"PYSPARK_PYTHON":"/opt/cloudera/parcels/Anaconda/bin/python",
"SPARK_HOME": "/opt/cloudera/parcels/SPARK2",
"PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.7-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",
"PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}