Run same Databricks notebook for different arguments concurrently? - pyspark

The following code (not mine) is able to run NotebookA and NotebookB concurrently. I need some help to figure out how to pass multiple arguments to the same notebooks.
I want to pass this list of arguments to each notebook:
args = {}
args["arg1"] = "some value"
args["arg2"] = "another value"
If I wanted to pass the arguments above to each of the running notebooks, what will I need to amend in the code below?
Here is the working code:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
inputs = [("NotebookA", "NotebookB") ]
run_in_parallel = lambda x: dbutils.notebook.run(x, 1800)
from concurrent.futures import ThreadPoolExecutor, wait
pool = ThreadPoolExecutor(3)
results = []
with ThreadPoolExecutor(3) as pool:
for x in inputs:
results.extend(pool.map(run_in_parallel, list(x)))

The dbutils.notebook.run accepts the 3rd argument as well, this is a map of parameters (see documentation for more details). So in your case, you'll need to change definition of the run_in_parallel to something like this:
run_in_parallel = lambda x: dbutils.notebook.run(x, 1800, args)
and the rest of the code should be the same.
If you'll want to pass different arguments to different notebooks, then you'll need to have a list of tuples, and pass this list to a map, like this:
data = [('notebook 1', {'arg1':'abc'}), ('notebook2', {'arg1': 'def', 'arg2': 'jkl'})]
...
run_in_parallel = lambda x: dbutils.notebook.run(x[0], 1800, x[1])
with ThreadPoolExecutor(3) as pool:
results.extend(pool.map(run_in_parallel, data))

Related

How do I use an Airflow variable inside a Databricks notebook?

I have a Databricks PySpark notebook that gets called from an Airflow DAG.
I created a variable in Airflow by going to Admin - Variables and added a key-value pair.
I cannot find a way to use that Airflow variable in Databricks.
Edit to add sample of my code.
notebook_task = {
'notebook_path': '/Users/email#exaple.com/myDAG',
'base_parameters': {
"token": token
}
}
and the operator defined here
opr_submit_run = DatabricksSubmitRunOperator(
task_id='run_notebook',
existing_cluster_id='xxxxx',
run_name='test',
databricks_conn_id='databricks_xxx',
notebook_task=notebook_task
)
What ended up working is using base_parameters instead of notebook_parans which can be found here https://docs.databricks.com/dev-tools/api/latest/jobs.html
and accessing it from databricks by using
my_param = dbutils.widgets.get("token")
Extending the answer provided by Alex since this question was asked in the context of Apache-Airflow that executing a databricks notebook.
The DatabricksRunNowOperator (which is available by the databricks provider) has notebook_params that is a dict from keys to values for jobs with notebook task, e.g. "notebook_params": {"name": "john doe", "age": "35"}. The map is passed to the notebook and will be accessible through the
dbutils.widgets.get function. As Alex explained you can access the value from databricks notebook as:
my_param = dbutils.widgets.get("key")
An example usage will be:
spark_jar_task = DatabricksSubmitRunOperator(
task_id='spark_jar_task',
new_cluster=new_cluster,
notebook_params={"name": "john doe", "age": "35"},
spark_jar_task={'main_class_name': 'com.example.ProcessData'},
libraries=[{'jar': 'dbfs:/lib/etl-0.1.jar'}],
)
The issue now is how to pass a value from Airflow Variable rather than a static value. For that we need the notebook_params to be a templated field so the Jinja engine will template the value. The problem is that notebook_params is not listed in the template_fields
To overcome this we can create a custom version of the operator as:
class MyDatabricksRunNowOperator(DatabricksRunNowOperator):
template_fields = DatabricksRunNowOperator.template_fields + ('notebook_params',)
Then we can use macro {{ var.value.my_var }} which will be templated during run time as:
spark_jar_task = MyDatabricksSubmitRunOperator(
task_id='spark_jar_task',
new_cluster=new_cluster,
notebook_params={"var_value": {{ var.value.my_var }} },
spark_jar_task={'main_class_name': 'com.example.ProcessData'},
libraries=[{'jar': 'dbfs:/lib/etl-0.1.jar'}],
)
The operator will get the value of my_var Variable and pass it to your notebook.
if you set it as a parameter to the notebook call (parameters inside notebook_task), then you need to use the dbutils.widgets.get function, put at the beginning of notebook something like this:
my_param = dbutils.widgets.get("key")

Best approach for building an LSH table using Apache Beam and Dataflow

I have an LSH table builder utility class which goes as follows (referred from here):
class BuildLSHTable:
def __init__(self, hash_size=8, dim=2048, num_tables=10, lsh_file="lsh_table.pkl"):
self.hash_size = hash_size
self.dim = dim
self.num_tables = num_tables
self.lsh = LSH(self.hash_size, self.dim, self.num_tables)
self.embedding_model = embedding_model
self.lsh_file = lsh_file
def train(self, training_files):
for id, training_file in enumerate(training_files):
image, label = training_file
if len(image.shape) < 4:
image = image[None, ...]
features = self.embedding_model.predict(image)
self.lsh.add(id, features, label)
with open(self.lsh_file, "wb") as handle:
pickle.dump(self.lsh,
handle, protocol=pickle.HIGHEST_PROTOCOL)
I then execute the following in order to build my LSH table:
training_files = zip(images, labels)
lsh_builder = BuildLSHTable()
lsh_builder.train(training_files)
Now, when I am trying to do this via Apache Beam (code below), it's throwing:
TypeError: can't pickle tensorflow.python._pywrap_tf_session.TF_Operation objects
Code used for Beam:
def generate_lsh_table(args):
options = beam.options.pipeline_options.PipelineOptions(**args)
args = namedtuple("options", args.keys())(*args.values())
with beam.Pipeline(args.runner, options=options) as pipeline:
(
pipeline
| 'Build LSH Table' >> beam.Map(
args.lsh_builder.train, args.training_files)
)
This is how I am invoking the beam runner:
args = {
"runner": "DirectRunner",
"lsh_builder": lsh_builder,
"training_files": training_files
}
generate_lsh_table(args)
Apache Beam pipelines should be converted to a standard (for example, proto) format before being executed. As a part of this certain pipeline objects such as DoFns get serialized (picked). If your DoFns have instance variables that cannot be serialized this process cannot continue.
One way to solve this is to load/define such instance objects or modules during execution instead of creating and storing such objects during pipeline submission. This might require adjusting your pipeline.

Function in pytest file works only with hard coded values

I have the below test_dss.py file which is used for pytest:
import dataikuapi
import pytest
def setup_list():
client = dataikuapi.DSSClient("{DSS_URL}", "{APY_KEY}")
client._session.verify = False
project = client.get_project("{DSS_PROJECT}")
# Check that there is at least one scenario TEST_XXXXX & that all test scenarios pass
scenarios = project.list_scenarios()
scenarios_filter = [obj for obj in scenarios if obj["name"].startswith("TEST")]
return scenarios_filter
def test_check_scenario_exist():
assert len(setup_list()) > 0, "You need at least one test scenario (name starts with 'TEST_')"
#pytest.mark.parametrize("scenario", setup_list())
def test_scenario_run(scenario, params):
client = dataikuapi.DSSClient(params['host'], params['api'])
client._session.verify = False
project = client.get_project(params['project'])
scenario_id = scenario["id"]
print("Executing scenario ", scenario["name"])
scenario_result = project.get_scenario(scenario_id).run_and_wait()
assert scenario_result.get_details()["scenarioRun"]["result"]["outcome"] == "SUCCESS", "test " + scenario[
"name"] + " failed"
My issue is with setup_list function, which able to get only hard coded values for {DSS_URL}, {APY_KEY}, {PROJECT}. I'm not able to use PARAMS or other method like in test_scenario_run
any idea how I can pass the PARAMS also to this function?
The parameters in the mark.parametrize marker are read at load time, where the information about the config parameters is not yet available. Therefore you have to parametrize the test at runtime, where you have access to the configuration.
This can be done in pytest_generate_tests (which can live in your test module):
#pytest.hookimpl
def pytest_generate_tests(metafunc):
if "scenario" in metafunc.fixturenames:
host = metafunc.config.getoption('--host')
api = metafuc.config.getoption('--api')
project = metafuc.config.getoption('--project')
metafunc.parametrize("scenario", setup_list(host, api, project))
This implies that your setup_list function takes these parameters:
def setup_list(host, api, project):
client = dataikuapi.DSSClient(host, api)
client._session.verify = False
project = client.get_project(project)
...
And your test just looks like this (without the parametrize marker, as the parametrization is now done in pytest_generate_tests):
def test_scenario_run(scenario, params):
scenario_id = scenario["id"]
...
The parametrization is now done at run-time, so it behaves the same as if you had placed a parametrize marker in the test.
And the other test that tests setup_list now has also to use the params fixture to get the needed arguments:
def test_check_scenario_exist(params):
assert len(setup_list(params["host"], params["api"], params["project"])) > 0,
"You need at least ..."

dbutils.notebook.run not working for mapping arguments

Suppose I have 2 notebooks of which the first is the main and the second is for testing.
In the main, I have the following
dbutils.notebook.run("testing", timeoutSeconds = 60, arguments = Map("var" -> "1234"))
In testing:
%scala
println(s"Donut price = $var")
And in Main run the notebook. There is error:
You can pass arguments to DataImportNotebook and run different notebooks (DataCleaningNotebook or ErrorHandlingNotebook) based on the result from DataImportNotebook.
val status = dbutils.notebook.run("DataImportNotebook", timeoutSeconds
= 60, argumenrs = Map ("x" -> "1234"))
println("Status: " + status)
In scala, the variables are declared as follows:
The following are examples of value definitions:
var $price = 1234
println("Donut price:" + $price)
For more details, refer "Scala - How to declare variables" and "Databricks - Notebook workflows".
Hope this helps.

Pytest - skip (xfail) mixed with parametrize

is there a way to use the #incremental plugin like described att Pytest: how to skip the rest of tests in the class if one has failed? mixed with #pytest.mark.parametrize like below:
#pytest.mark.incremental
Class TestClass:
#pytest.mark.parametrize("input", data)
def test_preprocess_check(self,input):
# prerequisite for test
#pytest.mark.parametrize("input",data)
def test_process_check(self,input):
# test only if test_preprocess_check succeed
The problem i encountered is, at the first fail of test_preprocess_check with a given input of my data set, the following test_preprocess_check and test_process_check are labeled "xfail".
The behaviour i expect will be, at each new "input" of my parametrized data set, the test will act in an incremental fashion.
ex: data = [0,1,2]
if only test_preprocess_check(0) failed:
i got the following report:
1 failed, 5 xfailed
but i expect the report:
1 failed, 1 xfailed, 4 passed
Thanks
After some experiments i found a way to generalize the #incremental to works with parametrize annotation. Simply rewrite the _previousfailed argument to make it unique for each input. The argument _genid was excactly the need.
I added a #pytest.mark.incrementalparam to achieve this.
Code become:
def pytest_runtest_setup(item):
previousfailed_attr = getattr(item, "_genid",None)
if previousfailed_attr is not None:
previousfailed = getattr(item.parent, previousfailed_attr, None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" %previousfailed.name)
previousfailed = getattr(item.parent, "_previousfailed", None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" %previousfailed.name)
def pytest_runtest_makereport(item, call):
if "incrementalparam" in item.keywords:
if call.excinfo is not None:
previousfailed_attr = item._genid
setattr(item.parent,previousfailed_attr, item)
if "incremental" in item.keywords:
if call.excinfo is not None:
parent = item.parent
parent._previousfailed = item
It's interesting to mention that's it can't be used without parametrize cause parametrize annotation creates automatically _genid variable.
Hope this can helps others than me.