Unit testing in Databricks notebooks - github

The following code is intended to run unit tests in Databricks notebooks, using pytest.
import pytest
import os
import sys
repo_name = "Databricks-Code-Repo"
# Get the path to this notebook, for example "/Workspace/Repos/{username}/{repo-name}".
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
# Get the repo's root directory name.
repo_root = os.path.dirname(os.path.dirname(notebook_path))
# Prepare to run pytest from the repo.
os.chdir(f"/Workspace/{repo_root}/{repo_name}")
print(os.getcwd())
# Skip writing pyc files on a readonly filesystem.
sys.dont_write_bytecode = True
# Run pytest.
retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])
# Fail the cell execution if there are any test failures.
assert retcode == 0, "The pytest invocation failed. See the log for details."
This code snippet is in the guide provided by Databricks.
However, it produces the following error:
PermissionError: [Errno 1] Operation not permitted: '/Workspace//Repos/<email_address>/Databricks-Code-Repo/Databricks-Code-Repo'
This notebook is inside Databricks Repos. I have two other notebooks:
functions (where I have defined three data transformation functions);
test_functions (where I have defined test function for each of the data transformation functions from the previous notebook).
I get that the error has something to do with permissions, but I can't figure out what is causing it. I will appreciate any suggestions.

Related

Pytest: load and run test from notebook

In the official Databricks example a notebook is created which runs tests with unittest:
The notebook imports another notebook (containing the test_* methods) with %run
The test is launched in the notebook with unittest.main(...
I would like to do the same thing with pytest:
I imported the notebook containing th test_* methods using %run
I tried to launch the tests with retcode = pytest.main([]) but I always get no tests ran
Notebook containing the tests (notebook_my_test)
def test_trivial2():
myvar=True
assert myvar == True
Main notebook:
%run ./notebook_my_test
import pytest
retcode = pytest.main([])

How to deploy a Google dataflow worker with a file loaded into memory?

I am trying to deploy Google Dataflow streaming for use in my machine learning streaming pipeline, but cannot seem to deploy the worker with a file already loaded into memory. Currently, I have setup the job to pull a pickle file from a GCS bucket, load it into memory, and use it for model prediction. But this is executed on every cycle of the job, i.e. pull from GCS every time a new object enters the dataflow pipeline - meaning that the current execution of the pipeline is much slower than it needs to be.
What I really need, is a way to allocate a variable within the worker nodes on setup of each worker. Then use that variable within the pipeline, without having to re-load on every execution of the pipeline.
Is there a way to do this step before the job is deployed, something like
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
But within my setup.py file?
##### based on - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
"""Setup.py module for the workflow's worker utilities.
All the workflow related code is gathered in a package that will be built as a
source distribution, staged in the staging area for the workflow being run and
then installed in the workers when they start running.
This behavior is triggered by specifying the --setup_file command line option
when running the workflow for remote execution.
"""
# pytype: skip-file
from __future__ import absolute_import
from __future__ import print_function
import subprocess
from distutils.command.build import build as _build # type: ignore
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
CUSTOM_COMMANDS = [['pip', 'install', 'scikit-learn==0.23.1']]
CUSTOM_COMMANDS = [['pip', 'install', 'google-cloud-storage']]
CUSTOM_COMMANDS = [['pip', 'install', 'mlxtend']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
REQUIRED_PACKAGES = [
'google-cloud-storage',
'mlxtend',
'scikit-learn==0.23.1',
]
setuptools.setup(
name='ML pipeline',
version='0.0.1',
description='ML set workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
})
Snippet of current ML load mechanism:
class MlModel(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self,element):
if self._model is None:
bucket = self._storage.Client().get_bucket(myBucket)
blob = bucket.get_blob(myBlob)
self._model = self._pkl.loads(blob.download_as_string())
new_df = self._pd.read_json(element, orient='records').iloc[:, 3:-1]
predict = self._model.predict(new_df)
df = self._pd.DataFrame(data=predict, columns=["A", "B"])
A = df.iloc[0]['A']
B = df.iloc[0]['B']
d = {'A':A, 'B':B}
return [d]
You can use the #Setup method in your MlModel DoFn method where you can load your model and then use it in your #Process method. The #Setup method is called once per worker initialization.
I had written a similar answer here
HTH

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment.
my MLProject file looks as following:
name: mercury_cltv_lib
conda_env: conda-env.yml
entry_points:
main:
command: "python3 run-multiple-notebooks.py"
Workflow is as following:
TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.
It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing
File "run-multiple-notebooks.py", line 3, in <module>
from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'
Please find below the relevant code section from run-multiple-notebooks.py
def get_spark_session():
from pyspark.sql import SparkSession
return SparkSession.builder.getOrCreate()
def get_dbutils(self, spark = None):
try:
if spark == None:
spark = spark
from pyspark.dbutils import DBUtils #error line
dbutils = DBUtils(spark) #error line
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
def submitNotebook(notebook):
print("Running notebook %s" % notebook.path)
spark = get_spark_session()
dbutils = get_dbutils(spark)
I tried all the options and tried
https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils
as well. It is not working :(
Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

How can I debug my python unit tests within Tox with PUDB?

I'm trying to debug a python codebase that uses tox for unit tests. One of the failing tests is proving difficult due to figure out, and I'd like to use pudb to step through the code.
At first thought, one would think to just pip install pudb then in the unit test code add in import pudb and pudb.settrace(). But that results in a ModuleNotFoundError:
> import pudb
>E ModuleNotFoundError: No module named 'pudb'
>tests/mytest.py:130: ModuleNotFoundError
> ERROR: InvocationError for command '/Users/me/myproject/.tox/py3/bin/pytest tests' (exited with code 1)
Noticing the .tox project folder leads one to realize there's a site-packages folder within tox, which makes sense since the point of tox is to manage testing under different virtualenv scenarios. This also means there's a tox.ini configuration file, with a deps section that may look like this:
[tox]
envlist = lint, py3
[testenv]
deps =
pytest
commands = pytest tests
adding pudb to the deps list should solve the ModuleNotFoundError, but leads to another error:
self = <_pytest.capture.DontReadFromInput object at 0x103bd2b00>
def fileno(self):
> raise UnsupportedOperation("redirected stdin is pseudofile, "
"has no fileno()")
E io.UnsupportedOperation: redirected stdin is pseudofile, has no fileno()
.tox/py3/lib/python3.6/site-packages/_pytest/capture.py:583: UnsupportedOperation
So, I'm stuck at this point. Is it not possible to use pudb instead of pdb within Tox?
There's a package called pytest-pudb which overrides the pudb entry points within an automated test environment like tox to successfully jump into the debugger.
To use it, just make your tox.ini file have both the pudb and pytest-pudb entries in its testenv dependencies, similar to this:
[tox]
envlist = lint, py3
[testenv]
deps =
pytest
pudb
pytest-pudb
commands = pytest tests
Using the original PDB (not PUDB) could work too. At least it works on Django and Nose testers. Without changing tox.ini, simply add a pdb breakpoint wherever you need, with:
import pdb; pdb.set_trace()
Then, when it get to that breakpoint, you can use the regular PDB commands:
w - print stacktrace
s - step into
n - step over
c - continue
p - print an argument value
a - print arguments of current function

Pytest: collecting 0 items even after following the conventions

I created a test module by following all the conventions, but when I run the test, I get the following message:
collecting 0 items
Here's my directory hierarchy:
integration_tests (Directory)-> tests (Directory)-> test_integration_use_cases.py (python file)
And this is the content of the file:
import pytest
from some_tests.integration_tests.backbone.SomeIntegrationTestBase import SomeIntegrationTestBase
class TestSomeIntegration(SomeIntegrationTestBase):
#pytest.mark.p1
def test_some_integration_use_cases(self):
print("**** Executing integration tests ****")
result = self.execute_test(4)
assert (True == result)
when I run the following command:
pytest test_integration_use_cases.py
I see the following result without any errors:
collecting 0 items
FYI: I am running this on a development machine (Like vagrant)
so I had the same problem as you have even after following all the recommended conventions. My application structure was as follows;
Application
-- API
app.py
-- docs
-- venv
-- tests
-- unit_test
test_factory
...
...
I, however, resolved the issue by moving the tests directory under the API package so that my application structure looked as below;
Application
-- API
app.py
-- tests
-- unit_test
test_factory
...
-- docs
-- venv
...
Although pytest is supposed to auto-discover the tests, it seems to do that if they are placed in the application root. Check out the pytest for flask
I also found this resource helpful.