How to run jupyter notebook in airflow - jupyter

My code is written in jupyter and saved as .ipynb format.
We want to use airflow to schedule the execution and define the dependencies.
How can the notebooks be executed in airflow?
I know I can convert them to python files first but the graphs generated on the fly will be difficult to handle.
Is there are any easier solution? Thanks

You can also use combination of airflow + papermill.
Papermill
Papermill is a tool for running jupyter notebooks with parameters: https://github.com/nteract/papermill
Running a jupyter notebook is very easy, you can do it from python script:
import papermill as pm
pm.execute_notebook(
'path/to/input.ipynb',
'path/to/output.ipynb',
parameters = dict(alpha=0.6, ratio=0.1)
)
or from CLI:
$ papermill local/input.ipynb s3://bkt/output.ipynb -p alpha 0.6 -p l1_ratio 0.1
and it will run a notebook from the input path, create a copy in the output path and update this copy after each cell run.
Airflow Integration
To integrate it with Airflow, there is a dedicated papermill operator for running parametrized notebooks: https://airflow.readthedocs.io/en/latest/howto/operator/papermill.html
You can setup the same input/output/paramters arguments directly in the DAG definition and use the templating for the aifrlow variables:
run_this = PapermillOperator(
task_id="run_example_notebook",
dag=dag,
input_nb="/tmp/hello_world.ipynb",
output_nb="/tmp/out-{{ execution_date }}.ipynb",
parameters={"msgs": "Ran from Airflow at {{ execution_date }}!"}
)

We encountered this problem before and spent quite a couple of days to solve it.
We packaged it as a docker file and published on github https://github.com/michaelchanwahyan/datalab.
It is done by modifing an open source package nbparameterize and integrating the passing arguments such as execution_date. Graph generated on the fly can also be updated and saved within inside the notebook.
When it is executed
the notebook will be read and inject the parameters
the notebook is executed and the output will overwrite the original path
Besides, it also installed and configured common tools such as spark, keras, tensorflow, etc.

Another alternative is to use Ploomner (disclaimer: I'm the author). It uses papermill under the hood to build multi-stage pipelines. Tasks can be notebooks, scripts, functions, or any combination of them. You can run locally, Airflow, or Kubernetes (using Argo workflows).
This is how a pipeline declaration looks like:
tasks:
- source: notebook.ipynb
product:
nb: output.html
data: output.csv
- source: another.ipynb
product:
nb: another.html
data: another.csv
Repository
Exporting to Airflow
Exporting to Kubernetes
Sample pipelines

Related

Use local script as source for Argo workflow

I have a python script that I'd like to execute on cloud using an Argo workflow.
Currently, I'm alternating between copying the source code to the workflow itself (using copy and paste), which is inconvenient and causes issues.
The second options is uploading my project directory to an s3 bucket, then downloading the source code to the Argo pod, then running the commands.
Both methods require some actions to sync the source code after I modify the script.
Is there a way to specify on the Argo workflow from where it should take the source code from?
Say, instead of creating a script template that takes the source from a string specified in the .yml file - take it from a local file by specifying a local path?
Prefer not to use Git for that
Also, if possible would prefer solutions with support for attaching additional dependencies source code files
If you have something more sophisticated than a simple script that you use in the .yml file, it might be worthwhile to use a docker image with a container template that you pre-build for your workflow.
The image will be named my-script and the entrypoint my-entrypoint
Assuming the script is in python called script.py you can have the following files:
workflow.yml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-name-
spec:
entrypoint: my-entrypoint
templates:
- name: my-entrypoint
container:
image: my-script
command: python3
args:
- script.py
script.py
import requests
response = requests.get('www.google.com')
print(response.status_code)
requirements.txt
requests
Dockerfile
FROM python:3.11.1-slim
COPY . .
RUN pip3 install -r requirements.txt
CMD python3 script.py
Assuming you can build your image to the cluster (in the case of minikube). You'd run:
docker build -t my-script .
This approach also makes your code testable, should you decide to have tests. For this it is not necessary to use Git, although I'd encourage you to use it for collaboration and versioning. Also the COPY command in the Dockerfile copies all files in your directory to the image, so you'd have other information readily available. I would discourage you to copy actual data in this way, but rather use argo parameters and artifacts.
Check out https://argoproj.github.io/argo-workflows/workflow-concepts/ for more info

Is there a way to parameterize magic commands in Databricks notebooks?

I want to use be able to run through a list of config files and use %run to import variables from config files into a databricks notebook.
But I cant find a method to dynamically change the file following %run.
I have tried specifying a parameter like this:
config = './config.py'
%run $config
But it doesn't work. I cannot use dbutils.notebook.run(config) as I won't get access to the variables in my main notebook.
Can anything think of a way to do this?
Since, you have already mentioned config files, I will consider that you have the config files already available in some path and those are not Databricks notebook.
You can use python - configparser in one notebook to read the config files and specify the notebook path using %run in main notebook (or you can ignore the notebook itself by using configparser in main notebook)
Reference: How to read a config file using python

How to include a PowerShell script file in a GitLab CI YAML file

Currently I have a large Bash script in my GitLab CI YAML file. The example below is how I am grouping my functions so that I can use them during my CI process.
.test-functions: &test-functions |
function write_one() { echo "Function 1" }
function write_two() { echo "Function 2"}
function write_three() { echo "Function 3"}
.plugin-nuget:
tags:
- test-es
- kubernetes
image: mygitlab-url.com:4567/containers/dotnet-sdk-2.2:latest
script:
- *test-functions
- write_one
- write_two
- write_three
The example below shows how we can include a YAML file inside another one:
include:
- project: MyGroup/GitlabCiPlugins/Dotnet
file: plugin-test.yml
ref: JayCiTest
I would like to do the same thing with my script. Instead of having the script in the same file as my YAML, I would like to include the file, so that my YAML has access to my script's functions. I would also like to use PowerShell instead of Bash if possible.
How can I do this?
Split shell scripts and GitLab CI syntax
GitLab CI has no feature "include file content to script block".
GitLab CI include feature doesn't import YAML anchors.
GitLab CI can't concat arrays. So you cant write before_script in one .gitlab-ci.yml file and then use it in before_script in another. You can only rewrite it, not concat.
Because of all of these problems you can't easily manage your scripts; split them, organize them and do another nice developer's decomposition stuff.
There are possible workarounds. You can store your scripts somewhere else; where a gitlab runner could reach them, and then inject them to your current job environment via source /ci-scripts/my-script.sh in before_script block.
Possible locations for storing ci scripts:
Special docker image with all your build/test/deploy utils and ci scripts
The same, but dedicated build server
You can deploy simple web page containing your scripts and download and import then in before_script. Just in case, make sure nobody, except gitlab runner could access it.
Using powershell
You can use powershell only if you installed your GitLab Runner on Windows. You can't use anything else in that case.

How to run a pytest-bdd test?

I am not understanding how to properly run a simple test(feature file and python file)
with the library pytest-bdd.
From the official documentation, I can't understand what command to issue to run a test.
I tried using pytest command, but I saw the NO test ran.
Do I need to use another library behave to run a feature file?
I figured out trying for 2 days,that ,
for running a pytest-bdd test, there are certain requirements, at least in my view.
put both the feature file and python file in the same directory (maybe this can be changed with configuration files)
the python file name needs to start with test_
the python file needs to contain a method of which name will start with test_
the method starting with test_ , need to be assigned to the #scenario sentence
to run the test, issue pytest command in the same directory(maybe it is also configurable)
After issuing you will only see the method with the name starting with test_ has passed, but all the tests actually ran. To test, you can assert False in any #when or #then annotated method, it will throw errors.
The system contained : pytest-bdd==3.0.2 (copied from pip freeze output)
Features files and python files can be placed in different folders using the bdd_features_base_dir hook provided by pytest-bdd; I think it is better having features files in different folders too.
Here you can see a working example (a simple hello world BDD test):
https://github.com/davidemoro/pytest-play-docker/tree/master/tests
https://github.com/davidemoro/pytest-play-docker/blob/master/tests/pytest.ini (see bdd_features_base_dir in [pytest] section)
https://github.com/davidemoro/pytest-play-docker/tree/master/tests/bdd
If you want to try out pytest-bdd without installation you can use Docker. Create a folder with inside your pytest BDD files and if you want a separate features folder targeted in bdd_features_base_dir and run:
docker run --rm -it -v $(pwd):/src davidemoro/pytest-play:latest
I've found out, that in the python file you don't have to put:
the method starting with test_ , need to be assigned to the #scenario sentence
You can just add: scenarios("") - to allow the tests to be started, which are using steps defined in this specific python file.
Remember to import scenarios!: from pytest_bdd import scenarios
Example:
Code example
Command..
pytest -v path_to_test_file.py
Things to note here..
Check format of feature file as filename.feature
Always __init__ modules, otherwise test-runner will not find test files
Glue right step definitions to test function
Add feature in features module
If you are using python3 execute test with python3
So,
python3 -m pytest -v path_to_test_file.py
Documentation
https://pytest-bdd.readthedocs.io/en/stable/#

Running an IPython/Jupyter notebook non-interactively

Does anyone know if it is possible to run an IPython/Jupyter notebook non-interactively from the command line and have the resulting .ipynb file saved with the results of the run. If it isn't already possible, how hard would it be to implement with phantomJS, something to turn the kernel on and off, and something to turn the web server on and off?
To be more specific, let's assume I already have a notebook original.ipynb and I want to rerun all cells in that notebook and save the results in a new notebook new.ipynb, but do this with one single command on the command line without requiring interaction either in the browser or to close the kernel or web server, and assuming no kernel or web server is already running.
example command:
$ ipython notebook run original.ipynb --output=new.ipynb
Yes it is possible, and easy, it will (mostly) be in IPython core for 2.0, I would suggest looking at those examples for now.
[edit]
$ jupyter nbconvert --to notebook --execute original.ipynb --output=new.ipynb
It is now in Jupyter NbConvert. NbConvert comes with a bunch of Preprocessors that are disabled by default, two of them (ClearOutputPreprocessor and ExecutePreprocessor) are of interest. You can either enabled them in your (local|global) config file(s) via c.<PreprocessorName>.enabled=True (Uppercase that's python), or on the command line with --ExecutePreprocessor.enabled=True keep the rest of the command as usual.
The --ExecutePreprocessor.enabled=True has convenient --execute alias that can be used on recent version of NbConvert. It can be combine with --inplace if desired
For example, convert to html after running the notebook headless :
$ jupyter nbconvert --to=html --execute RunMe.ipynb
converting to PDF after stripping outputs
$ ipython nbconvert --to=pdf --ClearOutputPreprocessor.enabled=True RunMe.ipynb
This (of course) does work with non-python kernels by spawning a <insert-your-language-here> kernel, if you set --profile=<your fav profile>. The conversion can be really long as it needs to rerun the notebook. You can do notebook to notebook conversion with the --to=notebook option.
There are various other options (timeout, allow errors, ...) that might need to be set/unset depending on use case. See documentation and of course jupyter nbconvert --help, --help-all, or nbconvert online documentation for more information.
Until this functionality becomes part of the core, I put together a little command-line app that does just what you want. It's called runipy and you can install it with pip install runipy. The source and readme are on github.
Run and replace original .ipynb file:
jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --inplace --execute original.ipynb
To cover some features such as parallel workers, input parameters, e-mail sending or S3 input/output... you can install jupyter-runner
pip install jupyter-runner
Readme on github: https://github.com/omar-masmoudi/jupyter-runner
One more way is to use papermill, it has Command Line Interface
Usage example: (you need to specify output path for execution results to be stored)
papermill your_notebook.ipynb logs/yourlog.out.ipynb
You also can specify required params if you wish with -p flag for each param:
papermill your_notebook.ipynb logs/yourlog.out.ipynb -p env "prod" -p tests "e2e"
one more related to papermill reply - https://stackoverflow.com/a/55458141/2957102
You can just run the iPython-Notebook-server via command line:
ipython notebook --pylab inline
This will start the server in non-interactive mode and all output is printed below the code. You can then save the .ipynb-File which includes Code & Output.