I am trying to run an simple dag with only a dummy- and a databricksoperator (DatabricksRunNowOperator) just to test it. I uploaded the dag into the airflow container, but the databricks operator is not part of the ordinary airflow package. I installed it (locally) with pip install apache-airflow-providers-databricks. Accordingly, the package is not present in the container and an error occurs.
Does anyone know how I provide the mentioned package to the airflow container?
if you use docker compose as recommended by the official Airflow documentation on Docker setup, then you can specify additional dependencies with _PIP_ADDITIONAL_REQUIREMENTS environment variable (also could be put into .env file in the same folder). For example, I have following in my testing environment:
_PIP_ADDITIONAL_REQUIREMENTS="apache-airflow-providers-databricks==2.4.0rc1"
Related
I need to install some libraries in my pod before it starts working as expected.
My use case: I need some libraries that will support SMB (samba), and the image that I have to use does not have it installed.
Unfortunately, exec'ing into the actual pod and running commands do not seem to be a very good idea.
Is there a way by which I can use an init-container to install libsmbclient-dev in my ubuntu pod?
Edit: Some restrictions in my case.
I use helm chart to install my app (nextcloud). So I guess I cannot use a custom image (as far as I know, we cannot use our own images in an existing helm chart). This would have been the best solution.
I cannot run commands in kubernetes value.yaml since I do not use kubectl to install my app. Also I need to restart apache2 after I install the library, and unfortunately, restarting apache2 results in restarting the pod, effectively making the whole installation meaningless.
Since nextcloud helm allows the use of initcontainers, I wondered if that could be used, but as far as I understand the usability of initcontainers, this is not possible (?).
You should build your own container image - e.g. with docker - and push it to a container repository that is suitable for your cluster, e.g. Docker Hub, AWS ECR, Google Artifact registry ...
First install docker (https://docs.docker.com/get-docker/)
create an empty directory and change into it.
Then create a file Dockerfile with the following content:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y libsmbclient-dev \
&& rm -rf /var/lib/apt/lists/*
Execute
docker build -t myimage:latest .
This will download Ubuntu and build a new container image where the commands from the RUN statement will be executed. The image name will be myimage and the version will be latest.
Then push your image with docker push to your appropriate repository.
See also Docker best practices:
https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.
I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.
Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.
The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;
runners:
hadoop:
cmdenv:
PATH: <pod name>:/opt/hadoop
However mrjob is still unable to detect hadoop bin and gives the below error
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'
So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.
mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.
IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.
Beam would be recommended since it is compatible with GCP DataFlow
I had created a dataproc cluster with Anaconda as optional component and created a virtual env. in that. Now when running a pyspark py file on master node I'm getting this error -
Exception: Python in worker has different version 2.7 than that in
driver 3.6, PySpark cannot run with different minor versions.Please
check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
are correctly set.
I need RDKit package inside the virtual env. and with that python 3x version gets installed. The following commands on my master node and then the python version changes.
conda create -n my-venv -c rdkit rdkit=2019.*
conda activate my-venv
conda install -c conda-forge rdkit
How can I solve this?
There's a few things here:
The 1.3 (default) image uses conda with Python 2.7. I recommend switching to 1.4 (--image-version 1.4) which uses conda with Python 3.6.
If this library will be needed on the workers you can use this initialization action to apply the change consistently to all nodes.
Pyspark does not currently support virtualenvs, but this support is coming. Currently you can run pyspark program from within a virtualenv, but this will not mean workers will run inside the virtualenv. Is it possible to apply your changes to the base conda environment without virtualenv?
Additional info can be found here https://cloud.google.com/dataproc/docs/tutorials/python-configuration
When trying to create a new cluster with gcloud dataproc clusters create, the following error is displayed:
ERROR: gcloud failed to load (gcloud.dataproc.clusters.create): Problem loading gcloud.dataproc.clusters.create: No module named jsonschema.
This usually indicates corruption in your gcloud installation or problems with your Python interpreter.
Please verify that the following is the path to a working Python 2.7 executable:
/usr/bin/python2
If it is not, please set the CLOUDSDK_PYTHON environment variable to point to a working Python 2.7 executable.
If you are still experiencing problems, please run the following command to reinstall:
$ gcloud components reinstall
If that command fails, please reinstall the Cloud SDK using the instructions here:
https://cloud.google.com/sdk/
Installing jsonschema does not seem to help.
This was an issue with gcloud sdk release 208.0.0. Upgrading to 208.0.1 should resolve this issue.
I'm currently running arangodb using docker and I want to be able to start with a clean slate just by restarting my containers.
I have mounted volumes in docker where I want the code of my services to be mounted.
How can I automatically have arangodb install those services? I want to be able to edit the code in the volume to be able to develop my services without having to upload them again. Also it is important that I can run VCS directly in the mounted volume from my client machine.
The ArangoDB container has script hooks that can be used in derived containers by placing files in specific directories:
FROM arangodb/testdrivearangodocker
MAINTAINER Frank Celler <info#arangodb.com>
COPY test.js /docker-entrypoint-initdb.d
COPY test.sh /docker-entrypoint-initdb.d
COPY dumps /docker-entrypoint-initdb.d/dumps
COPY verify.js /
As we demonstrate in this testcontainer.
the dumps directory will be restored using arangorestore
.js files will be executed using arangosh
.sh files will be executed
This script mechanism is implemented in this part of the docker entrypoint script.
With ArangoDB 3.3 you can use the old foxx-manager to install services, ArangoDB 3.4 on you may use foxx-cli for that purpose.