Run Python mrjob in a Kubernetes on Hadoop Cluster - kubernetes

I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.
I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.
Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.
The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;
runners:
hadoop:
cmdenv:
PATH: <pod name>:/opt/hadoop
However mrjob is still unable to detect hadoop bin and gives the below error
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'
So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.

mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.
IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.
Beam would be recommended since it is compatible with GCP DataFlow

Related

How to start Github Action (self-runner) on own machine with Ubuntu and Docker?

I successfully downloaded and started GitHub runner on my own Ubuntu machine. GitLab have a nice runner installer, but GitHub have only package with files and run.sh file.
When I will start the run.sh, it works and GitHub runner starts for listening of actions.
But I couldn't find it anywhere in the documentation, how to correctly integrate the package with sh file into Ubuntu to start sh automatically after Ubuntu starts.
Also I didn't find, if there is needed to do some steps to make runner secure from the internet.
...and also I don't know, where I can setup to be able to run parallel actions, and also where I can setup the limitations of resources, etc...
Thanks a lot for an each help.

How to import databricks operators into airflow container?

I am trying to run an simple dag with only a dummy- and a databricksoperator (DatabricksRunNowOperator) just to test it. I uploaded the dag into the airflow container, but the databricks operator is not part of the ordinary airflow package. I installed it (locally) with pip install apache-airflow-providers-databricks. Accordingly, the package is not present in the container and an error occurs.
Does anyone know how I provide the mentioned package to the airflow container?
if you use docker compose as recommended by the official Airflow documentation on Docker setup, then you can specify additional dependencies with _PIP_ADDITIONAL_REQUIREMENTS environment variable (also could be put into .env file in the same folder). For example, I have following in my testing environment:
_PIP_ADDITIONAL_REQUIREMENTS="apache-airflow-providers-databricks==2.4.0rc1"

AWX 9.1.1: j2 render Playbook runs successfully, but won't write results out to directory; running from CLI works properly

I have a weird situation that I think is a bug in the latest release of AWX (v9.1.1). In fact, I have registered this issue as a possible bug with AWX development (Issue #5818). From the report:
SUMMARY
When executing a j2 template rendering Playbook role from AWX, the Playbook runs without error, but the rendered file is never written out to the directory. If you run the Playbook from the CLI, it runs without error and will write out the file correctly.
ENVIRONMENT
AWX version: 9.1.1
AWX install method: Docker Compose
Ansible version: AWX v2.8.5; CLI environment 2.9.4; have downgraded the CLI to 2.8.5 and no change in behavior.
Operating System: Ubuntu 18.04.4 LTS
Web Browser: Chrome
STEPS TO REPRODUCE
Simply execute the Playbook role from CLI - successful. Execute within AWX - successful but no rendered template file is written.
EXPECTED RESULTS
Expect the template to render and show Change=1 as the completed status. Running the job once more should result in Change=0 due to idempotency.
ACTUAL RESULTS
No matter how many times the Playbook is ran in AWX, it still shows Change=1 (idempotency is indicating that the rendered file and existing file don't match).
--
One other piece of info noted during the debugs is that AWX 9.1.1 apparently uses Python3 in its venv; whereas my old functioning instance uses Python 2.7. Still, I've tried running the Playbook with different versions of Ansible and in both Python2 and Python3 venvs. Again, no issue with CLI execution with "ansible-playbook foo.yml".
Tried stopping all Containers, did a Docker system prune -a, deleted the cloned awx repo, and re-cloned/re-installed AWX. I've even tried pointing to both an internal and external assets database, but still no change.
Hopefully someone else has encountered this bizarre problem.
Thanks, Community!

Automatically install services from folder when starting

I'm currently running arangodb using docker and I want to be able to start with a clean slate just by restarting my containers.
I have mounted volumes in docker where I want the code of my services to be mounted.
How can I automatically have arangodb install those services? I want to be able to edit the code in the volume to be able to develop my services without having to upload them again. Also it is important that I can run VCS directly in the mounted volume from my client machine.
The ArangoDB container has script hooks that can be used in derived containers by placing files in specific directories:
FROM arangodb/testdrivearangodocker
MAINTAINER Frank Celler <info#arangodb.com>
COPY test.js /docker-entrypoint-initdb.d
COPY test.sh /docker-entrypoint-initdb.d
COPY dumps /docker-entrypoint-initdb.d/dumps
COPY verify.js /
As we demonstrate in this testcontainer.
the dumps directory will be restored using arangorestore
.js files will be executed using arangosh
.sh files will be executed
This script mechanism is implemented in this part of the docker entrypoint script.
With ArangoDB 3.3 you can use the old foxx-manager to install services, ArangoDB 3.4 on you may use foxx-cli for that purpose.

Application kill of spark on yarn via Zeppelin

Is there a recommended way to application kill spark on yarn from inside Zeppelin (using scala)? In the spark shell I use
:q
and it cleanly exits the shell, kills the application on yarn, and unreserves the cores I was using.
I've found that I can use
sys.exit
which does kill the application on yarn successfully, but it also throws an error and requires that I restart the interpreter if I want to start a new session. If I'm actively running another notebook with a separate instance of the same interpreter then sys.exit isn't ideal because I can't restart the interpreter until I've finished the work in the second notebook.
You probably want to go to the YARN UI and kill the application there. It should be running on port 8088 of your primary name node. However, this will require a restart of the service, as well.
Ideally you let YARN deal with this, though. Just because Zeppelin will start Spark with a specified number of executors and cores doesn't mean these are "reserved" in the way you think. These cores are still available for other containers. YARN manages these resources very well. Unless you have a limited cluster and/or are doing something that requires every last drop of resource management from YARN then you should be fine to leave the Spark application that Zeppelin is using alone.
You could try restarting the Zeppelin Spark interpreter (which can be done from within the interpreter settings page). This should kill the Zeppelin app, but will only restart the interpreter (and hence the Zeppelin app), when you try executing a paragraph again.