set PYSPARK_SUBMIT_ARGS="--name" "PySparkShell" "pyspark-shell" && jupyter notebook - pyspark

I'm looking to install PySpark on my Windows 10 machine and have been unable to correctly specify the PYSPARK_SUBMIT_ARGS argument.
This is the error I'm seeing when I run the "pyspark" command from gitbash:
$ pyspark
set PYSPARK_SUBMIT_ARGS="--name" "PySparkShell" "pyspark-shell" && jupyter notebook
I've uninstalled all versions of Java, except version 8. Within my .bashrc file, my path is currently specified as:
export JAVA_HOME="C:\PROGRA~2\Java\jre1.8.0_261"
export PYSPARK_SUBMIT_ARGS="--master local[*] pyspark-shell"
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_DRIVER_PYTHON="jupyter"
export SPARK_HOME="C:/spark/spark-2.4.7-bin-hadoop2.7"
export PATH=$SPARK_HOME/bin:$PATH
And JAVA_HOME is specified within my env variables and set in Path as well.
I would really appreciate any additional troubleshooting techniques!
Thank you so much!!!

Try to run from windows command prompt , having issues with gitbash

Related

Why does the WSL2 import command in PowerShell output "Access is denied"?

On Windows 10 Pro and 11 Pro I have installed and activated Ubuntu-20.04 and Debian. Using the documentation from MS on switching those distros to a secondary drive, everything seemed to work fine. Until the WSL import command. It outputs "Access is denied". I've tried Windows Terminal, PowerShell, and even WebStorm; I get the same output.
I am running with elevated privileges but to no avail. The export works fine, I use a different name as the source file to ensure I restore the name to its original name. The wsl.conf editing looks good, everything lines up... until the import command.
I am at a loss. I've exhausted all research. Can anyone help me resolve this so I can run these from my F: drive?
Cheers,
RN
You just have to put a filename in the end, like:
wsl --export Ubuntu C:\Users\Desktop\OneDrive\Documents\ubuntu.tar
Suppose you want to import an exported distribution "ubuntu.tar".
Try to cd at the location of the .tar file before executing the wsl --import command in PowerShell (running as standard user), for example:
PS X:\> cd D:\
PS D:\> wsl --import Ubuntu_copy .\Ubuntu_copy ubuntu.tar
Executing the wsl --import command with an absolute path didn't work for me, but the above mentioned method did.
Just in case this is an ongoing issue for anyone, you need to run wsl --import not just from an Administrator account, but you need to run Powershell/cmd as Administrator, for example by right-clicking a pwsh.exe icon/shortcut and clicking "Run as administrator". If you're running as a standard user and "Run as administrator", the import will install the distro for the admin user you've chosen to run as.
The full syntax is:
wsl --import <Distro name> <Install folder> <Source .tar file>
The import syntax is as following, you should be carefull about install dir and imported tar file's arguments order:
--import <Distro> <InstallLocation> <FileName> [Options]
Imports the specified tar file as a new distribution.
The filename can be - for standard input.
Options:
--version <Version>
Specifies the version to use for the new distribution.

Running ./pyspark fails to find local directories

After installing Spark I am trying to run PySpark from the installation folder:
opt/spark/bin/pyspark
But I get the following errors:
opt/spark/bin/pyspark: line 24: /opt/spark/bin/load-spark-env.sh: No such file or directory
opt/spark/bin/pyspark: line 68: /opt/spark/bin/spark-submit: No such file or directory
opt/spark/bin/pyspark: line 68: exec: /opt/spark/bin/spark-submit: cannot execute: No such file or directory
Why is this happening when I can see these items in their respective directories? I'm also trying to get PySpark to run standalone as a command, but I'd imagine that I must solve the former problem first.
I am running this on macOS.
This error indicates that SPARK_HOME is not set. Try this:
export SPARK_HOME=/opt/spark
pyspark
FYI, it is strongly recommended to install software on mac OS using a package manager, like https://brew.sh
This is the configuration:
export SPARK_HOME=<YOUR-PATH>/spark-2.4.4-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
And if you are thinking to use notebook as well:
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Change Conda environment via powershell script (for Gitlab-CI)

I am running some automated Python tests with GitLab-CI on a Windows 10 machine. The GitLab-Runner on the machine used to work with executor = "shell" using the simple Windows shell. This recently stopped working (The docs say support for this shell is depreceated) and the only way to get it work again has been to use the powershell instead with adding shell = "powershell" to our config.toml file. For the tests to run, we need to activate a conda environment. Unfortunately, this seems not to work via the powershell script that GitLab-CI creates for the job.
When I open the powershell manually logged in as the user that is executing the gitlab runner jobs, changing conda environments works. I have run conda init powershell and can change the environment with conda activate myenv. Yet, when I include the following in my gitlab-ci.yml file:
script:
- conda activate myenv
- conda list
the output from conda list confirms that the environment myenv is not activated and instead the base environment is used.
Also trying the absolute path like this
script:
- conda activate C:\Users\myuser\Miniconda3\envs\myenv
- conda list
does not work.
So it seems like I can manually activate the correct conda environment in the powershell, but activating the environment via the powershell script created by GitLab-CI does not work. Is there a fix for this problem? Any help is greatly appreciated.
Looks like gitlab executes each line of the script in a separate subshell. Combine the commands into a single line.
If that doesn't work, most conda commands will accept the name of the environment as parameter -n:
conda list -n myenv
conda install -n myenv PackageName
...
As long as you're just using conda, it shouldn't be necessary to activate the environment.
As it seems to be a problem within powershell but not with cmd, one could use the following in the gitlab-ci yaml:
- cmd '/C' 'conda activate myenv && python myunittests.py'

spark-notebook: command not found

I want to set up spark notebook on my laptop following the instructions listed in http://spark-notebook.io I gave the command bin/spark-notebook and I'm getting:
-bash: bin/spark-notebook: command not found
How to resolve this? I want to run spark-notebook for spark standalone and scala.
You can download
spark-notebook-0.7.0-pre2-scala-2.10.5-spark-1.6.3-hadoop-2.7.2-with-parquet.tqz
Set the path in bashrc
Example :
$sudo gedit ~/.bashrc
export SPARK_HOME=/yor/path/
export PATH=$PATH:$SPARK+HOME/bin
Then start your notebook following command...
$spark-notebook

Running an IPython/Jupyter notebook non-interactively

Does anyone know if it is possible to run an IPython/Jupyter notebook non-interactively from the command line and have the resulting .ipynb file saved with the results of the run. If it isn't already possible, how hard would it be to implement with phantomJS, something to turn the kernel on and off, and something to turn the web server on and off?
To be more specific, let's assume I already have a notebook original.ipynb and I want to rerun all cells in that notebook and save the results in a new notebook new.ipynb, but do this with one single command on the command line without requiring interaction either in the browser or to close the kernel or web server, and assuming no kernel or web server is already running.
example command:
$ ipython notebook run original.ipynb --output=new.ipynb
Yes it is possible, and easy, it will (mostly) be in IPython core for 2.0, I would suggest looking at those examples for now.
[edit]
$ jupyter nbconvert --to notebook --execute original.ipynb --output=new.ipynb
It is now in Jupyter NbConvert. NbConvert comes with a bunch of Preprocessors that are disabled by default, two of them (ClearOutputPreprocessor and ExecutePreprocessor) are of interest. You can either enabled them in your (local|global) config file(s) via c.<PreprocessorName>.enabled=True (Uppercase that's python), or on the command line with --ExecutePreprocessor.enabled=True keep the rest of the command as usual.
The --ExecutePreprocessor.enabled=True has convenient --execute alias that can be used on recent version of NbConvert. It can be combine with --inplace if desired
For example, convert to html after running the notebook headless :
$ jupyter nbconvert --to=html --execute RunMe.ipynb
converting to PDF after stripping outputs
$ ipython nbconvert --to=pdf --ClearOutputPreprocessor.enabled=True RunMe.ipynb
This (of course) does work with non-python kernels by spawning a <insert-your-language-here> kernel, if you set --profile=<your fav profile>. The conversion can be really long as it needs to rerun the notebook. You can do notebook to notebook conversion with the --to=notebook option.
There are various other options (timeout, allow errors, ...) that might need to be set/unset depending on use case. See documentation and of course jupyter nbconvert --help, --help-all, or nbconvert online documentation for more information.
Until this functionality becomes part of the core, I put together a little command-line app that does just what you want. It's called runipy and you can install it with pip install runipy. The source and readme are on github.
Run and replace original .ipynb file:
jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --inplace --execute original.ipynb
To cover some features such as parallel workers, input parameters, e-mail sending or S3 input/output... you can install jupyter-runner
pip install jupyter-runner
Readme on github: https://github.com/omar-masmoudi/jupyter-runner
One more way is to use papermill, it has Command Line Interface
Usage example: (you need to specify output path for execution results to be stored)
papermill your_notebook.ipynb logs/yourlog.out.ipynb
You also can specify required params if you wish with -p flag for each param:
papermill your_notebook.ipynb logs/yourlog.out.ipynb -p env "prod" -p tests "e2e"
one more related to papermill reply - https://stackoverflow.com/a/55458141/2957102
You can just run the iPython-Notebook-server via command line:
ipython notebook --pylab inline
This will start the server in non-interactive mode and all output is printed below the code. You can then save the .ipynb-File which includes Code & Output.