Download file from Databricks (Scala) - scala

I've used the following piece of code to divide the romania-latest.osm.pbf into romania-latest.osm.pbf.node.parquet and romania-latest.osm.pbf.way.parquet in Databricks. Now, I want to download these files to my local computer, to be able to use these in Intellij, but I can't seem to find where they're located and how to get them. I'm using the community edition of Databricks. This is done in Scala.
import sys.process._
"wget https://github.com/adrianulbona/osm-parquetizer/releases/download/v1.0.0/osm-parquetizer-1.0.0.jar -P /tmp/osm" !!
import sys.process._
"wget http://download.geofabrik.de/europe/monaco-latest.osm.pbf -P /tmp/osm" !!
import sys.process._
"java -jar /tmp/osm/osm-parquetizer-1.0.0.jar /tmp/osm/monaco-latest.osm.pbf" !!
I've searched on Google for a solution but nothing seems to work.

Related

Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation

I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.
Do i have to install/configure on my local machine for sparksession to work?
i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.
Thanks
What I have done to make it work was:
Create a .env file in the root of the project
Add the following content to the created file:
SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu#1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"
After refreshing the Testing section in Visual Studio Code, the setup should succeed.
Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.
Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594

PySpark on Linux with pycharm - first exception error

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.
there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?
This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

Import a python file into another python file on pythonanywhere

I am trying to built a web app on pythonanywhere.
I have a file abc.py and another file xyz.py . In the file xyz.py I want to use a function of abc.py so how do I import abc.py into xyz.py given that they are in the same directory. What will be the import statement? I was trying:
import "absolute/path/to/the/file.py" which didnt work.
Also did
from . import abc.py in xyz.py which also didnt work
PythonAnywhere dev here: if they're in the same directory, then you need to use
from abc import functionname
This part of the official Python tutorial explains how import works.