Trouble reading avro files in Jupyter notebook using pyspark

Trouble reading avro files in Jupyter notebook using pyspark - pyspark

I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error.
I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great.
This is an example of the code I am using to read the avro file
df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro")
This is the error I get
AnalysisException: 'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'

download the jar to a location and use the following code snippet in your pyspark app
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/tojar/spark-avro_2.11:4.0.0.jar pyspark-shell'

Related

Can't read a VCF file through spark

I am trying to read a vcf file using spark.
Spark 3.0
spark.read.format("com.databricks.vcf").load("vcfFilePath")
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.vcf. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:674)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
... 49 elided
Caused by: java.lang.ClassNotFoundException: com.databricks.vcf.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:648)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:648)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:648)
... 52 more
I have tried in spark in local ubuntu, and have also tried in databricks environment. Can you folks help me with this?

On Databricks (as Alex mentioned) you have to use the Databricks Genomics Runtime (see the picture below).
If you want to work with VCF files with Spark on your local machine, then you have to add the Glow package manually. This package contains the VCF reader. The official documentation here describes the steps that you have to do in detail.
For PySpark locally, the instructions are something like this:
# Install pyspark
pip install pyspark==3.0.1
# Install Glow
pip install glow.py
# Start PySpark with the Glow Maven package
psypark --packages io.projectglow:glow-spark3_2.12:0.6.0
In the Python shell:
import glow
glow.register(spark)
df = spark.read.format('vcf').load(path)
To load the example from the PDF document that you mentioned, you have to make sure to replace the spaces with tabs, otherwise you will get a malformed header exception. The VCF format requires each record and the header to be delimited by tabs.

PySpark on Linux with pycharm - first exception error

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.

there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?

This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.

You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.

In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.

Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

Running Apache SystemML

I am trying to get Apache SystemML set up and running (on Ubuntu) in a standalone mode.
I am relying on the github documentation to set this up.
I would like to run this with pyspark and I am following the instructions from this beginner's guide
After successfully installing systemml and launching pyspark shell, I tried the following code from the tutorial:
import systemml as sml
import numpy as np
m1 = sml.matrix(np.ones((3,3)) + 2)
The import statements work fine, however I encounter the following error with the 3rd line:
ImportError: Unable to load systemML.jar into the current pyspark session.Hint: Provide
the following argument to pyspark: --driver-class-path /usr/local...
As per the hint provided, I launched pyspark again appending the "--driver -class-path..." at the end. But I encountered the same error.
While googling for this, I found this error being highlighted in the Apache SystemML documentations. However, I wasn't really able to address the issue.
Any help will be greatly appreciated!

Can you please confirm that "/usr/local..." in your comment is path to systemml-*-incubating-SNAPSHOT.jar and that file exists ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Trouble reading avro files in Jupyter notebook using pyspark - pyspark

download the jar to a location and use the following code snippet in your pyspark app import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/tojar/spark-avro_2.11:4.0.0.jar pyspark-shell'

Related

Can't read a VCF file through spark

PySpark on Linux with pycharm - first exception error

can't find module 'graphframes' -- Jupyter

Error when running pyspark

Running Apache SystemML

Categories

Resources