Can't read a VCF file through spark

Can't read a VCF file through spark - scala

I am trying to read a vcf file using spark.
Spark 3.0
spark.read.format("com.databricks.vcf").load("vcfFilePath")
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.vcf. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:674)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
... 49 elided
Caused by: java.lang.ClassNotFoundException: com.databricks.vcf.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:648)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:648)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:648)
... 52 more
I have tried in spark in local ubuntu, and have also tried in databricks environment. Can you folks help me with this?

On Databricks (as Alex mentioned) you have to use the Databricks Genomics Runtime (see the picture below).
If you want to work with VCF files with Spark on your local machine, then you have to add the Glow package manually. This package contains the VCF reader. The official documentation here describes the steps that you have to do in detail.
For PySpark locally, the instructions are something like this:
# Install pyspark
pip install pyspark==3.0.1
# Install Glow
pip install glow.py
# Start PySpark with the Glow Maven package
psypark --packages io.projectglow:glow-spark3_2.12:0.6.0
In the Python shell:
import glow
glow.register(spark)
df = spark.read.format('vcf').load(path)
To load the example from the PDF document that you mentioned, you have to make sure to replace the spaces with tabs, otherwise you will get a malformed header exception. The VCF format requires each record and the header to be delimited by tabs.

Related

Google OR tools

I am learning to solve some optimisation programs using google or-tools.
I started with their example code and I an trying to run it in intellij.
But when I write the code and compile I get the following error.
Exception in thread "main" java.lang.UnsatisfiedLinkError: com.google.ortools.linearsolver.operations_research_linear_solverJNI.MPSolver_CLP_LINEAR_PROGRAMMING_get()I
at com.google.ortools.linearsolver.operations_research_linear_solverJNI.MPSolver_CLP_LINEAR_PROGRAMMING_get(Native Method)
at com.google.ortools.linearsolver.MPSolver$OptimizationProblemType.<clinit>(MPSolver.java:221)
I searched for some answers and I found that it requires jniortools.dll .
But I am working with ubuntu. Hence i assume i need to load the libjniortools.so file , am I right?.
So I included the line
static {
System.loadLibrary("libjniortools");
}
and I have a lib folder wherein I have put both com.google.ortools.jar and protobuf.jar along with all the other lib files that were present when I extracted the zip file(basically copy pasted the lib folder from extracted zip file).
I have added the jar paths in intellij as shown in figure
enter image description here
*the last 2 line of dependency in image
Then I have also tried giving the lib path in VM-options:
-Djava.library.path=/home/surajvashistha/IdeaProjects/LPModel/lib
After all this, I get the following error
Exception in thread "main" java.lang.UnsatisfiedLinkError: no libjniortools in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:871)
at java.lang.System.loadLibrary(System.java:1124)
at LP.<clinit>
I am stuck here and not able to move forward. Can anyone help?

PySpark on Linux with pycharm - first exception error

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.

there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

Trouble reading avro files in Jupyter notebook using pyspark

I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error.
I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great.
This is an example of the code I am using to read the avro file
df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro")
This is the error I get
AnalysisException: 'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'

download the jar to a location and use the following code snippet in your pyspark app
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/tojar/spark-avro_2.11:4.0.0.jar pyspark-shell'

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?

This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Running Apache SystemML

I am trying to get Apache SystemML set up and running (on Ubuntu) in a standalone mode.
I am relying on the github documentation to set this up.
I would like to run this with pyspark and I am following the instructions from this beginner's guide
After successfully installing systemml and launching pyspark shell, I tried the following code from the tutorial:
import systemml as sml
import numpy as np
m1 = sml.matrix(np.ones((3,3)) + 2)
The import statements work fine, however I encounter the following error with the 3rd line:
ImportError: Unable to load systemML.jar into the current pyspark session.Hint: Provide
the following argument to pyspark: --driver-class-path /usr/local...
As per the hint provided, I launched pyspark again appending the "--driver -class-path..." at the end. But I encountered the same error.
While googling for this, I found this error being highlighted in the Apache SystemML documentations. However, I wasn't really able to address the issue.
Any help will be greatly appreciated!

Can you please confirm that "/usr/local..." in your comment is path to systemml-*-incubating-SNAPSHOT.jar and that file exists ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can't read a VCF file through spark - scala

Related

Google OR tools

PySpark on Linux with pycharm - first exception error

Trouble reading avro files in Jupyter notebook using pyspark

can't find module 'graphframes' -- Jupyter

Running Apache SystemML

Categories

Resources