Often the case we need to import some extra libraries in pyspark, Databricks provides a great environment for practicing pyspark, however, is it possible to install needed library there? if yes, how?
or is there any workaround to use non-builtin library/package?
Thanks.
There are multiple ways to do so depends on the case and package type. If it is PyPI package then the easiest way is using
dbutils
dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras")
Or you could attach a library to a cluster. More info can be found here
https://docs.databricks.com/libraries.html#install-workspace-libraries
Related
I am trying to deploy PySpark locally using the instructions at
https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi
I can see that extra dependencies are available, such as sql and pandas_on_spark that can be deployed with
pip install pyspark[sql,pandas_on_spark]
But how can we find all available extras?
Looking in the json of the pyspark package (based on https://wiki.python.org/moin/PyPIJSON)
https://pypi.org/pypi/pyspark/json
I could not find the possible extra dependencies (as described in What is 'extra' in pypi dependency?); the value for requires_dist is null.
Many thanks for your help.
As far as I know, you can not easily get the list of extras. If this list is not clearly documented, then you will have to look at the code/config for the packaging. In this case, here which gives the following list: ml, mllib, sql, and pandas_on_spark.
I'm trying to install GeoMesa in Azure Databricks (Databricks Version 6.6 / Scala 2.11) - trying to follow this tutorial
I have installed GeoMesa in DataBricks using Maven Coordinates org.locationtech.geomesa:geomesa-spark-jts_2.11:2.3.2 as described.
However, when I run import org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator it's telling me that it's not found.
All the other imports in this tutorial work just fine:
import org.locationtech.jts.geom._
import org.locationtech.geomesa.spark.jts._
I looked at Geomesa's github, and it seems like it's the correct location.
I'm not super familiar with Java / Scala / Jars.
Not sure what other way I can approach this.
Thanks for help in advance!
Good question! It appears that there's a small error with this tutorial. The GeoMesaSparkKryoRegistrator is used for managing the serialization of SimpleFeatures in Spark.
This tutorial does not seem to use SimpleFeatures (at least as of August 2020). As such, this import is likely unnecessary. You ought to be able to progress by skipping that import and the registration of the GeoMesaSparkKryoRegistrator.
The imported module provides just the spatial types and functions necessary for achieving basic geometry support in Spark. To leverage GeoMesa's datastores in Spark, one would import a GeoMesa database-specific spark-runtime jar. Since those datastores use GeoTools SimpleFeatures, that jars would include the GeoMesaSparkKryoRegistrator, and its use would be similar to what is in that notebook and in the documentation on geomesa.org.
It seems basic but from what I see on databricks website, nothing works on my side
I have installed koalas package on my cluster
But when I try to import the package in my Scala notebook, I have issue.
command-3313152839336470:1: error: not found: value databricks
import databricks.koalas
If I do it in Python, everything works fine
Details cluster & notebook
Thanks for your help
Matt
Koalas is a Python package, which mimics the Pandas (another Python package) interfaces. Currently no Scala version is published, even though the project may contain some Scala code. The goal of Koalas is to provide a drop-in replacement for Pandas, to make use of the distributed nature of Apache Spark. Since Pandas is only available on Python I don't expect a direct of port on this in Scala.
https://github.com/databricks/koalas
For Scala your best bet is to use the DataSet and DataFrame APIs of Spark:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
When using the scala kernel with Vegas we see the nice charts
But when switching to the scala-spark kernel the imports no longer work:
What is the way to fix the imports for the spark kernel?
As described here you'll probably need to tweak your notebook config to pre-load those libraries, so they are available at runtime.
Then you can do a normal import (without the funny $ivy syntax, which actually comes from Ammonite REPL).
I would like to add the jar files from Stanford's CoreNLP into my Scala project. The part I'm struggling with in doing this in the context of a Scala kernel for Jupyter notebooks.
I'm using the Apachee Toree distribution for the kernel. There may be a simple one line command within-cell, but I can't find it.
Any help would be appreciated!
Not sure this applies to Stanford CoreNLP, but in a past project that involves evaluation of using IBM DSX on Jupytor Notebook, I read this article by Dustin V which consists of steps for adding jars. My guess is that the within-cell command you're seeking might be something similar to the following:
%AddJar http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar -f