Impossible to import koalas in scala notebook - scala

It seems basic but from what I see on databricks website, nothing works on my side
I have installed koalas package on my cluster
But when I try to import the package in my Scala notebook, I have issue.
command-3313152839336470:1: error: not found: value databricks
import databricks.koalas
If I do it in Python, everything works fine
Details cluster & notebook
Thanks for your help
Matt

Koalas is a Python package, which mimics the Pandas (another Python package) interfaces. Currently no Scala version is published, even though the project may contain some Scala code. The goal of Koalas is to provide a drop-in replacement for Pandas, to make use of the distributed nature of Apache Spark. Since Pandas is only available on Python I don't expect a direct of port on this in Scala.
https://github.com/databricks/koalas
For Scala your best bet is to use the DataSet and DataFrame APIs of Spark:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

Related

Jupyter for Scala with spylon-kernel without having to install Spark

Based on web search and as highly recommended, I am trying to run Jupyter on my local for Scala (using spylon-kernel).
I was able to create a notebook but while trying to run/play a Scala code snippet, I see this message initializing scala interpreter and in the console, I see this error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
I am not planning to install Spark. Is there a way I can still use Jupyter for Scala without installing Spark?
I am new to Jupyter and the ecosystem. Pardon me for the amateur question.
Thanks

Installing GeoMesa on Databricks

I'm trying to install GeoMesa in Azure Databricks (Databricks Version 6.6 / Scala 2.11) - trying to follow this tutorial
I have installed GeoMesa in DataBricks using Maven Coordinates org.locationtech.geomesa:geomesa-spark-jts_2.11:2.3.2 as described.
However, when I run import org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator it's telling me that it's not found.
All the other imports in this tutorial work just fine:
import org.locationtech.jts.geom._
import org.locationtech.geomesa.spark.jts._
I looked at Geomesa's github, and it seems like it's the correct location.
I'm not super familiar with Java / Scala / Jars.
Not sure what other way I can approach this.
Thanks for help in advance!
Good question! It appears that there's a small error with this tutorial. The GeoMesaSparkKryoRegistrator is used for managing the serialization of SimpleFeatures in Spark.
This tutorial does not seem to use SimpleFeatures (at least as of August 2020). As such, this import is likely unnecessary. You ought to be able to progress by skipping that import and the registration of the GeoMesaSparkKryoRegistrator.
The imported module provides just the spatial types and functions necessary for achieving basic geometry support in Spark. To leverage GeoMesa's datastores in Spark, one would import a GeoMesa database-specific spark-runtime jar. Since those datastores use GeoTools SimpleFeatures, that jars would include the GeoMesaSparkKryoRegistrator, and its use would be similar to what is in that notebook and in the documentation on geomesa.org.

How can I include extra library/package in Databricks pyspark notebook?

Often the case we need to import some extra libraries in pyspark, Databricks provides a great environment for practicing pyspark, however, is it possible to install needed library there? if yes, how?
or is there any workaround to use non-builtin library/package?
Thanks.
There are multiple ways to do so depends on the case and package type. If it is PyPI package then the easiest way is using
dbutils
dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras")
Or you could attach a library to a cluster. More info can be found here
https://docs.databricks.com/libraries.html#install-workspace-libraries

How to use the Vegas visualization within a scala-spark jupyter notebook

When using the scala kernel with Vegas we see the nice charts
But when switching to the scala-spark kernel the imports no longer work:
What is the way to fix the imports for the spark kernel?
As described here you'll probably need to tweak your notebook config to pre-load those libraries, so they are available at runtime.
Then you can do a normal import (without the funny $ivy syntax, which actually comes from Ammonite REPL).

Scala is a must for Spark?

I am new to Spark. In its docs, it says It is available in either Scala or Python.
And some blogs says spark depends on scala (for instance, http://cn.soulmachine.me/blog/20130614/). Therefore, I am wondering : Is scala a must for Spark? (Do I have to install scala first due to the dependency?)
The API of Scala has the following language bindings:
Scala
Java
Python
Scala as a natural fit, since it supports strongly functional programming, which is obvious benefical in the area of Big Data. Most tutorials and coding snippets, which you find on the net, are written in Scala.
Concerning the runtimne depenendencies please have a look at
the project download page
"Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark 1.2.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x)."
Java is a must for Spark + many other transitive dependencies (scala compiler is just a library for JVM). PySpark just connects remotely (by socket) to the JVM using Py4J (Python-Java interoperation). Py4J is included in PySpark.
PySpark requires Python 2.6 or higher. PySpark applications are
executed using a standard CPython interpreter in order to support
Python modules that use C extensions. We have not tested PySpark with
Python 3 or with alternative Python interpreters, such as PyPy or
Jython.
All of PySpark’s library dependencies, including Py4J, are bundled
with PySpark and automatically imported.
Standalone PySpark applications should be run using the bin/pyspark
script, which automatically configures the Java and Python environment
using the settings in conf/spark-env.sh or .cmd. The script
automatically adds the bin/pyspark package to the PYTHONPATH.
https://spark.apache.org/docs/0.9.1/python-programming-guide.html - this instruction shows how to build and run all this with Scala/Java Build Tool (SBT), which will download all dependencies (including scala) automatically from remote repository. Yo can also use Maven.
If you don't want Java on your machine - you can start it on any other and configure PySpark for using it (by SparkConf().setMaster).
So, you need Java for master node with Spark itself (and all java-dependencies like scala), and Python 2.6 for py-client