Databricks-connect with Python + Scala UDF in JAR file not working locally - pyspark

I'm trying to use a JAR file in python (using Databricks-connect) in Vs Code.
I already checked the path to the jar file.
I have the following code as example:
import datetime
import time
from pyspark.sql import SparkSession
from pyDataHub import LoadProcessorBase, ProcessItem
from pyspark.sql.functions import col, lit, sha1, concat, udf, array
from pyspark.sql import functions
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType
from pyspark import SparkContext
from pyspark.sql.functions import sha1, upper
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession \
.builder \
.config("spark.jars", "/users/Phill/source/jar/DataHub_Core_Functions.jar") \
.getOrCreate()
sc = spark.sparkContext
def PhillHash(col):
f = sc._jvm.com.narato.datahub.core.HashContentGenerator.getGenerateHashUdf()
return upper(sha1(Column(f.apply(_to_seq(sc, [col], _to_java_column)))))
sc._jsc.addJar("/users/Phill/source/jar/DataHub_Core_Functions.jar")
spark.range(100).withColumn("test", PhillHash("id")).show()
Any help would be appreciated cause I'm out of options here...
The error I get is the following:
Exception has occurred: TypeError 'JavaPackage' object is not callable

Add the jar to a dbfs location and update the path accordingly. The workers cannot connect to your local filesystem.
Also make sure you are running version 5.4 of the databricks runtime (or higher).

Related

Pyspark in google colab

I am trying to use pyspark on google colab. Every tutorial follows a similar method
!pip install pyspark # Import SparkSession
from pyspark.sql import SparkSession # Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
spark # Import a Spark function from library
from pyspark.sql.functions import col
But I get an error in
----> 4 spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
RuntimeError: Java gateway process exited before sending its port number
I tried installing java using something like this
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
as suggested by the tutorials, but nothing seems to work.
This worked for me so i post in case someone needs it.
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
import pyspark.sql as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions as pyspark_functions
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

spark error: spark.read.format("org.apache.spark.csv")

I am getting the following error after firing the command from spark-shell
scala> val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswa
s7917/src_files/movies_data_srcfile_sess06_01.csv")
<console>:21: error: not found: value spark
val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswas7917/src_files/movies_data_srcfile_sess06_01.csv")
Do I need to import something explicitly.
Please help with the complete command set
Thanks.
It seems like you are using the old version of spark, You need to use the spark2.x or higher and import the implicits as
import spark.implicits._
And then
val df1 = spark.read.format("csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("path")
You aren't even getting a SparkSession. You are using an older version of Spark it seems, and you should use the SQlContext and also you need to include the external databricks csv library when you start spark shell...
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
and then from within the spark shell...
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
You can see more info about it here

Running h2o in Jupyter scala notebook

I'm trying to get h2o running on a Jupyter notebook with scala kernel, with no success so far. Maybe someone can give me a hint on what could be wrong? The code I'm executing at the moment is
classpath.add("ai.h2o" % "sparkling-water-core_2.10" % "1.6.5")
import org.apache.spark.h2o._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setAppName("appName").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val h2oContext = new H2OContext(sc).start()
It fails on the last line with error
java.lang.NoClassDefFoundError: water/H2O
....
And prints out exception
java.lang.RuntimeException: Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false) (Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false))
org.apache.spark.h2o.H2OContextUtils$.startH2O(H2OContextUtils.scala:169)
org.apache.spark.h2o.H2OContext.start(H2OContext.scala:214)
If you use Toree,
In /usr/local/share/jupyter/kernels/apache_toree_scala/kernel.json
You should add --packages ai.h2o:sparkling-water-core_2.10:1.6.6 on __TOREE_SPARK_OPTS__ , something like
"__TOREE_SPARK_OPTS__": "--master local[*] --executor-memory 12g --driver-memory 12g --packages ai.h2o:sparkling-water-core_2.10:1.6.6",
Then, sc is created when you create your notebook. so you don't need to recreate sc.