Pyspark ModuleNotFound when importing custom package - pyspark

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Related

How read .shp files in databricks from filestore?

I'm using Databricks community, and I save a .shp in the FileStore, but when I tried to read I get this error:
DriverError: /dbfs/FileStore/tables/World_Countries.shp: No such file or directory
this is my Code
import geopandas as gpd
gdf = gpd.read_file("/dbfs/FileStore/tables/World_Countries.shp")
I also tried
gdf = gpd.read_file("/FileStore/tables/World_Countries.shp")
You should first verify that the file path is correct and that the file exists in the specified location. You can use the dbutils.fs.ls command to list the contents of the directory and check if the file is present. You can do this using:
dbutils.fs.ls("dbfs:/FileStore/path/to/your/file.shp")
Also, make sure that you have the correct permissions to access the file. In Databricks, you may need to be an administrator or have the correct permissions to access the file.
Try to read the file using the full path, including the file extension:
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
There are then several methods to read files in Databrick:
1.
from pyspark.sql.functions import *
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
df.show()
df = spark.read.shape(file_path)
and
3.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
from shapely.geometry import Point
geo_df = df.select("shape").withColumn("geometry", F.shape_to_geometry("shape")).drop("shape").select("geometry")``

Unable to call Notebook when using scala code in Databricks

I am into a situation where I am able to successfully run the below snippet in azure Databricks from a separate CMD.
%run ./HSCModule
But running into issues when including that piece of code with other scala code which is importing below packages and getting following error.
import java.io.{File, FileInputStream}
import java.text.SimpleDateFormat
import java.util{Calendar, Properties}
import org.apache.spark.SparkException
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import scala.util._
ERROR = :168: error: ';' expected but '.' found. %run
./HSCModule
FYI - I have also used dbutils.notebook.run and still facing same issues.
You can't mix the magic commands, like, %run, %pip, etc. with the Scala/Python code in the same cell. Documentation says:
%run must be in a cell by itself, because it runs the entire notebook inline.
So you need to put this magic command into a separate cell.

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

Databricks-connect with Python + Scala UDF in JAR file not working locally

I'm trying to use a JAR file in python (using Databricks-connect) in Vs Code.
I already checked the path to the jar file.
I have the following code as example:
import datetime
import time
from pyspark.sql import SparkSession
from pyDataHub import LoadProcessorBase, ProcessItem
from pyspark.sql.functions import col, lit, sha1, concat, udf, array
from pyspark.sql import functions
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType
from pyspark import SparkContext
from pyspark.sql.functions import sha1, upper
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession \
.builder \
.config("spark.jars", "/users/Phill/source/jar/DataHub_Core_Functions.jar") \
.getOrCreate()
sc = spark.sparkContext
def PhillHash(col):
f = sc._jvm.com.narato.datahub.core.HashContentGenerator.getGenerateHashUdf()
return upper(sha1(Column(f.apply(_to_seq(sc, [col], _to_java_column)))))
sc._jsc.addJar("/users/Phill/source/jar/DataHub_Core_Functions.jar")
spark.range(100).withColumn("test", PhillHash("id")).show()
Any help would be appreciated cause I'm out of options here...
The error I get is the following:
Exception has occurred: TypeError 'JavaPackage' object is not callable
Add the jar to a dbfs location and update the path accordingly. The workers cannot connect to your local filesystem.
Also make sure you are running version 5.4 of the databricks runtime (or higher).