Spatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona? - pyspark

I would like to run spatial queries on large data sets; e.g. geopandas would be too slow.
Inspiration I found here: https://anant-sharma.medium.com/apache-sedona-geospark-using-pyspark-e60485318fbe
In Spark Pool of Synapse Analytics I prepared (via Azure Portal):
Apache Spark Pool / Settings / Packages / Requirement files:
requirement.txt:
azure-storage-file-share
geopandas
apache-sedona
Apache Spark Pool / Settings / Packages / Workspace packages:
geotools-wrapper-geotools-24.1.jar
sedona-sql-3.0_2.12-1.2.0-incubating.jar
Apache Spark Pool / Settings / Packages / Spark configuration
config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
In Pyspark Notebook
print(spark.version)
print(spark.conf.get("spark.kryo.registrator"))
print(spark.conf.get("spark.serializer"))
The output was:
3.1.2.5.0-58001107
org.apache.sedona.core.serde.SedonaKryoRegistrator
org.apache.spark.serializer.KryoSerializer
Then I tried:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
spark = SparkSession.builder.master("local[*]").appName("Sedona App").config("spark.serializer", KryoSerializer.getName).config("spark.kryo.registrator", SedonaKryoRegistrator.getName).getOrCreate()
SedonaRegistrator.registerAll(spark)
But it failed:
Py4JJavaError: An error occurred while calling o636.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
A simple check that stuff is correctly installed would probaly allow this:
%%sql
SELECT ST_Point(0,0);
Please help with getting the spatial functions registered in pyspark running in Synapse notebook!

As per the repro from my end, I'm able to successfully run the above commands without any issue.
I just installed the requirement.txt file contains apache-sedona and downloaded below two jar files:
sedona-python-adapter-3.0_2.12–1.0.0-incubating.jar
geotools-wrapper-geotools-24.0.jar
Note: config.txt file is not required.

Related

Reading url via pyspark in Databricks notebook

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.
from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()
Here is the error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?
This is the best i've found from youtube pyspark for everyone playlist
!curl "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps
As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)
Screen print :
If you want to skip first row if that is invalid one ,

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.

Issue when connecting from Spark 2.2.0 to MongoDB

My problem is about connecting from Apache Spark to MongoDB using the official connector.
Stack versions are as follows:
Apache Spark 2.2.0 (HDP build: 2.2.0.2.6.3.0-235)
MongoDB 3.4.10 (2x-node replica set with authentification)
I use these jars:
mongo-spark-connector-assembly-2.2.0.jar which i tried both to download from Maven repo and to build by myself with a proper Mongo Driver version
mongo-java-driver.jar downloaded from Maven Repo
The issue is about version correspondence, as mentioned here and here.
They all say, that the method was renamed since Spark 2.2.0 so i need to use the connector for 2.2.0 version - and yes it is, here is the method in spark connector 2.1.1, and here is renamed one in 2.2.0
But i am sure, that i use the proper one. I did these steps:
git clone https://github.com/mongodb/mongo-spark.git
cd mongo-spark
git checkout tags/2.2.0
sbt check
sbt assembly
scp target/scala-2.11/mongo-spark-connector_2.11-2.2.0.jar user#remote-spark-server:/opt/jars
All tests was OK. After that i am using pyspark and Zeppelin (so deploy-mode is client) to read some data from MongoDB:
df = sqlc.read.format("com.mongodb.spark.sql.DefaultSource") \
.option('spark.mongodb.input.uri', 'mongodb://user:password#172.22.100.231:27017,172.22.100.234:27017/dmp?authMechanism=SCRAM-SHA-1&authSource=admin&replicaSet=rs0&connectTimeoutMS=300&readPreference=nearest') \
.option('collection', 'verification') \
.option('sampleSize', '10') \
.load()
df.show()
And got this error:
Py4JJavaError: An error occurred while calling o86.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 3, worker01, executor 1): java.lang.NoSuchMethodError:org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonTypeOfTwo()Lscala/Function2;
I think am sure this method is not in the jar: TypeCoercion$.findTightestCommonTypeOfTwo()
I go to the SparkUI and look at the environment:
spark.repl.local.jars file:file:/opt/jars/mongo-spark-connector-assembly-2.2.0.jar,file:/opt/jars/mongo-java-driver-3.6.1.jar
And no different MongoDB-related files anywhere.
Please help, what am i doing wrong? Thanks in advance
It was the filecache... helped this advice https://community.hortonworks.com/content/supportkb/150578/how-to-clear-local-file-cache-and-user-cache-for-y-1.html

Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!
You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin

apache zeppelin fails on reading csv using pyspark

I'm using Zeppelin-Sandbox 0.5.6 with Spark 1.6.1 on Amazon EMR.
I am reading csv file located on s3.
The problem is that sometimes I'm getting error reading the file. I need to restart the interpreter several times until it works. nothing in my code changes. I can't restore it, and can't tell when it's happening.
My code goes as following:
defining dependencies:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.4.0")
using spark-csv:
%pyspark
import pyspark.sql.functions as func
df = sqlc.read.format("com.databricks.spark.csv").option("header", "true").load("s3://some_location/some_csv.csv")
error msg:
Py4JJavaError: An error occurred while calling o61.load. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-172-22-2-187.ec2.internal):
java.io.InvalidClassException: com.databricks.spark.csv.CsvRelation;
local class incompatible: stream classdesc serialVersionUID =
2004612352657595167, local class serialVersionUID =
6879416841002809418
...
Caused by: java.io.InvalidClassException:
com.databricks.spark.csv.CsvRelation; local class incompatible
Once I'm reading the csv into the dataframe, the rest of the code works fine.
Any advice?
Thanks!
You need to execute spark adding the spark-csv package to it like this
$ pyspark --packages com.databricks:spark-csv_2.10:1.2.0
Now the spark-csv will be in your classpath