Unable to run PySpark (Kafka to Delta) in local and getting SparkException: Cannot find catalog plugin class for catalog 'spark_catalog'

Unable to run PySpark (Kafka to Delta) in local and getting SparkException: Cannot find catalog plugin class for catalog 'spark_catalog' - pyspark

I'm trying to write a PySpark code to read from the Kafka topic and publish to the Delta table. But it is not working.
Sample Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "latest") \
.load() \
.withColumn("current_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("current_timestamp", "value_str")
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
stream.awaitTermination()
Command:
Spark version: 3.3.1
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 kafka_to_delta.py
Console:
Traceback (most recent call last):
File "/Users/user/Desktop/python-module/kafka_to_delta.py", line 24, in <module>
stream = kafka_df.writeStream \
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1468, in toTable
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.toTable.
: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.spark.sql.delta.catalog.DeltaCatalog
at org.apache.spark.sql.errors.QueryExecutionErrors$.catalogPluginClassNotFoundForCatalogError(QueryExecutionErrors.scala:1638)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:65)
at org.apache.spark.sql.connector.catalog.CatalogManager.loadV2SessionCatalog(CatalogManager.scala:67)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$2(CatalogManager.scala:86)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$1(CatalogManager.scala:86)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.connector.catalog.CatalogManager.v2SessionCatalog(CatalogManager.scala:85)
at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:51)
at org.apache.spark.sql.connector.catalog.CatalogManager.currentCatalog(CatalogManager.scala:122)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog(LookupCatalog.scala:34)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog$(LookupCatalog.scala:34)
at org.apache.spark.sql.catalyst.analysis.Analyzer.currentCatalog(Analyzer.scala:187)
at org.apache.spark.sql.connector.catalog.LookupCatalog$CatalogAndIdentifier$.unapply(LookupCatalog.scala:111)
at
:
:
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.delta.catalog.DeltaCatalog
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:55)
... 25 more
23/01/29 18:00:53 INFO SparkContext: Invoking stop() from shutdown hook
Do I need to specify catalog and schema, before running this code? And what is the best practice for doing this?

Try to add io.delta:delta-core_2.12:2.1.0 to the list of packages passed via --packages (you also need to make sure that you use matching version of the spark-sql-kafka):
spark-submit --packages \
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 \
kafka_to_delta.py
or remove --packages and specify Kafka dependency inside the script, although it's always better not to have specific versions inside the code, and instead provide all options outside of it.

Related

Connnect to a postgres database on the localhost with pyspark

I am trying to connect to a postgres database with pyspark. My code looks as follows:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "C:/Users/.../Documents/HWH/projecten/spark/postgresql-42.5.0.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/postgres") \
.option("dbtable", "chicago_crime") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
I keep getting the following error:
Py4JJavaError: An error occurred while calling o128.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at ...
It says the class is not found. Does this mean the path to the jar is not correctly specified?
Anything would help!

pyspark kafka streaming data handler

I'm using spark 2.3.2 with pyspark and just figured out that foreach and foreachBatch are not available in 'DataStreamWriter' object in this configuration. The problem is the company Hadoop is 2.6 and spark 2.4(that provides what I need) doesn't work(SparkSession is crashing). There is some another alternative to send data to a custom handler and process streaming data?
This is my code until now:
def streamLoad(self,customHandler):
options = self.options
self.logger.info("Recuperando o schema baseado na estrutura do JSON")
jsonStrings = ['{"sku":"9","ean":"4","name":"DVD","description":"foo description","categories":[{"code":"M02_BLURAY_E_DVD_PLAYER"}],"attributes":[{"name":"attrTeste","value":"Teste"}]}']
myRDD = self.spark.sparkContext.parallelize(jsonStrings)
jsonSchema = self.spark.read.json(myRDD).schema # Maybe there is a way to serialize this
self.logger.info("Iniciando o streaming no Kafka[opções: {}]".format(str(options)))
df = self.spark \
.readStream \
.format("kafka") \
.option("maxFilesPerTrigger", 1) \
.option("kafka.bootstrap.servers", options["kafka.bootstrap.servers"]) \
.option("startingOffsets", options["startingOffsets"]) \
.option("subscribe", options["subscribe"]) \
.option("failOnDataLoss", options["failOnDataLoss"]) \
.load() \
.select(
col('value').cast("string").alias('json'),
col('key').cast("string").alias('kafka_key'),
col("timestamp").cast("string").alias('kafka_timestamp')
) \
.withColumn('pjson', from_json(col('json'), jsonSchema)).drop('json')
query = df \
.writeStream \
.foreach(customHandler) \ #This doesn't work in spark 2.3.x Alternatives, please?
.start()
query.awaitTermination()

Failed to find data source: com.mongodb.spark.sql.DefaultSource

I'm trying to connect spark (pyspark) to mongodb as follows:
conf = SparkConf()
conf.set('spark.mongodb.input.uri', default_mongo_uri)
conf.set('spark.mongodb.output.uri', default_mongo_uri)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession \
.builder \
.appName("my-app") \
.config("spark.mongodb.input.uri", default_mongo_uri) \
.config("spark.mongodb.output.uri", default_mongo_uri) \
.getOrCreate()
But when I do the following:
users = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", '{uri}.{col}'.format(uri=mongo_uri, col='users')).load()
I get this error:
java.lang.ClassNotFoundException: Failed to find data source:
com.mongodb.spark.sql.DefaultSource
I did the same thing from pyspark shell and I was able to retrieve data. This is the command I ran:
pyspark --conf "spark.mongodb.input.uri=mongodb_uri" --conf "spark.mongodb.output.uri=mongodburi" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.2
But here we have the option to specify the package we need to use. But what about standalone apps and scripts. how can I configure mongo-spark-connector there.
Any ideas?

Here how I did it in Jupyter notebook:
1. Download jars from central or any other repository and put them in directory called "jars":
mongo-spark-connector_2.11-2.4.0
mongo-java-driver-3.9.0
2. Create session and write/read any data
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
working_directory = 'jars/*'
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection") \
.config("spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection") \
.config('spark.driver.extraClassPath', working_directory) \
.getOrCreate()
people = my_spark.createDataFrame([("JULIA", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 22)], ["name", "age"])
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.select('*').where(col("name") == "JULIA").show()
As a result you will see this:

If you are using SparkContext & SparkSession, you have mentioned the connector jar packages in SparkConf, check the following Code:
from pyspark import SparkContext,SparkConf
conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you are using only SparkSession then use following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()

If you're using the newest version of mongo-spark-connector, i.e. v10.0.1 at the time of writing this, you need to use SparkConf object, as stated by the mongo documentation (https://www.mongodb.com/docs/spark-connector/current/configuration/).
Besides, you don't need to manually download anything, it will do it for you.
Bellow is the solution I came up with, for :
mongo-spark-connector: 10.0.1
mongo server : 5.0.8
spark : 3.2.0
def init_spark():
password = os.environ["MONGODB_PASSWORD"]
user = os.environ["MONGODB_USER"]
host = os.environ["MONGODB_HOST"]
db_auth = os.environ["MONGODB_DB_AUTH"]
mongo_conn = f"mongodb://{user}:{password}#{host}:27017/{db_auth}"
conf = SparkConf()
# Download mongo-spark-connector and its dependencies.
# This will download all the necessary jars and put them in your $HOME/.ivy2/jars, no need to manually download them :
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector:10.0.1")
# Set up read connection :
conf.set("spark.mongodb.read.connection.uri", mongo_conn)
conf.set("spark.mongodb.read.database", "<my-read-database>")
conf.set("spark.mongodb.read.collection", "<my-read-collection>")
# Set up write connection
conf.set("spark.mongodb.write.connection.uri", mongo_conn)
conf.set("spark.mongodb.write.database", "<my-write-database>")
conf.set("spark.mongodb.write.collection", "<my-write-collection>")
# If you need to update instead of inserting :
conf.set("spark.mongodb.write.operationType", "update")
SparkContext(conf=conf)
return SparkSession \
.builder \
.appName('<my-app-name>') \
.getOrCreate()
spark = init_spark()
df = spark.read.format("mongodb").load()
df_grouped = df.groupBy("<some-column>").agg(mean("<some-other-column>"))
df_grouped.write.format("mongodb").mode("append").save()

I was also facing same error "java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource" while trying to connect to MongoDB from Spark (2.3).
I had to download and copy mongo-spark-connector_2.11 JAR file(s) into jars directory of spark installation.
That resolved my issue and I was successfully able to call my spark code via spark-submit.
Hope it helps.

Here is how this error got resolved by downloading the jar files below. (Used the solution of this question.)
1.Downloaded the jar files below.
mongo-spark-connector_2.11-2.4.1 from here
mongo-java-driver-3.9.0 from here
copy and paste both these jar files into 'jars' location in spark directory.
Pyspark Code in jupiter notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mongo").\
config("spark.mongodb.input.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
config("spark.mongodb.output.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
getOrCreate()
df=spark.read.format('com.mongodb.spark.sql.DefaultSource')\
.option( "uri", "mongodb://127.0.0.1:27017/$database.$table_name") \
.load()
df.printSchema()
#create Temp view of df to view the data
table = df.createOrReplaceTempView("df")
#to read table present in mongodb
query1 = spark.sql("SELECT * FROM df ")
query1.show(10)

You are not using sc to create the SparkSession. Maybe this code can help you:
conf.set('spark.mongodb.input.uri', mongodb_input_uri)
conf.set('spark.mongodb.input.collection', 'collection_name')
conf.set('spark.mongodb.output.uri', mongodb_output_uri)
sc = SparkContext(conf=conf)
spark = SparkSession(sc) # Using the context (conf) to create the session

kafka to pyspark structured streaming, parsing json as dataframe

I am experimenting with spark structured streaming (spark v2.2.0) to consume json data from kafka. However I encountered the following error.
pyspark.sql.utils.StreamingQueryException: 'Missing required
configuration "partition.assignment.strategy" which has no default
value.
Does anyone know why? The job was submitted using spark-submit below.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 sparksstream.py
This is the entire python script.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("test") \
.getOrCreate()
# Define schema of json
schema = StructType() \
.add("Session-Id", StringType()) \
.add("TransactionTimestamp", IntegerType()) \
.add("User-Name", StringType()) \
.add("ID", StringType()) \
.add("Timestamp", IntegerType())
# load data into spark-structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "topicName") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Print output
query = df.writeStream \
.outputMode("append") \
.format("console") \
.start()

use this instead to submit:
spark-submit \
--conf "spark.driver.extraClassPath=$SPARK_HOME/jars/kafka-clients-1.1.0.jar" \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 \
sparksstream.py
Assuming that you have donwloaded the kafka-clients*jar in you $SPARK_HOME/jars folder

Using pyspark to connect to PostgreSQL

I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql://[hostname]/[database]",
dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
password = "MichaelJordan",
user = "ScottyPippen",
source = "jdbc",
driver = "org.postgresql.Driver"
)
and I am getting the following error:
Any idea why is this happening?
Edit: I am trying to run the code locally in my computer.

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/
Then replace the database configuration values by yours.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark shell you use the SPARK_CLASSPATH environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit use the --driver-class-path flag:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame as follows:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit, you need to define the sqlContext.

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter

You normally need either:
to install the Postgres Driver on your cluster,
to provide the Postgres driver jar from your client with the --jars option
or to provide the maven coordinates of the Postgres driver with --packages option.
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.
This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()

To use pyspark and jupyter notebook notebook: first open pyspark with
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

I had trouble to get a connection to the postgresDB with the jars i had on my computer.
This code solved my problem with the driver
from pyspark.sql import SparkSession
import os
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", sparkClassPath) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "yourtablename") \
.option("user", "postgres") \
.option("password", "***") \
.load()
df.show()

I also get this error
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(Unknown Source)
and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.
eg:
from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName('Python Spark Postgresql') \
.config("spark.jars", "./postgresql-42.2.18.jar") \
.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/abc") \
.option("dbtable", 'tablename') \
.option("user", "postgres") \
.option("password", "1") \
.load()
df.printSchema()

This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

Download postgresql jar from here:
Add this to ~Spark/jars/ folder.
Restart your kernel.
It should work.

Just initialize pyspark with --jars <path/to/your/jdbc.jar>
E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar
then create a dataframe as suggested above in other answers
E.g.:
df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()

Download postgres JDBC driver from https://jdbc.postgresql.org/download.html
and use the script below.
Changes to make:
Edit PATH_TO_JAR_FILE
Save your DB credentials in an environment file and load them
Query the DB using query option and limit using fetch size
import os
from pyspark.sql import SparkSession
PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
.builder \
.appName("Example") \
.config("spark.jars", PATH_TO_JAR_FILE) \
.getOrCreate()
DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")
df = spark.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
.option("user", DB_USER) \
.option("password", DB_PASSWORD) \
.option("driver", "org.postgresql.Driver") \
.option("query","select * from your_table") \
.option('fetchsize',"1000") \
.load()
df.printSchema()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Unable to run PySpark (Kafka to Delta) in local and getting SparkException: Cannot find catalog plugin class for catalog 'spark_catalog' - pyspark

Related

Connnect to a postgres database on the localhost with pyspark

pyspark kafka streaming data handler

Failed to find data source: com.mongodb.spark.sql.DefaultSource

kafka to pyspark structured streaming, parsing json as dataframe

Using pyspark to connect to PostgreSQL

Categories

Resources