Issues in passing application configuration parameters to spark application - scala

I created an object using Spark/Scala to load data from an Oracle source to Hive Table. Database Password is passed through application.properties through typesafe.ConfigFactory.
I ATTEMPTED APPLICATION.CONF IN MY USER FOLDER and also IN CLASSPATH IN ANOTHER ATTEMPT with below spark-submit.
On every attempt I encounter error saying "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':".
which indicates like properties not reached configFactory methods
Can someone help me what am missing?
//my object snippet
object LoadFromOracleToHive {
def SaveToHive(spark :SparkSession):Unit = {
try {
val appConf = ConfigFactory.load(s"application.conf").getConfig("my.config")
val sparkConfig = appConf.getConfig("spark") // config.getConfig("spark")
val df = spark
.read
.format("jdbc")
.options(Map("password" -> sparkConfig.getString("password") , "driver" -> "oracle.jdbc.driver.OracleDriver"))
//my application.conf
my.config {
spark {
password = "password"
}
}
//my spark-submit
spark-submit --class LoadFromOracleToHive \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--executor-memory 8g \
--num-executors 15 \
--executor-cores 5 \
--conf spark.kryoserializer.buffer.max=512m \
--queue csg \
--jars /home/myuserfolder/ojdbc7.jar /home/myuserfolder/SandeepTest-1.0-SNAPSHOT-jar-with-dependencies.jar \
--queue /home/myuserfolder/application.conf \
--conf spark.driver.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf \
--conf spark.executor.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf

Related

Unable to run PySpark (Kafka to Delta) in local and getting SparkException: Cannot find catalog plugin class for catalog 'spark_catalog'

I'm trying to write a PySpark code to read from the Kafka topic and publish to the Delta table. But it is not working.
Sample Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "latest") \
.load() \
.withColumn("current_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("current_timestamp", "value_str")
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
stream.awaitTermination()
Command:
Spark version: 3.3.1
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 kafka_to_delta.py
Console:
Traceback (most recent call last):
File "/Users/user/Desktop/python-module/kafka_to_delta.py", line 24, in <module>
stream = kafka_df.writeStream \
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1468, in toTable
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.toTable.
: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.spark.sql.delta.catalog.DeltaCatalog
at org.apache.spark.sql.errors.QueryExecutionErrors$.catalogPluginClassNotFoundForCatalogError(QueryExecutionErrors.scala:1638)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:65)
at org.apache.spark.sql.connector.catalog.CatalogManager.loadV2SessionCatalog(CatalogManager.scala:67)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$2(CatalogManager.scala:86)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$1(CatalogManager.scala:86)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.connector.catalog.CatalogManager.v2SessionCatalog(CatalogManager.scala:85)
at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:51)
at org.apache.spark.sql.connector.catalog.CatalogManager.currentCatalog(CatalogManager.scala:122)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog(LookupCatalog.scala:34)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog$(LookupCatalog.scala:34)
at org.apache.spark.sql.catalyst.analysis.Analyzer.currentCatalog(Analyzer.scala:187)
at org.apache.spark.sql.connector.catalog.LookupCatalog$CatalogAndIdentifier$.unapply(LookupCatalog.scala:111)
at
:
:
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.delta.catalog.DeltaCatalog
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:55)
... 25 more
23/01/29 18:00:53 INFO SparkContext: Invoking stop() from shutdown hook
Do I need to specify catalog and schema, before running this code? And what is the best practice for doing this?
Try to add io.delta:delta-core_2.12:2.1.0 to the list of packages passed via --packages (you also need to make sure that you use matching version of the spark-sql-kafka):
spark-submit --packages \
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 \
kafka_to_delta.py
or remove --packages and specify Kafka dependency inside the script, although it's always better not to have specific versions inside the code, and instead provide all options outside of it.

Spark GraphX pregel the number of iterations is greater than 3, resulting in FULL GC

Spark GraphX 【pregel】 the number of iterations is greater than 3, resulting in FULL GC
The following is the specific implementation code
Spark - 2.4.5
def vprog(vid: VertexId, vdata: Set[(Int, ArrayBuffer[VertexId])], message: Set[(Int, ArrayBuffer[Long])]): Set[(Int, ArrayBuffer[VertexId])] = {
message.filter(_._1 > 0) ++ vdata
}
def addMaps(spmap1: Set[(Int, ArrayBuffer[Long])], spmap2: Set[(Int, ArrayBuffer[Long])]): Set[(Int, ArrayBuffer[Long])] = {
spmap1 ++ spmap2
}
def sendMsg(e: EdgeTriplet[Set[(Int, ArrayBuffer[Long])], _]): Iterator[(VertexId, Set[(Int, ArrayBuffer[VertexId])])] = {
if (e.srcAttr.isEmpty) {
Iterator.empty
}
else {
val newAttr = e.srcAttr
.filter(!_._2.contains(e.dstId))
.map(lineData => {
(lineData._2.length, lineData._2 :+ e.dstId)
})
Iterator((e.dstId, newAttr))
}
}
val graph = atomicGraph.mapVertices((vid, value) => Set((0, ArrayBuffer(vid))))
atomicGraph.unpersist(true)
val initializeMessage: Set[(Int, ArrayBuffer[Long])] = Set((0, ArrayBuffer()))
val resultGraph = graph.pregel(initializeMessage, totalDegree)(vprog, sendMsg, addMaps)
println(resultGraph.vertices.collect().mkString("\n"))
The program consumes a lot of time in GC
Number of vertices: 183
Number of edges: 2000+
Spark-submit:
spark-submit \
--master yarn \
--num-executors 5 \
--deploy-mode cluster \
--driver-memory 20g \
--executor-memory 20g \
--executor-cores 10 \
--driver-cores 10 \
--conf spark.driver.memoryOverhead=20g \
--conf spark.executor.memoryOverhead=30g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \

Hive table in Spark

I am running following job in HDP.
export SPARK-MAJOR-VERSION=2
spark-submit --class com.spark.sparkexamples.Audit --master yarn --deploy-mode cluster \
--files /bigdata/datalake/app/config/metadata.csv BRNSAUDIT_v4.jar dl_raw.ACC /bigdatahdfs/landing/AUDIT/BW/2017/02/27/ACC_hash_total_and_count_20170227.dat TH 20170227
Its failing with error that:
*Table or view not found: `dl_raw`.`ACC`; line 1 pos 94;
'Aggregate [count(1) AS rec_cnt#58L, 'count('BRCH_NUM) AS hashcount#59, 'sum('ACC_NUM) AS hashsum#60]
+- 'Filter (('trim('country_code) = trim(TH)) && ('from_unixtime('unix_timestamp('substr('bus_date, 0, 11), MM/dd/yyyy), yyyyMMdd) = 20170227))
+- 'UnresolvedRelation `dl_raw`.`ACC'*
Whereas table is present in Hive and it is accessible from spark-shell.
UPD.
val sparkSession = SparkSession.builder
.appName("spark session example")
.enableHiveSupport()
.getOrCreate()
sparkSession.conf.set("spark.sql.crossJoin.enabled", "true")
val df_table_stats = sparkSession.sql("""select count(*) as rec_cnt,count(distinct BRCH_NUM) as hashcount,
sum(ACC_NUM) as hashsum
from dl_raw.ACC
where trim(country_code) = trim('BW')
and from_unixtime(unix_timestamp(substr(bus_date,0,11),'MM/dd/yy‌​yy'),'yyyyMMdd')='20‌​170227'""")
include the hive-site.xml file in the --file parameter while submitting the job

kafka to pyspark structured streaming, parsing json as dataframe

I am experimenting with spark structured streaming (spark v2.2.0) to consume json data from kafka. However I encountered the following error.
pyspark.sql.utils.StreamingQueryException: 'Missing required
configuration "partition.assignment.strategy" which has no default
value.
Does anyone know why? The job was submitted using spark-submit below.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 sparksstream.py
This is the entire python script.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("test") \
.getOrCreate()
# Define schema of json
schema = StructType() \
.add("Session-Id", StringType()) \
.add("TransactionTimestamp", IntegerType()) \
.add("User-Name", StringType()) \
.add("ID", StringType()) \
.add("Timestamp", IntegerType())
# load data into spark-structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "topicName") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Print output
query = df.writeStream \
.outputMode("append") \
.format("console") \
.start()
use this instead to submit:
spark-submit \
--conf "spark.driver.extraClassPath=$SPARK_HOME/jars/kafka-clients-1.1.0.jar" \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 \
sparksstream.py
Assuming that you have donwloaded the kafka-clients*jar in you $SPARK_HOME/jars folder

Using pyspark to connect to PostgreSQL

I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql://[hostname]/[database]",
dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
password = "MichaelJordan",
user = "ScottyPippen",
source = "jdbc",
driver = "org.postgresql.Driver"
)
and I am getting the following error:
Any idea why is this happening?
Edit: I am trying to run the code locally in my computer.
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/
Then replace the database configuration values by yours.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark shell you use the SPARK_CLASSPATH environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit use the --driver-class-path flag:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame as follows:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit, you need to define the sqlContext.
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter
You normally need either:
to install the Postgres Driver on your cluster,
to provide the Postgres driver jar from your client with the --jars option
or to provide the maven coordinates of the Postgres driver with --packages option.
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell
One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.
This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()
To use pyspark and jupyter notebook notebook: first open pyspark with
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
I had trouble to get a connection to the postgresDB with the jars i had on my computer.
This code solved my problem with the driver
from pyspark.sql import SparkSession
import os
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", sparkClassPath) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "yourtablename") \
.option("user", "postgres") \
.option("password", "***") \
.load()
df.show()
I also get this error
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(Unknown Source)
and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.
eg:
from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName('Python Spark Postgresql') \
.config("spark.jars", "./postgresql-42.2.18.jar") \
.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/abc") \
.option("dbtable", 'tablename') \
.option("user", "postgres") \
.option("password", "1") \
.load()
df.printSchema()
This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.
Download postgresql jar from here:
Add this to ~Spark/jars/ folder.
Restart your kernel.
It should work.
Just initialize pyspark with --jars <path/to/your/jdbc.jar>
E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar
then create a dataframe as suggested above in other answers
E.g.:
df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()
Download postgres JDBC driver from https://jdbc.postgresql.org/download.html
and use the script below.
Changes to make:
Edit PATH_TO_JAR_FILE
Save your DB credentials in an environment file and load them
Query the DB using query option and limit using fetch size
import os
from pyspark.sql import SparkSession
PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
.builder \
.appName("Example") \
.config("spark.jars", PATH_TO_JAR_FILE) \
.getOrCreate()
DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")
df = spark.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
.option("user", DB_USER) \
.option("password", DB_PASSWORD) \
.option("driver", "org.postgresql.Driver") \
.option("query","select * from your_table") \
.option('fetchsize',"1000") \
.load()
df.printSchema()