How to connect to oracle database in Spark using Scala - scala

I have Spark 3.1.2 and Scala 2.12.8. I want to connect to oracle Database and read a table then show it, using this code:
import org.apache.spark.sql.SparkSession
object readTable extends App{
val spark = SparkSession
.builder
.master("local[*]")
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "orcl.brnc_grp")
.option("user", "orcl")
.option("password", "xxxxxxx")
.load()
jdbcDF.show()
}
When I run the code, I receive this error:
Exception in thread "main" java.sql.SQLException: No suitable driver
Would you please guide me how to solve this problem?
Any help would be really appreciated.

Download oracle JDBC jar from here and place this JAR in $SPARK_HOME/jars folder.

Related

How can I fix an error of spark application?

I'm loading data from hdfs to clickhouse with spark job and get the error DB::Exception: Too many parts (306). Merges are processed much slower than inserts. (TOO_MANY_PARTS) (version 22.3.44 (official build))
Data is in parquet, volume 34 GB.
Path to parquet "hdfs://host:8020/user/stat/year=2022/month=1/day=1/101.parq".
My settings like so:
val df = spark.read.parquet("hdfs://host:8020/user/stat/year=2022/")
df.write
.format("jdbc")
.mode("append")
.option("driver","cc.blynk.clickhouse.ClickHouseDriver")
.option("url", "jdbc:clickhouse://host:8123/default")
.option("user", "login")
.option("password", "pass")
.option("dbtable", "table")
.save()
I'm new at Scala & Spark
so thanks for any advice
Use async_insert
https://clickhouse.com/docs/en/operations/settings/settings/#async-insert
.option("async_insert", "1")
or Engine=Buffer tables, for older ClickHouse version
https://clickhouse.com/docs/en/engines/table-engines/special/buffer/

How to connect Snowflake with PySpark?

I am trying to connect to Snowflake with Pyspark on my local machine.
My code is as follows:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars','/path/to/driver/snowflake-jdbc-3.12.17.jar , \
/path/to/connector/spark-snowflake_2.12-2.10.0-spark_3.2.jar')
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config(conf=conf) \
.getOrCreate()
sfOptions = {
"sfURL": "https://someurl.com",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "password",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * DimDate") \
.load()
df.show()
When I run this I get the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
How to fix this?
With the Snowflake Spark JAR version "spark-snowflake_2.12:2.10.0-spark_3.2"
Snowflake JDBC 3.13.14 needs to be used. I see that you are using 3.12.17 JDBC version.
Can you add JDBC Version 3.13.14 and then test. As pointed by FKyani, this is a compatibility issue between Snowflake-Spark Jar and JDBC jar.
Please confirm the correct JDBC version is imported.
Recommended Client Versions: https://docs.snowflake.com/en/release-notes/requirements.html#recommended-client-versions
This looks similar to the error mentioned in this article:
https://community.snowflake.com/s/article/Error-py4j-protocol-Py4JJavaError-An-error-occurred-while-calling-o74-load-java-lang-NoSUchMethodError-scala-Product-init-Lscala-Product-V
If you are using Scala 2.12, you need to downgrade it to 2.11. Please note that, in this case, you will also have to use the associated version of the Spark connector for Snowflake.
Spark connectors for Snowflake can be found here. We recommend you to use the latest connector version depending on your Spark version and Scala 2.11.

Pyspark, MongoDB and Missing BSON Reference

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?

How spark-shell or Zepellin notebook set HiveContext to SparkSession?

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.
When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder

PySpark Mongodb / java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame

I'm trying to connect pyspark to MongoDB with this (running on Databricks) :
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.sql import SQLContext
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
but I get this error
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I am using Spark 2.0 and Mongo-spark-connector 2.11 and defined spark.mongodb.input.uri and spark.mongodb.output.uri
You are using spark.read.format before you defined spark
As you can see in the Spark 2.1.0 documents
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
I managed to make it work because I was using mongo-spark-connector_2.10-1.0.0 instead of mongo-spark-connector_2.10-2.0.0