Netezza connection with Spark / Scala JDBC - scala

I've set up Spark 2.2.0 on my Windows machine using Scala 2.11.8 on IntelliJ IDE. I'm trying to make Spark connect to Netezza using JDBC drivers.
I've read through this link and added the com.ibm.spark.netezzajars to my project through Maven. I attempt to run the Scala script below just to test the connection:
package jdbc
object SimpleScalaSpark {
def main(args: Array[String]) {
import org.apache.spark.sql.{SparkSession, SQLContext}
import com.ibm.spark.netezza
val spark = SparkSession.builder
.master("local")
.appName("SimpleScalaSpark")
.getOrCreate()
val sqlContext = SparkSession.builder()
.appName("SimpleScalaSpark")
.master("local")
.getOrCreate()
val nzoptions = Map("url" -> "jdbc:netezza://SERVER:5480/DATABASE",
"user" -> "USER",
"password" -> "PASSWORD",
"dbtable" -> "ADMIN.TABLENAME")
val df = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
}
}
However I get the following error:
17/07/27 16:28:17 ERROR NetezzaJdbcUtils$: Couldn't find class org.netezza.Driver
java.lang.ClassNotFoundException: org.netezza.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:49)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:netezza://SERVER:5480/DATABASE
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:54)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
I have two ideas:
1) I don't don't believe I actually installed any Netezza JDBC driver, though I thought the jars I brought into my project from the link above was sufficient. Am I just missing a driver or am I missing something in my Scala script?
2) In the same link, the author makes mention of starting the Netezza Spark package:
For example, to use the Spark Netezza package with Spark’s interactive
shell, start it as shown below:
$SPARK_HOME/bin/spark-shell –packages
com.ibm.SparkTC:spark-netezza_2.10:0.1.1
–driver-class-path~/nzjdbc.jar
I don't believe I'm invoking any package apart from jdbc in my script. Do I have to add that to my script?
Thanks!

Your 1st idea is right, I think. You almost certainly need to install the Netezza JDBC driver if you have not done this already.
From the link you posted:
This package can be deployed as part of an application program or from
Spark tools such as spark-shell, spark-sql. To use the package in the
application, you have to specify it in your application’s build
dependency. When using from Spark tools, add the package using
–packages command line option. Netezza JDBC driver also should be
added to the application dependencies.
The Netezza driver is something you have to download yourself, and you need support entitlement to get access to it (via IBM's Fix Central or Passport Advantage). It is included in either the Windows driver/client support package, or the linux driver package.

Related

java.lang.NoSuchMethodError: com.mongodb.internal.connection.Cluster.selectServer

I am new to Apache Spark and I am using Scala and Mongodb to learn it.
https://docs.mongodb.com/spark-connector/current/scala-api/
I am trying to read the RDD from my MongoDB database, my notebook script as below:
import com.mongodb.spark.config._
import com.mongodb.spark._
val readConfig = ReadConfig(Map("uri" -> "mongodb+srv://root:root#mongodbcluster.td5gp.mongodb.net/test_database.test_collection?retryWrites=true&w=majority"))
val testRDD = MongoSpark.load(sc, readConfig)
print(testRDD.collect)
At the print(testRDD.collect) line, I got this error:
java.lang.NoSuchMethodError:
com.mongodb.internal.connection.Cluster.selectServer(Lcom/mongodb/selector/ServerSelector;)Lcom/mongodb/internal/connection/Server;
And more than 10 lines "at..."
Used libraries:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
org.mongodb.scala:mongo-scala-driver_2.12:4.2.3
Is this the problem from Mongodb internal libraries or how could I fix it?
Many thanks
I suspect that there is a conflict between mongo-spark-connector and mongo-scala-driver. The former is using Mongo driver 4.0.5, but the later is based on the version 4.2.3. I would recommend to try only with mongo-spark-connector
I was also facing the same problem, solved it using mongo-spark-connector-2.12:3.0.1 jar and with that also added jar of Scalaj HTTP » 2.4.2. It's working fine now.

Connecting HIVE from Spark/Scala

I have installed Hadoop-3.3.0 and Hive-3.1.2 in Ubuntu WSL (as windows subsystem).
I have all hadoop, YARN and hiveserver2 demons running in Ubuntu WSL.
In my windows OS (host), I open Scala IDE. Via Spark/Scala, I would like to connect to HIVE tables which are available in Ubuntu WSL.
In Windows, I have nothing related to Hadoop/HIVE installed. Everything is available only in Ubuntu WSL.
Can someone please help how to do this in Scala IDE.
I do this with Maven.
Code I use:
val spark = SparkSession
.builder
.master("local[*]")
.appName("My APP")
.config("spark.sql.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate
spark.sql("show tables").show();
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport
Thanks!

Spark unit tests with hive on local metastore

I'm using spark 2.2.0, and I would like to create unit tests for spark with hive support.
The test should relay on a metastore that is stored on the local disk (as explained in the programming guide)
I define the session in the following way:
val spark = SparkSession
.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
the creation of the spark session fails with the error:
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
I managed to work around this error by adding the following dependency:
"org.datanucleus" % "datanucleus-accessplatform-jdo-rdbms" % "3.2.9"
This is strange to me, since this library is already included in spark.
Is there another way to solve this?
I wouldn't wan't to keep track of the library and update it with every new spark version.

Zeppelin 6.5 + Apache Kafka connector for Structured Streaming 2.0.2

I'm trying to run a zeppelin notebook that contains spark's Structured Streaming example with Kafka connector.
>kafka is up and running on localhost port 9092
>from zeppelin notebook, sc.version returns String = 2.0.2
Here is my environment:
kafka: kafka_2.10-0.10.1.0
zeppelin: zeppelin-0.6.2-bin-all
spark: spark-2.0.2-bin-hadoop2.7
Here is the code in my zeppelin notebook:
import org.apache.enter code herespark.sql.functions.{explode, split}
// Setup connection to Kafka val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
// comma separated list of broker:host
.option("subscribe", "twitter")
// comma separated list of topics
.option("startingOffsets", "latest")
// read data from the end of the stream .load()
Here is the error I'm getting when I run the notebook:
import org.apache.spark.sql.functions.{explode, split}
java.lang.ClassNotFoundException: Failed to find data source: kafka.
Please find packages at
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:218)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 86 elided Caused by: java.lang.ClassNotFoundException:
kafka.DefaultSource at
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at scala.util.Try$.apply(Try.scala:192)
Any help advice would be greatly appreciated.
Thnx
You probably have figured this out already but putting in the answer for others, you have to add the following to zeppelin-env.sh.j2
SPARK_SUBMIT_OPTIONS=--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0
along with potentially other dependencies if you are using the kafka client:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-sql_2.11:2.1.0,org.apache.kafka:kafka_2.11:0.10.0.1,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0,org.apache.kafka:kafka-clients:0.10.0.1
This solution has been tested in zeppelin version 0.10.1.
You need to add dependencies of your code. It can be done with zeppelin UI. Go to Interpreter panel (http://localhost:8080/#/interpreter) and in spark section, under Dependencies you can add artifact of each dependency. If by adding spark-sql-kafka you ran into other dependency issues, add all packages the spark-sql-kafka needs. You can find them in Compile Dependencies section of it's maven repository.
I'm working with spark version 3.0.0 and scala version 2.12 and I was trying to integrate spark with kafka. I managed to get passed this issue by adding all the bellow artifacts:
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0
com.google.code.findbugs:jsr305:3.0.0
org.apache.spark:spark-tags_2.12:3.3.0
org.apache.spark:spark-token-provider-kafka-0-10_2.12:3.3.0
org.apache.kafka:kafka-clients:3.0.0

Unable to connect to redshift from spark

I am trying to read data from redshift to spark 1.5 using scala 2.10
I have built the spark-redshift package and added the amazon JDBC connector to the project, but I keep getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials
I have authenticated in the following way:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "ACCESSKEY")
hadoopConf.set("fs.s3n.awsSecretAccessKey","SECRETACCESSKEY")
val df: DataFrame = sqlContext.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://AWS_SERVER:5439/warehouseuser=USER&password=PWD")
.option("dbtable", "fact_time")
.option("tempdir", "s3n://bucket/path")
.load()
df.show()
Concerning your first error java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials I repeat what I said in the comment :
You have forgot to ship your AWS dependency jar in your spark app jar
And about the second error, I'm not sure of the package but it's more likely to be the org.apache.httpcomponents library you need. (I don't know for what you are using it thought!)
You can add the following to your maven dependencies :
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.3</version>
</dependency>
and you'll need to assembly the whole.
PS: You'll always need to provide the libraries when they are not installed. You must also becareful with the size of the jar you are submitting because it can harm performance.