object App {
def main(args: Array[String]) {
val conf = new spark.SparkConf().setMaster("local[2]").setAppName("mySparkApp")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcUrl = "1.2.34.567"
val jdbcUser = "someUser"
val jdbcPassword = "xxxxxxxxxxxxxxxxxxxx"
val tableName = "myTable"
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = sqlContext
.read
.format("jdbc")
.option("driver", driver)
.option("url", jdbcUrl)
.option("userName", jdbcUser)
.option("password", jdbcPassword)
.option("dbtable", tableName) // NullPointerException occurs here
.load()
}
}
I want to connect to a Postgres database on my LAN from Spark. During runtime, the following error occurs:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at <redacted>?.main(App.scala:42)
at <redacted>.App.main(App.scala)
Is there an obvious reason why there's a nullpointer exception at the option("dbtable", tableName) line? I'm using spark-2.3.1-bin-hadoop2.7 with Scala 2.11.12. For the postgres dependency, I'm using this version:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4-1200-jdbc41</version>
</dependency>
The error message (which isn't very helpful for troubleshooting) is probably not against option dbtable, but option url.
It looks like your jdbcUrl is missing the URL protocol jdbc:postgresql:// as its prefix. Here's a link re: Spark's JDBC data sources.
Related
I am trying to read a simple csv file Azure Data Lake Storage V2 with Spark 2.4 on my IntelliJ-IDE on mac
Code Below
package com.example
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object Test extends App {
val appName: String = "DataExtract"
val master: String = "local[*]"
val sparkConf: SparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.scheduler.mode", "FAIR")
.set("spark.sql.session.timeZone", "UTC")
.set("spark.sql.shuffle.partitions", "32")
.set("fs.defaultFS", "abfs://development#xyz.dfs.core.windows.net/")
.set("fs.azure.account.key.xyz.dfs.core.windows.net", "~~key~~")
val spark: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
spark.time(run(spark))
def run(spark: SparkSession): Unit = {
val df = spark.read.csv("abfs://development#xyz.dfs.core.windows.net/development/sales.csv")
df.show(10)
}
}
It's able to read, and throwing security exception
Exception in thread "main" java.lang.NullPointerException
at org.wildfly.openssl.CipherSuiteConverter.toJava(CipherSuiteConverter.java:284)
at org.wildfly.openssl.OpenSSLEngine.toJavaCipherSuite(OpenSSLEngine.java:1094)
at org.wildfly.openssl.OpenSSLEngine.getEnabledCipherSuites(OpenSSLEngine.java:729)
at org.wildfly.openssl.OpenSSLContextSPI.getCiphers(OpenSSLContextSPI.java:333)
at org.wildfly.openssl.OpenSSLContextSPI$1.getSupportedCipherSuites(OpenSSLContextSPI.java:365)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.<init>(SSLSocketFactoryEx.java:105)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.initializeDefaultFactory(SSLSocketFactoryEx.java:72)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.<init>(AbfsClient.java:79)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:817)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:149)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
Can anyone help me, what is the mistake?
As per my research, you will receive this error message when you have incompatible jar with the hadoop version.
I would request you to kindly go through the below issues:
http://mail-archives.apache.org/mod_mbox/spark-issues/201907.mbox/%3CJIRA.13243325.1562321895000.591499.1562323440292#Atlassian.JIRA%3E
https://issues.apache.org/jira/browse/HADOOP-16410
Had the same issue, resolved by adding wildfly version 1.0.7, as the docs shared by #cheekatlapradeep-msft mentioned
<dependency>
<groupId>org.wildfly.openssl</groupId>
<artifactId>wildfly-openssl</artifactId>
<version>1.0.7.Final</version>
</dependency>
I had the following code:
import org.jooq._
import org.jooq.impl._
import org.jooq.impl.DSL._
import java.sql.DriverManager
import org.apache.log4j.receivers.db.dialect.SQLDialect
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
val userName = "user"
val password = "pass"
val c = DriverManager.getConnection("jdbc:mysql://blah_blah.com", userName, password)
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then I got the following error:
found : org.jooq.SQL
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df1 = sql(s"select * from $db1_name.$tb1_name")
^
found : java.sql.PreparedStatement
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then per the comments suggestions I changed my code to:
I have the following Scala code:
val userName = "user"
val password = "pass"
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
....
....
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = session.read.format("jdbc").
option("url", "jdbc:mysql://blah_blah.com").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", userName).
option("password", password).
option("dbtable",s"select * from $db2_name.$tb2_name").load()
I am getting errors as following:
The last packet sent successfully to the server was 0 milliseconds
ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:632)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1016)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 78 elided
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:567)
... 100 more
Any solution or suggestion on these two errors?
I have tried postgresql and h2 driver as well => org.postgresql.Driver
But I get similar errors (not exact maybe)
Your issue is that the scala compilere have already initialized the var ds1 and ds2 as empty dataframe.
you have to try to read directly from spark:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
for other info you can check directly on the apache spark page
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can simply get a DataFrame by reading as below. Set you connection details:
val jdbcHostname = "some.host.name"
val jdbcDatabase = "some_db"
val driver = "com.mysql.cj.jdbc.Driver" // update driver as needed, In your case it will be `org.postgresql.Driver`
// url to DB
val jdbcUrl = s"jdbc:mysql://$jdbcHostname:3306/$jdbcDatabase"
val username = "someUser"
val password = "somePass"
// create a properties map for your DB connection
val connectionProperties = new Properties()
connectionProperties.put("user", s"${username}")
connectionProperties.put("password", s"${password}")
connectionProperties.setProperty("Driver", driver)
and then read from JDBC as:
// use above created url and connection properties to fetch data
val tableName = "some-table"
val mytable = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.
You can use the above mytable dataframe to run your queries or save data.
Say you want to select the columns like and save then
// your select query
val selectedDF = mytable.select("c1", "c2")
// now you can save above dataframe
I'm trying to query a Hbase table with spark but i get this error :
14:08:35.134 [main] DEBUG org.apache.hadoop.util.Shell - Failed to
detect a valid hadoop home directory java.io.FileNotFoundException:
HADOOP_HOME and hadoop.home.dir are unset.
I have set the HADOOP_HOME in .bashrc and echo $HADOOP_HOME give me the path
my code :
object HbaseQuery {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "emp"
System.setProperty("hadoop.home.dir", "/usr/local/hadoop-2.7.6")
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
hBaseRDD.foreach(println)
}
}
i tried to create spark-env.sh too by adding
export HADOOP_HOME="my path"
but still the same problem
Thank's in advance
I have a task to read csv file and load the csv file to sql table but I am not sure of my code and facing "No suitable driver error" and tried with new driver.
val DBURL= "jdbc:sqlserver://servername:port;DatabaseName=DBname"
val srcfile=spark.read.text("filename")
val test =srcfile.write.format("jdbc")
.option("url", DBURL)
.option("dbtable", "tablename")
.option("user", "username")
.option("password", "password")
.save()
Any help highly appreciated.
You can add the corresponding driver also in the option, like
.option ("driver","org.postgresql.Driver")
or
.option("driver", "com.mysql.jdbc.Driver")
I hope the following answer will help you and it is tried one so it must not have any kind of error
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Testing Transpose").setMaster("local[*]").set("spark.sql.crossJoin.enabled","true")
val sc = new SparkContext(conf)
val sparksession = SparkSession.builder().config("spark.sql.warehouse.dir","file:///c://tmp/spark-warehouse").getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sparksession.read.format("com.databricks.spark.csv").option("header", "true").load(Path)
val prop : java.util.Properties = new Properties()
prop.setProperty("user","(temp_User)")
prop.setProperty("password","(temp_password)")
df
.write
.option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")
.mode("append")
.jdbc("jdbc:sqlserver://(database_ip):(database_port_to_access)","(table_name)",prop)
sparksession.stop()
}
include this dependency if you want to use databricks.csv else you can replace it
val df = sparkSession.read.option("header","true").csv("src/main/resources/sales.csv")
this needed to be included in build.sbt
libraryDependencies += "com.databricks" % "spark-csv_2.10" % "0.1"
if your file doesn't have header then you can provide them header like the following
import sqlContext.implicits._
df.toDF("column_name_1","column_name_2",.....)
Note: The number of column names must be similar to the number of columns in the dataframe and one more thing to note you need to change the header option to false like follows
sparksession.read.format("com.databricks.spark.csv").option("header", "false").load(Path)
I have a spark (1.2.1 v) job that inserts a content of an rdd to postgres using postgresql.Driver for scala:
rdd.foreachPartition(iter => {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
iter.foreach(row => {
val mapRequest = Utils.getInsertMap(row)
val query = Utils.getInsertRequest(squares_table, mapRequest)
try { statement.execute(query) }
catch {
case pe: PSQLException => println("exception caught: " + pe);
}
})
connection.close()
})
In the above code I open new connection to postgres for each partition of the rdd and close it. I think that the right way to go would be to use connection pool to postgres that I can take connections from (as described here), but its just pseudo-code:
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
What is the right way to connect to postgres with connection pool from spark?
This code will work for spark 2 or grater version and scala , First you have to add spark jdbc driver.
If you are using Maven then you can work this way. add this setting to your pom file
<dependency>
<groupId>postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.1-901-1.jdbc4</version>
</dependency>
write this code to scala file
import org.apache.spark.sql.SparkSession
object PostgresConnection {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("user", "username")
prop.setProperty("password", "password")
val url = "jdbc:postgresql://127.0.0.1:5432/databaseName"
val df = spark.read.jdbc(url, "table_name",prop)
println(df.show(5))
}
}