BSON Error Connecting to MongoDB from Spark - mongodb

Following the instructions to connect to Mongo, I get the following error
Uncaught exception: org/bson/BsonValue (java.lang.NoClassDefFoundError)
Here is my code:
val user = "<>"
val pwd = "<>"
val url = s"mongodb://$user:$pwd#<address>"
spark.conf.set("spark.mongodb.read.connection.uri", url)
spark.conf.set("spark.mongodb.read.database", "<db>")
val df = spark.read.format("mongodb")
.option("database", "<db>")
.option("collection", "<collection>").load()
What am I doing wrong?

Related

Push down DML commands to SQL using Pyspark on Databricks

I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()

Scala sql query remote access error from GCP to on-premises

I had the following code:
import org.jooq._
import org.jooq.impl._
import org.jooq.impl.DSL._
import java.sql.DriverManager
import org.apache.log4j.receivers.db.dialect.SQLDialect
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
val userName = "user"
val password = "pass"
val c = DriverManager.getConnection("jdbc:mysql://blah_blah.com", userName, password)
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then I got the following error:
found : org.jooq.SQL
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df1 = sql(s"select * from $db1_name.$tb1_name")
^
found : java.sql.PreparedStatement
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then per the comments suggestions I changed my code to:
I have the following Scala code:
val userName = "user"
val password = "pass"
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
....
....
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = session.read.format("jdbc").
option("url", "jdbc:mysql://blah_blah.com").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", userName).
option("password", password).
option("dbtable",s"select * from $db2_name.$tb2_name").load()
I am getting errors as following:
The last packet sent successfully to the server was 0 milliseconds
ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:632)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1016)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 78 elided
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:567)
... 100 more
Any solution or suggestion on these two errors?
I have tried postgresql and h2 driver as well => org.postgresql.Driver
But I get similar errors (not exact maybe)
Your issue is that the scala compilere have already initialized the var ds1 and ds2 as empty dataframe.
you have to try to read directly from spark:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
for other info you can check directly on the apache spark page
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can simply get a DataFrame by reading as below. Set you connection details:
val jdbcHostname = "some.host.name"
val jdbcDatabase = "some_db"
val driver = "com.mysql.cj.jdbc.Driver" // update driver as needed, In your case it will be `org.postgresql.Driver`
// url to DB
val jdbcUrl = s"jdbc:mysql://$jdbcHostname:3306/$jdbcDatabase"
val username = "someUser"
val password = "somePass"
// create a properties map for your DB connection
val connectionProperties = new Properties()
connectionProperties.put("user", s"${username}")
connectionProperties.put("password", s"${password}")
connectionProperties.setProperty("Driver", driver)
and then read from JDBC as:
// use above created url and connection properties to fetch data
val tableName = "some-table"
val mytable = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.
You can use the above mytable dataframe to run your queries or save data.
Say you want to select the columns like and save then
// your select query
val selectedDF = mytable.select("c1", "c2")
// now you can save above dataframe

NullpointerException when connection to Postgres from Spark -- why?

object App {
def main(args: Array[String]) {
val conf = new spark.SparkConf().setMaster("local[2]").setAppName("mySparkApp")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcUrl = "1.2.34.567"
val jdbcUser = "someUser"
val jdbcPassword = "xxxxxxxxxxxxxxxxxxxx"
val tableName = "myTable"
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = sqlContext
.read
.format("jdbc")
.option("driver", driver)
.option("url", jdbcUrl)
.option("userName", jdbcUser)
.option("password", jdbcPassword)
.option("dbtable", tableName) // NullPointerException occurs here
.load()
}
}
I want to connect to a Postgres database on my LAN from Spark. During runtime, the following error occurs:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at <redacted>?.main(App.scala:42)
at <redacted>.App.main(App.scala)
Is there an obvious reason why there's a nullpointer exception at the option("dbtable", tableName) line? I'm using spark-2.3.1-bin-hadoop2.7 with Scala 2.11.12. For the postgres dependency, I'm using this version:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4-1200-jdbc41</version>
</dependency>
The error message (which isn't very helpful for troubleshooting) is probably not against option dbtable, but option url.
It looks like your jdbcUrl is missing the URL protocol jdbc:postgresql:// as its prefix. Here's a link re: Spark's JDBC data sources.

Error on loading csv file from hdfs using spark sql

This is the code I'm running:
scala> val telecom = sqlContext.read.format("csv").load("hdfs:///CDR.csv")
but I am getting an error:
<console>:23: error: not found: value sqlContext
val telecom = sqlContext.read.format("csv").load("hdfs:///CDR.csv")

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()