I tried to run a Scala program to extract the data from mysql retail_db database. It throws SQLException.
My Code
import java.sql.DriverManager
import java.sql.Connection
import java.util.Properties
import org.apache.spark.sql.functions._
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://127.0.0.1:3306/employee"
val username = "cool"
val password = "Cool12345"
Class.forName (driver).newInstance ();
val connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"SELECT * FROM emp")```
Error
SQLNonTransientConnectionException: Could not connect to address=(host=127.0.0.1)(port=3306)(type=master) : Connection refused (Connection refused)
Caused by: ConnectException: Connection refused (Connection refused)
Database Details
Hostname - 127.0.0.1
post - 3306
username - cool
Related
Following the instructions to connect to Mongo, I get the following error
Uncaught exception: org/bson/BsonValue (java.lang.NoClassDefFoundError)
Here is my code:
val user = "<>"
val pwd = "<>"
val url = s"mongodb://$user:$pwd#<address>"
spark.conf.set("spark.mongodb.read.connection.uri", url)
spark.conf.set("spark.mongodb.read.database", "<db>")
val df = spark.read.format("mongodb")
.option("database", "<db>")
.option("collection", "<collection>").load()
What am I doing wrong?
Has anyone ever managed to get this to work? I've added a connection in AWS Glue to connect to my Mongodb cluster in Atlas but experiencing
Check that your connection definition references your Mongo database with correct URL syntax, username, and password Exiting with error code 30
in aws.
I spun up an ec2 instance in the same subnet as the glue connection in my VPC and it connects just fine. Also allowed all traffic in my security group but still getting the same error.
You might gotta take a look at authSource for optional compositions in a string uri of MongoDB.
Scala Example
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
val DEFAULT_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
val WRITE_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
lazy val defaultJsonOption = jsonOptions(DEFAULT_URI)
lazy val writeJsonOption = jsonOptions(WRITE_URI)
def main(sysArgs: Array[String]): Unit = {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// Get DynamicFrame from MongoDB
val resultFrame: DynamicFrame = glueContext.getSource("mongodb", defaultJsonOption).getDynamicFrame()
// Write DynamicFrame to MongoDB and DocumentDB
glueContext.getSink("mongodb", writeJsonOption).writeDynamicFrame(resultFrame)
Job.commit()
}
private def jsonOptions(uri: String): JsonOptions = {
new JsonOptions(
s"""{"uri": "${uri}",
|"database":"test",
|"collection":"coll",
|"username": "username",
|"password": "pwd",
|"ssl":"true",
|"ssl.domain_match":"false",
|"partitioner": "MongoSamplePartitioner",
|"partitionerOptions.partitionSizeMB": "10",
|"partitionerOptions.partitionKey": "_id"}""".stripMargin)
}
}
You maybe need to assign the authentication source as a database in the cluster you intend to connect.
mongodb://<an_ip_from_atlas_project_ip_access_list>:27017?authSource=test
References
docs.atlas.mongodb.com. 2021. Connect to a Cluster. [ONLINE] Available at: https://docs.atlas.mongodb.com/connect-to-cluster.
docs.aws.amazon.com. 2021. Examples: Setting Connection Types and Options. [ONLINE] Available at: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-samples.html.
docs.mongodb.com. 2021. Configuration Options. [ONLINE] Available at: https://docs.mongodb.com/spark-connector/master/configuration#partitioner-conf.
docs.mongodb.com. 2021. Connection String URI Format. [ONLINE] Available at: https://docs.mongodb.com/manual/reference/connection-string/.
I had the following code:
import org.jooq._
import org.jooq.impl._
import org.jooq.impl.DSL._
import java.sql.DriverManager
import org.apache.log4j.receivers.db.dialect.SQLDialect
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
val userName = "user"
val password = "pass"
val c = DriverManager.getConnection("jdbc:mysql://blah_blah.com", userName, password)
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then I got the following error:
found : org.jooq.SQL
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df1 = sql(s"select * from $db1_name.$tb1_name")
^
found : java.sql.PreparedStatement
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then per the comments suggestions I changed my code to:
I have the following Scala code:
val userName = "user"
val password = "pass"
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
....
....
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = session.read.format("jdbc").
option("url", "jdbc:mysql://blah_blah.com").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", userName).
option("password", password).
option("dbtable",s"select * from $db2_name.$tb2_name").load()
I am getting errors as following:
The last packet sent successfully to the server was 0 milliseconds
ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:632)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1016)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 78 elided
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:567)
... 100 more
Any solution or suggestion on these two errors?
I have tried postgresql and h2 driver as well => org.postgresql.Driver
But I get similar errors (not exact maybe)
Your issue is that the scala compilere have already initialized the var ds1 and ds2 as empty dataframe.
you have to try to read directly from spark:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
for other info you can check directly on the apache spark page
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can simply get a DataFrame by reading as below. Set you connection details:
val jdbcHostname = "some.host.name"
val jdbcDatabase = "some_db"
val driver = "com.mysql.cj.jdbc.Driver" // update driver as needed, In your case it will be `org.postgresql.Driver`
// url to DB
val jdbcUrl = s"jdbc:mysql://$jdbcHostname:3306/$jdbcDatabase"
val username = "someUser"
val password = "somePass"
// create a properties map for your DB connection
val connectionProperties = new Properties()
connectionProperties.put("user", s"${username}")
connectionProperties.put("password", s"${password}")
connectionProperties.setProperty("Driver", driver)
and then read from JDBC as:
// use above created url and connection properties to fetch data
val tableName = "some-table"
val mytable = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.
You can use the above mytable dataframe to run your queries or save data.
Say you want to select the columns like and save then
// your select query
val selectedDF = mytable.select("c1", "c2")
// now you can save above dataframe
Iam trying to connect to hdfs locally via intelliJ installed on my laptop.The cluster I'am trying to connect to is Kerberized with an edge node. I generated a keytab for the edge node and configured that in the code below. Iam able to login to the edgenode now. But when I now try to access the hdfs data which is on the namenode it throws an error.
Below is the Scala code that is trying to connect to hdfs:
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.security.{Credentials, UserGroupInformation}
import org.apache.hadoop.security.token.{Token, TokenIdentifier}
import java.security.{AccessController, PrivilegedAction, PrivilegedExceptionAction}
import java.io.PrintWriter
object DataframeEx {
def main(args: Array[String]) {
// $example on:init_session$
val spark = SparkSession
.builder()
.master(master="local")
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
runHdfsConnect(spark)
spark.stop()
}
def runHdfsConnect(spark: SparkSession): Unit = {
System.setProperty("HADOOP_USER_NAME", "m12345")
val path = new Path("/data/interim/modeled/abcdef")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://namenodename.hugh.com:8020")
conf.set("hadoop.security.authentication", "kerberos")
conf.set("dfs.namenode.kerberos.principal.pattern","hdfs/_HOST#HUGH.COM")
UserGroupInformation.setConfiguration(conf);
val ugi=UserGroupInformation.loginUserFromKeytabAndReturnUGI("m12345#HUGH.COM","C:\\Users\\m12345\\Downloads\\m12345.keytab");
println(UserGroupInformation.isSecurityEnabled())
ugi.doAs(new PrivilegedExceptionAction[String] {
override def run(): String = {
val fs= FileSystem.get(conf)
val output = fs.create(path)
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
"done"
}
})
}
}
Iam able to log into the edgenode. But when Iam trying to write to hdfs (the doAs method) it throws the following error:
WARN Client: Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM
18/06/11 12:12:01 ERROR UserGroupInformation: PriviledgedActionException m12345#HUGH.COM (auth:KERBEROS) cause:java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM
18/06/11 12:12:01 ERROR UserGroupInformation: PriviledgedActionException as:m12345#HUGH.COM (auth:KERBEROS) cause:java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM; Host Details : local host is: "INMBP-m12345/172.29.155.52"; destination host is: "namenodename.hugh.com":8020;
Exception in thread "main" java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM; Host Details : local host is: "INMBP-m12345/172.29.155.52"; destination host is: "namenodename.hugh.com":8020
If I log into the edgenode and do a kinit and then access the hdfs its fine. So why am I not able to access the hdfs namenode when Iam able to log into the edgenode?
Let me know if any more details are needed from my side.
The Spark conf object was set incorrectly. Below is what worked for me:
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://namenodename.hugh.com:8020")
conf.set("hadoop.security.authentication", "kerberos")
conf.set("hadoop.rpc.protection", "privacy") ***---(was missing this parameter)***
conf.set("dfs.namenode.kerberos.principal","hdfs/_HOST#HUGH.COM") ***---(this was initially wrongly set as dfs.namenode.kerberos.principal.pattern)***
I have a spark (1.2.1 v) job that inserts a content of an rdd to postgres using postgresql.Driver for scala:
rdd.foreachPartition(iter => {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
iter.foreach(row => {
val mapRequest = Utils.getInsertMap(row)
val query = Utils.getInsertRequest(squares_table, mapRequest)
try { statement.execute(query) }
catch {
case pe: PSQLException => println("exception caught: " + pe);
}
})
connection.close()
})
In the above code I open new connection to postgres for each partition of the rdd and close it. I think that the right way to go would be to use connection pool to postgres that I can take connections from (as described here), but its just pseudo-code:
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
What is the right way to connect to postgres with connection pool from spark?
This code will work for spark 2 or grater version and scala , First you have to add spark jdbc driver.
If you are using Maven then you can work this way. add this setting to your pom file
<dependency>
<groupId>postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.1-901-1.jdbc4</version>
</dependency>
write this code to scala file
import org.apache.spark.sql.SparkSession
object PostgresConnection {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("user", "username")
prop.setProperty("password", "password")
val url = "jdbc:postgresql://127.0.0.1:5432/databaseName"
val df = spark.read.jdbc(url, "table_name",prop)
println(df.show(5))
}
}