I got this error when i tried to write a spark dataframe to postgres DB. I am using a local cluster and the code is as follows:
from pyspark import SparkContext
from pyspark import SQLContext, SparkConf
import os
os.environ["SPARK_CLASSPATH"] = '/usr/share/java/postgresql-jdbc4.jar'
conf = SparkConf() \
.setMaster('local[2]') \
.setAppName("test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sc.parallelize([("a", "b", "c", "d")]).toDF()
url_connect = "jdbc:postgresql://localhost:5432"
table = "table_test"
mode = "overwrite"
properties = {"user":"postgres", "password":"12345678"}
df.write.option('driver', 'org.postgresql.Driver').jdbc(
url_connect, table, mode, properties)
The error log is as follows:
Py4JJavaError: An error occurred while calling o119.jdbc.
: java.lang.NullPointerException
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
I have tried search an answer from the web but could not find any. Thank you in advance!
Have you tried specifying the database in your table_test variable? I have a similar implementation that looks like this:
mysqlUrl = "jdbc:mysql://mysql:3306"
properties = {'user':'root',
'password':'password',
'driver':'com.mysql.cj.jdbc.Driver'
}
table = 'db_name.table_name'
try:
schemaDF = spark.read.jdbc(mysqlUrl, table, properties=properties)
print 'schema DF loaded'
except Exception, e:
print 'schema DF does not exist!'
I also have the same problem by using MySQL.
The way to solve the problem is by finding the right jar.
Related
I have followed instructions from this posting to read data from an existing Postgres database with table named "objects" as defined and created by the Objects class in SQLalchemy. In my Jupyter notebook, my code is
from pyspark import SparkContext
from pyspark import SparkConf
from random import random
#spark conf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
properties = {
"driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://PG_USER:PASSWORD#PG_SERVER_IP/db_name'
df = sqlContext.read.jdbc(url=url, table='objects', properties=properties)
the last line results in the following:
Py4JJavaError: An error occurred while calling o25.jdbc.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:158)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
so it looks like it can't resolve the table. How do I test from here to make sure that I am connected to the database properly?
Problems with name resolving are indicated by org.postgresql.util.PSQLException and don't result in NPE. The source of the issue is actually a connection string and in particular the way you provide user credentials. At first glance it looks like a bug but if you're looking for a quick solution you can either use URL properties:
url = 'jdbc:postgresql://PG_SERVER_IP/db_name?user=PG_USER&password=PASSWORD'
or properties argument:
properties = {
"user": "PG_USER",
"password": "PASSWORD",
"driver": "org.postgresql.Driver"
}
I am try to push data in existing hive table, i have already created orc table in hive not able to push data in hive. this code is work if i copy paste on spark console but not able to run by spark-submit.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object TestCode {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("first example").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
for (i <- 0 to 100 - 1) {
// sample value but it replace with business logic. and try to push into table.for loop consider as business logic.
var fstring = "fstring" + i
var cmd = "cmd" + i
var idpath = "idpath" + i
import sqlContext.implicits._
val sDF = Seq((fstring, cmd, idpath)).toDF("t_als_s_path", "t_als_s_cmd", "t_als_s_pd")
sDF.write.insertInto("l_sequence");
//sDF.write.format("orc").saveAsTable("l_sequence");
println("write data ==> " + i)
}
}
Giving the error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: l_sequence;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:449)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:455)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:443)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:259)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:239)
at com.hq.bds.Helloword$$anonfun$main$1.apply$mcVI$sp(Helloword.scala:16)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.hq.bds.Helloword$.main(Helloword.scala:10)
at com.hq.bds.Helloword.main(Helloword.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
You need to link hive-site.xml with spark conf or copy hive-site.xml into spark conf directory. Spark is not
able to find your hive metastore (derby database which is by default), so for that we have to link hive-conf to spark conf direcrtory.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your hive-site.xml file to Spark’s configuration directory ($SPARK_HOME/conf). If you
don’t have an existing Hive installation, Spark SQL will still run.
Sudo to root user and then copy hive-site to spark conf directory.
sudo -u root
cp /etc/hive/conf/hive-site.xml /etc/spark/conf
I would like to read a conf file in to my spark application. The conf file is located in Hadoop edge node directory.
omega.conf
username = "surrender"
location = "USA"
My Spark Code :
package com.test.spark
import org.apache.spark.{SparkConf, SparkContext}
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
object DemoMain {
def main(args: Array[String]): Unit = {
println("Lets Get Started ")
val conf = new SparkConf().setAppName("SIMPLE")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val conf_loc = "/home/cloudera/localinputfiles/omega.conf"
loadConfigFile(conf_loc)
}
def loadConfigFile(loc:String):Unit ={
val config = ConfigFactory.parseFile(new File(loc))
val username = config.getString("username")
println(username)
}
}
I am running this spark application using spark-submit
spark-submit --class com.test.spark.DemoMain --master local /home/cloudera/dev/jars/spark_examples.jar
Spark job is initiated ,but it throws me the below error .It says that No configuration setting found for key 'username'
17/03/29 12:57:37 INFO SparkContext: Created broadcast 0 from textFile at DemoMain.scala:25
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'username'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
at com.typesafe.config.impl.SimpleConfig.getString (SimpleConfig.java:197)
at com.test.spark.DemoMain$.loadConfigFile(DemoMain.scala:53)
at com.test.spark.DemoMain$.main(DemoMain.scala:27)
at com.test.spark.DemoMain.main(DemoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me on fixing this issue
I just tried its working fine i test this with below code
val config=ConfigFactory.parseFile(new File("/home/sandy/my.conf"))
println("::::::::::::::::::::"+config.getString("username"))
and conf file is
username = "surrender"
location = "USA"
Please check location of your file by printing it.
Please guide me the steps to connect and read data from MS SQL by using Pyspark.
Below is my code and the error message that i am getting when i am trying to load data from MS SQL Server. Please guide me.
import urllib
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
APP_NAME = 'My Spark Application'
conf = SparkConf().setAppName("APP_NAME").setMaster("local[4]")
sc = SparkContext(conf=conf)
sqlcontext = SQLContext(sc)
jdbcDF = sqlcontext.read.format("jdbc")\
.option("url", "jdbc:sqlserver:XXXX:1433")\
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")\
.option("dbtable", "dbo.XXXX")\
.option("user", "XXXX")\
.option("password", "XXX")\
.load()
******************************ERROR***************************************
teway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\spark-2.0.1-bin-hadoop2.6\python\pyspark\sql\utils.py", line 63, in d
eco
return f(*a, **kw)
File "C:\spark-2.0.1-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\protoco
l.py", line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o66.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable
(JDBCRDD.scala:167)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(J
DBCRelation.scala:117)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.
createRelation(JdbcRelationProvider.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation
(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
The following solution worked for me:
Include mssql-jdbc-7.0.0.jre8.jar file into jars sub-folder( ex: C:\spark\spark-2.2.2-bin-hadoop2.7\jars) or you can paste any of the jar file based on your system.
Then use the following command to connect to MS SQL server & create the Spark Dataframe:
dbData = spark.read.jdbc("jdbc:sqlserver://servername;databaseName:ExampleDB;user:username;password:password","tablename")
Download mssql-jdbc-x.x.x.jrex.jar file (https://learn.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15)
Run the following code:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
appName = "PySpark SQL Server Example - via JDBC"
master = "local[*]"
conf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("spark.driver.extraClassPath","path/to/mssql-jdbc-x.x.x.jrex.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession
database = "mydatabase"
table = "dbo.mytable"
user = "username"
password = "password"
jdbcDF = spark.read.format("jdbc") \
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", f"jdbc:sqlserver://serverip:1433;databaseName={database}") \
.option("dbtable", "mytable") \
.option("user", user) \
.option("password", password) \
.load()
jdbcDF.show()
I have written a sample Spark program in Scala to count the number of lines of a text file present in Amazon S3. Below is my sample program.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.util.{Map => JMap}
import org.apache.hadoop.conf.Configuration
object CountLines {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("CountLines").setMaster("local"))
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey","XYZ");
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId","ABC");
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","XYX");
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val path ="s3:///my-bucket/test/test.txt";
println("num lines: " + countLines(sc, path));
}
def countLines(sc: SparkContext, path: String): Long = {
sc.textFile(path).count();
}
}
Unfortunately I am getting IllegalArgumentException which has something to do with credentials. Below is the stack trace.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid hostname in URI s3:/my-bucket/test/test.txt
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
I have given valid credentials. I package this as a JAR file and run on the cluster using spark-submit command. I am not sure if this is the right way to set the access key and secret key in spark. I have tried different approaches but nothing seems to work. Throwing some light on this issue would be highly appreciated.
Thanks,
J Joseph
You have an extra slash. You have to change s3:///my-bucket/test/test.txt to s3://my-bucket/test/test.txt.