pyspark-hbase: Failed to find data source: org.apache.phoenix.spark - pyspark

when i use pyspark read from hbase ,i wanna get info by phoenix
this is my codes:
df = spark.read \
.format("org.apache.phoenix.spark") \
.option("table", table) \
.option("zkUrl", phoenixurl) \
.load()
but there are some errors
Py4JJavaError: An error occurred while calling o137.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
it bother me
how to fix it ,
thanks very mush

Related

Error when save data with glueContext.write_dynamic_frame.from_options

When I finished the data processing,I need to save the data back to the Postgre DB.
I can use the same JDBC properties which I used in reading data out. I faced the error . I can insert the data in Data Grip with the same config and that JDBC config is not for read-only. I can use the same config to write via Data Grip.
I cannot insert the data with logic below :
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(dataFrame, glueContext,"tableName"),
connection_type="postgresql",
connection_options={
"url": jdbcConfig["url"],
"user": jdbcConfig["user"],
"password": jdbcConfig["password"],
"dbtable": "tableName"
}
)
And below is the error I faced(Got from CloudWatch log):
ERROR [Thread-9] postgresql.Driver (Driver.java:connect(233)): Error in url: jdbc:postgresql://cc2-data-analytics-cluster-ndec.cluster-abcdefg.us-west-2.rds.amazonaws.com:5440
2022-04-13 22:36:26,838 INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame.
: java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password.
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883)
at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame. : java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password. at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920) at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883) at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41) at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I tried to use another method but the same error as above method:
dataFrame.select(*igDataDfSchema).write.format("jdbc")\
.option("url", jdbcConfig['url']) \
.option("driver", "org.postgresql.Driver").option("dbtable", "tableName") \
.option("user", jdbcConfig['user']).option("password",jdbcConfig['password']).save()

Reading data from snowflake using pyspark throws SnowflakeLoggedFeatureNotSupportedException

Good day,
I am using a sample code from snowflake documentation on using pyspark to connect to it:
sfparams = {
"sfURL": "SOME_URL",
"sfUser": "SOME_USER",
"sfPassword": "SOME_PASSWORD",
"sfDatabase": "SOME_DB",
"sfSchema": "SOME_SCHEMA",
"sfWarehouse": "SOME_WH",
"sfRole": "sysadmin"
}
df = self.spark_sql_context\
.read\
.format('snowflake')\
.options(**sfparams)\
.option('query', "SELECT * FROM TABLE1 LIMIT 10")\
.load()
df.show(truncate = False)
I have downloaded the required jar files (snowflake-jdbc-3.9.2.jar and spark-snowflake_2.11-2.9.3-spark_2.4.jar) and put them inside spark jars directory. Have also added the following to spark config:
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/snowflake-jdbc-3.4.2.jar") \
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/spark-snowflake_2.11-2.9.3-spark_2.4.jar")
However, whenever I try to run the code above, the following exception shows up:
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/SnowflakeLoggedFeatureNotSupportedException
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:68)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.SnowflakeLoggedFeatureNotSupportedException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
Couldn't find anything anywhere on how to deal with it so here I am.
Educated guess. As the query is simple .option('query', "SELECT * FROM TABLE1 LIMIT 10")\
it may contain a column with not supported data type like BLOB/BINARY. If that is the case then explicit column list and omitting such column will help.
Using the Spark Connector - From Snowflake to Spark SQL

Connect to AWS Postgres using Databricks

having issues connecting to AWS Postgres from Azure Databricks, I am new to Azure and below is the code I am using to connect to Postgres but somehow its throwing an error
error: org.postgresql.util.PSQLException: The connection attempt failed.
Code:
jdbc_url="jdbc:postgresql://postgreshost:5432/db?user={}&password={}&ssl=true.format(username,password)"
pushdown_query = "(select * from test limit 10) emp_alias"
df = spark.read.jdbc(url=jdbc_url, table="test")
display(df)
2nd method:
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://postgreshost:5432/db?user=user&password=password") \
.option("dbtable", "test") \
.load()
Am I missing anything? or should I follow any steps prior to execution?
Log using Scala:
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:275)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:450)
at org.postgresql.Driver.connect(Driver.java:252)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:298)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:202)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:8)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:51)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:53)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw.<init>(command-3334328075204474:55)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw.<init>(command-3334328075204474:57)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw.<init>(command-3334328075204474:59)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read.<init>(command-3334328075204474:61)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$.<init>(command-3334328075204474:65)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$.<clinit>(command-3334328075204474)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval$.$print$lzycompute(<notebook>:7)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval$.$print(<notebook>:6)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:199)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:587)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:542)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:324)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:304)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:45)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:268)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:45)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:304)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:475)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:542)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:381)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:328)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.postgresql.core.PGStream.<init>(PGStream.java:68)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:450)
at org.postgresql.Driver.connect(Driver.java:252)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:298)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:202)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:8)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:51)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw$$iw.<init>(command-3334328075204474:53)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw$$iw.<init>(command-3334328075204474:55)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw$$iw.<init>(command-3334328075204474:57)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$$iw.<init>(command-3334328075204474:59)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read.<init>(command-3334328075204474:61)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$.<init>(command-3334328075204474:65)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$read$.<clinit>(command-3334328075204474)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval$.$print$lzycompute(<notebook>:7)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval$.$print(<notebook>:6)
at lined9bdaa60f31e4f44a370d2ec7ae9793627.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:199)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:587)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:542)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:189)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:324)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$7.apply(DriverLocal.scala:304)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:45)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:268)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:45)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:304)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:475)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:542)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:381)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:328)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)
at java.lang.Thread.run(Thread.java:748)
I've never provided the username and password on the connection URL, so I am not sure it works. Usually, it is specified as extra parameters. Checking the Spark Docs, it is specified like this (in Scala):
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Its an internal company issue, nothing to do with the code

pyspark - Spark hbase connector throws failed to find data source

I'm trying to connect to hbase from Pyspark using SHC API by referring the below link.
https://community.hortonworks.com/questions/143802/read-hbase-with-pyspark-from-jupyter-notebook.html
Sample code:
spark = SparkSession.builder.appName("Hbase Read").getOrCreate()
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
"table":{"namespace":"default", "name":"table"},
"rowkey":"key",
"columns":{
"firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
"secondcol":{"cf":"cf", "col":"col1", "type":"int"}
}
}""".split())
df = spark.read \
.options(catalog=catalog) \
.format(data_source_format) \
.load()
df.show()
Spark-submit:
spark-submit --packages com.hortonworks:shc:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ TestHbaseRead.py
i'm getting this error.
Error log:
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.execution.datasources.hbase. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.hbase.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more

Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':

I'm trying to connect to Hive through Intelliji. I'm using Scala version 2.11.4,the version of spark-core,spark-hive,spark-sql are 2.1.1. here is the code snippt I'm using to connect remotely from my windows m/c. While connecting I'm getting following error, can someone help me to address this issue?
Note: when I read some threads, they mentioned about checking the permission of tmp, in this case, /tmp/hive/warehouse. it has appropriate permission to the user xyz I'm using to connect. using this functional id, I'm able to manually connect from one of the unix server. I even tried with spark.sql("show databases") but it was the same error.
def main(args: Array[String]): Unit = {
createKerberosTicket()
val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("SparkHiveTest")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.dynamic.partition", "true")
.config("mapreduce.job.queuename", "root.XYZ_Pool")
.enableHiveSupport()
.getOrCreate()
}
spark.sparkContext.hadoopConfiguration.addResource(new Path("core-site.xml"))
spark.sparkContext.hadoopConfiguration.addResource(new Path("hdfs-site.xml"))
spark.sparkContext.hadoopConfiguration.addResource(new Path("hive-site.xml"))
spark.sparkContext.hadoopConfiguration.set("fs.hdfs.impl", classOf[DistributedFileSystem].getName)
spark.sparkContext.hadoopConfiguration.set("fs.file.impl", classOf[LocalFileSystem].getName)
val listOfDBs = spark.sqlContext.sql("show databases")
}
18/05/02 23:59:13 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/tmp/hive/warehouse').
18/05/02 23:59:13 INFO SharedState: Warehouse path is '/tmp/hive/warehouse'.
18/05/02 23:59:14 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
18/05/02 23:59:14 INFO metastore: Trying to connect to metastore with URI thrift://xyz.net:1234
18/05/02 23:59:14 INFO metastore: Connected to metastore.
18/05/02 23:59:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
at spark.SparkPlusHive$.main(SparkPlusHive.scala:25)
at spark.SparkPlusHive.main(SparkPlusHive.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 12 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
... 17 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
... 25 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:358)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:66)
... 30 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:188)
... 38 more
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:470)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:510)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:488)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:309)
at org.apache.hadoop.hive.ql.session.SessionState.createPath(SessionState.java:639)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:567)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 39 more
18/05/02 23:59:13 INFO SharedState: spark.sql.warehouse.dir is not
set,
it is clear that your spark.sql.warehouse.dir is not set please set that to erase the above issue
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = SparkSession
.builder()
.appName("***")
.master("***")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
hope this helps you
In my situation, it's worked by following here.
download the winutils.exe.
set the env variable HADOOP_HOME.
make dir 'C:\tmp\hive', used as the Spark property value 'spark.sql.warehouse.dir'.
change permission of the dir 'C:\tmp\hive' to 777.