I'm trying to connect to hbase from Pyspark using SHC API by referring the below link.
https://community.hortonworks.com/questions/143802/read-hbase-with-pyspark-from-jupyter-notebook.html
Sample code:
spark = SparkSession.builder.appName("Hbase Read").getOrCreate()
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
"table":{"namespace":"default", "name":"table"},
"rowkey":"key",
"columns":{
"firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
"secondcol":{"cf":"cf", "col":"col1", "type":"int"}
}
}""".split())
df = spark.read \
.options(catalog=catalog) \
.format(data_source_format) \
.load()
df.show()
Spark-submit:
spark-submit --packages com.hortonworks:shc:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ TestHbaseRead.py
i'm getting this error.
Error log:
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.execution.datasources.hbase. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.hbase.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Related
When I finished the data processing,I need to save the data back to the Postgre DB.
I can use the same JDBC properties which I used in reading data out. I faced the error . I can insert the data in Data Grip with the same config and that JDBC config is not for read-only. I can use the same config to write via Data Grip.
I cannot insert the data with logic below :
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(dataFrame, glueContext,"tableName"),
connection_type="postgresql",
connection_options={
"url": jdbcConfig["url"],
"user": jdbcConfig["user"],
"password": jdbcConfig["password"],
"dbtable": "tableName"
}
)
And below is the error I faced(Got from CloudWatch log):
ERROR [Thread-9] postgresql.Driver (Driver.java:connect(233)): Error in url: jdbc:postgresql://cc2-data-analytics-cluster-ndec.cluster-abcdefg.us-west-2.rds.amazonaws.com:5440
2022-04-13 22:36:26,838 INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame.
: java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password.
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883)
at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame. : java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password. at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920) at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883) at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41) at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I tried to use another method but the same error as above method:
dataFrame.select(*igDataDfSchema).write.format("jdbc")\
.option("url", jdbcConfig['url']) \
.option("driver", "org.postgresql.Driver").option("dbtable", "tableName") \
.option("user", jdbcConfig['user']).option("password",jdbcConfig['password']).save()
Good day,
I am using a sample code from snowflake documentation on using pyspark to connect to it:
sfparams = {
"sfURL": "SOME_URL",
"sfUser": "SOME_USER",
"sfPassword": "SOME_PASSWORD",
"sfDatabase": "SOME_DB",
"sfSchema": "SOME_SCHEMA",
"sfWarehouse": "SOME_WH",
"sfRole": "sysadmin"
}
df = self.spark_sql_context\
.read\
.format('snowflake')\
.options(**sfparams)\
.option('query', "SELECT * FROM TABLE1 LIMIT 10")\
.load()
df.show(truncate = False)
I have downloaded the required jar files (snowflake-jdbc-3.9.2.jar and spark-snowflake_2.11-2.9.3-spark_2.4.jar) and put them inside spark jars directory. Have also added the following to spark config:
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/snowflake-jdbc-3.4.2.jar") \
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/spark-snowflake_2.11-2.9.3-spark_2.4.jar")
However, whenever I try to run the code above, the following exception shows up:
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/SnowflakeLoggedFeatureNotSupportedException
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:68)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.SnowflakeLoggedFeatureNotSupportedException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
Couldn't find anything anywhere on how to deal with it so here I am.
Educated guess. As the query is simple .option('query', "SELECT * FROM TABLE1 LIMIT 10")\
it may contain a column with not supported data type like BLOB/BINARY. If that is the case then explicit column list and omitting such column will help.
Using the Spark Connector - From Snowflake to Spark SQL
We are doing pyspark development(cloudera) and inside the pyspark we are using spark SQL engine to migrated Greenplum to hiveSQL. some of the function getting facing as below error. please suggest.
An error occurred while calling o72.insertInto. :
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:255)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:135)
at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:201)
at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:195)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:71)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:195)
at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:179)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:215)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:230)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:684)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:334)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:320)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I am getting strange error while saving dataframe to AWS S3.
df.coalesce(1).write.mode(SaveMode.Overwrite)
.json(s"s3://myawsacc/results/")
In the same location I was able to insert the data from spark-shell .
and is working...
spark.sparkContext.parallelize(1 to 4).toDF.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.save(s"s3://myawsacc/results/")
My question is why its working in spark-shell and is not working via
spark-submit ? Is there any logic/properties/configuration for this?
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.amazon.ws.emr.hadoop.fs.s3n.S3Credentials.initialize(S3Credentials.java:45)
at com.amazon.ws.emr.hadoop.fs.HadoopConfigurationAWSCredentialsProvider.(HadoopConfigurationAWSCredentialsProvider.java:26)
at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProviderChain(DefaultAWSCredentialsProviderFactory.java:44)
at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProvider(DefaultAWSCredentialsProviderFactory.java:28)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.getAwsCredentialsProvider(EmrFSProdModule.java:70)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createS3Configuration(EmrFSProdModule.java:86)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3LiteClient(EmrFSProdModule.java:80)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3Lite(EmrFSProdModule.java:120)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSBaseModule.provideAmazonS3Lite(EmrFSBaseModule.java:99)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderMethod.get(ProviderMethod.java:104)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1031)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.Scopes$1$1.get(Scopes.java:65)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:110)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:94)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:54)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1009)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)
at com.org.ComparatorUtil$.writeLogNError(ComparatorUtil.scala:245)
at com.org.ComparatorUtil$.writeToJson(ComparatorUtil.scala:161)
at com.org.comparator.SnowFlakeTableComparator$.mainExecutor(SnowFlakeTableComparator.scala:98)
at com.org.executor.myclass$$anonfun$main$4$$anonfun$apply$1.apply(myclass.scala:232)
at com.org.executor.myclass$$anonfun$main$4$$anonfun$apply$1.apply(myclass.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.org.executor.myclass$$anonfun$main$4.apply(myclass.scala:153)
at com.org.executor.myclass$$anonfun$main$4.apply(myclass.scala:134)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.org.executor.myclass$.main(myclass.scala:134)
at com.org.executor.myclass.main(myclass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: URI is not absolute
at java.net.URI.toURL(URI.java:1088)
at org.apache.hadoop.fs.http.AbstractHttpFileSystem.open(AbstractHttpFileSystem.java:60)
at org.apache.hadoop.fs.http.HttpFileSystem.open(HttpFileSystem.java:23)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)
at org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)
at java.net.URL.openStream(URL.java:1045)
at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory._optimizedStreamFromURL(JsonFactory.java:1479)
at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:779)
at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2679)
at com.amazon.ws.emr.hadoop.fs.util.PlatformInfo.getClusterIdFromConfigurationEndpoint(PlatformInfo.java:39)
at com.amazon.ws.emr.hadoop.fs.util.PlatformInfo.getJobFlowId(PlatformInfo.java:53)
at com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils.getJobFlowId(EmrFsUtils.java:384)
at com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils.(EmrFsUtils.java:60)
... 77 more
import java.net.URI
import spark.implicits._
spark.sparkContext.parallelize(1 to 4).toDF
.coalesce(1)
.write.mode(SaveMode.Overwrite)
.json(new URI("s3://myawsacc/results/").toString)
spark.sparkContext.parallelize(1 to 4).toDF
.coalesce(1)
.write.mode(SaveMode.Overwrite)
.json(URI.create("s3://myawsacc/results/").toString)
is working fine for me.
seems like spark-shell implcitly applying new URI or URI.create and hence its was working fine.
From a remote scala program, using Spark 1.3, how do I initialize the sparkContext so that I can connect to Spark running on YARN? i.e. where do I put the address of the YARN node(s)?
Currently my program contains:
val conf = new SparkConf().setMaster("yarn-client").setAppName("MyApp")
val sc = new SparkContext(conf)
and it yields
[error] (run-main-0) java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)
at SparkExampleLocalDriver$.main(SparkExample.scala:9)
at SparkExampleLocalDriver.main(SparkExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:217)
at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:212)
at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)
at SparkExampleLocalDriver$.main(SparkExample.scala:9)
at SparkExampleLocalDriver.main(SparkExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:195)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:212)
at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)
at SparkExampleLocalDriver$.main(SparkExample.scala:9)
at SparkExampleLocalDriver.main(SparkExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
Your Spark binary doesn't contain YARN related classes
Use pre-built binary for hadoop
http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.4.tgz
If you are compiling source, include yarn and hadoop profiles
./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0