Getting error trying to access hive table with jsonserde format through notebook pyspark - pyspark

CREATE EXTERNAL TABLE `table_name`( data string COMMENT 'from deserializer') PARTITIONED BY ( date string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://' TBLPROPERTIES ( 'classification'='json')
Trying to read this table using pyspark but getting below error
table = spark.table('table_name')
table.show(5, False)
: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$2(HiveTableScanExec.scala:210)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2524)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:210)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:326)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:444)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:430)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3733)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2762)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
But this table is accessible from athena.

Related

AWS GlueStudio RDS -> Redshift invalid timestamp format

I am trying to create an AWS Glue ETL job to move data from Aurora RDS to Redshift, but cannot resolve how to get the timestamp fields properly mapped. All stages of the job show a valid preview of the expected data, but the job always fails with the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o179.pyWriteDynamicFrame.
: java.sql.SQLException:
Error (code 1206) while loading data into Redshift: "Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]"
Table name: public.stage_table_ae89e9dffe974b649bbf4852e49a4b12
Column name: updated_at
Column type: timestamp(0)
Raw line: 1234,5341,1121,0,2022-01-06 16:29:55.000000000,2022-01-06 16:29:55.000000000,1,1,Suzy
Raw field value: 0
I have tried doing a date format to remove the microseconds, I have tried forcing quotes around the date fields, nothing works.
Here is the generated script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrameCollection
from awsglue.dynamicframe import DynamicFrame
from awsglue import DynamicFrame
# Script generated for node Custom transform
def CastIntsTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = (
df.resolveChoice(specs=[("id", "cast:bigint")])
.resolveChoice(specs=[("user_id", "cast:bigint")])
.resolveChoice(specs=[("connected_user_id", "cast:bigint")])
.resolveChoice(specs=[("mg_id", "cast:bigint")])
.resolveChoice(specs=[("access_level", "cast:tinyint")])
.resolveChoice(specs=[("status", "cast:tinyint")])
)
return DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext)
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node JDBC Connection
JDBCConnection_node1 = glueContext.create_dynamic_frame.from_catalog(
database="ABC123",
table_name="user_connections",
transformation_ctx="JDBCConnection_node1",
)
# Script generated for node SQL
SqlQuery0 = """
select
id,
user_id,
connected_user_id,
COALESCE(mg_id, 0) mg_id,
created_at,
updated_at,
updated_at,
access_level,
status,
COALESCE(nickname, '') nickname
from
apiData
"""
SQL_node1647619002820 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"apiData": JDBCConnection_node1},
transformation_ctx="SQL_node1647619002820",
)
# Script generated for node Custom transform
Customtransform_node1647612655336 = CastIntsTransform(
glueContext,
DynamicFrameCollection(
{"SQL_node1647619002820": SQL_node1647619002820}, glueContext
),
)
# Script generated for node Select From Collection
SelectFromCollection_node1647613332516 = SelectFromCollection.apply(
dfc=Customtransform_node1647612655336,
key=list(Customtransform_node1647612655336.keys())[0],
transformation_ctx="SelectFromCollection_node1647613332516",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=SelectFromCollection_node1647613332516,
mappings=[
("id", "bigint", "id", "bigint"),
("user_id", "bigint", "user_id", "bigint"),
("connected_user_id", "bigint", "connected_user_id", "bigint"),
("mg_id", "bigint", "mg_id", "bigint"),
("created_at", "timestamp", "created_at", "timestamp"),
("updated_at", "timestamp", "updated_at", "timestamp"),
("access_level", "tinyint", "access_level", "tinyint"),
("status", "tinyint", "status", "tinyint"),
("nickname", "varchar", "nickname", "varchar"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon Redshift
pre_query = "drop table if exists public.stage_table_cd5d65739d334453938f090ea1cb2d6e;create table public.stage_table_cd5d65739d334453938f090ea1cb2d6e as select * from public.test_user_connections where 1=2;"
post_query = "begin;delete from public.test_user_connections using public.stage_table_cd5d65739d334453938f090ea1cb2d6e where public.stage_table_cd5d65739d334453938f090ea1cb2d6e.id = public.test_user_connections.id; insert into public.test_user_connections select * from public.stage_table_cd5d65739d334453938f090ea1cb2d6e; drop table public.stage_table_cd5d65739d334453938f090ea1cb2d6e; end;"
AmazonRedshift_node1647612972417 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=ApplyMapping_node2,
catalog_connection="ABC123",
connection_options={
"database": "test",
"dbtable": "public.stage_table_cd5d65739d334453938f090ea1cb2d6e",
"preactions": pre_query,
"postactions": post_query,
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="AmazonRedshift_node1647612972417",
)
job.commit()

Apache flink connect to postgresql

I'm trying to connect to a postgresql with pyflink on windows and I'm using the following code:
from pyflink.table import EnvironmentSettings, TableEnvironment
env_settings = EnvironmentSettings.in_streaming_mode()
table_env = TableEnvironment.create(env_settings)
table_env.execute_sql("""
CREATE TABLE test_nifi (
codecountry VARCHAR(50),
name VARCHAR(50),
PRIMARY KEY (codecountry) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://localhost:5432/TestDS',
'table-name' = 'public.test_nifi',
'username' = 'postgres',
'password' = 'postgres'
)
""")
result = table_env.from_path("test_nifi").select("codecountry, name")
print(result.to_pandas())
and I'm getting the following error:
Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'jdbc' that implements 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.
Any idea why is this happening?
add following line:
table_env.get_config().get_configuration().set_string("pipeline.jars", "file:///C:/Users/Admin/Desktop/Flink/flink-connector-jdbc_2.12-1.14.3.jar;file:///C:/Users/Admin/Desktop/Flink/postgresql-42.3.1.jar")
Since Flink is a Java/Scala-based project, for both connectors and formats, implementations are available as jars
postgresql in pyflink relies on Java's flink-connector-jdbc implementation and you need to add this jar in stream_execution_environment
stream_execution_environment.add_jars("file:///my/jar/path/connector1.jar", "file:///my/jar/path/connector2.jar")

Produce data into Kafka topic with KSQL

The topic test already exists inside of a remote Kafka Cluster. The session details are saved in the client variable. We have three columns in our table called FEATURE, ACCOUNTHOLDER, CLASS.
Now I create sample_table.
client.create_table(table_name='sample_table',
columns_type=['FEATURE double', 'ACOUNTHOLDER string', 'CLASS int'],
topic='test',
value_format='DELIMITED',
key = 'ACCOUNTHOLDER')
The result is true. So the table is successfully created.
Now I want to push data into sample_table by:
client.ksql("""INSERT INTO sample_table (FEATURE, ACCOUNTHOLDER, CLASS) VALUES (6.1, 'C1', 1)""")
I receive a NullPointerException.
KSQLError: ('java.lang.NullPointerException', 50000, ['io.confluent.ksql.rest.server.execution.InsertValuesExecutor.extractRow(InsertValuesExecutor.java:164)', 'io.confluent.ksql.rest.server.execution.InsertValuesExecutor.execute(InsertValuesExecutor.java:98)', 'io.confluent.ksql.rest.server.validation.CustomValidators.validate(CustomValidators.java:109)', 'io.confluent.ksql.rest.server.validation.RequestValidator.validate(RequestValidator.java:143)', 'io.confluent.ksql.rest.server.validation.RequestValidator.validate(RequestValidator.java:115)', 'io.confluent.ksql.rest.server.resources.KsqlResource.handleKsqlStatements(KsqlResource.java:163)', 'sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)', 'sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)', 'java.lang.reflect.Method.invoke(Method.java:498)', 'org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)', 'org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:415)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:104)', 'org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)', 'org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)', 'org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:316)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:298)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:268)', 'org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)', 'org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:256)', 'org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)', 'org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:416)', 'org.glassfish.jersey.servlet.ServletContainer.serviceImpl(ServletContainer.java:409)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:584)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:525)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:462)', 'org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)', 'org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)', 'org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)', 'org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)', 'org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)', 'org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)', 'org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)', 'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)', 'org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)', 'org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)', 'org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)', 'org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)', 'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)', 'org.eclipse.jetty.server.Server.handle(Server.java:505)', 'org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)', 'org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)', 'org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)', 'org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)', 'org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)', 'org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)', 'org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)', 'org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)', 'java.lang.Thread.run(Thread.java:748)'])
Any ideas what I am doing wrong?

Problem when use Spark SQL【2.1】 to work with PostgreSQL DB

I use following test case to write data to a postgresql table, and it works fine.
test("SparkSQLTest") {
val session = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
val url = "jdbc:postgresql://dbhost:12345/db1"
val table = "schema1.table1"
val props = new Properties()
props.put("user", "user123")
props.put("password", "pass#123")
props.put(JDBCOptions.JDBC_DRIVER_CLASS, "org.postgresql.Driver")
session.range(300, 400).write.mode(SaveMode.Append).jdbc(url, table, props)
}
Then, I use following spark-sql -f sql_script_file.sql to write an hive data into postgresql table.
CREATE OR REPLACE TEMPORARY VIEW tmp_v1
USING org.apache.spark.sql.jdbc
OPTIONS (
driver 'org.postgresql.Driver',
url 'jdbc:postgresql://dbhost:12345/db1',
dbtable 'schema1.table2',
user 'user123',
password 'pass#123',
batchsize '2000'
);
insert into tmp_v1 select
name,
age
from test.person; ---test.person is the Hive db.table
But when I run the above script using spark-sql -f sql_script.sql, it complains that the postgresql user/passord is invalid, the exception is as follows, I think the above two methods are basically the same, so I would ask where the problem is, thanks.
org.postgresql.util.PSQLException: FATAL: Invalid username/password,login denied.
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:375)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:189)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:124)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:28)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:20)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:22)
at org.postgresql.Driver.makeConnection(Driver.java:392)
at org.postgresql.Driver.connect(Driver.java:266)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:59)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:76)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:75)

How to use Spark BigQuery Connector locally?

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?
If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.