How to use Spark BigQuery Connector locally? - scala

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?

If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.

Related

Getting "java.lang.IllegalStateException: Tried to lookup lag for unknown task 3_0" after upgrading Kafka Stream from 2.5.1 to 2.6.2

I just upgraded our Kafka Stream application from 2.5.1 to 2.6.2. It used to work, now it doesn't.
Here is the troublesome topology (I have omitted the irrelevant Serdes):
val builder = new StreamsBuilder()
val contractEventStream: KStream[TariffId, ContractEvent] =
builder.stream[String, ContractUpsertAvro](settings.contractsTopicName)
.flatMap { (_, contractAvro) =>
ContractEvent.from(contractAvro)
.map(contractEvent => (contractEvent.tariffId, contractEvent))
}
val tariffsTable: KTable[TariffId, Tariff] =
builder.stream[String, TariffUpdateEventAvro](settings.tariffTopicName)
.flatMapValues(Tariff.fromAvro(_))
.selectKey((_, tariff) => tariff.tariffId)
.toTable(Materialized.`with`(tariffIdSerde, tariffSerde)) // Materialized.as also throws the same IllegalStateExceptions
contractEventStream
.join(tariffsTable)(JourneyStep.from(_, _).asInstanceOf[ContractCreated])(Joined.`with`(tariffIdSerde, contractEventSerde, tariffSerde))
.selectKey((_, contractUpdated) => contractUpdated.accountId)
.foreach((_, journeyStep) => println(journeyStep))
The join gives the following exception:
java.lang.IllegalStateException: Tried to lookup lag for unknown task 3_0
at org.apache.kafka.streams.processor.internals.assignment.ClientState.lagFor(ClientState.java:306)
at java.util.Comparator.lambda$comparingLong$6043328a$1(Comparator.java:511)
at java.util.Comparator.lambda$thenComparing$36697e65$1(Comparator.java:216)
at java.util.TreeMap.compare(TreeMap.java:1295)
at java.util.TreeMap.put(TreeMap.java:538)
at java.util.TreeSet.add(TreeSet.java:255)
at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
at java.util.TreeSet.addAll(TreeSet.java:312)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.getPreviousTasksByLag(StreamsPartitionAssignor.java:1275)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assignTasksToThreads(StreamsPartitionAssignor.java:1189)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.computeNewAssignment(StreamsPartitionAssignor.java:940)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assign(StreamsPartitionAssignor.java:399)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:589)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:684)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$1000(AbstractCoordinator.java:111)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:597)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:560)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1160)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1135)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1296)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:624)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
I can't see what I am doing wrong. The code above works with Kafka 2.5.1. Anyone has any idea what is going on?
The problem is caused by the Kafka Streams cache, which it keeps on disk. This cache is specific to Kafka-version and to the Kafka Streams topology you use (ie. a change in your topology could also lead to this error).
The cache is usually found in /tmp or elsewhere if you passed in the "state.dir" property to Kafka Streams. Clear the directory with the cache and you should be able to cleanly start again.

Extract Embedded AWS Glue Connection Credentials Using Scala

I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala?
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
table = spark.read.format(
'com.databricks.spark.redshift'
).option(
'url',
'jdbc:redshift://prod.us-east-1.redshift.amazonaws.com:5439/db'
).option(
'user',
response['Connection']['ConnectionProperties']['USERNAME']
).option(
'password',
response['Connection']['ConnectionProperties']['PASSWORD']
).option(
'dbtable',
'db.table'
).option(
'tempdir',
's3://config/glue/temp/redshift/'
).option(
'forward_spark_s3_credentials', 'true'
).load()
There is no scala equivalent from AWS to issue this API call.But you can use Java SDK code inside scala as mentioned in this answer.
This is the Java SDK call for getConnection and if you don't want to do this then you can follow below approach:
Create AWS Glue python shell job and retrieve the connection information.
Once you have the values then call the other scala Glue job with these as arguments inside your python shell job as shown below :
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
response = client.start_job_run(
JobName = 'my_scala_Job',
Arguments = {
'--username': response['Connection']['ConnectionProperties']['USERNAME'],
'--password': response['Connection']['ConnectionProperties']['PASSWORD'] } )
Then access these parameters inside your scala job using getResolvedOptions as shown below:
import com.amazonaws.services.glue.util.GlueArgParser
val args = GlueArgParser.getResolvedOptions(
sysArgs, Array(
"username",
"password")
)
val user = args("username")
val pwd = args("password")

How to write to HDFS using spark programming API if I have authentication details?

I need to write to external HDFS cluster whose authentication details are available for both simple as well as kerberos authentication. For the sake of simplicity, lets assume we are dealing with simple authentication.
This is what I have:
External HDFS cluster connection details (host, port)
Authentication details (user for simple auth)
HDFS location where files need to be written (hdfs://host:port/loc)
Also, other details like format, etc.
Please note SPARK user is not same as user specified for HDFS auth.
Now, using the spark programming API, this is what I am trying to do:
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", fileSystemPath)
hadoopConf.set("hadoop.job.ugi", userName)
val jConf = new JobConf(hadoopConf)
jConf.setUser(user)
jConf.set("user.name", user)
jConf.setOutputKeyClass(classOf[NullWritable])
jConf.setOutputValueClass(classOf[Text])
jConf.setOutputFormat(classOf[TextOutputFormat[NullWritable, Text]])
outputDStream.foreachRDD(r => {
val rdd = r.mapPartitions { iter =>
val text = new Text()
iter.map { x =>
text.set(x.toString)
println(x.toString)
(NullWritable.get(), text)
}
}
val rddCount = rdd.count()
if(rddCount > 0) {
rdd.saveAsHadoopFile(config.outputPath, classOf[NullWritable], classOf[Text], classOf[TextOutputFormat[NullWritable, Text]], jConf)
}
})
Here, I was assuming that if we pass JobConf with correct details, it should be used for authentication and write should be done using the user specified in JobConf.
However, write still happens as the spark user ("root") irrespective of the auth details present in JobConf ("hdfs" as user). Below is the exception that I get:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/spark-deploy/out/_temporary/0":hdfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1698)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1682)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1665)
at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:71)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3900)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:978)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy40.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:558)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy41.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3000)
... 45 more
Please let me know, if there are any suggestions.
This is probably more a comment than an answer but as it is too long I put it here. I haven't tried this because I have no environment to test it. Please try and let me know if this works (and if it doesn't I'll remove this answer).
Looking a bit into the code it looks like DFSClient creates a proxy using createProxyWithClientProtocol that uses UserGroupInformation.getCurrentUser() (I haven't traced the createHAProxy branch down but I suspect the same logic there). Then this info is sent to the server for authentication.
It means that you need to change what UserGroupInformation.getCurrentUser() returns in the context of your particular call. This is what UserGroupInformation.doAs is supposed to do so you just need to get a proper UserGroupInformation instance. And in the case of simple authentication UserGroupInformation.createRemoteUser might actually work.
So I suggest trying something like this:
...
val rddCount = rdd.count()
if(rddCount > 0) {
val remoteUgi = UserGroupInformation.createRemoteUser("hdfsUserName")
remoteUgi.doAs(() => { rdd.saveAsHadoopFile(config.outputPath, classOf[NullWritable], classOf[Text], classOf[TextOutputFormat[NullWritable, Text]], jConf) })
}

No FileSystem for scheme: cos

I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')

Access public available Amazon S3 file from Apache Spark

I have a public available Amazon s3 resource (text file) and want to access it from spark. That means - I don't have any Amazon credentials - it works fine if I want to just download it:
val bucket = "<my-bucket>"
val key = "<my-key>"
val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream
However, when I try to access the same resource from spark context
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())
I receive the following error with stacktrace:
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at com.example.Main$.main(Main.scala:14)
at com.example.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop?
P.S. My build.sbt just in case
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)
UPDATED: After I did some investigations - I see the reason why itsn't working.
First of all, S3AFileSystem creates AWS client with the following order of credentials:
AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
new BasicAWSCredentialsProvider(accessKey, secretKey),
new InstanceProfileCredentialsProvider(),
new AnonymousAWSCredentialsProvider()
);
"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient).
Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? See the implementation of AnonymousAWSCredentials:
public class AnonymousAWSCredentials implements AWSCredentials {
public String getAWSAccessKeyId() {
return null;
}
public String getAWSSecretKey() {
return null;
}
}
It simply returns null for both access key and secret key. Sounds reasonable. But look inside AWSCredentialsProviderChain:
AWSCredentials credentials = provider.getCredentials();
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. Looks like a bug inside aws-java-sdk-1.7.4. I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1.
Any other ideas?
I personally never accessed public data from Spark. You can try to use dummy credentials, or to create ones just for this usage. Set them directly on the SparkConf object.
val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)
As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. The list (order is important) is:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Java System Properties - aws.accessKeyId and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Instance profile credentials delivered through the Amazon EC2 metadata service
This is what helped me:
val session = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
.getOrCreate()
val df = session.read.csv(filesFromS3:_*)
Versions:
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",
Documentation:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties
It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop.
Here is how I did it colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow
Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH, so spark can see it.
import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"
Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/, but add s3a and anonymous reading support, which is what the question is about.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'
And finally we can create the session.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()