InvalidSignatureException occurs when trying to add user record using Kinesis Producer library.
AWS_JAVA_SDK_VERSION=1.10.26
AWS_KINESIS_PRODUCER_VERSION=0.10.1
ERROR:
PutRecords failed: {"__type":"InvalidSignatureException","message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method.
SCALA KINESIS PRODUCER CODE
private val configuration: KinesisProducerConfiguration = new KinesisProducerConfiguration
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(config.awsAccessKey, config.awsSecretKey)
configuration.setCredentialsProvider(credentialsProvider)
configuration.setRecordMaxBufferedTime(config.timeLimit)
configuration.setAggregationMaxCount(1)
configuration.setRegion(config.streamRegion)
configuration.setMetricsLevel("none")
private val kinesisProducer = new KinesisProducer(configuration)
kinesisProducer.addUserRecord(streamName, key, eventBytes)`
The above code is not working. But its possible for me to add records to kinesis stream through aws cli from terminal and KinesisClient in code which is specified below.
private def createKinesisClient = {
val accessKey = config.awsAccessKey
val secretKey = config.awsSecretKey
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(accessKey, secretKey)
val client = new AmazonKinesisClient(credentialsProvider)
client.setEndpoint(config.streamEndpoint)
client
}
This happens because your VM/PC/Server clock might by skewed.
If you're running ubuntu, try updating your system time:
sudo ntpdate ntp.ubuntu.com
If you are using docker-machine on Mac, you can resolve with this command:
docker-machine ssh default 'sudo ntpclient -s -h pool.ntp.org'
Related
I am trying to push records to AWS S3 with Aplakka S3 library. The issue is due to security issues, I have to "assume role", and my IAM user doesn't have access to PUT. with aws cli, I could have successfully pushed to s3 with --profile parameter in aws s3 --profile <profile>. I want to know HOW TO ASSUME ROLE IN ALPAKKA S3 LIBRARY?
my application.conf file has the credentials as in https://github.com/akka/alpakka/blob/v3.0.4/s3/src/main/resources/reference.conf
:
alpakka.s3 {
# default values for AWS configuration
aws {
# to use the same configuration as if credentials.provider = default
credentials {
# static credentials
provider = default //static
access-key-id = <> //valid access key exists in original code
secret-access-key = <>
}
and my PUTing code is:
object Experiments extends App{
//AWS S3 configs
val s3BucketName = config.getString("alpakka.s3.bucket_name")
val s3BucketRegion = config.getString("alpakka.s3.bucket_region")
val bucket_path = config.getString("alpakka.s3.path_inside_bucket")
Source.single(record)
.map { Record => println("Record value: " + Record)
val K_record = //some parsing for the record into JSON
//push to S3
val bucketKey = s"$bucket_path/${K_record.session_id}.json"
try{
Source.single(ByteString(K_record.featureSet))
.runWith(S3.multipartUpload(bucket = s3BucketName, key = bucketKey))
}
catch{
case e2:Exception => println(s"S3 pushing error: ${e2.getMessage}")
}
}
.runWith(Sink.ignore)
I'm using Akka (version 2.5.18) to send JSON strings to a specific server via https. I have used a poolRouter (balancing-pool with 10 instances) in order to create a pool of actors that are going to send JSONs (generated from different customers) to a single server:
val router: ActorRef = system.actorOf(
FromConfig.props(Props(new SenderActor(configuration.getString("https://server.com"), this.self))),
"poolRouter"
)
The project specification says that the requests can also be sent using curl:
curl -X PUT --cert certificate.pem --key private.key -H 'Content-Type: application / json' -H 'cache-control: no-cache' -d '[{"id" : "test"}] 'https://server.com'
Where "certificate.pem" is the tls certificate of the customer and "private.key" is the private key used to generate the CSR of the customer.
I'm using a balancing-pool because I will have a very big set of certificates (one for each customer) and I need to send the requests concurrently.
My approach is to have a "SenderActor" class that will be created by the balancing pool. Each actor, upon the reception of a message with a "customerId" and the JSON data generated by this customer, will send a https request:
override def receive: Receive = {
case Data(customerId, jsonData) =>
send(customerId(cid, jsonData))
Each SenderActor will read the certificate (and the private key) based on a path using the customerId. For instance, the customerId: "cust1" will have their certificate and key stored in "/home/test/cust1". This way, the same actor class can be used for all the customers.
According to the documentation, I need to create a HttpsConnectionContext in order to send the different requests:
def send(customerId: String, dataToSend): Future[HttpResponse] = {
// Create the request
val req = HttpRequest(
PUT,
uri = "https://server.com",
entity = HttpEntity(`application/x-www-form-urlencoded` withCharset `UTF-8`, dataToSend),
protocol = `HTTP/1.0`)
val ctx: SSLContext = SSLContext.getInstance("TLS")
val permissiveTrustManager: TrustManager = new X509TrustManager() {
override def checkClientTrusted(chain: Array[X509Certificate], authType: String): Unit = {}
override def checkServerTrusted(chain: Array[X509Certificate], authType: String): Unit = {}
override def getAcceptedIssuers(): Array[X509Certificate] = Array.empty
}
ctx.init(Array.empty, Array(permissiveTrustManager), new SecureRandom())
val httpsConnContext: HttpsConnectionContext = ConnectionContext.https(ctx)
// Send the request
Http(system).singleRequest(req, httpsConnContext)
}
The problem I have is that I don't have any clue about how to "set the certificate and the key" in the request, so that the server accepts them.
For instance, I can read the certificate using the following code:
import java.util.Base64
val certificate: String => String = (customer: String) => IO {
Source.fromInputStream(getClass.getClassLoader
.getResourceAsStream("/home/test/".concat(customer).concat("_cert.pem")))
.getLines().mkString
}.unsafeRunSync()
val decodedCertificate = Base64.getDecoder.decode(certificate(customerId)
.replaceAll(X509Factory.BEGIN_CERT, "").replaceAll(X509Factory.END_CERT, ""))
val cert: Certificate = CertificateFactory.getInstance("X.509")
.generateCertificate(new ByteArrayInputStream(decodedCertificate))
But I don't know how to "set" this certificate and the private key in the request (which is protected by a passphrase), so that the server accepts it.
Any hint or help would be greatly appreciated.
The following allows making a https request and identifying yourself with a private key from a x.509 certificate.
The following libraries are used to manage ssl configuration and to make https calls:
ssl-config
akka-http
Convert your pem certificate to pks12 format as defined here
openssl pkcs12 -export -out certificate.pfx -inkey privateKey.key -in certificate.crt
Define key-store in your application.conf. It supports only pkcs12 and because of
this step 1 is required.
ssl-config {
keyManager {
stores = [
{
type = "pkcs12"
path = "/path/to/pkcs12/cetificate"
password = changeme //the password is set when using openssl
}
]
}
}
Load ssl config using special akka trait DefaultSSLContextCreation
import akka.actor.ActorSystem
import akka.actor.ExtendedActorSystem
import akka.http.scaladsl.DefaultSSLContextCreation
import com.typesafe.sslconfig.akka.AkkaSSLConfig
import com.typesafe.sslconfig.ssl.SSLConfigFactory
class TlsProvider(val actorSystem: ActorSystem) extends DefaultSSLContextCreation {
override protected def sslConfig: AkkaSSLConfig =
throw new RuntimeException("Unsupported behaviour when creating new sslConfig")
def httpsConnectionContext() = {
val akkaSslConfig =
new AkkaSSLConfig(system.asInstanceOf[ExtendedActorSystem], SSLConfigFactory.parse(system.settings.config))
createClientHttpsContext(akkaSslConfig)
}
}
Create a https context and use in http connection pool.
Http(actorSystem).cachedHostConnectionPoolHttps[RequestContext](
host = host,
port = portValue,
connectionContext = new TlsProvider(actorSystem).httpsConnectionContext()
)
Or set connection context to Http(actorSystem).singleRequest method.
In summary, I used ssl-config library to manage certificates instead of doing it programmatically yourself. By defining a keyManager in a ssl-config, any http request done with help of custom httpsConnectionContext will use the certificate to identify the caller/client.
I focused on describing how to establish a https connection using client certificate. Any dynamic behavior for managing multiple certificates is omitted. But I hope this code should be able give you understanding how to proceed.
I am trying to get some data from a REST web service. So far I can get the data correctly if I don't use HTTPS with this code working as expected -
val client = Http.client.newService(s"$host:80")
val r = http.Request(http.Method.Post, "/api/search/")
r.host(host)
r.content = queryBuf
r.headerMap.add(Fields.ContentLength, queryBuf.length.toString)
r.headerMap.add("Content-Type", "application/json;charset=UTF-8")
val response: Future[http.Response] = client(r)
But when I am trying to get the same data from https request (Following this link)
val client = Http.client.withTls(host).newService(s"$host:443")
val r = http.Request(http.Method.Post, "/api/search/")
r.headerMap.add("Cookie", s"_elfowl=${authToken.elfowlToken}; dc=$dc")
r.host(host)
r.content = queryBuf
r.headerMap.add(Fields.ContentLength, queryBuf.length.toString)
r.headerMap.add("Content-Type", "application/json;charset=UTF-8")
r.headerMap.add("User-Agent", authToken.userAgent)
val response: Future[http.Response] = client(r)
I get the error
Remote Info: Not Available at remote address: searchservice.com/10.59.201.29:443. Remote Info: Not Available, flags=0x08
I can curl the same endpoint with 443 port and it returns the right result. Can anyone please help me troubleshoot the issue ?
Few things to check:
withTls(host)
needs to be the host name that is in the certificate of server (as opposed to the the ip for instance)
you can try:
Http.client.withTlsWithoutValidation
to verify the above.
Also you might want to verify if the server checks that the host header is set, and if so, you might want to include it:
val withHeader = new SimpleFilter[http.Request, http.Response] {
override def apply(request: http.Request, service: HttpService): Future[http.Response] = {
request.host_=(host)
service(request)
}
}
withHeader.andThen(client)
more info on host header:
What is http host header?
I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')
I have a public available Amazon s3 resource (text file) and want to access it from spark. That means - I don't have any Amazon credentials - it works fine if I want to just download it:
val bucket = "<my-bucket>"
val key = "<my-key>"
val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream
However, when I try to access the same resource from spark context
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())
I receive the following error with stacktrace:
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at com.example.Main$.main(Main.scala:14)
at com.example.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop?
P.S. My build.sbt just in case
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)
UPDATED: After I did some investigations - I see the reason why itsn't working.
First of all, S3AFileSystem creates AWS client with the following order of credentials:
AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
new BasicAWSCredentialsProvider(accessKey, secretKey),
new InstanceProfileCredentialsProvider(),
new AnonymousAWSCredentialsProvider()
);
"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient).
Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? See the implementation of AnonymousAWSCredentials:
public class AnonymousAWSCredentials implements AWSCredentials {
public String getAWSAccessKeyId() {
return null;
}
public String getAWSSecretKey() {
return null;
}
}
It simply returns null for both access key and secret key. Sounds reasonable. But look inside AWSCredentialsProviderChain:
AWSCredentials credentials = provider.getCredentials();
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. Looks like a bug inside aws-java-sdk-1.7.4. I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1.
Any other ideas?
I personally never accessed public data from Spark. You can try to use dummy credentials, or to create ones just for this usage. Set them directly on the SparkConf object.
val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)
As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. The list (order is important) is:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Java System Properties - aws.accessKeyId and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Instance profile credentials delivered through the Amazon EC2 metadata service
This is what helped me:
val session = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
.getOrCreate()
val df = session.read.csv(filesFromS3:_*)
Versions:
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",
Documentation:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties
It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop.
Here is how I did it colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow
Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH, so spark can see it.
import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"
Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/, but add s3a and anonymous reading support, which is what the question is about.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'
And finally we can create the session.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()