Spark Streaming doesn't receive any data from Flume - scala

I created a simple spark streaming application to consume data from Flume using Pull-based approach.
Spark version: 2.2.0
Flume version: 1.7.0
It works well when I run the program from my PC in Eclipse (Run As - Scala Application). But after compiling it into jar and submit the app via spark-submit, it's not receiving any data from Flume. Here's my code:
def main(args: Array[String]){
val conf = new SparkConf().setAppName("twitter").set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds(30))
val flumeStream = FlumeUtils.createPollingStream(ssc, "172.31.190.31", 9999)
val tweets = flumeStream.map(e => new String(e.event.getBody.array()))
tweets.print()
tweets.foreachRDD(rdd=>{
rdd.saveAsTextFile("/warehouse/raw/twitter/data")
})
ssc.start()
ssc.awaitTermination()
}
I build the program via Right Click the project - Run As - Maven Build - Goals=package - Run.
And here's how I submit the app:
spark-submit --master local[*] --deploy-mode client --class co.id.linknet.general.StreamingFlume ./spark/lib/linknet-general-1.0.1.jar
Flume config:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = keyword1, keyword2, keywordN
TwitterAgent01.sinks = sparkStream
TwitterAgent01.sinks.sparkStream.type = org.apache.spark.streaming.flume.sink.SparkSink
TwitterAgent01.sinks.sparkStream.hostname = edge01
TwitterAgent01.sinks.sparkStream.port = 9999
TwitterAgent01.sinks.sparkStream.channel = MemoryChannel01
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
Flume and spark submission are in the same server, I'm able to telnet port 9999 from itself.
For additional info, I've added some required library in flume and spark directory
$FLUME_HOME/lib
spark-streaming-flume_2.11-2.2.0.jar
spark-streaming-flume-sink_2.11-2.2.0.jar
scala-library-2.11.8.jar
commons-lang3-3.5.jar
$SPARK_HOME/jars
spark-streaming-flume_2.11-2.2.0.jar
spark-streaming-flume-sink_2.11-2.2.0.jar
scala-library-2.11.8.jar
commons-lang-2.5.jar
commons-lang3-3.5.jar
Did I miss something ?

Related

Configure log4j.properties for Kafka appender, error when parsing property bootstrap.servers

I want to add a Kafka appender to the audit-hdfs log in a Cloudera cluster.
I have successfully configured a log4j2.xml file with a Kafka appender, I need to convert this log4j2.xml into a log4j2.properties file in order to be able to merge it with the HDFS log log4j2.properties file. I am unable to do this because when I launch my dummy process with log4j2.properties instead of XML, I get an error.
I have tried writing the properties file in several different ways, always resulting in problems with the bootstrap.servers property
This is my properties file
filters = threshold
filter.threshold.type = ThresholdFilter
filter.threshold.level = ALL
appenders = console,kafka
appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %m%n
appender.kafka.type = Kafka
appender.kafka.name = kafka
appender.kafka.layout.type = PatternLayout
appender.kafka.layout.pattern =%m%n
appender.kafka.property.type = Property
appender.kafka.property.bootstrap.servers = ip:port
appender.kafka.topic = cdh-audit-hdfs
Here the problem is in this line:
appender.kafka.property.bootstrap.servers = ip:port
I have tried the following, to no avail:
appender.kafka.property.bootstrap.servers = ip:port
appender.kafka.property.bootstrap\.servers = ip:port
appender.kafka.property.name = "bootstrap.servers"
appender.kafka.property.bootstrap.servers = ip:port
appender.kafka.property.key = "bootstrap.servers"
appender.kafka.property.value = ip:port
etc...
This is my dummy process:
package blabla
import org.apache.logging.log4j.LogManager
object dummy extends App{
val logger = LogManager.getLogger
val record = "...c"
while(true){
logger.info(record)
Thread.sleep(5000)
}
}
How do I need to configure my log4j2.properties in order to be able to define this property?
Im expecting this process to write the record in my Kafka topic but instead I get errors like this:
Exception in thread "main" org.apache.logging.log4j.core.config.ConfigurationException: No type attribute provided for component bootstrap
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Missing required configuration "bootstrap.servers" which has no default value.
try this solution, it works for me :
appender.kafka.property.type=Property
appender.kafka.property.name=bootstrap.servers
appender.kafka.property.value=host:port

No FileSystem for scheme: cos

I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')

Logging for Custom Code Written in Spark

We are writing Scala code for Apache Spark and running the process in Yarn Mode (Yarn Client Mode) in cloudera 5.5. Spark version is 1.5
I need to do logging for this code and want to move logs in Specific Directory for Spark ,outside of noise in spark logs
We are using plain log4j. Dont have time for logging trait for now.
I have changed the default log4j file in ${spark_home} like this
# Set everything to be logged to the console
log4j.rootLogger=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Spark Logs
log4j.appender.aAppender=org.apache.log4j.RollingFileAppender
log4j.appender.aAppender.File=/var/log/aLogger.log
log4j.appender.aAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.aAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
log4j.appender.aAppender.maxFileSize=50MB
log4j.appender.aAppender.maxBackupIndex=5
log4j.appender.aAppender.encoding=UTF-8
# My custom logging goes to another file
log4j.logger.fsaLogger=ERROR, fsaAppender
This is my Code in scala
This is inside main method
val logger =LogManager.getLogger("aLogger")
val jobName = "AData"
logger.warn("jobName ::" +jobName +" Started at ::" +
+Calendar.getInstance().getTime())
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sourceDF = sqlContext.read
.format("com.databricks.spark.avro")
.load(pathToFiles)
finalCSMDF.map { row =>
//To remove Serilizable Exception
var loggers = org.apache.log4j.LogManager.getLogger("aLogger")
loggers.error("Test Log")
row
}
logger.warn("jobName ::" +jobName +" completed at ::" +
+Calendar.getInstance().getTime())
My problem here is that when ever i run this in yarn mode , i find the lines jobName ::" +jobName started" etc in /var/log/aLogger.log but not the loggers.error("Test Log") which is inside the closure. Basically everything in driver is going in closure but not the exceutioners closures
Please help

AWS: InvalidSignature exception while adding record

InvalidSignatureException occurs when trying to add user record using Kinesis Producer library.
AWS_JAVA_SDK_VERSION=1.10.26
AWS_KINESIS_PRODUCER_VERSION=0.10.1
ERROR:
PutRecords failed: {"__type":"InvalidSignatureException","message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method.
SCALA KINESIS PRODUCER CODE
private val configuration: KinesisProducerConfiguration = new KinesisProducerConfiguration
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(config.awsAccessKey, config.awsSecretKey)
configuration.setCredentialsProvider(credentialsProvider)
configuration.setRecordMaxBufferedTime(config.timeLimit)
configuration.setAggregationMaxCount(1)
configuration.setRegion(config.streamRegion)
configuration.setMetricsLevel("none")
private val kinesisProducer = new KinesisProducer(configuration)
kinesisProducer.addUserRecord(streamName, key, eventBytes)`
The above code is not working. But its possible for me to add records to kinesis stream through aws cli from terminal and KinesisClient in code which is specified below.
private def createKinesisClient = {
val accessKey = config.awsAccessKey
val secretKey = config.awsSecretKey
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(accessKey, secretKey)
val client = new AmazonKinesisClient(credentialsProvider)
client.setEndpoint(config.streamEndpoint)
client
}
This happens because your VM/PC/Server clock might by skewed.
If you're running ubuntu, try updating your system time:
sudo ntpdate ntp.ubuntu.com
If you are using docker-machine on Mac, you can resolve with this command:
docker-machine ssh default 'sudo ntpclient -s -h pool.ntp.org'

Access public available Amazon S3 file from Apache Spark

I have a public available Amazon s3 resource (text file) and want to access it from spark. That means - I don't have any Amazon credentials - it works fine if I want to just download it:
val bucket = "<my-bucket>"
val key = "<my-key>"
val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream
However, when I try to access the same resource from spark context
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())
I receive the following error with stacktrace:
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at com.example.Main$.main(Main.scala:14)
at com.example.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop?
P.S. My build.sbt just in case
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)
UPDATED: After I did some investigations - I see the reason why itsn't working.
First of all, S3AFileSystem creates AWS client with the following order of credentials:
AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
new BasicAWSCredentialsProvider(accessKey, secretKey),
new InstanceProfileCredentialsProvider(),
new AnonymousAWSCredentialsProvider()
);
"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient).
Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? See the implementation of AnonymousAWSCredentials:
public class AnonymousAWSCredentials implements AWSCredentials {
public String getAWSAccessKeyId() {
return null;
}
public String getAWSSecretKey() {
return null;
}
}
It simply returns null for both access key and secret key. Sounds reasonable. But look inside AWSCredentialsProviderChain:
AWSCredentials credentials = provider.getCredentials();
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. Looks like a bug inside aws-java-sdk-1.7.4. I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1.
Any other ideas?
I personally never accessed public data from Spark. You can try to use dummy credentials, or to create ones just for this usage. Set them directly on the SparkConf object.
val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)
As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. The list (order is important) is:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Java System Properties - aws.accessKeyId and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Instance profile credentials delivered through the Amazon EC2 metadata service
This is what helped me:
val session = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
.getOrCreate()
val df = session.read.csv(filesFromS3:_*)
Versions:
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",
Documentation:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties
It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop.
Here is how I did it colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow
Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH, so spark can see it.
import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"
Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/, but add s3a and anonymous reading support, which is what the question is about.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'
And finally we can create the session.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()