Play 2.5 + Slick application.conf configuration error with URL - postgresql

In case anyone gets this weird error, which doesn't help explain what the problem is:
CreationException: Unable to create injector, see the following errors: 1) Error in custom provider, java.lang.IllegalStateException: when specifying driverClassName, jdbcUrl must also be specified while locating play.api.db.evolutions.ApplicationEvolutionsProvider at play.api.db.evolutions.EvolutionsModule.bindings(EvolutionsModule.scala:22): Binding(class play.api.db.evolutions.ApplicationEvolutions to ProviderConstructionTarget(class play.api.db.evolutions.ApplicationEvolutionsProvider) eagerly) (via modules: com.google.inject.util.Modules$OverrideModule -> play.api.inject.guice.GuiceableModuleConversions$$anon$1) while locating play.api.db.evolutions.ApplicationEvolutions 1 error
What I found strange was that the error goes away if you remove
"com.typesafe.play" %% "play-slick-evolutions" % "2.0.0"
from your build.sbt file.

Anyway, the problem was that I had my application.conf file look like this:
slick.dbs.default.driver = "slick.driver.PostgresDriver$"
slick.dbs.default.db.driver = "org.postgresql.Driver"
slick.dbs.default.url = "jdbc:postgresql://localhost:5432/pusdienodb"
slick.dbs.default.user = "pusdieno"
slick.dbs.default.password = "password"
Turns out that both url, user and password also need the .db. part.
So your configuration should look something like this in the end:
slick.dbs.default.driver = "slick.driver.PostgresDriver$"
slick.dbs.default.db.driver = "org.postgresql.Driver"
slick.dbs.default.db.url = "jdbc:postgresql://localhost:5432/pusdienodb"
slick.dbs.default.db.user = "pusdieno"
slick.dbs.default.db.password = "password"

Related

Failed to load data source for config using Play-2.6 and Quill.io

I'm currently getting an error when I try to run my Play app. It says Failed to load data source but then it looks like it is indeed loading the data source. I'm very new to Play and Scala and the rest of my team is also new, so apologies if this is a silly error or if I'm missing some code samples. Database app-users with owner root exists on my local and I don't believe root has a password (created using the createuser tool).
Any ideas on what could cause this? Or what I am missing?
Error:
play.api.UnexpectedException: Unexpected exception[IllegalStateException: Failed to load data source for config: 'Config(SimpleConfigObject({"dataSource":"org.postgresql.ds.PGSimpleDataSource","database":"app-users","driver":"org.postgresql.Driver","host":"localhost","password":"","port":5432,"url":"jdbc:postgresql://localhost:5432/app-users","user":"root"}))']
at play.core.server.DevServerStart$$anon$1.reload(DevServerStart.scala:186)
at play.core.server.DevServerStart$$anon$1.get(DevServerStart.scala:124)
at play.core.server.AkkaHttpServer.modelConversion(AkkaHttpServer.scala:183)
at play.core.server.AkkaHttpServer.handleRequest(AkkaHttpServer.scala:189)
at play.core.server.AkkaHttpServer.$anonfun$createServerBinding$3(AkkaHttpServer.scala:106)
at akka.stream.impl.fusing.MapAsync$$anon$24.onPush(Ops.scala:1191)
at akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:512)
at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:475)
at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:371)
at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:584)
Caused by: java.lang.IllegalStateException: Failed to load data source for config: 'Config(SimpleConfigObject({"dataSource":"org.postgresql.ds.PGSimpleDataSource","database":"app-users","driver":"org.postgresql.Driver","host":"localhost","password":"","port":5432,"url":"jdbc:postgresql://localhost:5432/app-users","user":"root"}))'
at io.getquill.JdbcContextConfig.dataSource(JdbcContextConfig.scala:24)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:17)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:18)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:19)
at db.db.package$DBContext.<init>(package.scala:6)
at MyComponents.ctx$lzycompute(MyApplicationLoader.scala:19)
at MyComponents.ctx(MyApplicationLoader.scala:19)
at MyComponents.userService$lzycompute(MyApplicationLoader.scala:22)
at MyComponents.userService(MyApplicationLoader.scala:22)
at MyComponents.applicationController$lzycompute(MyApplicationLoader.scala:29)
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: argument type mismatch
at com.zaxxer.hikari.util.PropertyElf.setProperty(PropertyElf.java:154)
at com.zaxxer.hikari.util.PropertyElf.lambda$setTargetFromProperties$0(PropertyElf.java:57)
at java.util.Hashtable.forEach(Hashtable.java:879)
at com.zaxxer.hikari.util.PropertyElf.setTargetFromProperties(PropertyElf.java:52)
at com.zaxxer.hikari.HikariConfig.<init>(HikariConfig.java:132)
at io.getquill.JdbcContextConfig.dataSource(JdbcContextConfig.scala:21)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:17)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:18)
at io.getquill.PostgresJdbcContext.<init>(PostgresJdbcContext.scala:19)
at db.db.package$DBContext.<init>(package.scala:6)
Caused by: java.lang.IllegalArgumentException: argument type mismatch
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.zaxxer.hikari.util.PropertyElf.setProperty(PropertyElf.java:149)
at com.zaxxer.hikari.util.PropertyElf.lambda$setTargetFromProperties$0(PropertyElf.java:57)
at java.util.Hashtable.forEach(Hashtable.java:879)
at com.zaxxer.hikari.util.PropertyElf.setTargetFromProperties(PropertyElf.java:52)
at com.zaxxer.hikari.HikariConfig.<init>(HikariConfig.java:132)
at io.getquill.JdbcContextConfig.dataSource(JdbcContextConfig.scala:21)
application.conf
play.db {
config = "db"
default = "default"
}
db.default {
driver = "org.postgresql.Driver"
dataSource = "org.postgresql.ds.PGSimpleDataSource"
url = "jdbc:postgresql://localhost:5432/app-users"
user = "root"
user = ${?DB_USER}
host = "localhost"
host = ${?DB_HOST}
port = 5432
port = ${?DB_PORT}
password = ""
password = ${?DB_PASSWORD}
database = "app-users"
}
db/package.scala
import io.getquill.{PostgresJdbcContext, SnakeCase}
package object db {
class DBContext(config: String) extends PostgresJdbcContext(SnakeCase, config)
trait Repository {
val ctx: DBContext
}
}
Using:
Scala 2.12.4
Quill 2.3.2
Play 2.6.6
Postgres JDBC Driver 42.2.1
PostgreSQL 10.2
UPDATE:
Added a password of "root" to the root user and switched to using the same format as the Quill docs, so now appliation.conf looks like this:
db.default {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.user = root
dataSource.password = root
dataSource.databaseName = app-users
dataSource.portNumber = 5432
dataSource.serverName = host
connectionTimeout = 30000
}
But the error message is still basically the same:
play.api.UnexpectedException: Unexpected exception[IllegalStateException: Failed to load data source for config: 'Config(SimpleConfigObject({"connectionTimeout":30000,"dataSource":{"databaseName":"app-users","password":"root","portNumber":5432,"serverName":"host","user":"root"},"dataSourceClassName":"org.postgresql.ds.PGSimpleDataSource"}))']
The following worked for me:
db.default {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.user = root
dataSource.password = root
dataSource.databaseName = app-users
dataSource.portNumber = 5432
dataSource.serverName = localhost
connectionTimeout = 30000
}
Basically, localhost instead of host. I'm guessing the first iteration didn't work because of the quotes.

Akka Streams Hikari Connection Pool for MySQL Streaming

I am streaming data from mysql using Slick 3 and Akka Streams.
This is how I build my source
import slick.jdbc.MySQLProfile.api._
val enableJdbcStreaming: (java.sql.Statement) => Unit = {statement =>
if (statement.isWrapperFor(classOf[com.mysql.cj.jdbc.StatementImpl])) {
statement.unwrap(classOf[com.mysql.cj.jdbc.StatementImpl]).enableStreamingResults()
}
}
val query = Tables.Foo.filter(r => r.isActive === true)
.map(r => r.id).result.withStatementParameters(statementInit = enableJdbcStreaming)
Source.fromPublisher(db.stream(query))
My application runs for like 20 minutes and then shuts down with the following error
[error] Exception in thread "abhipool network timeout executor" java.lang.NullPointerException
[info] 15:31:46 INFO [HikariPool] - abhipool - Close initiated...
[error] at com.mysql.cj.mysqla.io.MysqlaProtocol.setSocketTimeout(MysqlaProtocol.java:1397)
[error] at com.mysql.cj.mysqla.MysqlaSession$1.run(MysqlaSession.java:401)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
I have a feeling that because my query is running for a very long time there is some kind of timeout occurring which is initiating this shutdown.
My connection
mysql {
profile = "slick.jdbc.MySQLProfile$"
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties {
driver = "com.mysql.cj.jdbc.Driver"
url = "jdbc:mysql://foo:3306/bar?useLegacyDatetimeCode=false&serverTimezone=America/Chicago"
user = "foo"
password = "bar"
}
connectionTimeout = 0
idleTimeout = 0
maxLifetime = 0
maxConnections = 40
minConnections = 10
poolName = "abhipool"
numThreads = 10
}
Dependencies
"com.typesafe.slick" %% "slick" % "3.2.1",
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.1",
"mysql" % "mysql-connector-java" % "6.0.6",
How can I configure my application database connections so that even if my streaming application streams data for several days... it keeps running.
There is an extremely lengthy conversation about this same issue here but it doesn't tell me how to really fix this issue. This issues makes it totally impossible to write long running streaming tasks which use Mysql as a source.
You can configure the MySQL driver by adding parameters in the URL
url = "jdbc:mysql://foo:3306/bar?useLegacyDatetimeCode=false&serverTimezone=America/Chicago&socketTimeout=30000"
I put 30000 for the sake of the example, put the right value that fits your need

Accessing postgres using slick is not working

I have following environment scala2.11.8 / akka 2.4.8 / slick 3.1.1 / postgreSQL 9.6
I have done following configuration in application.conf
mydb {
driver = "slick.driver.PostgresDriver$"
db {
url = "jdbc:postgresql://localhost:5432/mydb"
driver = org.postgresql.Driver
user="postgres"
password="postgres"
numThreads = 10
connectionPool = disabled
keepAliveConnection = true
}
}
The DB access is done in class
package mib
import slick.driver.PostgresDriver.api._
import scala.concurrent.ExecutionContext.Implicits.global
class DBAccess {
import scala.concurrent.Future
import scala.concurrent._
import scala.concurrent.duration._
import slick.backend.DatabaseConfig
import slick.driver.JdbcProfile
import slick.driver.PostgresDriver
import slick.driver.PostgresDriver.api._
import slick.jdbc.JdbcBackend.Database
println("creating database")
val dbConfig: DatabaseConfig[PostgresDriver] = DatabaseConfig.forConfig("mydb")
val db = dbConfig.db
try{
val accesspoints = TableQuery[mibPoint]
// SELECT * FROM users WHERE username='john'
val q = for (a <- accesspoints) yield a.mib_id
val dbAction = q.result
val f: Future[Seq[String]] = db.run(dbAction)
Await.result(f, Duration.Inf)
f.onSuccess { case s => println(s"Result: $s") }
}
catch
{
case _: Throwable =>println("got some exception")
}
finally
db.close
}
// this is a class that represents the table I've created in the database
class mibPoint(tag: Tag) extends Table[(String, Double,Double)](tag, "mib_non_info") {
def mac_id = column[String]("mib_id",O.PrimaryKey)
def lat = column[Double]("lat")
def lng = column[Double]("lng")
def * = (mib_id, lat,lng)
}
This class is called from APP object as
object wmib extends App {
val mWBootStrapper = new bootStrap
mWBootStrapper.ReadProperties();
val mdB = new DBAccess
}
However after running, I always get the output as "got some exception"
I have tried to enable logging using slf4j/logback but still i do not see much in the logs.
The above seems like very trivial and probably i am missing something obvious.
Thanks in advance,
Vishal
I added the exception handling as suggested by sarvesh. That was cool and thank you.
However my problem vanished and there was no exception.
What happened?
Earlier in the day, I had attempted to access the DB using the java JDBC way.
i.e. just to check that there is nothing wrong with DB and DB access.
In the process, I downloaded and added the postgresDriver in the classpath. Earlier that was not the case.
Since the driver was now in the path, the code just worked.
Since I was not printing the exception, i was not realizing the error.
I then removed the driver jar AND i got the following error.
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.JdbcBackend.statement - Preparing statement: select "mib_id" from "mibpoint"
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.DriverDataSource - Driver org.postgresql.Driver not already registered; trying to load it
java.lang.ClassNotFoundException: org.postgresql.Driver
at java.lang.ClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at slick.util.ClassLoaderUtil$$anon$1.loadClass(ClassLoaderUtil.scala:12)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:60)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:58)
at scala.Option.getOrElse(Option.scala:121)
Thanks to all for helping.
Vishal
I was running into the same connection issues when first using Slick. I submitted this PR with details on how to connect up a local Postgres server.
https://github.com/slick/slick/issues/1861#issuecomment-387616310.
But basically try edit your build.sbt and application.conf files:
The 2020 answer:
You have to make sure of two things:
Add the driver to the build.sbt's libraryDependencies: "org.postgresql" % "postgresql" % "42.2.5". That will cause java.sql.DriverManager's method getDrivers (which is used by slick in class DriverDataSource) to find the driver org.postgresql.Driver
Make sure that the database url in application.conf is following the JDBC's full-url pattern, as described in the source code: https://github.com/slick/slick/blob/42d787b4950fe876569b5fd68e98c4e0379ac83c/slick/src/main/scala/slick/jdbc/DatabaseUrlDataSource.scala#L9. For example: postgresql://user:password#localhost:5432/postgres.
My full configuration is:
build.sbt
libraryDependencies ++= Seq(
...,
"org.postgresql" % "postgresql" % "42.2.5"
)
application.conf
slick-postgres {
profile = "slick.jdbc.PostgresProfile$"
db {
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
driver = "org.postgresql.Driver"
url = "postgresql://postgres:postgres#localhost:5432/postgres"
}
}
}
I added the exception handling as suggested by sarvesh. That was cool and thank you. However my problem vanished and there was no exception. What happened? Earlier in the day, I had attempted to access the DB using the java JDBC way. i.e. just to check that there is nothing wrong with DB and DB access. In the process, I downloaded and added the postgresDriver in the classpath. Earlier that was not the case. Since the driver was now in the path, the code just worked. Since I was not printing the exception, i was not realizing the error. I then removed the driver jar AND i got the following error.
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.JdbcBackend.statement - Preparing statement: select "mib_id" from "mibpoint"
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.DriverDataSource - Driver org.postgresql.Driver not already registered; trying to load it
java.lang.ClassNotFoundException: org.postgresql.Driver
at java.lang.ClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at slick.util.ClassLoaderUtil$$anon$1.loadClass(ClassLoaderUtil.scala:12)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:60)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:58)
at scala.Option.getOrElse(Option.scala:121)
Thanks to all for helping. Vishal
mydb {
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
driver = "slick.driver.PostgresDriver$"
url = "postgres://postgresql:postgresql#localhost:5432/mydb"
}
}
Or.. you can try something like,
mydb = {
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
properties = {
url = "jdbc:postgresql://localhost:5432/mydb"
user = "postgres"
password = "postgres"
}
numThreads = 10
}
You need the Postgres Driver on the classpath:
Try adding "org.postgresql" % "postgresql" % "42.1.4" to your libraryDependencies.

com.typesafe.config.ConfigException$NotResolved: has not been resolved,

I am trying to read the following config file using typesafe config
common = {
jdbcDriver = "com.mysql.jdbc.Driver"
slickDriver = "slick.driver.MySQLDriver"
port = 3306
db = "foo"
user = "bar"
password = "baz"
}
source = ${common} {server = "remoteserver"}
target = ${common} {server = "localserver"}
When I try to read my config using this code
val conf = ConfigFactory.parseFile(new File("src/main/resources/application.conf"))
val username = conf.getString("source.user")
I get an error
com.typesafe.config.ConfigException$NotResolved: source.user has not been resolved, you need to call Config#resolve(), see API docs for Config#resolve()
I don't get any error if I put everything inside "source" or "target" tags. I get errors only when I try to use "common"
I solved it myself.
ConfigFactory.parseFile(new File("src/main/resources/application.conf")).resolve()
I solved it.
Config confSwitchEnv = ConfigFactory.load("env.conf");
the env.conf is in the resources dir.
reference: https://nicedoc.io/lightbend/config

Access public available Amazon S3 file from Apache Spark

I have a public available Amazon s3 resource (text file) and want to access it from spark. That means - I don't have any Amazon credentials - it works fine if I want to just download it:
val bucket = "<my-bucket>"
val key = "<my-key>"
val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream
However, when I try to access the same resource from spark context
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())
I receive the following error with stacktrace:
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at com.example.Main$.main(Main.scala:14)
at com.example.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop?
P.S. My build.sbt just in case
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)
UPDATED: After I did some investigations - I see the reason why itsn't working.
First of all, S3AFileSystem creates AWS client with the following order of credentials:
AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
new BasicAWSCredentialsProvider(accessKey, secretKey),
new InstanceProfileCredentialsProvider(),
new AnonymousAWSCredentialsProvider()
);
"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient).
Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? See the implementation of AnonymousAWSCredentials:
public class AnonymousAWSCredentials implements AWSCredentials {
public String getAWSAccessKeyId() {
return null;
}
public String getAWSSecretKey() {
return null;
}
}
It simply returns null for both access key and secret key. Sounds reasonable. But look inside AWSCredentialsProviderChain:
AWSCredentials credentials = provider.getCredentials();
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. Looks like a bug inside aws-java-sdk-1.7.4. I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1.
Any other ideas?
I personally never accessed public data from Spark. You can try to use dummy credentials, or to create ones just for this usage. Set them directly on the SparkConf object.
val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)
As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. The list (order is important) is:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Java System Properties - aws.accessKeyId and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Instance profile credentials delivered through the Amazon EC2 metadata service
This is what helped me:
val session = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
.getOrCreate()
val df = session.read.csv(filesFromS3:_*)
Versions:
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",
Documentation:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties
It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop.
Here is how I did it colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow
Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH, so spark can see it.
import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"
Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/, but add s3a and anonymous reading support, which is what the question is about.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'
And finally we can create the session.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()