Access public available Amazon S3 file from Apache Spark - scala

I have a public available Amazon s3 resource (text file) and want to access it from spark. That means - I don't have any Amazon credentials - it works fine if I want to just download it:
val bucket = "<my-bucket>"
val key = "<my-key>"
val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream
However, when I try to access the same resource from spark context
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())
I receive the following error with stacktrace:
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at com.example.Main$.main(Main.scala:14)
at com.example.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop?
P.S. My build.sbt just in case
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)
UPDATED: After I did some investigations - I see the reason why itsn't working.
First of all, S3AFileSystem creates AWS client with the following order of credentials:
AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
new BasicAWSCredentialsProvider(accessKey, secretKey),
new InstanceProfileCredentialsProvider(),
new AnonymousAWSCredentialsProvider()
);
"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient).
Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? See the implementation of AnonymousAWSCredentials:
public class AnonymousAWSCredentials implements AWSCredentials {
public String getAWSAccessKeyId() {
return null;
}
public String getAWSSecretKey() {
return null;
}
}
It simply returns null for both access key and secret key. Sounds reasonable. But look inside AWSCredentialsProviderChain:
AWSCredentials credentials = provider.getCredentials();
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. Looks like a bug inside aws-java-sdk-1.7.4. I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1.
Any other ideas?

I personally never accessed public data from Spark. You can try to use dummy credentials, or to create ones just for this usage. Set them directly on the SparkConf object.
val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)
As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. The list (order is important) is:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
Java System Properties - aws.accessKeyId and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Instance profile credentials delivered through the Amazon EC2 metadata service

This is what helped me:
val session = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
.getOrCreate()
val df = session.read.csv(filesFromS3:_*)
Versions:
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",
Documentation:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties

It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop.
Here is how I did it colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow
Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH, so spark can see it.
import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"
Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/, but add s3a and anonymous reading support, which is what the question is about.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'
And finally we can create the session.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Related

No FileSystem for scheme: cos

I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')

Accessing postgres using slick is not working

I have following environment scala2.11.8 / akka 2.4.8 / slick 3.1.1 / postgreSQL 9.6
I have done following configuration in application.conf
mydb {
driver = "slick.driver.PostgresDriver$"
db {
url = "jdbc:postgresql://localhost:5432/mydb"
driver = org.postgresql.Driver
user="postgres"
password="postgres"
numThreads = 10
connectionPool = disabled
keepAliveConnection = true
}
}
The DB access is done in class
package mib
import slick.driver.PostgresDriver.api._
import scala.concurrent.ExecutionContext.Implicits.global
class DBAccess {
import scala.concurrent.Future
import scala.concurrent._
import scala.concurrent.duration._
import slick.backend.DatabaseConfig
import slick.driver.JdbcProfile
import slick.driver.PostgresDriver
import slick.driver.PostgresDriver.api._
import slick.jdbc.JdbcBackend.Database
println("creating database")
val dbConfig: DatabaseConfig[PostgresDriver] = DatabaseConfig.forConfig("mydb")
val db = dbConfig.db
try{
val accesspoints = TableQuery[mibPoint]
// SELECT * FROM users WHERE username='john'
val q = for (a <- accesspoints) yield a.mib_id
val dbAction = q.result
val f: Future[Seq[String]] = db.run(dbAction)
Await.result(f, Duration.Inf)
f.onSuccess { case s => println(s"Result: $s") }
}
catch
{
case _: Throwable =>println("got some exception")
}
finally
db.close
}
// this is a class that represents the table I've created in the database
class mibPoint(tag: Tag) extends Table[(String, Double,Double)](tag, "mib_non_info") {
def mac_id = column[String]("mib_id",O.PrimaryKey)
def lat = column[Double]("lat")
def lng = column[Double]("lng")
def * = (mib_id, lat,lng)
}
This class is called from APP object as
object wmib extends App {
val mWBootStrapper = new bootStrap
mWBootStrapper.ReadProperties();
val mdB = new DBAccess
}
However after running, I always get the output as "got some exception"
I have tried to enable logging using slf4j/logback but still i do not see much in the logs.
The above seems like very trivial and probably i am missing something obvious.
Thanks in advance,
Vishal
I added the exception handling as suggested by sarvesh. That was cool and thank you.
However my problem vanished and there was no exception.
What happened?
Earlier in the day, I had attempted to access the DB using the java JDBC way.
i.e. just to check that there is nothing wrong with DB and DB access.
In the process, I downloaded and added the postgresDriver in the classpath. Earlier that was not the case.
Since the driver was now in the path, the code just worked.
Since I was not printing the exception, i was not realizing the error.
I then removed the driver jar AND i got the following error.
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.JdbcBackend.statement - Preparing statement: select "mib_id" from "mibpoint"
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.DriverDataSource - Driver org.postgresql.Driver not already registered; trying to load it
java.lang.ClassNotFoundException: org.postgresql.Driver
at java.lang.ClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at slick.util.ClassLoaderUtil$$anon$1.loadClass(ClassLoaderUtil.scala:12)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:60)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:58)
at scala.Option.getOrElse(Option.scala:121)
Thanks to all for helping.
Vishal
I was running into the same connection issues when first using Slick. I submitted this PR with details on how to connect up a local Postgres server.
https://github.com/slick/slick/issues/1861#issuecomment-387616310.
But basically try edit your build.sbt and application.conf files:
The 2020 answer:
You have to make sure of two things:
Add the driver to the build.sbt's libraryDependencies: "org.postgresql" % "postgresql" % "42.2.5". That will cause java.sql.DriverManager's method getDrivers (which is used by slick in class DriverDataSource) to find the driver org.postgresql.Driver
Make sure that the database url in application.conf is following the JDBC's full-url pattern, as described in the source code: https://github.com/slick/slick/blob/42d787b4950fe876569b5fd68e98c4e0379ac83c/slick/src/main/scala/slick/jdbc/DatabaseUrlDataSource.scala#L9. For example: postgresql://user:password#localhost:5432/postgres.
My full configuration is:
build.sbt
libraryDependencies ++= Seq(
...,
"org.postgresql" % "postgresql" % "42.2.5"
)
application.conf
slick-postgres {
profile = "slick.jdbc.PostgresProfile$"
db {
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
driver = "org.postgresql.Driver"
url = "postgresql://postgres:postgres#localhost:5432/postgres"
}
}
}
I added the exception handling as suggested by sarvesh. That was cool and thank you. However my problem vanished and there was no exception. What happened? Earlier in the day, I had attempted to access the DB using the java JDBC way. i.e. just to check that there is nothing wrong with DB and DB access. In the process, I downloaded and added the postgresDriver in the classpath. Earlier that was not the case. Since the driver was now in the path, the code just worked. Since I was not printing the exception, i was not realizing the error. I then removed the driver jar AND i got the following error.
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.JdbcBackend.statement - Preparing statement: select "mib_id" from "mibpoint"
01:44:08.224 [mydb.db-1] DEBUG slick.jdbc.DriverDataSource - Driver org.postgresql.Driver not already registered; trying to load it
java.lang.ClassNotFoundException: org.postgresql.Driver
at java.lang.ClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at slick.util.ClassLoaderUtil$$anon$1.loadClass(ClassLoaderUtil.scala:12)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:60)
at slick.jdbc.DriverDataSource$$anonfun$init$2.apply(DriverDataSource.scala:58)
at scala.Option.getOrElse(Option.scala:121)
Thanks to all for helping. Vishal
mydb {
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
driver = "slick.driver.PostgresDriver$"
url = "postgres://postgresql:postgresql#localhost:5432/mydb"
}
}
Or.. you can try something like,
mydb = {
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
properties = {
url = "jdbc:postgresql://localhost:5432/mydb"
user = "postgres"
password = "postgres"
}
numThreads = 10
}
You need the Postgres Driver on the classpath:
Try adding "org.postgresql" % "postgresql" % "42.1.4" to your libraryDependencies.

How to use Spark BigQuery Connector locally?

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?
If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.

AWS: InvalidSignature exception while adding record

InvalidSignatureException occurs when trying to add user record using Kinesis Producer library.
AWS_JAVA_SDK_VERSION=1.10.26
AWS_KINESIS_PRODUCER_VERSION=0.10.1
ERROR:
PutRecords failed: {"__type":"InvalidSignatureException","message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method.
SCALA KINESIS PRODUCER CODE
private val configuration: KinesisProducerConfiguration = new KinesisProducerConfiguration
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(config.awsAccessKey, config.awsSecretKey)
configuration.setCredentialsProvider(credentialsProvider)
configuration.setRecordMaxBufferedTime(config.timeLimit)
configuration.setAggregationMaxCount(1)
configuration.setRegion(config.streamRegion)
configuration.setMetricsLevel("none")
private val kinesisProducer = new KinesisProducer(configuration)
kinesisProducer.addUserRecord(streamName, key, eventBytes)`
The above code is not working. But its possible for me to add records to kinesis stream through aws cli from terminal and KinesisClient in code which is specified below.
private def createKinesisClient = {
val accessKey = config.awsAccessKey
val secretKey = config.awsSecretKey
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(accessKey, secretKey)
val client = new AmazonKinesisClient(credentialsProvider)
client.setEndpoint(config.streamEndpoint)
client
}
This happens because your VM/PC/Server clock might by skewed.
If you're running ubuntu, try updating your system time:
sudo ntpdate ntp.ubuntu.com
If you are using docker-machine on Mac, you can resolve with this command:
docker-machine ssh default 'sudo ntpclient -s -h pool.ntp.org'

Requester cannot establish the connection. Jetty, Lift /Scala, iSeries DB2/400

I’m working my way through the Lift Application Development Cookbook by Gilberto T. Garcia Jr and have run up against a problem I can’t seem to resolve. I’ve copied the source code Chap06-map-table and I’m trying to modify it to work with my IBM i (iSeries, AS/400, i5) database. I was able to make it work with the first type of connection using Squeryl Record. However, I can’t seem to figure how to get this to work using a JNDI Datasource. I’ve spent a couple of days searching the internet for examples of setting this up and have not found a good example involving a DB/400 database connection. Below is the error I get when I attempt to start the container and the code I’ve modified in an effort to make it work. Any help would be appreciated.
There seems to be some choices for the data source class from jt4oo.jar (jtOpen) and I’m not sure which would be the best to use or perhaps there’s another. I’ve been trying this with each of the three and am assuming the first is the correct one.
com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource
com.ibm.as400.access.AS400JDBCConnectionPoolDataSource
com.ibm.as400.access.AS400JDBCDataSource
Thanks. Bob
This is the start of the error:
> container:start
[info] jetty-8.0.4.v20111024
[info] No Transaction manager found - if your webapp requires one, please config
ure one.
[info] NO JSP Support for /, did not find org.apache.jasper.servlet.JspServlet
[info] started o.e.j.w.WebAppContext{/,[file:/C:/Users/Bob/Lift26Projects/scala_
210/chap06-map-table/src/main/webapp/]}
[info] started o.e.j.w.WebAppContext{/,[file:/C:/Users/Bob/Lift26Projects/scala_
210/chap06-map-table/src/main/webapp/]}
18:21:47.062 [pool-7-thread-1] ERROR n.liftweb.http.provider.HTTPProvider - Fail
ed to Boot! Your application may not run properly
java.sql.SQLException: The application requester cannot establish the connection
. ("jdbc:as400://www.busapp.com;libraries=PLAY2TEST";naming=system;errors=full;)
at com.ibm.as400.access.JDError.throwSQLException(JDError.java:524) ~[jt
400-6.7.jar:JTOpen 6.7]
at com.ibm.as400.access.AS400JDBCConnection.setProperties(AS400JDBCConne
ction.java:3142) ~[jt400-6.7.jar:JTOpen 6.7]
at com.ibm.as400.access.AS400JDBCManagedDataSource.createPhysicalConnect...
My Build.sbt File:
name := "Lift 2.5 starter template"
version := "0.0.1"
organization := "net.liftweb"
scalaVersion := "2.10.0"
resolvers ++= Seq("snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",
"staging" at "http://oss.sonatype.org/content/repositories/staging",
"releases" at "http://oss.sonatype.org/content/repositories/releases"
)
seq(com.github.siasia.WebPlugin.webSettings :_*)
unmanagedResourceDirectories in Test <+= (baseDirectory) { _ / "src/main/webapp" }
scalacOptions ++= Seq("-deprecation", "-unchecked")
env in Compile := Some(file("./src/main/webapp/WEB-INF/jetty-env.xml") asFile)
libraryDependencies ++= {
val liftVersion = "2.5"
Seq(
"net.liftweb" %% "lift-webkit" % liftVersion % "compile",
"net.liftmodules" %% "lift-jquery-module_2.5" % "2.3",
"org.eclipse.jetty" % "jetty-webapp" % "8.0.4.v20111024" % "container",
"org.eclipse.jetty" % "jetty-plus" % "8.0.4.v20111024" % "container",
"ch.qos.logback" % "logback-classic" % "1.0.6",
"org.specs2" %% "specs2" % "1.14" % "test",
"net.liftweb" %% "lift-squeryl-record" % liftVersion % "compile",
"net.sf.jt400" % "jt400" % "6.7",
"org.liquibase" % "liquibase-maven-plugin" % "3.0.2"
)
}
This is my boot.scala file:
package bootstrap.liftweb
import _root_.liquibase.database.DatabaseFactory
import _root_.liquibase.database.jvm.JdbcConnection
import _root_.liquibase.exception.DatabaseException
import _root_.liquibase.Liquibase
import _root_.liquibase.resource.FileSystemResourceAccessor
import net.liftweb._
import util._
import Helpers._
import common._
import http._
import sitemap._
import Loc._
import net.liftmodules.JQueryModule
import net.liftweb.http.js.jquery._
import net.liftweb.squerylrecord.SquerylRecord
import org.squeryl.Session
import java.sql.{SQLException, DriverManager}
import org.squeryl.adapters.DB2Adapter
import javax.naming.InitialContext
import javax.sql.DataSource
import code.model.LiftBookSchema
/**
* A class that's instantiated early and run. It allows the application
* to modify lift's environment
*/
class Boot {
def runChangeLog(ds: DataSource) {
val connection = ds.getConnection
try {
val database = DatabaseFactory.getInstance().
findCorrectDatabaseImplementation(new JdbcConnection(connection))
val liquibase = new Liquibase(
"database/changelog/db.changelog-master.xml",
new FileSystemResourceAccessor(),
database
)
liquibase.update(null)
} catch {
case e: SQLException => {
connection.rollback()
throw new DatabaseException(e)
}
}
}
def boot {
// where to search snippet
LiftRules.addToPackages("code")
prepareDb()
// Build SiteMap
val entries = List(
Menu.i("Home") / "index", // the simple way to declare a menu
// more complex because this menu allows anything in the
// /static path to be visible
Menu(Loc("Static", Link(List("static"), true, "/static/index"),
"Static Content")))
// set the sitemap. Note if you don't want access control for
// each page, just comment this line out.
LiftRules.setSiteMap(SiteMap(entries: _*))
//Show the spinny image when an Ajax call starts
LiftRules.ajaxStart =
Full(() => LiftRules.jsArtifacts.show("ajax-loader").cmd)
// Make the spinny image go away when it ends
LiftRules.ajaxEnd =
Full(() => LiftRules.jsArtifacts.hide("ajax-loader").cmd)
// Force the request to be UTF-8
LiftRules.early.append(_.setCharacterEncoding("UTF-8"))
// Use HTML5 for rendering
LiftRules.htmlProperties.default.set((r: Req) =>
new Html5Properties(r.userAgent))
//Init the jQuery module, see http://liftweb.net/jquery for more information.
LiftRules.jsArtifacts = JQueryArtifacts
JQueryModule.InitParam.JQuery = JQueryModule.JQuery172
JQueryModule.init()
}
def prepareDb() {
Class.forName("com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource")
val ds = new InitialContext().lookup("java:/comp/env/jdbc/dsliftbook").asInstanceOf[DataSource]
runChangeLog(ds)
SquerylRecord.initWithSquerylSession(
Session.create(
ds.getConnection,
new DB2Adapter)
)
}
}
This is my jetty-env-xml File
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
<New id="dsliftbook" class="org.eclipse.jetty.plus.jndi.Resource">
<Arg></Arg>
<Arg>jdbc/dsliftbook</Arg>
<Arg>
<New class="com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource">
<Set name="serverName">"jdbc:as400://www.[server].com;libraries=PLAY2TEST";naming=system;errors=full;</Set>
<Set name="user">[user]</Set>
<Set name="password">[password]</Set>
</New>
</Arg>
</New>
</Configure>
Okay, I've managed to get connected. One problem was the quotation marks in the jetty-env-xml file. And the user name/password I was using apparently did not the authority required to make this work I'm not sure why since this is the same id/password I use for all my iSeries development. So for now, I'm another user profile with security officer authority until I can figure out what's happening or what authorities are required.
Once I got signed on, I was not able to set a library list for the user and this was causing the SQL to fail. It was looking for a library name that was the same as the user ID. For the time being, I've gotten around this issue by creating a new library named the same as the user id.
One other problem here is that even though I'm supplying both the ID and Password, I'm getting prompted to enter the ID/Password before it will connect. The ID and url are filled in but the password always has to be re-keyed.
I've included the current source for the jetty-env-xml file and the boot.scala file. Hopefully this may help others.
Thanks to Dave and James for their help!
Bob
boot.scala:
package bootstrap.liftweb
// import _root_.liquibase.database.DatabaseFactory
// import _root_.liquibase.database.jvm.JdbcConnection
// import _root_.liquibase.exception.DatabaseException
// import _root_.liquibase.Liquibase
// import _root_.liquibase.resource.FileSystemResourceAccessor
import net.liftweb._
import util._
import Helpers._
import common._
import http._
import sitemap._
import Loc._
import net.liftmodules.JQueryModule
import net.liftweb.http.js.jquery._
import net.liftweb.squerylrecord.SquerylRecord
import org.squeryl.Session
import java.sql.{SQLException, DriverManager}
import org.squeryl.adapters.DB2Adapter
import javax.naming.InitialContext
import javax.sql.DataSource
import code.model.LiftBookSchema
import com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource
/**
* A class that's instantiated early and run. It allows the application
* to modify lift's environment
*/
class Boot {
// def runChangeLog(ds: DataSource) {
// val connection = ds.getConnection
// try {
// val database = DatabaseFactory.getInstance().
// findCorrectDatabaseImplementation(new JdbcConnection(connection))
// val liquibase = new Liquibase(
// "database/changelog/db.changelog-master.xml",
// new FileSystemResourceAccessor(),
// database
// )
// liquibase.update(null)
// } catch {
// case e: SQLException => {
// connection.rollback()
// throw new DatabaseException(e)
// }
// }
// }
def boot {
// where to search snippet
LiftRules.addToPackages("code")
prepareDb()
// Build SiteMap
val entries = List(
Menu.i("Home") / "index", // the simple way to declare a menu
// more complex because this menu allows anything in the
// /static path to be visible
Menu(Loc("Static", Link(List("static"), true, "/static/index"),
"Static Content")))
// set the sitemap. Note if you don't want access control for
// each page, just comment this line out.
LiftRules.setSiteMap(SiteMap(entries: _*))
//Show the spinny image when an Ajax call starts
LiftRules.ajaxStart =
Full(() => LiftRules.jsArtifacts.show("ajax-loader").cmd)
// Make the spinny image go away when it ends
LiftRules.ajaxEnd =
Full(() => LiftRules.jsArtifacts.hide("ajax-loader").cmd)
// Force the request to be UTF-8
LiftRules.early.append(_.setCharacterEncoding("UTF-8"))
// Use HTML5 for rendering
LiftRules.htmlProperties.default.set((r: Req) =>
new Html5Properties(r.userAgent))
//Init the jQuery module, see http://liftweb.net/jquery for more information.
LiftRules.jsArtifacts = JQueryArtifacts
JQueryModule.InitParam.JQuery = JQueryModule.JQuery172
JQueryModule.init()
}
def prepareDb() {
Class.forName("com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource")
val ds = new InitialContext().lookup("java:/comp/env/jdbc/dsliftbook").asInstanceOf[DataSource]
// runChangeLog(ds)
SquerylRecord.initWithSquerylSession(Session.create(ds.getConnection, new DB2Adapter)
)
}
}
jetty-env-xml
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
<New id="dsliftbook" class="org.eclipse.jetty.plus.jndi.Resource">
<Arg></Arg>
<Arg>jdbc/dsliftbook</Arg>
<Arg>
<New class="com.ibm.as400.access.AS400JDBCManagedConnectionPoolDataSource">
<Set name="serverName">www.[server].com</Set>
<Set name="user">DBUSER</Set>
<Set name="password">DBUSER</Set>
</New>
</Arg>
</New>
</Configure>