Can't connect from Spark to S3 - AmazonS3Exception Status Code: 400 - scala

I am trying to connect from Spark (running on my PC) to my S3 bucket:
val spark = SparkSession
.builder
.appName("S3Client")
.config("spark.master", "local")
.getOrCreate()
val sc = spark.sparkContext;
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
val txtFile = sc.textFile("s3a://bucket-name/folder/file.txt")
val contents = txtFile.collect();
But getting the following exception:
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 07A7BDC9135BCC84, AWS Error
Code: null, AWS Error Message: Bad Request, S3 Extended Request ID:
6ly2vhZ2mAJdQl5UZ/QUdilFFN1hKhRzirw6h441oosGz+PLIvLW2fXsZ9xmd8cuBrNHCdh8UPE=
I have seen this question but it didn't help me.
Edit:
As Zack suggested, I added:
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
But I still get the same exception.

I've solve the problem.
I was targeting a region (Frankfurt) that required using version 4 of the signature.
I've changed the region of the S3 bucket to Ireland and now it's working.

According to s3 doc, some region only support "Signature Version(s) 4", need to add the configurations below:
--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true"
and
--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true"

Alon,
try the below configurations:
val spark = SparkSession
.builder
.appName("S3Client")
.config("spark.master", "local")
.getOrCreate()
val sc = spark.sparkContext;
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.us-east-1.amazonaws.com")
val txtFile = sc.textFile("s3a://s3a://bucket-name/folder/file.txt")
val contents = txtFile.collect();
I believe your issue was due to you not specifying your endpoint in the configuration set. Sub out us-east-1 for whichever region you use.

This works for me (this is everything...no other export etc. is needed)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
to run:
spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' --packages org.apache.hadoop:hadoop-aws:2.7.1 spark_read_s3.py

Related

Issue running aws glue job locally

I'm trying to run a glue job locally but I'm facing a problem, when I run my script a exception is raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o47.getDynamicFrame.
: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I downloaded aws-glue-libs from here: https://github.com/awslabs/aws-glue-libs/tree/glue-1.0/awsglue.
My code:
from pyspark.sql import SparkSession
from awsglue.context import GlueContext
spark = SparkSession \
.builder \
.appName("GlueSparkJobExample") \
.config("spark.jars", "AWSGlueETLPython-1.0.0-jar-with-dependencies.jar") \
.config("spark.local.dir", "/tmp") \
.getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
db = "database"
table = "table"
my_df = glueContext.create_dynamic_frame.from_catalog(
database=db, table_name=table)
If someone can help would be great.

Need a solution on connecting Teradata using Pyspark

I have a below code which will be used to connect the hadoop env with Teradata.
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").options(url="jdbc:teradata://teradata-dns-sysa.fg.rbc.com",driver="com.teradata.jdbc.TeraDriver",dbtable="table",user="userid",password="xxxxxxxx").load()
Now the userid & password is different for different users. Hence looking out for a solution where credentials can be stored in a file in a secure location and the code simply refer to the data (userid & password) in the file
Here you can use property file where you can store required user id and password in file. You can refer properties file using argument parameter --properties-file file_name in command while running spark-submit command. Below is sample code for same.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Teradata Connect") \
.getOrCreate()
sc = spark.sparkContext
DB_DRIVER = sc._conf.get('spark.DB_DRIVER')
JDBC_URL = sc._conf.get('spark.JDBC_URL')
DB_USER = sc._conf.get('spark.DB_USER')
DB_PASS = sc._conf.get('spark.DB_PASS')
jdbcDF = (spark.read.format("jdbc").option("driver", DB_DRIVER)
.option("url", JDBC_URL)
.option("dbtable", "sql_query")
.option("user", DB_USER)
.option("password", DB_PASS)
.load())
jdbcDF.show(10)
Sample Properties file
spark.DB_DRIVER com.teradata.jdbc.TeraDriver
spark.JDBC_URL jdbc:teradata://teradata-dns-sysa.fg.rbc.com
spark.DB_USER userid
spark.DB_PASS password
Spark submit command
spark2-submit --master yarn \
--deploy-mode cluster \
--properties-file $CONF_FILE \
pyspark_script.py

Failed to find data source: com.mongodb.spark.sql.DefaultSource

I'm trying to connect spark (pyspark) to mongodb as follows:
conf = SparkConf()
conf.set('spark.mongodb.input.uri', default_mongo_uri)
conf.set('spark.mongodb.output.uri', default_mongo_uri)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession \
.builder \
.appName("my-app") \
.config("spark.mongodb.input.uri", default_mongo_uri) \
.config("spark.mongodb.output.uri", default_mongo_uri) \
.getOrCreate()
But when I do the following:
users = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", '{uri}.{col}'.format(uri=mongo_uri, col='users')).load()
I get this error:
java.lang.ClassNotFoundException: Failed to find data source:
com.mongodb.spark.sql.DefaultSource
I did the same thing from pyspark shell and I was able to retrieve data. This is the command I ran:
pyspark --conf "spark.mongodb.input.uri=mongodb_uri" --conf "spark.mongodb.output.uri=mongodburi" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.2
But here we have the option to specify the package we need to use. But what about standalone apps and scripts. how can I configure mongo-spark-connector there.
Any ideas?
Here how I did it in Jupyter notebook:
1. Download jars from central or any other repository and put them in directory called "jars":
mongo-spark-connector_2.11-2.4.0
mongo-java-driver-3.9.0
2. Create session and write/read any data
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
working_directory = 'jars/*'
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection") \
.config("spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection") \
.config('spark.driver.extraClassPath', working_directory) \
.getOrCreate()
people = my_spark.createDataFrame([("JULIA", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 22)], ["name", "age"])
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.select('*').where(col("name") == "JULIA").show()
As a result you will see this:
If you are using SparkContext & SparkSession, you have mentioned the connector jar packages in SparkConf, check the following Code:
from pyspark import SparkContext,SparkConf
conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you are using only SparkSession then use following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you're using the newest version of mongo-spark-connector, i.e. v10.0.1 at the time of writing this, you need to use SparkConf object, as stated by the mongo documentation (https://www.mongodb.com/docs/spark-connector/current/configuration/).
Besides, you don't need to manually download anything, it will do it for you.
Bellow is the solution I came up with, for :
mongo-spark-connector: 10.0.1
mongo server : 5.0.8
spark : 3.2.0
def init_spark():
password = os.environ["MONGODB_PASSWORD"]
user = os.environ["MONGODB_USER"]
host = os.environ["MONGODB_HOST"]
db_auth = os.environ["MONGODB_DB_AUTH"]
mongo_conn = f"mongodb://{user}:{password}#{host}:27017/{db_auth}"
conf = SparkConf()
# Download mongo-spark-connector and its dependencies.
# This will download all the necessary jars and put them in your $HOME/.ivy2/jars, no need to manually download them :
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector:10.0.1")
# Set up read connection :
conf.set("spark.mongodb.read.connection.uri", mongo_conn)
conf.set("spark.mongodb.read.database", "<my-read-database>")
conf.set("spark.mongodb.read.collection", "<my-read-collection>")
# Set up write connection
conf.set("spark.mongodb.write.connection.uri", mongo_conn)
conf.set("spark.mongodb.write.database", "<my-write-database>")
conf.set("spark.mongodb.write.collection", "<my-write-collection>")
# If you need to update instead of inserting :
conf.set("spark.mongodb.write.operationType", "update")
SparkContext(conf=conf)
return SparkSession \
.builder \
.appName('<my-app-name>') \
.getOrCreate()
spark = init_spark()
df = spark.read.format("mongodb").load()
df_grouped = df.groupBy("<some-column>").agg(mean("<some-other-column>"))
df_grouped.write.format("mongodb").mode("append").save()
I was also facing same error "java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource" while trying to connect to MongoDB from Spark (2.3).
I had to download and copy mongo-spark-connector_2.11 JAR file(s) into jars directory of spark installation.
That resolved my issue and I was successfully able to call my spark code via spark-submit.
Hope it helps.
Here is how this error got resolved by downloading the jar files below. (Used the solution of this question.)
1.Downloaded the jar files below.
mongo-spark-connector_2.11-2.4.1 from here
mongo-java-driver-3.9.0 from here
copy and paste both these jar files into 'jars' location in spark directory.
Pyspark Code in jupiter notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mongo").\
config("spark.mongodb.input.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
config("spark.mongodb.output.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
getOrCreate()
df=spark.read.format('com.mongodb.spark.sql.DefaultSource')\
.option( "uri", "mongodb://127.0.0.1:27017/$database.$table_name") \
.load()
df.printSchema()
#create Temp view of df to view the data
table = df.createOrReplaceTempView("df")
#to read table present in mongodb
query1 = spark.sql("SELECT * FROM df ")
query1.show(10)
You are not using sc to create the SparkSession. Maybe this code can help you:
conf.set('spark.mongodb.input.uri', mongodb_input_uri)
conf.set('spark.mongodb.input.collection', 'collection_name')
conf.set('spark.mongodb.output.uri', mongodb_output_uri)
sc = SparkContext(conf=conf)
spark = SparkSession(sc) # Using the context (conf) to create the session

Not able to read data from AWS S3(orc) through Intellij local(spark/Scala)

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

PySpark sqlContext read Postgres 9.6 NullPointerException

Trying to read a table with PySpark from a Postgres DB. I have set up the following code and verified SparkContext exists:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /tmp/jars/postgresql-42.0.0.jar --jars /tmp/jars/postgresql-42.0.0.jar pyspark-shell'
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext
properties = {
"driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://tom:#localhost/gqp'
sqlContext = SQLContext(sc)
sqlContext.read \
.format("jdbc") \
.option("url", url) \
.option("driver", properties["driver"]) \
.option("dbtable", "specimen") \
.load()
I get the following error:
Py4JJavaError: An error occurred while calling o812.load. : java.lang.NullPointerException
The name of my database is gqp, table is specimen, and have verified it is running on localhost using the Postgres.app macOS app.
The URL was the problem!
Originally it was: url = 'jdbc:postgresql://tom:#localhost/gqp'
I removed the tom:# part, and it worked. The URL must follow the pattern: jdbc:postgresql://ip_address:port/db_name, whereas mine was directly copied from a Flask project.
If you're reading this, hope you didn't make this same mistake :)