I am trying to write parquet file in the s3 bucket in Spark/scala program but not sure what's the best way to do it. After a bit of research, I found below option :
Add below dependency in the code :
"com.amazonaws" % "aws-java-sdk" % "1.11.690",
"org.apache.hadoop" % "hadoop-aws" % "2.6.0"
then, set the required config in the code :
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "access key")
// Replace Key with your AWS secret key (You can find this on IAM
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secrete key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "end point", "us-east-1")
But, I am getting below error :
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID...
I am inclined to go with above option as this would give me flexibility to directly write the file to S3:
With the AWSClient option, I need to generate it first on my local and then upload it on the S3 bucket.
I know the keys are correct as when I was using in one of my jave process, it was working fine..
AmazonS3 s3client = AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(s3BucketAccessKey, s3BucketSecretKey)))
.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(s3BucketEndpoint, "us-east-1"))
return s3client;
I am only having access to signed HTTPS urls for csv files (seperate for each file)
Below is the code I am using:
for blob_url in paths:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(f"test").getOrCreate()
storage_account_name = '***'
container_name = '***'
url = blob_url.split("?")[0]
access_key = '?'+blob_url.split("?")[1] # tried without '?' also
conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
spark.conf.set(conf_path, access_key)
blob_path = "wasbs://"+container_name+"#"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
df = spark.read.csv(blob_path, header=False, inferSchema=True)
The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds.
I have tried to stop the spark session everytime in the loop.
I have tried to give different spark session name everytime.
Nothing seems to work.
Same code works in databricks but does not work in dataproc.
I want to read files in a sequence and persist it somewhere. I am not able to do so
Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
I try to execute the following command line:
I get the error:
Py4JJavaError: An error occurred while calling z:mssparkutils.fs.ls.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, GET, https://myadfs.dfs.core.windows.net/mycontainer?upn=false&resource=filesystem&maxResults=5000&directory=myfolder&timeout=90&recursive=false, AuthorizationFailure, "This request is not authorized to perform this operation.
I followed the steps described in this link
by granting access to me and my Synapse workspace the role of "Storage Blob Data Contributor" in the container or file system level:
Even that, I still get this persistent error. Am I missing other steps?
I got the same kind of error in my environment. I just followed this official document and done the repro, now it's working fine for me. You can follow the below code it will solve your problem.
Sample code:
from pyspark.sql import SparkSession
account_name = 'your_blob_name'
container_name = 'your_container_name'
relative_path = 'your_folder path'
linked_service_name = 'Your_linked_service_name'
sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
Access to Blob Storage
path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (container_name,account_name,relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (container_name,account_name),sas_token)
print('Remote blob path: ' + path)
Sample output:
Updated answer
Reference to configure Spark in pyspark notebook:
I'm trying to create a bucket in cloud object storage using python. I have followed the instructions in the API docs.
This is the code I'm using
COS_ENDPOINT = "https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints"
# Create client
cos = ibm_boto3.client("s3",
s3 = ibm_boto3.resource('s3')
def create_bucket(bucket_name):
print("Creating new bucket: {0}".format(bucket_name))
bucket_name = 'test_bucket_442332'
I'm getting this error - I tried setting CreateBucketConfiguration={"LocationConstraint":"us-south"}, but it doesnt seem to work
"ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."
Resolved by going to https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints#endpoints
And choosing the endpoint specific to the region I need. The "Endpoint" provided with the credentials, is not the actual endpoint.
I have a bunch of code that is used to read/write data from/to a S3 bucket.
After duplicating this code on a new AWS Region (originally deployed on eu-west-1, new deployment occurs on ap-east-1 region). I have the following error:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: SOMEID, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: SOMEREQUESTID
After investigation the error is thrown from S3AFileSystem.java at the following line
if (!s3.doesBucketExist(bucket)) {
Here, a HEAD operation is performed on the bucket.
After investigation, it appears that endpoint is always https://s3.amazonaws.com and the head request is trigger on https://bucket-name.s3.amazonaws.com, when it should occurs on https://bucket-name.s3.ap-east-1.amazonaws.com (also, fs.s3a.path.style.access is false)
Specifying the region with AWS_DEFAULT_REGION environment variable does not modify the behavior.
Last information, here are my dependencies as defined in build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
Needless to say, anytime I point on my bucket in eu-west-1 configuration, the code works.
What should I do (on my code or on my bucket configuration) to allow my code to load files.
For the record, the code is
getSparkSession().read.format("csv").option("delimiter", ";").option("header","false").load(path)
I've created a model:
val model = pipeline.fit(commentLower)
and I'm attempting to write it to s3:
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
but I'm getting this error:
Name: java.lang.IllegalArgumentException
Message: Wrong FS: s3n://sparkstore/model, expected: file:///
StackTrace: org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
I also tried with my access key inline:
How do I write a model (or any file for that matter) to s3 from Spark?
I don't have S3 connection to test.
But Here is what i think, you should use:-
val hconf=sc.hadoopConfiguration
hconf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hconf.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
hconf.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
When i do df.write.save("s3://sparkstore/model")
I get Name: org.apache.hadoop.fs.s3.S3Exception
Message: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/model' - ResponseCode=403, ResponseMessage=Forbidden
StackTrace: org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:229)
which makes me believe that it did recongnize s3 protocal for s3 fs.
But it failed authentication which is obvious.
Hopefully it fixes your issue.
This isn't exactly what I wanted to do, but I found a similar thread with a similar problem:
How to save models from ML Pipeline to S3 or HDFS?
This is what I ended up doing:
sc.parallelize(Seq(model), 1).saveAsObjectFile("swift://RossL.keystone/model")
val modelx = sc.objectFile[PipelineModel]("swift://RossL.keystone/model").first()