Write to s3 bucket directly using Spark/Scala

Write to s3 bucket directly using Spark/Scala - scala

I am trying to write parquet file in the s3 bucket in Spark/scala program but not sure what's the best way to do it. After a bit of research, I found below option :
Add below dependency in the code :
"com.amazonaws" % "aws-java-sdk" % "1.11.690",
"org.apache.hadoop" % "hadoop-aws" % "2.6.0"
then, set the required config in the code :
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "access key")
// Replace Key with your AWS secret key (You can find this on IAM
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secrete key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "end point", "us-east-1")
But, I am getting below error :
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID...
I am inclined to go with above option as this would give me flexibility to directly write the file to S3:
jdbcWriteDF.write.format("parquet")
.mode(SaveMode.Overwrite)
.save("s3a://cs-dev/cmc_site_qa.parquet")
With the AWSClient option, I need to generate it first on my local and then upload it on the S3 bucket.
I know the keys are correct as when I was using in one of my jave process, it was working fine..
AmazonS3 s3client = AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(s3BucketAccessKey, s3BucketSecretKey)))
.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(s3BucketEndpoint, "us-east-1"))
.build();
return s3client;

Related

Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

I am only having access to signed HTTPS urls for csv files (seperate for each file)
ex:
https://<container_name>.blob.core.windows.net/<folder_name>/<file_name>.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b
Below is the code I am using:
for blob_url in paths:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(f"test").getOrCreate()
storage_account_name = '***'
container_name = '***'
url = blob_url.split("?")[0]
access_key = '?'+blob_url.split("?")[1] # tried without '?' also
conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
spark.conf.set(conf_path, access_key)
blob_path = "wasbs://"+container_name+"#"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
df = spark.read.csv(blob_path, header=False, inferSchema=True)
df.show()
The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds.
I have tried to stop the spark session everytime in the loop.
I have tried to give different spark session name everytime.
Nothing seems to work.
Same code works in databricks but does not work in dataproc.
I want to read files in a sequence and persist it somewhere. I am not able to do so
Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Operation failed: "This request is not authorized to perform this operation." in Synapse with a Pyspark Notebook

I try to execute the following command line:
mssparkutils.fs.ls("abfss://mycontainer#myadfs.dfs.core.windows.net/myfolder/")
I get the error:
Py4JJavaError: An error occurred while calling z:mssparkutils.fs.ls.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, GET, https://myadfs.dfs.core.windows.net/mycontainer?upn=false&resource=filesystem&maxResults=5000&directory=myfolder&timeout=90&recursive=false, AuthorizationFailure, "This request is not authorized to perform this operation.
I followed the steps described in this link
by granting access to me and my Synapse workspace the role of "Storage Blob Data Contributor" in the container or file system level:
Even that, I still get this persistent error. Am I missing other steps?

I got the same kind of error in my environment. I just followed this official document and done the repro, now it's working fine for me. You can follow the below code it will solve your problem.
Sample code:
from pyspark.sql import SparkSession
account_name = 'your_blob_name'
container_name = 'your_container_name'
relative_path = 'your_folder path'
linked_service_name = 'Your_linked_service_name'
sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
Access to Blob Storage
path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (container_name,account_name,relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (container_name,account_name),sas_token)
print('Remote blob path: ' + path)
Sample output:
Updated answer
Reference to configure Spark in pyspark notebook:
https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/notebook-this-request-is-not-authorized-to-perform-this/ba-p/1712566

How to create a bucket using the python SDK?

I'm trying to create a bucket in cloud object storage using python. I have followed the instructions in the API docs.
This is the code I'm using
COS_ENDPOINT = "https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints"
# Create client
cos = ibm_boto3.client("s3",
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id=COS_INSTANCE_CRN,
config=Config(signature_version="oauth"),
endpoint_url=COS_ENDPOINT
)
s3 = ibm_boto3.resource('s3')
def create_bucket(bucket_name):
print("Creating new bucket: {0}".format(bucket_name))
s3.Bucket(bucket_name).create()
return
bucket_name = 'test_bucket_442332'
create_bucket(bucket_name)
I'm getting this error - I tried setting CreateBucketConfiguration={"LocationConstraint":"us-south"}, but it doesnt seem to work
"ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."

Resolved by going to https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints#endpoints
And choosing the endpoint specific to the region I need. The "Endpoint" provided with the credentials, is not the actual endpoint.

Error 400 while trying to read data store in a bucket located in ap-east-1 region with spark

I have a bunch of code that is used to read/write data from/to a S3 bucket.
After duplicating this code on a new AWS Region (originally deployed on eu-west-1, new deployment occurs on ap-east-1 region). I have the following error:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: SOMEID, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: SOMEREQUESTID
After investigation the error is thrown from S3AFileSystem.java at the following line
if (!s3.doesBucketExist(bucket)) {
Here, a HEAD operation is performed on the bucket.
After investigation, it appears that endpoint is always https://s3.amazonaws.com and the head request is trigger on https://bucket-name.s3.amazonaws.com, when it should occurs on https://bucket-name.s3.ap-east-1.amazonaws.com (also, fs.s3a.path.style.access is false)
Specifying the region with AWS_DEFAULT_REGION environment variable does not modify the behavior.
Last information, here are my dependencies as defined in build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
Needless to say, anytime I point on my bucket in eu-west-1 configuration, the code works.
What should I do (on my code or on my bucket configuration) to allow my code to load files.
For the record, the code is
getSparkSession().read.format("csv").option("delimiter", ";").option("header","false").load(path)

IllegalArgumentException, Wrong FS when writing ML model to s3 from Spark (Scala)

I've created a model:
val model = pipeline.fit(commentLower)
and I'm attempting to write it to s3:
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
model.write.overwrite().save("s3n://sparkstore/model")
but I'm getting this error:
Name: java.lang.IllegalArgumentException
Message: Wrong FS: s3n://sparkstore/model, expected: file:///
StackTrace: org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)
I also tried with my access key inline:
model.write.overwrite().save("s3n://MYACCESSKEY:MYSECRETKEY#/sparkstore/model")
How do I write a model (or any file for that matter) to s3 from Spark?

I don't have S3 connection to test.
But Here is what i think, you should use:-
val hconf=sc.hadoopConfiguration
hconf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hconf.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
hconf.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
When i do df.write.save("s3://sparkstore/model")
I get Name: org.apache.hadoop.fs.s3.S3Exception
Message: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/model' - ResponseCode=403, ResponseMessage=Forbidden
StackTrace: org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:229)
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:111)
s
which makes me believe that it did recongnize s3 protocal for s3 fs.
But it failed authentication which is obvious.
Hopefully it fixes your issue.
Thanks,
Charles.

This isn't exactly what I wanted to do, but I found a similar thread with a similar problem:
How to save models from ML Pipeline to S3 or HDFS?
This is what I ended up doing:
sc.parallelize(Seq(model), 1).saveAsObjectFile("swift://RossL.keystone/model")
val modelx = sc.objectFile[PipelineModel]("swift://RossL.keystone/model").first()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Write to s3 bucket directly using Spark/Scala - scala

Related

Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

Operation failed: "This request is not authorized to perform this operation." in Synapse with a Pyspark Notebook

How to create a bucket using the python SDK?

Error 400 while trying to read data store in a bucket located in ap-east-1 region with spark

IllegalArgumentException, Wrong FS when writing ML model to s3 from Spark (Scala)

Categories

Resources