EMR Spark Fails to Save Dataframe to S3 - scala

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.

The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)

On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

parquet streaming of Azure Blob storage into databricks with unity catalog

Unity Catalog have recently been set up in my databricks account, and I am trying to stream from an Azure container containing parquet files to a service catalog, using a notebook that ran before.
I do however now get the following
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.streaming.DataStreamReader.format(java.lang.String) is not whitelisted on class class org.apache.spark.sql.streaming.DataStreamReader
when trying to run the following spark command from my Notebook:
df = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.useNotifications", "false") # useNotifications determines if we efficiently scan the new files or if we set up a subscription to listen to new file events
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns") # schemaEvolutionMode determines what happens when the schema changes
.option("cloudFiles.schemaLocation", schemaPath)
.load(dataPath)
)
where schemaPathand dataPath contain the paths to the parquet schema and data files.
The closest related error I have found is the following pre-Unity Catalog error, suggesting that I should disable table access control on my clusters:
https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted
All table access control are disabled in my Admin Console.
Are there some other settings that should be set to ensure white-listing from Azure files now that Unity Catalog is set up?
------ Edit -----
Using a Single User cluster on Databricks runtime version 11.3 beta, I get the following error instead:
com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location
followed by the location to the azure schema in my storage location. I also get this error message by spawning new job clusters from azure datafactory.

How to set Hadoop fs.s3a.acl.default on AWS EMR?

I have a map-reduce application running on AWS EMR that writes some output to a different (aws account) s3 bucket. I have the permission setup and the job can write to the external bucket, but the owner is still the root from the account where the Hadoop job is running. I would like to change this to the external account that owns the bucket.
I found I can set fs.s3a.acl.default to bucket-owner-full-control, however that doesn't seem like working. This is what I am doing:
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
FileSystem fileSystem = FileSystem.get(URI.create(s3Path), conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(filePath));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write(contentAsString);
writer.close();
fsDataOutputStream.close();
Any help is appreciated.
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
is the right property you are setting.
As this the property in core-site.xml to give full control to bucket owner.
<property>
<name>fs.s3a.acl.default</name>
<description>Set a canned ACL for newly created and copied objects. Value may be private,
public-read, public-read-write, authenticated-read, log-delivery-write,
bucket-owner-read, or bucket-owner-full-control.</description>
</property>
BucketOwnerFullControl
Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object.
I recommend to set fs.s3.canned.acl also to value BucketOwnerFullControl
For debugging you can use the below snippet to understand what parameters are actually passing..
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
For testing purpose do this command with command line
aws s3 cp s3://bucket/source/dummyfile.txt s3://bucket/target/dummyfile.txt --sse --acl bucket-owner-full-control
If this works then through api also it will.
Bonus point with Spark , useful for spark scala users:
For Spark to access the s3 file system and set the proper configurations like the below example...
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.fast.upload","true")
hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version","2")
hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoopConf.set("fs.s3a.canned.acl","BucketOwnerFullControl")
hadoopConf.set("fs.s3a.acl.default","BucketOwnerFullControl")
If you are using EMR then you have to use the AWS team's S3 connector, with "s3://" URLs and use their documented configuration options. They don't support the apache one, so any option with "fs.s3a" at the beginning isn't going to have any effect whatsoever.
As mentioned in answer by Stevel, For EMR with pyspark use this
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.canned.acl","BucketOwnerFullControl")
Canned ACL Description
BucketOwnerFullControl Specifies that the owner of the bucket is granted
Permission.FullControl. The owner of the bucket is not necessarily
the same as the owner of the object.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html

Connection to a S3 instance using a service-connector

I'm trying to create a service-connector to my s3 instance like this:
cf service-connector 13001 mybucketname.ds31s3.swisscom.com:443
But I get the following error:
Server-Error 403: Check of security groups failed (no access)
I have created my service key according to this documentation.
Connecting to my MongoDB works perfectly using a service connector.
You can access Swisscom's S3 directly without the service connector.
The error message suggests that your current org and space do no have access to the S3. This is usually the case is there is no app-binding for that service in the current space. Please check whether you created your service key in the right org and space.
There was a misconfiguration due to security changes. We fixed the issue, so connecting to s3 with the service-connector should now work.

AWS Unload Error: 'The bucket you are attempting to access must be addressed using the specified endpoint.'

I am running the following query in SQL. I am trying to unload data from Redshift to a bucket in my personal S3 account:
UNLOAD ('SELECT * FROM table WHERE
UPPER(description) LIKE \'%something%\')
TO 's3://mybucketname/sometextname.txt' CREDENTIALS
'aws_access_key_id=xxx;aws_secret_access_key=xxx'
PARALLEL OFF
When I do this, I get the following error:
The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid AE9F82CD626A5B05,ExtRid 1hl5HHhv9rkaq0Vw7fB0kpm2WO1uOmy4MmXq
Is my s3 path correct? Do I need to change some permissions for my s3 account or bucket?
This feature is now supported. https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
unload ('select * from category')
to 's3://your-bucket/your-prefix'
iam_role 'arn:aws:iam::xxxxxxxx:role/redshift-role'
region 'us-west-2';