AWS Unload Error: 'The bucket you are attempting to access must be addressed using the specified endpoint.' - postgresql

I am running the following query in SQL. I am trying to unload data from Redshift to a bucket in my personal S3 account:
UNLOAD ('SELECT * FROM table WHERE
UPPER(description) LIKE \'%something%\')
TO 's3://mybucketname/sometextname.txt' CREDENTIALS
'aws_access_key_id=xxx;aws_secret_access_key=xxx'
PARALLEL OFF
When I do this, I get the following error:
The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.,Status 301,Error PermanentRedirect,Rid AE9F82CD626A5B05,ExtRid 1hl5HHhv9rkaq0Vw7fB0kpm2WO1uOmy4MmXq
Is my s3 path correct? Do I need to change some permissions for my s3 account or bucket?

This feature is now supported. https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
unload ('select * from category')
to 's3://your-bucket/your-prefix'
iam_role 'arn:aws:iam::xxxxxxxx:role/redshift-role'
region 'us-west-2';

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

AWS Athena Federated Query - GENERIC_USER_ERROR when running DB query for PostgreSQL

Hi all,
I am trying to execute queries on a postgresql database I created in AWS.
I added a data source to Athena, I created the data source for postgresql and I created the lambda function.
In Lambda function I set:
default connection string
spill_bucket and spill prefix (I set the same for both: 'athena-spill'. In the S3 page I cannot see any athena-spill bucket)
the security group --> I set the security group I created to access the db
the subnet --> I set one of the database subnet
I deployed the lambda function but I received an error and I had to add a new environment variable created with the connection string but named as 'dbname_connection_string'.
After adding this new env variable I am able to see the database in Athena but when I try to execute any query on this database as:
select * from tests_summary limit 10;
I receive this error after running query:
GENERIC_USER_ERROR: Encountered an exception[com.amazonaws.SdkClientException] from your LambdaFunction[arn:aws:lambda:eu-central-1:449809321626:function:data-production-athena-connector-nina-lambda] executed in context[retrieving meta-data] with message[Unable to execute HTTP request: Connect to s3.eu-central-1.amazonaws.com:443 [s3.eu-central-1.amazonaws.com/52.219.170.25] failed: connect timed out]
This query ran against the "public" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 3366bd80-143e-459c-a4da-5350b5ab4a77
What could be causing the problem?
Thanks a lot!
Root Cause:
VPC have no internet connection issue, causing Lambda can't access S3.
Solution:
Add VPC Gateway Endpoint (Select com.amazonaws.eu-central-1.s3) in Lambda associated VPC.

Cannot copy data from Snowflake into Azure Blob

I am trying to copy data from Snowflake into an Azure Blob using Azure Data Factory.
The role I am using has select permissions on the table, and I have no issues querying the data using the Snowflake console.
I am also able to copy into the targeted blob from other sources (in Azure) using the same SAS token.
This is the query I have, generated by Azure Data Factory, (with specifics omitted)
COPY INTO 'azure://****.blob.core.windows.net/snowflake-stage/********-****-****-****-************/SnowflakeExportCopyCommand/'
FROM (select * from MYSCHEMA.MYTABLE)
CREDENTIALS = (AZURE_SAS_TOKEN = '****')
FILE_FORMAT = (type = CSV COMPRESSION = GZIP RECORD_DELIMITER = '
' FIELD_OPTIONALLY_ENCLOSED_BY = '"' ESCAPE = '\\' NULL_IF = '')
HEADER = TRUE
SINGLE = FALSE
OVERWRITE = TRUE
MAX_FILE_SIZE = 268435456
And this is the error I am getting:
ErrorCode=UserErrorOdbcOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ERROR
[42501] Failed to access remote file: access denied.
Please check your credentials,Source=Microsoft.DataTransfer.Runtime.GenericOdbcConnectors,''Type=System.Data.Odbc.OdbcException,Message=ERROR
[42501] Failed to access remote file: access denied. Please check your
credentials,Source=Snowflake,'
Are there more Snowflake permissions that I need in order to do this kind of copy? Or is this perhaps an issue with the write-permissions to the Azure container?
The "solution" for this indicates a likely bug for the permissions on the blob itself.
Switching the container permissions to public, then back to private again fixes the issue.

External Table on DELTA format files in ADLS Gen 1

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace.
similarly, I am trying to create same sort of external tables on the same DELTA format files,but in different workspace.
I do have read only access via Service principle on ADLS Gen1. So I can read DELTA files through spark data-frames, as in given below:
read_data_df = spark.read.format("delta").load('dbfs:/mnt/data/<foldername>')
I can even able to create hive external tables, but I do see following warning while reading data from the same table:
Error in SQL statement: AnalysisException: Incompatible format detected.
A transaction log for Databricks Delta was found at `dbfs:/mnt/data/<foldername>/_delta_log`,
but you are trying to read from `dbfs:/mnt/data/<foldername>` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
If I create external table 'using DELTA', then I see a different access error as in:
Caused by: org.apache.hadoop.security.AccessControlException:
OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
Does it mean that I would need full access, rather just READ ONLY?, on those underneath file system?
Thanks
Resolved after upgrading to Databricks Runtime environment to runtime version DBR-7.3.

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.
The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)
On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.