(Unauthorized) not authorized error when setting up for Data Federation to move data to S3 - mongodb

I am setting up Data Federations and want to move data from cluster to S3 following guidance from How to Automate Continuous Data Copying from MongoDB to S3. I have setup my cluster and s3 as Data Sources in Data Federations. Next, I have created Linked Data Sources to service name "kaison-data-federation" for Data Federations. Next, in my triggers, I wrote the pipeline.
However, I am hitting error of "(Unauthorized) not authorized" with no exact error mentioning. I have refer to other posts like Unauth Error on Moving Data from DataLake to S3. But still not working. I have stucked here for 3 days. Can anyone provide some insight to me on this? Thank you
screenshot1
screenshot2
screenshot3

Related

ADLS Gen2 operation failed for: An error occurred while sending the request. User error 2011

Hi I have the above error coming up when accessing storage container folder where I am trying to get the metadata of a folder and its files. It can't access the folders for some reason. checked linked service and storage container where public access is enabled and private end point is also set.
Please let me know what else is missing.
I tried to reproduce the error and got similar error.
The cause of error was the I am trying to access the ADLS gen 2 which is not available or present.
After providing correct information I am successfully able to connect ADLS Gen 2

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

Azure Data Factory CICD error: The document creation or update failed because of invalid reference

All, when running a build pipeline using Azure Devops with ARM template, the process is consistently failing when trying to deploy a dataset or a reference to a dataset with this error:
ARM Template deployment: Resource Group scope (AzureResourceManagerTemplateDeployment)
BadRequest: The document creation or update failed because of invalid reference 'dataset_1'.
I've tried renaming the dataset and also recreating it to see if that would help.
I then deleted the dataset_1.json file from the repo and still get the same message so it's some reference to this dataset and not the dataset itself I think. I've looked through all the other files for references to this but they all look fine.
Any ideas on how to troubleshoot this?
thanks
try this
Looks like you have created 'myTestLinkedService' linked service, tested connection but haven't published it yet and trying to reference that linked service in the new dataset that you are trying to create using Powershell.
In order to reference any data factory entity from Powershell, please make sure those entities are published first. Please try publishing the linked service first from the portal and then try to run your Powershell script to create the new dataset/actvitiy.
I think I found the issue. When I went into the detailed logs I found that in addition to this error there was an error message about an invalid SQL connection string, so I though it may be related since the dataset in question uses Azure SQL database linked service.
I adjusted the connection string and this seems to have solved the issue.

Aspera Node API /files/{id}/files endpoint not returning up to date data

I am working on a webapp for transferring files with Aspera. We are using AoC for the transfer server and an S3 bucket for storage.
When I upload a file to my s3 bucket using aspera connect everything appears to be successful, I see it in the bucket, and I see the new file in the directory when I run /files/browse on the parent folder.
I am refactoring my code to use the /files/{id}/files endpoint to list the directory because the documentation says it is faster compared to the /files/browse endpoint. After the upload is complete, when I run the /files/{id}/files GET request, the new file does not show up in the returned data right away. It only becomes available after a few minutes.
Is there some caching mechanism in place? I can't find anything about this in the documentation. When I make a transfer in the AoC dashboard everything updates right away.
Thanks,
Tim
Yes, the file-id base system uses an in-memory cache (redis).
This cache is updated when a new file is uploaded using Aspera. But for files movement directly on the storage, there is a daemon that will periodically scan and find new files.
If you want to bypass the cache, and have the API read the storage, you can add this header in the request:
X-Aspera-Cache-Control: no-cache
Another possibility is to trigger a scan by reading:
/files/{id}
for the folder id

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.
The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)
On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.