Operation failed: "This request is not authorized to perform this operation." in Synapse with a Pyspark Notebook - pyspark

I try to execute the following command line:
mssparkutils.fs.ls("abfss://mycontainer#myadfs.dfs.core.windows.net/myfolder/")
I get the error:
Py4JJavaError: An error occurred while calling z:mssparkutils.fs.ls.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, GET, https://myadfs.dfs.core.windows.net/mycontainer?upn=false&resource=filesystem&maxResults=5000&directory=myfolder&timeout=90&recursive=false, AuthorizationFailure, "This request is not authorized to perform this operation.
I followed the steps described in this link
by granting access to me and my Synapse workspace the role of "Storage Blob Data Contributor" in the container or file system level:
Even that, I still get this persistent error. Am I missing other steps?

I got the same kind of error in my environment. I just followed this official document and done the repro, now it's working fine for me. You can follow the below code it will solve your problem.
Sample code:
from pyspark.sql import SparkSession
account_name = 'your_blob_name'
container_name = 'your_container_name'
relative_path = 'your_folder path'
linked_service_name = 'Your_linked_service_name'
sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
Access to Blob Storage
path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (container_name,account_name,relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (container_name,account_name),sas_token)
print('Remote blob path: ' + path)
Sample output:
Updated answer
Reference to configure Spark in pyspark notebook:
https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/notebook-this-request-is-not-authorized-to-perform-this/ba-p/1712566

Related

Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

I am only having access to signed HTTPS urls for csv files (seperate for each file)
ex:
https://<container_name>.blob.core.windows.net/<folder_name>/<file_name>.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b
Below is the code I am using:
for blob_url in paths:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(f"test").getOrCreate()
storage_account_name = '***'
container_name = '***'
url = blob_url.split("?")[0]
access_key = '?'+blob_url.split("?")[1] # tried without '?' also
conf_path = "fs.azure.sas."+container_name+"."+storage_account_name+".blob.core.windows.net"
spark.conf.set(conf_path, access_key)
blob_path = "wasbs://"+container_name+"#"+storage_account_name+".blob.core.windows.net/"+url.split(".net/")[1]
df = spark.read.csv(blob_path, header=False, inferSchema=True)
df.show()
The first file read is successful. Next reads fail. Even if I change the order of files, only first one suceeds.
I have tried to stop the spark session everytime in the loop.
I have tried to give different spark session name everytime.
Nothing seems to work.
Same code works in databricks but does not work in dataproc.
I want to read files in a sequence and persist it somewhere. I am not able to do so
Error: py4j.protocol.Py4JJavaError: An error occurred while calling o68.csv.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Getting an error while using copy activity (polybase) in adf to copy parquet files in ADLS gen2 to Azure synapse table

My source is parquet files in ADLS gen2. All the parquet files are part files of size 10-14 MB. The total size should be around 80 GB
Sink is Azuresynapse table.
Copy method is Polybase. Getting below error within 5 sec of execution like below:
ErrorCode=PolybaseOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse. Operation: 'Create external table'.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, URL',Source=.Net SqlClient Data Provider,SqlErrorNumber=105019,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=105019,State=1,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD,
I've seen this error due to failed authentication, check whether the authorization header and/or signature is wrong.
For example, create the scope credential using your ADLS Gen2 storage account access key:
CREATE DATABASE SCOPED CREDENTIAL [MyADLSGen2Cred] WITH
IDENTITY='user',
SECRET='zge . . . 8V/rw=='
The external data source is created as follows:
CREATE EXTERNAL DATA SOURCE [MyADLSGen2] WITH (
TYPE=HADOOP,
LOCATION='abfs://myblob#pabechevb.dfs.core.windows.net',
CREDENTIAL=[MyADLSGen2Cred])
You can specify wasb instead of abfs, and if you're using SSL, specify it as abfss. Then the external table is created as follows:
CREATE EXTERNAL TABLE [dbo].[ADLSGen2] (
[Content] varchar(128))
WITH (
LOCATION='/',
DATA_SOURCE=[MyADLSGen2],
FILE_FORMAT=[TextFileFormat])
You can find additional information in my book "Hands-On Data Virtualization with Polybase".

How to create a bucket using the python SDK?

I'm trying to create a bucket in cloud object storage using python. I have followed the instructions in the API docs.
This is the code I'm using
COS_ENDPOINT = "https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints"
# Create client
cos = ibm_boto3.client("s3",
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id=COS_INSTANCE_CRN,
config=Config(signature_version="oauth"),
endpoint_url=COS_ENDPOINT
)
s3 = ibm_boto3.resource('s3')
def create_bucket(bucket_name):
print("Creating new bucket: {0}".format(bucket_name))
s3.Bucket(bucket_name).create()
return
bucket_name = 'test_bucket_442332'
create_bucket(bucket_name)
I'm getting this error - I tried setting CreateBucketConfiguration={"LocationConstraint":"us-south"}, but it doesnt seem to work
"ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."
Resolved by going to https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints#endpoints
And choosing the endpoint specific to the region I need. The "Endpoint" provided with the credentials, is not the actual endpoint.

Google Cloud authorization keeps failing with Python 3 - Type is None, expected one of ('authorized_user', 'service_account')

I am trying to download a file for the first time from Google Cloud Storage.
I set the path to the googstruct.json service account key file that I downloaded from https://cloud.google.com/storage/docs/reference/libraries#client-libraries-usage-python
Do need to set the authorization to Google Cloud outside the code somehow? Or is there a better "How to use Google Cloud Storage" then the one on the google site?
It seems like I am passing the wrong type to the storage_client = storage.Client()
the exception string is below.
Exception has occurred: google.auth.exceptions.DefaultCredentialsError
The file C:\Users\Cary\Documents\Programming\Python\QGIS\GoogleCloud\googstruct.json does not have a valid type.
Type is None, expected one of ('authorized_user', 'service_account').
MY PYTHON 3.7 CODE
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\GoogleCloud\\googstruct.json"
# Instantiates a client
storage_client = storage.Client()
bucket_name = 'structure_ssi'
destination_file_name = "C:\\Users\\18809_PIPEM.shp"
source_blob_name = '18809_PIPEM.shp'
download_blob(bucket_name, source_blob_name, destination_file_name)
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name
)
)
I did look at this but I cannot tell if this is my issue. I have tried both.
('Unexpected credentials type', None, 'Expected', 'service_account') with oauth2client (Python)
This error means that the Json Service Account Credentials that you are trying to use C:\\GoogleCloud\\googstruct.json are corrupt or the wrong type.
The first (or second) line in the file googstruct.json should be "type": "service_account".
Another few items to improve your code:
You do not need to use \\, just use / to make your code easier
and cleaner to read.
Load your credentials directly and do not modify environment
variables:
storage_client = storage.Client.from_service_account_json('C:/GoogleCloud/googstruct.json')
Wrap API calls in try / except. Stack traces do not impress customers. It is better to have clear, simple, easy to read error messages.

IBM - COS - SDK IAM token

I am trying to access my COS service using python.Referring IBM's Documentation was able to write the following code snippet
import ibm_boto3
from ibm_botocore.client import Config
api_key = 'key'
service_instance_id = 'resource-service-id'
auth_endpoint = 'http://iam.bluemix.net/'
service_endpoint = 'endpoint'
s3 = ibm_boto3.resource('s3',
ibm_api_key_id=api_key,
ibm_service_instance_id=service_instance_id,
ibm_auth_endpoint=auth_endpoint,
config=Config(signature_version='oauth'),
endpoint_url=service_endpoint)
s3.Bucket('bucket name').download_file('object name','location where the object must be saved')
Is this correct ? Also while trying to execute the above code the compiler is not able to retrieve the authentication token from auth_endpoint. Am i missing something?
Please to help
Thanks in advance!
I am including the output for your reference...
ibm_botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from https://iam.ng.bluemix.net/oidc/token: Retrieval of tokens from server failed
And I am using python 3.x
As instructed in README, the auth_endpoint should have /oidc/token at the end, for example, 'http://iam.bluemix.net/oidc/token'.
auth_endpoint = 'https://iam.bluemix.net/oidc/token'
The auth_endpoint should be https
See the example here
https://github.com/IBM/ibm-cos-sdk-python
To Connect with ibm cloud storage account we need api_key, service_instace_id,auth_endpoint and service_endpoint.
import ibm_boto3
from ibm_botocore.client import Config
api_key = '......' # u can find api_key in service credentials in ibm cloud account
service_instance_id = '.....' u can find service_instance_id in service credentials in ibm cloud account
auth_endpoint = 'https://iam.bluemix.net/oidc/token'
service_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
cos = ibm_boto3.resource('s3',
ibm_api_key_id=api_key,
ibm_service_instance_id=service_instance_id,
ibm_auth_endpoint=auth_endpoint,
config=Config(signature_version='oauth'),
endpoint_url=service_endpoint)
to create a bucket
new_bucket = 'abcd1234'
def create_bucket():
cos.create_bucket(Bucket=new_bucket)
return "Bucket created sucessfully"
create_bucket()
to list Buckets in cloud
def get_buckets():
print("Retrieving list of buckets")
try:
buckets = cos.buckets.all()
for bucket in buckets:
print("Bucket Name: {0}".format(bucket.name))
except ClientError as be:
print("CLIENT ERROR: {0}\n".format(be))
except Exception as e:
print("Unable to retrieve list buckets: {0}".format(e))
get_buckets()