Getting an error while using copy activity (polybase) in adf to copy parquet files in ADLS gen2 to Azure synapse table - azure-data-factory

My source is parquet files in ADLS gen2. All the parquet files are part files of size 10-14 MB. The total size should be around 80 GB
Sink is Azuresynapse table.
Copy method is Polybase. Getting below error within 5 sec of execution like below:
ErrorCode=PolybaseOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse. Operation: 'Create external table'.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, URL',Source=.Net SqlClient Data Provider,SqlErrorNumber=105019,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=105019,State=1,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD,

I've seen this error due to failed authentication, check whether the authorization header and/or signature is wrong.
For example, create the scope credential using your ADLS Gen2 storage account access key:
CREATE DATABASE SCOPED CREDENTIAL [MyADLSGen2Cred] WITH
IDENTITY='user',
SECRET='zge . . . 8V/rw=='
The external data source is created as follows:
CREATE EXTERNAL DATA SOURCE [MyADLSGen2] WITH (
TYPE=HADOOP,
LOCATION='abfs://myblob#pabechevb.dfs.core.windows.net',
CREDENTIAL=[MyADLSGen2Cred])
You can specify wasb instead of abfs, and if you're using SSL, specify it as abfss. Then the external table is created as follows:
CREATE EXTERNAL TABLE [dbo].[ADLSGen2] (
[Content] varchar(128))
WITH (
LOCATION='/',
DATA_SOURCE=[MyADLSGen2],
FILE_FORMAT=[TextFileFormat])
You can find additional information in my book "Hands-On Data Virtualization with Polybase".

Related

get blob API Call from Azure Data Factory

I asked the same question at MS qna site too.
In ADF, I tried to call get BLOB() https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob
I got this error message: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature."
I'd like to read an image or non structured file and insert it into a varchar(max) column in SQL server. (source: binary to sink:binary in sQL server)
My pipeline is configured as below.
linked service:
base url: https://{account name}.blob.core.windows.net/
authentication type: anonymouse
server certificate: disabled
type: Rest
data set
type :Rest
relative url: {container name}/xyz.jpeg
copy data activity
request method: get
x-ms-date: #concat(formatDateTime(utcNow(), 'yyyy-MM-ddTHH:mm:ss'), 'Z')
x-ms-version: 2018-11-09
x-ms-blob-type: BlockBlob
Authorization: SharedKey {storage name}:CBntp....{SAS key}....LsIHw%3D
( I took a key from an SAS connection string....https&sig=CBntp{SAS key}LsIHw%3D)
Is it possible to call the Azure Blob rest API in ADF pipelines?
Unfortunately this is not possible because When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
Source dataset property when Source is Binary
Sink dataset property
Reference - https://learn.microsoft.com/en-us/azure/data-factory/format-binary#copy-activity-properties

Operation failed: "This request is not authorized to perform this operation." in Synapse with a Pyspark Notebook

I try to execute the following command line:
mssparkutils.fs.ls("abfss://mycontainer#myadfs.dfs.core.windows.net/myfolder/")
I get the error:
Py4JJavaError: An error occurred while calling z:mssparkutils.fs.ls.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, GET, https://myadfs.dfs.core.windows.net/mycontainer?upn=false&resource=filesystem&maxResults=5000&directory=myfolder&timeout=90&recursive=false, AuthorizationFailure, "This request is not authorized to perform this operation.
I followed the steps described in this link
by granting access to me and my Synapse workspace the role of "Storage Blob Data Contributor" in the container or file system level:
Even that, I still get this persistent error. Am I missing other steps?
I got the same kind of error in my environment. I just followed this official document and done the repro, now it's working fine for me. You can follow the below code it will solve your problem.
Sample code:
from pyspark.sql import SparkSession
account_name = 'your_blob_name'
container_name = 'your_container_name'
relative_path = 'your_folder path'
linked_service_name = 'Your_linked_service_name'
sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
Access to Blob Storage
path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (container_name,account_name,relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (container_name,account_name),sas_token)
print('Remote blob path: ' + path)
Sample output:
Updated answer
Reference to configure Spark in pyspark notebook:
https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/notebook-this-request-is-not-authorized-to-perform-this/ba-p/1712566

SAM Deployment failed Error- Waiter StackCreateComplete failed: Waiter encountered a terminal failure state

When I try to deploy package on SAM, the very first status comes in cloud formation console is ROLLBACK_IN_PROGRESS after that it gets changed to ROLLBACK_COMPLETE
I have tried deleting the stack and trying again, but every time same issue occurs.
Error in terminal looks like this-
Sourcing local options from ./SAMToolkit.devenv
SAM_PARAM_PKG environment variable not set
SAMToolkit will operate in legacy mode.
Please set SAM_PARAM_PKG in your .devenv file to run modern packaging.
Run 'sam help package' for more information
Runtime: java
Attempting to assume role from AWS Identity Broker using account 634668058279
Assumed role from AWS Identity Broker successfully.
Deploying stack sam-dev* from template: /home/***/1.0/runtime/sam/template.yml
sam-additional-artifacts-url.txt was not found, which is fine if there is no additional artifacts uploaded
Replacing BATS::SAM placeholders in template...
Uploading template build/private/tmp/sam-toolkit.yml to s3://***/sam-toolkit.yml
make_bucket failed: s3://sam-dev* An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
upload: build/private/tmp/sam-toolkit.yml to s3://sam-dev*/sam-toolkit.yml
An error occurred (ValidationError) when calling the DescribeStacks operation: Stack with id sam-dev* does not exist
sam-dev* will be created.
Creating ChangeSet ChangeSet-2020-01-20T12-25-56Z
Deploying stack sam-dev*. Follow in console: https://aws-identity-broker.amazon.com/federation/634668058279/CloudFormation
ChangeSet ChangeSet-2020-01-20T12-25-56Z in sam-dev* succeeded
"StackStatus": "REVIEW_IN_PROGRESS",
sam-dev* reached REVIEW_IN_PROGRESS
Deploying stack sam-dev*. Follow in console: https://console.aws.amazon.com/cloudformation/home?region=us-west-2
Waiting for stack-create-complete
Waiter StackCreateComplete failed: Waiter encountered a terminal failure state
Command failed.
Please see the logs above.
I set SQS as event source for Lambda, but didn't provided the permissions like this
- Effect: Allow
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
- sqs:GetQueueAttributes
Resource: "*"
in lambda policies.
I found this error in "Events" tab of "CloudFormation" service.

Azure devops deployment

Failed to deploy web package to IIS website.
Error: An error was encountered when processing operation 'Create File' on D:\Websites\project\project.pdb.
Error: The error code was 0x800704C8. Error: The requested operation
cannot be performed on a file with a user- mapped section open.
I have tried to add this
MSDEPLOY_RENAME_LOCKED_FILES = 1 on variables

ACL verification fails for Data Factory Service Principal, although it has rwx permissions

I have a U-SQL script that executes successfully from VS Code with my personal credentials, but fails when triggered from a Data Factory pipeline. My personal account has Owner rights on the Azure subscription. ADF uses Service Principal authentication with Data Lake Analytics & Store.
I am using Data Factory V2 and Data Lake Gen1 with the Default Integration Runtime. ADLA Firewall is disabled.
The U-SQL script is very simple, it just reads data from a CSV file and tries to write it in another CSV file. This is the whole script:
#companies =
EXTRACT
Id string,
Name string
FROM #InputFile
USING Extractors.Csv(skipFirstNRows: 1);
OUTPUT #companies
TO #OutputFile
USING Outputters.Csv(outputHeader: true);
The parameters InputFile and OutputFile contain the ADL paths to the input and output data. These parameters are passed from Data Factory. The first stage of the script ("Extract") executes successfully, and the graph shows that the error occurs in the "PodAggregate" stage. A similar error occurs if I try to write the output to a managed table instead of a CSV file.
The high level error message in Data Factory is:
Error Id: VertexFailedFast, Error Message: Vertex failed with a fail-fast error.
Data Lake Analytics gives the more detailed error:
E_STORE_USER_ERROR: A user error has been reported when reading or writing data.
Component: STORE
Description: Operation 'Open::Wait' returned error code -2096559454 'Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.' for stream 'adl://[myadl].azuredatalakestore.net/adla/tmp/8a1495dc-8d80-44b9-a724-f2a0a963b3c8/stack_test/Companies.csv---6F21F973-45B9-46C7-805F-192672C99393-9_0_1.dtf%23N'.
Details:
Stream name 'adl://[myadl].azuredatalakestore.net/adla/tmp/8a1495dc-8d80-44b9-a724-f2a0a963b3c8/stack_test/Companies.csv---6F21F973-45B9-46C7-805F-192672C99393-9_0_1.dtf%23N'
Thu Jul 19 02:35:42 2018: Store user error, Operation:[Open::Wait], ErrorEither the resource does not exist or the current user is not authorized to perform the requested operation
7ffd8c4195b7 ScopeEngine!?ToStringInternal#KeySampleCollection#SSLibV3#ScopeEngine##AEAA?AV?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##XZ + 11b7
7ffd8c39a96d ScopeEngine!??0ExceptionWithStack#ScopeEngine##QEAA#W4ErrorNumber#1#AEBV?$initializer_list#VScopeErrorArg#ScopeCommon###std##_N#Z + 13d
7ffd8c3abe3e ScopeEngine!??0DeviceException#ScopeEngine##QEAA#AEAVBlockDevice#1#AEBV?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##J#Z + 1de
7ffd8c3f8c7b ScopeEngine!?GetTotalIoWaitTime#Statistics#Scanner#ScopeEngine##QEAA_JXZ + 133b
7ffd8c3f87dc ScopeEngine!?GetTotalIoWaitTime#Statistics#Scanner#ScopeEngine##QEAA_JXZ + e9c
7ffd9157780d ScopeCodeGenEngine!ScopeEngine::CosmosOutput::IssueWritePage + 4d d:\data\yarnnm\local\usercache\f675bad0-3d48-4f08-9933-d7cb614ec7a8\appcache\application_1531519980045_88416\container_e194_1531519980045_88416_01_000001\wd\scopeio.h line:6063
7ffd9156be7d ScopeCodeGenEngine!ScopeEngine::TextOutputStream,ScopeEngine::CosmosOutput>::Write + 2bd d:\data\yarnnm\local\usercache\f675bad0-3d48-4f08-9933-d7cb614ec7a8\appcache\application_1531519980045_88416\container_e194_1531519980045_88416_01_000001\wd\scopeio.h line:7828
7ffd91579290 ScopeCodeGenEngine!ScopeEngine::TextOutputPolicy::SerializeHeader + 30 d:\data\yarnnm\local\usercache\f675bad0-3d48-4f08-9933-d7cb614ec7a8\appcache\application_1531519980045_88416\container_e194_1531519980045_88416_01_000001\wd__scopecodegenengine__.dll.cpp line:514
7ffd91574ba7 ScopeCodeGenEngine!ScopeEngine::Outputer,ScopeEngine::BinaryInputStream,ScopeEngine::ExecutionStats>,SV1_Extract_out0,ScopeEngine::ScopeUnionAll,ScopeEngine::BinaryInputStream,ScopeEngine::ExecutionStats>,SV1_Extract_out0>,3>,SV1_Extract_out0,1>,SV1_Extract_out0,ScopeEngine::TextOutputPolicy,ScopeEngine::TextOutputStream,ScopeEngine::CosmosOutput>,0,ScopeEngine::ExecutionStats,ScopeEngine::DummyStatsWriter>::DoOutput + 27 d:\data\yarnnm\local\usercache\f675bad0-3d48-4f08-9933-d7cb614ec7a8\appcache\application_1531519980045_88416\container_e194_1531519980045_88416_01_000001\wd\scopeoperators.h line:5713
7ffd91582258 ScopeCodeGenEngine!SV2_PodAggregate_execute + 658 d:\data\yarnnm\local\usercache\f675bad0-3d48-4f08-9933-d7cb614ec7a8\appcache\application_1531519980045_88416\container_e194_1531519980045_88416_01_000001\wd__scopecodegenengine__.dll.cpp line:722
7ffd8c36571d ScopeEngine!??1OutputFileInfo#ScopeEngine##QEAA#XZ + 60d
7ffd8c397aa0 ScopeEngine!?RunUserCode#Vertex#ScopeEngine##SA_N_NAEBV?$function#$$A6AXXZ#std###Z + 1b0
7ffd8c397a4e ScopeEngine!?RunUserCode#Vertex#ScopeEngine##SA_N_NAEBV?$function#$$A6AXXZ#std###Z + 15e
7ffd8c397915 ScopeEngine!?RunUserCode#Vertex#ScopeEngine##SA_N_NAEBV?$function#$$A6AXXZ#std###Z + 25
7ffd8c365c7f ScopeEngine!??1OutputFileInfo#ScopeEngine##QEAA#XZ + b6f
7ffd8c3950c4 ScopeEngine!?Execute#Vertex#ScopeEngine##SA_NAEBVVertexStartupInfo#2#PEAUVertexExecutionInfo#2##Z + 3f4
7ff731d8ae8d scopehost!(no name)
7ff731d8adbd scopehost!(no name)
7ffd8c4274d9 ScopeEngine!?Execute#VertexHostBase#ScopeEngine##IEAA_NAEAVVertexStartupInfo#2##Z + 379
7ff731d8d236 scopehost!(no name)
7ff731d6a966 scopehost!(no name)
7ff731d98dac scopehost!(no name)
7ffd9e4713d2 KERNEL32!BaseThreadInitThunk + 22
7ffd9e5e54e4 ntdll!RtlUserThreadStart + 34
The Service Principal account has Owner permissions on Data Lake Analytics. The SP account also has (default) rwx permissions on the /stack_test subdirectory in Data Lake Store and all its files and children, and x permission on the root directory. The error message seems to say that the SP account is missing permissions on the destination file (/stack_test/Companies.csv), but I can explicitly see that it has rwx on that file. Which permissions am I still missing?
For reference, the script and the Data Factory resources necessary to reproduce this problem can be found at: https://github.com/lehmus/StackQuestions/tree/master/ADF_ADLA_Auth.