Error ingesting a flat file into Azure Data Lake Store using Azure Data Lake - azure-data-factory

I am getting the error below for certain time slices while using COPY activity in a data factory pipeline but at the same time it is also able to copy the file successfully to the specified data lake folder for some slices. I don't understand if this is an issue with the factory or lake or is it a data manager gateway communication failure. The pipeline creates 2 GUID folders underneath the specified lake folder and a 0KB "temp" file but the COPY activity fails with the error below.
FileWriter|Error trying to write the file|uri:https://bcbsne.azuredatalakestore.net/webhdfs/v1/Consumer/FlatFiles/2017071013/2a621c14-bdac-4cd6-a0d3-efba4a4526a0/5a3ac937-8176-469d-b6c6-ca738f8ab3a6/_tmp_test.txt-0.tmp?op=APPEND&overwrite=true&user.name=testUser&api-version=2014-01-01&offset=0&length=27&append=true,Content:
Job ID: 41ff39a9-f6e0-4b94-8f9d-625dec7f84de
Log ID: Error

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

Azure Data Factory - source dataset fails with "path does not resolve to any file(s)" when sink to a different directory is in progress

We have an ADF pipeline with Copy activity to transfer data from Azure Table Storage to a JSON file in an Azure Blob Storage container. When the data transfer is in progress, other pipelines that use this dataset as a source fail with the following error "Job failed due to reason: Path does not resolve to any file(s)".
The dataset has a property that indicates the container directory. This property is populated by the trigger time of the pipeline copying the data, so it writes to a different directory in each run. The other failing pipelines use a directory corresponding to an earlier run of the pipeline copying the data and I have confirmed that the path does exist.
Anyone knows why this is happening and how to solve it?
Probably your expression in directory and file textbox inside the dataset is not correct.
Check this link : Azure data flow not showing / in path to data source

synapse spark notebook strange errors reading from ADLS Gen 2 and writing to another ADLS Gen 2 both using private endpoints and TokenLibrary

The premise is simple two ADLS Gen 2 accounts. Accessing both as abfss://
Source account with directory format yyyy/mm/dd records stored as jsonl format.
I need to read this account recursively starting at any directory.
Transform the data.
Write the data to the target account in parquet with format Year=yyyy/Month=mm/Day=dd.
The source account is an ADLS Gen 2 account with private endpoints that is not a part of Synapse Analytics.
The target account is the default ADLS Gen 2 for Azure Synapse Analytics.
Using Spark notebook within Synapse Analytics with managed virtual network.
Source storage account has private endpoints.
Code written in pyspark.
Using linked services for both ADLS Gen 2 accounts setup with private endpoints.
from pyspark.sql.functions import col, substring
spark.conf.set("spark.storage.synapse.linkedServiceName", psourceLinkedServiceName)
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
#read the data into a data frame
df = spark.read.option("recursiveFileLookup","true").schema(inputSchema).json(sourceFile)
#perform the transformations to flatten this structure
#create the partition columns for delta lake
dfDelta = df.withColumn('ProductSold_OrganizationId',col('ProductSold.OrganizationId'))\
.withColumn('ProductSold_ProductCategory',col('ProductSold.ProductCategory'))\
.withColumn('ProductSold_ProductId',col('ProductSold.ProductId'))\
.withColumn('ProductSold_ProductLocale',col('ProductSold.ProductLocale'))\
.withColumn('ProductSold_ProductName',col('ProductSold.ProductName'))\
.withColumn('ProductSold_ProductType',col('ProductSold.ProductType'))\
.withColumn('Year',substring(col('CreateDate'),1,4))\
.withColumn('Month',substring(col('CreateDate'),6,2))\
.withColumn('Day',substring(col('CreateDate'),9,2))\
.drop('ProductSold')
#dfDelta.show()
spark.conf.set("spark.storage.synapse.linkedServiceName", psinkLinkedServiceName)
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
dfDelta.write.partitionBy("Year","Month","Day").mode('append').format("parquet").save(targetFile) ;
when trying to access a single file you get the error message
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (vm-19231616 executor 2): java.nio.file.AccessDeniedException: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD,
The error indicates that it cannot authenticate to the source file. However
Here is where it gets strange.
I uncomment the line
#dfDelta.show()
And it works.
However, if you put the comment back and run it again it continues to work. The only way to see the failure again is to totally stop the spark session and then restart it.
Ok more strangeness
change the sourcefile path so that its something like 2022/06 which should read multiple files and regardless of if the dfDelta.show() statement is uncommented you get the same error.
The only method I have found to get this to work is to process one file at a time within the spark notebook. option("recursiveFileLookup","true") only works when there is one file that is processed.
Finally what have I tried.
I have tried creating another spark session.
Spark session 1 reads the data and puts into a view.
Spark session 2 reads the data in the view and and attempts to write it.
The configuration for spark session two uses the token library for the sink file.
Results in the same error message.
My best guess - this has something to do with the spark cores processing multiple files and when I change the config they get confused as to how to read from the source file.
I had this working perfectly before I changed the Synapse Analytics account to use a managed virtual network. But in that case I accessed the source storage account using a managed identity linked service and had not issues writing to the default Synapse Analytics ADLS Gen 2 account.
Also tried option("forwardSparkAzureStorageCredentials", "true") on both the read and write dataframes
Any suggestions on how to get this to work with multiple files would be appreciated.

Extra Blob Created after Sink in Data Flow

I'm importing from Snowflake to Azure blob using data flow activity in Azure Data Factory.
I noticed that whenever I created a blob thru sink (placed inside provider/Inbound/ folder), I get an extra empty blob file outside Inbound.
Does this happen for all data flow sink to blob?
I had created a data flow and loaded data to blob from snowflake and I don't see any additional blob file generated outside my sink folder.
Make sure the sink connection is pointed to the correct folder and also double-check if any other process is running other than this dataflow which is causing to create an extra file outside the sink folder.
Snowflake source:
Sink:
Output file path to generate the out file:
Sink setting to add a date as the filename:
Output folder:
Output file generated after executing the data flow.

How to use file name prefix in Data Factory when importing data into Azure data lake from SAP BW Open Hub?

I have a source of SAP BW Open Hub in data factory and a sink of Azure data lake gen2 and am using a copy activity to move the data.
I am attempting to transfer the data to the lake and split into numerous files, with 200000 rows per file. I would also like to be able to prefix all of the filenames e.g. 'cust_', so the files would be something along the lines of cust_1, cust_2, cust_3 etc.
This method only seems to be an issue when using SAP BW Open Hub as a source (it works fine when using SQL Server as a source. Please see the warning message below. After checking with out internal SAP BW team, they assure me that the data is in a tabular format, and no explicit partition is enabled, so there shouldn't be an issue.
When executing the copy activity, the files are transferred to the lake but the file name prefix setting is ignored, and the filenames instead are set automatically, as below (the name seems to be automatically made up of the SAP BW Open Hub table and the request ID):
Here is the source config:
All other properties on the other tabs are set to default and have been unchanged.
QUESTION: without using a data flow, is there any way to split the files when pulling from SAP BW Open Hub and also be able to dictate the filenames in the lake?
I tried to reproduce the issue and it works fine with a work around. Instead of splitting the data while copying from SAP BW to Azure data lake storage, you can just simply copy the entire exact data (without partition) into the Azure SQL Database. Please follow copy data from SAP Business warehouse by using azure data factory (make sure to use Azure SQL Database as sink).
Now the data is in you Azure SQL Database, you can now simply use the copy activity to copy the data to Azure data lake storage.
In source configuration, keep “Partition option” as None.
Source Config:
Sink config:
Output: