External Table on DELTA format files in ADLS Gen 1 - pyspark

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace.
similarly, I am trying to create same sort of external tables on the same DELTA format files,but in different workspace.
I do have read only access via Service principle on ADLS Gen1. So I can read DELTA files through spark data-frames, as in given below:
read_data_df = spark.read.format("delta").load('dbfs:/mnt/data/<foldername>')
I can even able to create hive external tables, but I do see following warning while reading data from the same table:
Error in SQL statement: AnalysisException: Incompatible format detected.
A transaction log for Databricks Delta was found at `dbfs:/mnt/data/<foldername>/_delta_log`,
but you are trying to read from `dbfs:/mnt/data/<foldername>` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
If I create external table 'using DELTA', then I see a different access error as in:
Caused by: org.apache.hadoop.security.AccessControlException:
OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
Does it mean that I would need full access, rather just READ ONLY?, on those underneath file system?
Thanks

Resolved after upgrading to Databricks Runtime environment to runtime version DBR-7.3.

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

parquet streaming of Azure Blob storage into databricks with unity catalog

Unity Catalog have recently been set up in my databricks account, and I am trying to stream from an Azure container containing parquet files to a service catalog, using a notebook that ran before.
I do however now get the following
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.streaming.DataStreamReader.format(java.lang.String) is not whitelisted on class class org.apache.spark.sql.streaming.DataStreamReader
when trying to run the following spark command from my Notebook:
df = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.useNotifications", "false") # useNotifications determines if we efficiently scan the new files or if we set up a subscription to listen to new file events
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns") # schemaEvolutionMode determines what happens when the schema changes
.option("cloudFiles.schemaLocation", schemaPath)
.load(dataPath)
)
where schemaPathand dataPath contain the paths to the parquet schema and data files.
The closest related error I have found is the following pre-Unity Catalog error, suggesting that I should disable table access control on my clusters:
https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted
All table access control are disabled in my Admin Console.
Are there some other settings that should be set to ensure white-listing from Azure files now that Unity Catalog is set up?
------ Edit -----
Using a Single User cluster on Databricks runtime version 11.3 beta, I get the following error instead:
com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location
followed by the location to the azure schema in my storage location. I also get this error message by spawning new job clusters from azure datafactory.

synapse spark notebook strange errors reading from ADLS Gen 2 and writing to another ADLS Gen 2 both using private endpoints and TokenLibrary

The premise is simple two ADLS Gen 2 accounts. Accessing both as abfss://
Source account with directory format yyyy/mm/dd records stored as jsonl format.
I need to read this account recursively starting at any directory.
Transform the data.
Write the data to the target account in parquet with format Year=yyyy/Month=mm/Day=dd.
The source account is an ADLS Gen 2 account with private endpoints that is not a part of Synapse Analytics.
The target account is the default ADLS Gen 2 for Azure Synapse Analytics.
Using Spark notebook within Synapse Analytics with managed virtual network.
Source storage account has private endpoints.
Code written in pyspark.
Using linked services for both ADLS Gen 2 accounts setup with private endpoints.
from pyspark.sql.functions import col, substring
spark.conf.set("spark.storage.synapse.linkedServiceName", psourceLinkedServiceName)
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
#read the data into a data frame
df = spark.read.option("recursiveFileLookup","true").schema(inputSchema).json(sourceFile)
#perform the transformations to flatten this structure
#create the partition columns for delta lake
dfDelta = df.withColumn('ProductSold_OrganizationId',col('ProductSold.OrganizationId'))\
.withColumn('ProductSold_ProductCategory',col('ProductSold.ProductCategory'))\
.withColumn('ProductSold_ProductId',col('ProductSold.ProductId'))\
.withColumn('ProductSold_ProductLocale',col('ProductSold.ProductLocale'))\
.withColumn('ProductSold_ProductName',col('ProductSold.ProductName'))\
.withColumn('ProductSold_ProductType',col('ProductSold.ProductType'))\
.withColumn('Year',substring(col('CreateDate'),1,4))\
.withColumn('Month',substring(col('CreateDate'),6,2))\
.withColumn('Day',substring(col('CreateDate'),9,2))\
.drop('ProductSold')
#dfDelta.show()
spark.conf.set("spark.storage.synapse.linkedServiceName", psinkLinkedServiceName)
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
dfDelta.write.partitionBy("Year","Month","Day").mode('append').format("parquet").save(targetFile) ;
when trying to access a single file you get the error message
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (vm-19231616 executor 2): java.nio.file.AccessDeniedException: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD,
The error indicates that it cannot authenticate to the source file. However
Here is where it gets strange.
I uncomment the line
#dfDelta.show()
And it works.
However, if you put the comment back and run it again it continues to work. The only way to see the failure again is to totally stop the spark session and then restart it.
Ok more strangeness
change the sourcefile path so that its something like 2022/06 which should read multiple files and regardless of if the dfDelta.show() statement is uncommented you get the same error.
The only method I have found to get this to work is to process one file at a time within the spark notebook. option("recursiveFileLookup","true") only works when there is one file that is processed.
Finally what have I tried.
I have tried creating another spark session.
Spark session 1 reads the data and puts into a view.
Spark session 2 reads the data in the view and and attempts to write it.
The configuration for spark session two uses the token library for the sink file.
Results in the same error message.
My best guess - this has something to do with the spark cores processing multiple files and when I change the config they get confused as to how to read from the source file.
I had this working perfectly before I changed the Synapse Analytics account to use a managed virtual network. But in that case I accessed the source storage account using a managed identity linked service and had not issues writing to the default Synapse Analytics ADLS Gen 2 account.
Also tried option("forwardSparkAzureStorageCredentials", "true") on both the read and write dataframes
Any suggestions on how to get this to work with multiple files would be appreciated.

Azure Synapse Exception while reading table from synapse Dwh

While reading from table I'm getting
jdbc.SQLServerException : Create External Table As Sect statement failed as the path ####### could not be used for export.
Error Code :105005
jdbc.SQLServerException : Create External Table As Sect statement
failed as the path ####### could not be used for export. Error Code
:105005
This error occurs because of PolyBase can't complete the operation. The operation failure can be due to the following reasons :
Network failure when you try to access the Azure blob storage
The configuration of the Azure storage account.
You can fix this issue by following this article it helps you resolve the problem that occurs when you do a CREATE EXTERNAL TABLE AS SELECT.
For more in detail, please refer below links:
https://learn.microsoft.com/en-us/troubleshoot/sql/analytics-platform-system/error-cetas-to-blob-storage
https://www.sqlservercentral.com/articles/access-external-data-from-azure-synapse-analytics-using-polybase
https://knowledge.informatica.com/s/article/000175628?language=en_US

Loading Amazon Redshift with a manifest, with an error in one file

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?
The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)
As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file