Databricks load file from path which contains equals (=) sign - pyspark

I'm looking to export Azure Monitor data from Log Analytics to a storage account and the read the JSON files into Databricks using PySpark.
The blob path for the Log Analytics export contains an equals (=) sign and Databricks throws and exception when using the path.
WorkspaceResourceId=/subscriptions/subscription-id/resourcegroups/<resource-group>/providers/microsoft.operationalinsights/workspaces/<workspace>/y=<four-digit numeric year>/m=<two-digit numeric month>/d=<two-digit numeric day>/h=<two-digit 24-hour clock hour>/m=<two-digit 60-minute clock minute>/PT05M.json
Log Analytics Data Export
Is there a way to escape the equals sign so that the JSON files can be loaded from the blob location?

I tried the similar use case referring from Microsoft Documentation, below are the steps:
Mount the storage container. We can do it with python code as below, make sure you pass all the parameters correct, because incorrect parameters will lead to multiple different errors.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Below are the parameters description:
<storage-account-name> is the name of your Azure Blob storage account.
<container-name> is the name of a container in your Azure Blob storage account.
<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as a secret in a secret scope.
Then you can access those files as below:
df = spark.read.text("/mnt/<mount-name>/...")
df = spark.read.text("dbfs:/<mount-name>/...")
Also there are multiple ways in accessing the file, all of these were mentioned clearly in the doc.
And check this Log Analytics workspace doc to understand about exporting the data to Azure Storage.

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

Azure Data Factory - source dataset fails with "path does not resolve to any file(s)" when sink to a different directory is in progress

We have an ADF pipeline with Copy activity to transfer data from Azure Table Storage to a JSON file in an Azure Blob Storage container. When the data transfer is in progress, other pipelines that use this dataset as a source fail with the following error "Job failed due to reason: Path does not resolve to any file(s)".
The dataset has a property that indicates the container directory. This property is populated by the trigger time of the pipeline copying the data, so it writes to a different directory in each run. The other failing pipelines use a directory corresponding to an earlier run of the pipeline copying the data and I have confirmed that the path does exist.
Anyone knows why this is happening and how to solve it?
Probably your expression in directory and file textbox inside the dataset is not correct.
Check this link : Azure data flow not showing / in path to data source

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.
For example, when I have to read a csv in databricks I use the following code:
dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)
So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.
Is it possible to read csv files coming from Azure storage via pySpark in databricks?
I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.
import pandas as pd
import openpyxl, zipfile
#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
zip_ref.extractall('/dbfs/mnt/data/unzipped')
#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx')
ws = my_excel['worksheet1']
# create pandas dataframe
df = pd.DataFrame(ws.values)
# create spark dataframe
spark_df = spark.createDataFrame(df)
The problem is that this only is being executed in the driver VM of the cluster.
Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.
In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account
Sample df = spark.read.text("dbfs:/mymount/my_file.txt")
Reference: https://docs.databricks.com/data/databricks-file-system.html
and regarding ZIP file please refer
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

Find Last modified timestamp of a files/folders in Azure Datalake through python script in Azure databricks that uses Credential passthrough

I have an Azure DataLake Storage Gen2 which contains a few Parquet files. My Organization has enabled credential passthrough and so I am able to create a python script in Azure Databricks and access the files available in ADLS using dbutils.fs.ls. All these work fine.
Now, I need to access the last modified timestamp of these files too. I found a link that does this. However, it uses BlockBlobService and requires an account_key.
I do not have an account key and can't get one due to security policies of the organization. I am unsure of how to do the same using Credential passthrough. Any ideas here?
You can try to mount the Azure DataLake Storage Gen2 instance with credentials passthrough.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
mount_name = 'localmountname'
container_name = 'containername'
storage_account_name = 'datalakestoragename'
dbutils.fs.mount(
source = f"abfss://{container_name}#{storage_account_name}.dfs.core.windows.net/",
mount_point = f"/mnt/{mount_name}>",
extra_configs = configs)
You can do this using the Hadoop FileSystem object accessible via Spark:
import time
path = spark._jvm.org.apache.hadoop.fs.Path
fs = path('abfss://container#storageaccount.dfs.core.windows.net/').getFileSystem(sc._jsc.hadoopConfiguration())
res = fs.listFiles(path('abfss://container#storageaccount.dfs.core.windows.net/path'), True)
while res.hasNext():
file = res.next()
localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
print(f"{file.getPath()}: {localTime}")
Note that that the True parameter in the listFiles() method means recursive.