Find Last modified timestamp of a files/folders in Azure Datalake through python script in Azure databricks that uses Credential passthrough - pyspark

I have an Azure DataLake Storage Gen2 which contains a few Parquet files. My Organization has enabled credential passthrough and so I am able to create a python script in Azure Databricks and access the files available in ADLS using dbutils.fs.ls. All these work fine.
Now, I need to access the last modified timestamp of these files too. I found a link that does this. However, it uses BlockBlobService and requires an account_key.
I do not have an account key and can't get one due to security policies of the organization. I am unsure of how to do the same using Credential passthrough. Any ideas here?

You can try to mount the Azure DataLake Storage Gen2 instance with credentials passthrough.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
mount_name = 'localmountname'
container_name = 'containername'
storage_account_name = 'datalakestoragename'
dbutils.fs.mount(
source = f"abfss://{container_name}#{storage_account_name}.dfs.core.windows.net/",
mount_point = f"/mnt/{mount_name}>",
extra_configs = configs)

You can do this using the Hadoop FileSystem object accessible via Spark:
import time
path = spark._jvm.org.apache.hadoop.fs.Path
fs = path('abfss://container#storageaccount.dfs.core.windows.net/').getFileSystem(sc._jsc.hadoopConfiguration())
res = fs.listFiles(path('abfss://container#storageaccount.dfs.core.windows.net/path'), True)
while res.hasNext():
file = res.next()
localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
print(f"{file.getPath()}: {localTime}")
Note that that the True parameter in the listFiles() method means recursive.

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Databricks load file from path which contains equals (=) sign

I'm looking to export Azure Monitor data from Log Analytics to a storage account and the read the JSON files into Databricks using PySpark.
The blob path for the Log Analytics export contains an equals (=) sign and Databricks throws and exception when using the path.
WorkspaceResourceId=/subscriptions/subscription-id/resourcegroups/<resource-group>/providers/microsoft.operationalinsights/workspaces/<workspace>/y=<four-digit numeric year>/m=<two-digit numeric month>/d=<two-digit numeric day>/h=<two-digit 24-hour clock hour>/m=<two-digit 60-minute clock minute>/PT05M.json
Log Analytics Data Export
Is there a way to escape the equals sign so that the JSON files can be loaded from the blob location?
I tried the similar use case referring from Microsoft Documentation, below are the steps:
Mount the storage container. We can do it with python code as below, make sure you pass all the parameters correct, because incorrect parameters will lead to multiple different errors.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Below are the parameters description:
<storage-account-name> is the name of your Azure Blob storage account.
<container-name> is the name of a container in your Azure Blob storage account.
<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as a secret in a secret scope.
Then you can access those files as below:
df = spark.read.text("/mnt/<mount-name>/...")
df = spark.read.text("dbfs:/<mount-name>/...")
Also there are multiple ways in accessing the file, all of these were mentioned clearly in the doc.
And check this Log Analytics workspace doc to understand about exporting the data to Azure Storage.

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.
For example, when I have to read a csv in databricks I use the following code:
dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)
So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.
Is it possible to read csv files coming from Azure storage via pySpark in databricks?
I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.
import pandas as pd
import openpyxl, zipfile
#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
zip_ref.extractall('/dbfs/mnt/data/unzipped')
#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx')
ws = my_excel['worksheet1']
# create pandas dataframe
df = pd.DataFrame(ws.values)
# create spark dataframe
spark_df = spark.createDataFrame(df)
The problem is that this only is being executed in the driver VM of the cluster.
Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.
In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account
Sample df = spark.read.text("dbfs:/mymount/my_file.txt")
Reference: https://docs.databricks.com/data/databricks-file-system.html
and regarding ZIP file please refer
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

Cannot copy data from Snowflake into Azure Blob

I am trying to copy data from Snowflake into an Azure Blob using Azure Data Factory.
The role I am using has select permissions on the table, and I have no issues querying the data using the Snowflake console.
I am also able to copy into the targeted blob from other sources (in Azure) using the same SAS token.
This is the query I have, generated by Azure Data Factory, (with specifics omitted)
COPY INTO 'azure://****.blob.core.windows.net/snowflake-stage/********-****-****-****-************/SnowflakeExportCopyCommand/'
FROM (select * from MYSCHEMA.MYTABLE)
CREDENTIALS = (AZURE_SAS_TOKEN = '****')
FILE_FORMAT = (type = CSV COMPRESSION = GZIP RECORD_DELIMITER = '
' FIELD_OPTIONALLY_ENCLOSED_BY = '"' ESCAPE = '\\' NULL_IF = '')
HEADER = TRUE
SINGLE = FALSE
OVERWRITE = TRUE
MAX_FILE_SIZE = 268435456
And this is the error I am getting:
ErrorCode=UserErrorOdbcOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ERROR
[42501] Failed to access remote file: access denied.
Please check your credentials,Source=Microsoft.DataTransfer.Runtime.GenericOdbcConnectors,''Type=System.Data.Odbc.OdbcException,Message=ERROR
[42501] Failed to access remote file: access denied. Please check your
credentials,Source=Snowflake,'
Are there more Snowflake permissions that I need in order to do this kind of copy? Or is this perhaps an issue with the write-permissions to the Azure container?
The "solution" for this indicates a likely bug for the permissions on the blob itself.
Switching the container permissions to public, then back to private again fixes the issue.

Pyspark and BigQuery using two different project-ids in Google Dataproc

I want to run some pyspark jobs using Google Dataproc with different project Ids without success so far. I'm newbie with pyspark and Google Cloud but I've followed this example and runs well (if the BigQuery dataset is either public or belongs to my GCP project, which is ProjectA). Input parameters look like this:
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
projectA = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectA',
'mapred.bq.input.dataset.id': 'my_dataset',
'mapred.bq.input.table.id': 'my_table',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
But what I need is to run a job from a BQ dataset of a ProjectB (I have credentials to query it), so when setting the input parameters, which look like this:
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectB',
'mapred.bq.input.dataset.id': 'the_datasetB',
'mapred.bq.input.table.id': 'the_tableB',
}
and try to load data in from BQ, my script keeps running infinitely. How should I set it up properly?
FYI, after running the example I mentioned before, I can see that 2 carpets (shard-0 and shard-1) are created in Google Storage and contain the corresponding BQ data, but with my job only shard-0 is created and it's empty.
I talked to my co-worker Dennis and here is his suggestion:
"Hmm, not sure, it should work. They might want to test with "bq" CLI inside the master node to manually try some "bq extract" job of the projectB table into their GCS bucket since that's all the connector is doing under the hood.
If I had to guess I'd suspect they only meant their personal username has the credentials to query projectB, but the default service account of projectA might not have the query permissions. Everything inside Dataproc VMs act on behalf of the compute service account assigned to the VMs, not the end-user.
They can
gcloud compute instances describe -m
and somewhere in there it lists the service-account email address."