Azure linked service using python - azure-data-factory

I have created a Linked service in azure data factory using azure portal. I want to connect to this Linked service in notebook activity in synapse using python. Do we have any such api?
Please let me know.
Thanks

As per your requirements you can directly copy data from synapse notebook to your Azure blob storage using python itself without creating linked service in data factory.
I created Azure blob storage and created new container and generated SAS for the blob storage.
I am have created data frame in synapse notebook using python using below code:
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import pandas as pd
data = {
'Name': ['ABC', 'CDE', 'BBA', 'DSA'],
'Age': [25, 30, 35, 40],
'Gender': ['F', 'M', 'M', 'M']
}
df = pd.DataFrame(data)
I want to save the dataframe as csv file in azure blob storage for that I converted the dataframe to csv format using below code:
csv_data = df.to_csv(index=False)
I copied the connection string of blob storage account.
I copied the data into blob storage using below code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
connection_string = "<connection string>"
container_name = "<container name>"
blob_name = "<filename>.csv"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob_name)
blob_client.upload_blob(csv_data, overwrite=True)
It executed successfully.
My file is created in blob storage container successfully.

Related

Databricks load file from path which contains equals (=) sign

I'm looking to export Azure Monitor data from Log Analytics to a storage account and the read the JSON files into Databricks using PySpark.
The blob path for the Log Analytics export contains an equals (=) sign and Databricks throws and exception when using the path.
WorkspaceResourceId=/subscriptions/subscription-id/resourcegroups/<resource-group>/providers/microsoft.operationalinsights/workspaces/<workspace>/y=<four-digit numeric year>/m=<two-digit numeric month>/d=<two-digit numeric day>/h=<two-digit 24-hour clock hour>/m=<two-digit 60-minute clock minute>/PT05M.json
Log Analytics Data Export
Is there a way to escape the equals sign so that the JSON files can be loaded from the blob location?
I tried the similar use case referring from Microsoft Documentation, below are the steps:
Mount the storage container. We can do it with python code as below, make sure you pass all the parameters correct, because incorrect parameters will lead to multiple different errors.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Below are the parameters description:
<storage-account-name> is the name of your Azure Blob storage account.
<container-name> is the name of a container in your Azure Blob storage account.
<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as a secret in a secret scope.
Then you can access those files as below:
df = spark.read.text("/mnt/<mount-name>/...")
df = spark.read.text("dbfs:/<mount-name>/...")
Also there are multiple ways in accessing the file, all of these were mentioned clearly in the doc.
And check this Log Analytics workspace doc to understand about exporting the data to Azure Storage.

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.
For example, when I have to read a csv in databricks I use the following code:
dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)
So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.
Is it possible to read csv files coming from Azure storage via pySpark in databricks?
I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.
import pandas as pd
import openpyxl, zipfile
#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
zip_ref.extractall('/dbfs/mnt/data/unzipped')
#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx')
ws = my_excel['worksheet1']
# create pandas dataframe
df = pd.DataFrame(ws.values)
# create spark dataframe
spark_df = spark.createDataFrame(df)
The problem is that this only is being executed in the driver VM of the cluster.
Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.
In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account
Sample df = spark.read.text("dbfs:/mymount/my_file.txt")
Reference: https://docs.databricks.com/data/databricks-file-system.html
and regarding ZIP file please refer
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

Find Last modified timestamp of a files/folders in Azure Datalake through python script in Azure databricks that uses Credential passthrough

I have an Azure DataLake Storage Gen2 which contains a few Parquet files. My Organization has enabled credential passthrough and so I am able to create a python script in Azure Databricks and access the files available in ADLS using dbutils.fs.ls. All these work fine.
Now, I need to access the last modified timestamp of these files too. I found a link that does this. However, it uses BlockBlobService and requires an account_key.
I do not have an account key and can't get one due to security policies of the organization. I am unsure of how to do the same using Credential passthrough. Any ideas here?
You can try to mount the Azure DataLake Storage Gen2 instance with credentials passthrough.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
mount_name = 'localmountname'
container_name = 'containername'
storage_account_name = 'datalakestoragename'
dbutils.fs.mount(
source = f"abfss://{container_name}#{storage_account_name}.dfs.core.windows.net/",
mount_point = f"/mnt/{mount_name}>",
extra_configs = configs)
You can do this using the Hadoop FileSystem object accessible via Spark:
import time
path = spark._jvm.org.apache.hadoop.fs.Path
fs = path('abfss://container#storageaccount.dfs.core.windows.net/').getFileSystem(sc._jsc.hadoopConfiguration())
res = fs.listFiles(path('abfss://container#storageaccount.dfs.core.windows.net/path'), True)
while res.hasNext():
file = res.next()
localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
print(f"{file.getPath()}: {localTime}")
Note that that the True parameter in the listFiles() method means recursive.

Load model from Google Cloud Storage without downloading

Is there a way to serve model from Google Cloud Storage without actually downloading a copy of model? like streaming the data directly?
I'm trying to load a fasttext model that is hosted on Google Cloud Storage. everytime i run the program, it needs to get and download a copy of that model in the bucket.
language_model_filename = 'lid.176.bin' // filename in GCS
language_model_local = 'lid.176.bin' // local file name when downloaded
bucket = storage_client.get_bucket(CLOUD_STORAGE_BUCKET)
blob = bucket.blob(language_model_filename)
blob.download_to_filename(language_model_local)
language_model = FastText.load_model(language_model_local)
You can use Streaming Tranfers for that purpose. As explained in the documentation, you can use the third party boto client library plugin for Cloud Storage.
A streaming download example would look like this:
import sys
downloaded_file = 'saved_data_file'
MY_BUCKET = 'my_app_bucket'
object_name = 'data_file'
src_uri = boto.storage_uri(MY_BUCKET + '/' + object_name, 'gs')
src_uri.get_key().get_file(sys.stdout)

Read data stored in zip file in Google Cloud Storage from Notebook in Google Cloud Datalab

I have a zip file containing a relatively large dataset (1Gb) stored in a zip file in Google Cloud Storage instance.
I need to use Notebook hosted in Google Cloud Datalab to access that file and the data contained there. How do I go about this?
Thank you.
Can you try the following?
import pandas as pd
# Path to the object in Google Cloud Storage that you want to copy
sample_gcs_object = 'gs://path-to-gcs/Hello.txt.zip'
# Copy the file from Google Cloud Storage to Datalab
!gsutil cp $sample_gcs_object 'Hello.txt.zip'
# Unzip the file
!unzip 'Hello.txt.zip'
# Read the file into a pandas DataFrame
pandas_dataframe = pd.read_csv('Hello.txt')