How to read a csv file from a "File Share" in an ADLS Gen2 Datalake inside Databricks using pyspark - pyspark

I have ADLS Gen2 Datalake with "Blob Containers" and "File Shares". I have mounted the Blob containers in my Databricks notebook, so I can read everything there inside my databricks notebooks.
I also have some files in the "File Share", but I am not able to read these files into a dataframe thorugh Databricks using pyspark.
I have created an Access Signature for the File share and I have got the url for one of the files inside the Share as well. That url works fine through Postman. I can download that file using the url.
The sample url is shown below:
https://somedatalakename.file.core.windows.net/file_share_name/Data_20200330_1030.csv?sv=yyyy-mm-dd&si=somename&sr=s&sig=somerandomsignature%3D
How to read the same csv, which is inside this file share, into a dataframe through databricks using pyspark?
I also tried
from pyspark import SparkFiles
spark.sparkContext.addFile(uri)
call_df = spark.read.format("csv").option("header", "true").load("file://" + SparkFiles.get("Data_" + date_str + "_1030.csv"))
And I get the below error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/local_disk0/spark-ce42ed1b-5d82-4559-9000-d1bf3621539e/userFiles-eaf0fd36-68aa-409e-8610-a7909635b006/Data_20200330_1030.csv
Please give me some pointers on how to solve this problem. Thanks.

The problem with your load syntax. file: does not works in Databricks so you need to replace it with dbfs i.e. Databricks file system.
Command to load the file:
spark.read.format("csv").option("header","true").load(f"dbfs:/path/to/your/directory/FileName.csv")

Related

Unzipping a zip file through ADF is giving a special character

We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?
Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Extra Blob Created after Sink in Data Flow

I'm importing from Snowflake to Azure blob using data flow activity in Azure Data Factory.
I noticed that whenever I created a blob thru sink (placed inside provider/Inbound/ folder), I get an extra empty blob file outside Inbound.
Does this happen for all data flow sink to blob?
I had created a data flow and loaded data to blob from snowflake and I don't see any additional blob file generated outside my sink folder.
Make sure the sink connection is pointed to the correct folder and also double-check if any other process is running other than this dataflow which is causing to create an extra file outside the sink folder.
Snowflake source:
Sink:
Output file path to generate the out file:
Sink setting to add a date as the filename:
Output folder:
Output file generated after executing the data flow.

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.
For example, when I have to read a csv in databricks I use the following code:
dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)
So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.
Is it possible to read csv files coming from Azure storage via pySpark in databricks?
I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.
import pandas as pd
import openpyxl, zipfile
#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
zip_ref.extractall('/dbfs/mnt/data/unzipped')
#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx')
ws = my_excel['worksheet1']
# create pandas dataframe
df = pd.DataFrame(ws.values)
# create spark dataframe
spark_df = spark.createDataFrame(df)
The problem is that this only is being executed in the driver VM of the cluster.
Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.
In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account
Sample df = spark.read.text("dbfs:/mymount/my_file.txt")
Reference: https://docs.databricks.com/data/databricks-file-system.html
and regarding ZIP file please refer
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

Read data stored in zip file in Google Cloud Storage from Notebook in Google Cloud Datalab

I have a zip file containing a relatively large dataset (1Gb) stored in a zip file in Google Cloud Storage instance.
I need to use Notebook hosted in Google Cloud Datalab to access that file and the data contained there. How do I go about this?
Thank you.
Can you try the following?
import pandas as pd
# Path to the object in Google Cloud Storage that you want to copy
sample_gcs_object = 'gs://path-to-gcs/Hello.txt.zip'
# Copy the file from Google Cloud Storage to Datalab
!gsutil cp $sample_gcs_object 'Hello.txt.zip'
# Unzip the file
!unzip 'Hello.txt.zip'
# Read the file into a pandas DataFrame
pandas_dataframe = pd.read_csv('Hello.txt')