access azure files using azure databricks pyspark

access azure files using azure databricks pyspark - pyspark

I am trying to access a file which is Rds extension. I am using the below code however it is not helping.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.rds?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)

I created storage account and created file share and uploaded rds file into file share.
Image for reference:
I generated SAS key in storage account.
Image for reference:
I installed azure file shares in data bricks using
pip install azure-storage-file
I installed pyreadr package to load rds file using
pip install pyreadr
I tried to load the rds extension file in databrick using
from azure.storage.file import FilePermissions, FileService
from datetime import datetime, timedelta
import pyreadr
from urllib.request import urlopen
url_sas_token="<File Service SAS URL>"
response = urlopen(url_sas_token)
content = response.read()
fhandle = open( 'counties.rds', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("counties.rds")
print(result)
In above code I have given File Service SAS URL at url_sas_token.
image for reference:
Above code loaded rds file data successfully.
Image for reference:
In this way I accessed rds extension file which is in azure blob file share from data bricks.

Related

Exporting firebase data to json

I'm trying to export a particular Firestore collection using gcloud
Right now, I did
gcloud storage buckets create gs://my-bucket
gcloud firestore export gs://my-bucket --collection-ids=my-collection
gcloud storage cp -r gs://my-bucket/2022-09-22T09:20:16_11252 my-directory
which results in some content in my-directory/all_namespaces/.../output-1. The output-1 file definitely seems to contain some relevant data but it is not very readable. Hence some questions:
which file format is used?
can I export to JSON (or CSV or XML) directly
can I convert the current file to JSON, CSV or XML
And related
why is output-0 empty?

Firestore does not support exporting existing data to a readable file but Firestore do have a managed Exporting and importing data that allows you to dump your data into a GCS bucket. It produces a format that is the same as Cloud Datastore uses. This means you can then import it into BigQuery. You can refer to this stack overflow and this video
Also as mentioned above in comment by Dazwilkin, The output of a managed export uses the LevelDB log format.
Additionally you can have a look at this link1 & link2

Surprisingly, there don't seem to be a lot of LevelDB tools available so exporting to LevelDB format is not convenient.
I managed to export to csv by adding two extra steps: loading and extracting to/from BigQuery. So I now do something like
# create bucket
gcloud storage buckets create gs://my-bucket
# firestore export
gcloud firestore export gs://my-bucket/my-prefix --collection-ids=my-collection
# create dataset
bq mk --dataset my-dataset
# load bucket into BigQuery
bq load --replace --source_format=DATASTORE_BACKUP my-dataset.input \
gs://my-bucket/my-prefix/all_namespaces/.../....export_metadata
# export BigQuery as csv to bucket
bq extract --compression GZIP 'my-dataset.input' gc://my-bucket/results.csv
# download csv file
gcloud storage cp -r gc://my-bucket/results.csv <local-dir>

Yes you can export firebase data as json, follow this article for exporting the data from firebase I hope this helps - https://support.google.com/firebase/answer/6386780?hl=en#zippy=%2Cin-this-article

How to read tables from synapse database tables using pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.

You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

Error when trying to import with CSV file format in Cloud SQL

HTTPError 400: Unknow export file type was thrown when I try to Import csv file from my Cloud Storage bucket into my Cloud SQL db. Any idea what I missed out.
Reference:
gcloud sql import csv

CSV files are not supported in Cloud SQL, MS SQL Server. As mentioned here,
In Cloud SQL, SQL Server currently supports importing databases using
SQL and BAK files.
Somehow, it is supported for MySQL and PostgreSQL versions of Cloud SQL.
You could perform one of the next solutions:
Change the database engine to either PostgreSQL or MySQL (where CSV files are supported).
If the data on your CSV file came from an on-premise SQL Server DB table, you can create an SQL file from it, then use it to import into Cloud SQL, SQL Server.

How to upload text file to FTP from Databricks notebook

I tried to find a solution but nothing. I am new in this, so please help me if you know the solution.
Thanks!

Ok, I found a solution.
#copy file from ADLS to SFTP
from ftplib import FTP_TLS
from azure.datalake.store import core, lib, multithread
import pandas as pd
keyVaultName = "yourkeyvault"
#then you need to configure keyvault with ADLS
#set up authentification for ADLS
tenant_id = dbutils.secrets.get(scope = keyVaultName, key = "tenantId")
username = dbutils.secrets.get(scope = keyVaultName, key = "appRegID")
password = dbutils.secrets.get(scope = keyVaultName, key = "appRegSecret")
store_name = 'ADLSStoridge'
token = lib.auth(tenant_id = tenant_id, client_id = username, client_secret = password)
adl = core.AzureDLFileSystem(token, store_name=store_name)
#create secure connection with SFTP
ftp = FTP_TLS('ftp.xyz.com')
#add credentials
ftp.login(user='',passwd='')
ftp.prot_p()
#set sftp directory path
ftp.cwd('folder path on FTP')
#load file
f = adl.open('adls path of your file')
#write to SFTP
ftp.storbinary('STOR myfile.csv', f)

In Databricks, you can access files stored in ADLS using any one of the method described below.
There are three ways of accessing Azure Data Lake Storage Gen2:
Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
Use the Azure Data Lake Storage Gen2 storage account access key directly.
Steps to mount and access the files in your filesystem as if they were local files:
To mount a Azure Data Lake Storage Gen2 or a folder inside a container, use the following command:
Syntax:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<password>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Example:
After mounting the ADLS, you can access in your filesystem as if they were local files, for example:
df = spark.read.csv("/mnt/flightdata/flightdata.csv", header="true")
display(df)
Example:
Reference: Databricks - Azure Data Lake Storage Gen2.
Hope this helps.

I have found a workaround for accessing the files outside Databricks (using sftp software like WinSCP/FileZilla).
These are the steps I followed -
I downloaded the awscli using this documentation https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html in Databricks's cluster's terminal.
I ran this command in my Databricks notebook -
!aws s3api put-object-acl --bucket S3-bucket-name --key
path/to-s3/file.txt --acl bucket-owner-full-control
By doing these 2 steps I was able to access the file outside Databricks.

AWS Glue Error | Not able to read Glue tables from Developer End points using spark

I am not able to access AWS Glue tables even if I given all required IAM permissions. I cant even list all the databases.Here is the code.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# New recommendation from AWS Support 2018-03-22
newconf = sc._conf.set("spark.sql.catalogImplementation", "in-memory")
sc.stop()
sc = sc.getOrCreate(newconf)
# End AWS Support Workaround
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
The error is here.while accessing one of the Glue table.
datasource_history_1 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "history", transformation_ctx = "datasource_history_1")
I tried to list databases also where I can see only the default one, nothing else(which I have created in Glue)
I tried to refer the below link, still did not help me.
Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

You seem to have taken your code straight from this question braj: Unable to run scripts properly in AWS Glue PySpark Dev Endpoint - but that code is specific to my Amazon Glue environment and the tables I'm referencing won't exist in your environment.
For this command to work:
datasource_history_1 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "history", transformation_ctx = "datasource_history_1")
Check your own Glue Catalog https://eu-west-1.console.aws.amazon.com/glue/home and ensure you have a table called history inside a database called dev. If you don't then I'm not sure what behaviour you expect to see from this code.
Instead of starting from a script taken from someone else's StackOverflow answer I suggest you create a Job in Glue and get it to generate the source connection code for you first. Use that as your starting point. It'll generate the create_dynamic_frame.from_catalog command for you in that script.