How to upload text file to FTP from Databricks notebook - pyspark

I tried to find a solution but nothing. I am new in this, so please help me if you know the solution.
Thanks!

Ok, I found a solution.
#copy file from ADLS to SFTP
from ftplib import FTP_TLS
from azure.datalake.store import core, lib, multithread
import pandas as pd
keyVaultName = "yourkeyvault"
#then you need to configure keyvault with ADLS
#set up authentification for ADLS
tenant_id = dbutils.secrets.get(scope = keyVaultName, key = "tenantId")
username = dbutils.secrets.get(scope = keyVaultName, key = "appRegID")
password = dbutils.secrets.get(scope = keyVaultName, key = "appRegSecret")
store_name = 'ADLSStoridge'
token = lib.auth(tenant_id = tenant_id, client_id = username, client_secret = password)
adl = core.AzureDLFileSystem(token, store_name=store_name)
#create secure connection with SFTP
ftp = FTP_TLS('ftp.xyz.com')
#add credentials
ftp.login(user='',passwd='')
ftp.prot_p()
#set sftp directory path
ftp.cwd('folder path on FTP')
#load file
f = adl.open('adls path of your file')
#write to SFTP
ftp.storbinary('STOR myfile.csv', f)

In Databricks, you can access files stored in ADLS using any one of the method described below.
There are three ways of accessing Azure Data Lake Storage Gen2:
Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
Use the Azure Data Lake Storage Gen2 storage account access key directly.
Steps to mount and access the files in your filesystem as if they were local files:
To mount a Azure Data Lake Storage Gen2 or a folder inside a container, use the following command:
Syntax:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<password>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Example:
After mounting the ADLS, you can access in your filesystem as if they were local files, for example:
df = spark.read.csv("/mnt/flightdata/flightdata.csv", header="true")
display(df)
Example:
Reference: Databricks - Azure Data Lake Storage Gen2.
Hope this helps.

I have found a workaround for accessing the files outside Databricks (using sftp software like WinSCP/FileZilla).
These are the steps I followed -
I downloaded the awscli using this documentation https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html in Databricks's cluster's terminal.
I ran this command in my Databricks notebook -
!aws s3api put-object-acl --bucket S3-bucket-name --key
path/to-s3/file.txt --acl bucket-owner-full-control
By doing these 2 steps I was able to access the file outside Databricks.

Related

access azure files using azure databricks pyspark

I am trying to access a file which is Rds extension. I am using the below code however it is not helping.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.rds?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
I created storage account and created file share and uploaded rds file into file share.
Image for reference:
I generated SAS key in storage account.
Image for reference:
I installed azure file shares in data bricks using
pip install azure-storage-file
I installed pyreadr package to load rds file using
pip install pyreadr
I tried to load the rds extension file in databrick using
from azure.storage.file import FilePermissions, FileService
from datetime import datetime, timedelta
import pyreadr
from urllib.request import urlopen
url_sas_token="<File Service SAS URL>"
response = urlopen(url_sas_token)
content = response.read()
fhandle = open( 'counties.rds', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("counties.rds")
print(result)
In above code I have given File Service SAS URL at url_sas_token.
image for reference:
Above code loaded rds file data successfully.
Image for reference:
In this way I accessed rds extension file which is in azure blob file share from data bricks.

Upload changed files only to az blob container folder

Using az devops pipeline I am deleting all files in container then upload all files instead I want to upload only last changed files using az storage blob cmdlet (like incremental backup) or any other cmdlet except azcopy,can anyone guide me to do so
Since you want to use az storage blob cmdlet to upload files to Azure Storage account blob, I suggest that you can use the az storage blob sync.
For example:
az storage blob sync -c mycontainer -s "path/to/file" -d NewBlob
In Azure DevOps Pipeline, you can use Azure CLI task to run the command.

Google Cloud SQL Restore BAK file

I am new in Google Cloud. I created a Cloud SQL Instance and I need to restore the data from a .bak file. I have the .bak file in a GCS bucket, and I am trying to restore using Microsoft Management Studio -> Task -> Restore. But I'm not able to access the file.
Can anyone help me with the procedure on how to restore from a .bak file?
You need to give the Cloud SQL service Account access to the bucket where the file is saved.
On Cloud Shell run the following:
gcloud sql instances describe [INSTANCE_NAME]
On the output search for the field "serviceAccountEmailAddress" an copy the SA email.
Then again on cloud shell run the following:
gsutil iam ch serviceAccount:<<SERVICE_ACCOUNT_EMAIL>:legacyBucketWriter gs://<<BUCKET_NAME>>
gsutil iam ch serviceAccount:<<SERVICE_ACCOUNT_EMAIL>:objectViewer gs://<<BUCKET_NAME>>
That should give the service account permission to access the bucket and retrieve the file, also here is the guide on doing the import, take in mind that doing the import will override all the data on the DB.
Also remember that:
You cannot import a database that was exported from a higher version of SQL Server. For example, if you exported a SQL Server 2017 Enterprise version, you cannot import it into a SQL Server 2017 Standard version.

Connecting to DocumentDB from AWS Glue

For a current ETL job, I am trying to create a Python Shell Job in Glue. The transformed data needs to be persisted in DocumentDB. I am unable to access the DocumentDB from Glue.
Since the DocumentDb cluster resides in a VPC, I thought of creating a Interface gateway to access the Document DB from Glue but DocumentDB was not one of the approved service in Interface gateway. I see tunneling as a suggested option but I do not wanna do that.
So, I want to know is there a way to connect to DocumentDB from Glue.
Create a dummy JDBC connection in AWS Glue. You will not need to do a test connection but this will allow ENIs to be created in the VPC. Attach this connection to your python shell job. This will allow you to interact with your resources.
Have you tried using the mongo db connection in glue connections, we can connect document db through that option.
I have been able to connect DocumentDb with glue and ingest data using a csv in S3, here's the script to do that
# Constants
data_catalog_database = 'sample-db'
data_catalog_table = 'data'
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
job = Job(glue_context)
job.init(args['JOB_NAME'], args)
# Read from data source
## #type: DataSource
## #args: [database = "glue-gzip", table_name = "glue_gzip"]
## #return: dynamic_frame
## #inputs: []
dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
database=data_catalog_database,
table_name=data_catalog_table
)
documentdb_write_uri = 'mongodb://yourdocumentdbcluster.amazonaws.com:27017'
write_documentdb_options = {
"uri": documentdb_write_uri,
"database": "yourdbname",
"collection": "yourcollectionname",
"username": "###",
"password": "###"
}
# Write DynamicFrame to MongoDB and DocumentDB
glue_context.write_dynamic_frame.from_options(dynamic_frame, connection_type="documentdb",
connection_options=write_documentdb_options)
In summary:
Create a crawler that creates the schema of your data and a table, which can be stored in an S3 bucket.
Use that db and table to ingest it into your documentdb.

How can i backup and Azure DB to local disk?

I have tried using the option Export Data-tier option but I get errors in all outputs
Exporting Database
Extracting Schema
Extracting Schema from database
and there are no details about why it failed.\
Is there a way I can back it up on Azure and copy the bacpac file or with powershell ?
You can use SqlPackage to create a backup (export a bacpac) of your Azure SQL Database in your local drive.
sqlpackage.exe /Action:Export /ssn:tcp:sqlftpbackupserver.database.windows.net /sdn:sqlftpbackupdb /su:alberto /tf:c:\sql\sqlftpbackup.bacpac /sp:yourpwd /p:Storage=File
In above example, we are exporting a file (Export) to an Azure SQL Server named sqlftpbackupserver.database.windows.net and the source database name is sqlftpbackup. The source user is alberto and the target file where we will export is in the c:\sql\sqlftpbackup.bacpac sp is to specify the Azure SQL database password of the Azure SQL user. Finally, we will store in a file.
Another example is:
SqlPackage /Action:Export /SourceServerName:SampleSQLServer.sample.net,1433
/SourceDatabaseName:SampleDatabase /TargetFile:"F:\Temp\SampleDatabase.bacpac"
You can try the backup the database to Blob Storage on Portal, then download it to your local disk.
For Powershell, here's the Command example: Backup SQL Azure database to local disk only:
C:\PS>Backup-Database -Name “database1” -DownloadLocation “D:\temp” -Server “mydatabaseserver” -UserName “username” -Password “password” -Verbose
For more details, reference Backup SQL Azure Database.
Here's a tutorial How to backup Azure SQL Database to Local Machine talks about almost all the ways to help you backup the Azure SQL Database to local disk:
Hope this helps.