Confusion in the dbutils.fs.ls() command output. Please suggest - pyspark

When I use the below command in Azure Databricks
display(dbutils.fs.ls("/mnt/MLRExtract/excel_v1.xlsx"))
My output is coming as wasbs://paycnt#sdvstr01.blob.core.windows.net/mnt/MLRExtract/excel_v1.xlsx
not as expected-- dbfs://mnt/MLRExtract/excel_v1.xlsx
Please suggest

Mounting a storage account to Databricks File System allows users to access them any number of times without any credentials. Any files or directories can be accessed from Databricks clusters using these mount points. The procedure you used allows you to mount blob storage container to DBFS.
So, you can access your blob storage container from DBFS using the mount point. The method dbutils.fs.ls(<mount_point>) displays all the files and directories available in that mount point. It is not necessary to provide path of a file, instead simply use:
display(dbutils.fs.ls(“/mnt/MLRExtract/”))
The above command returns all the files available in the mount point (which is your blob storage container). You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container too.
Refer to the following link to understand more about Databricks file system.
https://docs.databricks.com/data/databricks-file-system.html

Related

How to access different storage accounts with same container name in databricks notebooks

I have 2 different storage accounts with same container name. Lets say tenant1 and tenant2 as storage account name with "appdata" as container name in both accounts. I can create and mount both containers to dbfs. But i am unable to read/write dynamically by passing storage account names to the mount point code. since dbfs has mnt/containername as mount point in dbfs, only latest or previously passed storage account's mount point is being referred in databricks. How to achieve my goal here?
Mount points should be static, so you just need to have two different mount points pointing to the correct container, something like this:
/mnt/storage1_appdata
/mnt/storage2_appdata
so if you want your code be dynamic, use the f"/mnt/{storage_name}_appdata".
It's not recommended to dynamically remount containers - you can get cryptic errors when you remount mount point while somebody is reading/writing data using it.
Also, you can access ADLS directly if you specify correct configuration for your cluster/job (see doc) - you can even access both containers at the same time, just need to setup configuration for both storage accounts:
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net",
"OAuth")
spark.conf.set(
"fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(
"fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net",
"<application-id>")
spark.conf.set(
"fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set(
"fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

RMAN backup into Google Cloud Storage

I want to take Oracle database backup using RMAN directly into the Google Cloud Storage
I am unable to find the plugin to use to take the RMAN backups into Cloud Storage. We have a plugin for Amazon S3 and am looking for one such related to Google Cloud Storage.
I don't believe there's an official way of doing this. Although I did file a Feature Request for the Cloud Storage engineering team to look into that you can find here.
I recommend you to star the Feature Request, for easy visibility and access, allowing you to view its status updates. The Cloud Storage team might ask questions there too.
You can use gcsfuse to mount GCS bucket as file systems on your machine and use RMAN to create backups there.
You can find more information about gcsfuse on its github page. Here are the basic steps to mount a bucket and run RMAN:
Create a bucket oracle_bucket. Check that it doesn't have a retention policy defined on it (it looks like gcsfuse has some issues with retention policies).
Please have a look at mounting.md that describes credentials for GCS. For example, I created a service account with Storage Admin role and created a JSON key for it.
Next, set up credentials for gcsfuse on your machine. In my case, I set GOOGLE_APPLICATION_CREDENTIALS to the path to JSON key from step 1. Run:
sudo su - oracle
mkdir ./mnt_bucket
gcsfuse --dir-mode 755 --file-mode 777 --implicit-dirs --debug_fuse oracle_bucket ./mnt_bucket
From gcsfuse docs:
Important: You should run gcsfuse as the user who will be using the
file system, not as root. Do not use sudo.
Configure RMAN to create a backup in mnt_bucket. For example:
configure controlfile autobackup format for device type disk to '/home/oracle/mnt_bucket/%F';
configure channel device type disk format '/home/oracle/mnt_bucket/%U';
After you run backup database you'll see a backup files created in your GCS bucket.

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

How to mount all bucket in Google Cloud Storage

From document https://cloud.google.com/storage/docs/gcs-fuse, when use FUSE GCS there is only mound predefine bucket to the specified path.
In case I have many buckets, show how I can mount all buckets as root directory can access or create any bucket as the directory?
gcsfuse doesn't support mounting multiple buckets in a single process, nor does it support creating buckets for you. You'll need to create buckets in the usual way outside gcsfuse, then run one gcsfuse process per desired bucket.

mount google cloud storage bucket but cache locally

I would like to know if there is a way to mount google cloud storage bucket as a folder for the first time
and each time we read the file, cache it locally (so it won't use money/bandwidth).
GCSFUSE has two type of caching available, Stat caching and type caching. You can refer to this document which provide detailed information on these types of caching with there trade-offs.