How to mount all bucket in Google Cloud Storage - gcsfuse

From document https://cloud.google.com/storage/docs/gcs-fuse, when use FUSE GCS there is only mound predefine bucket to the specified path.
In case I have many buckets, show how I can mount all buckets as root directory can access or create any bucket as the directory?

gcsfuse doesn't support mounting multiple buckets in a single process, nor does it support creating buckets for you. You'll need to create buckets in the usual way outside gcsfuse, then run one gcsfuse process per desired bucket.

Related

Confusion in the dbutils.fs.ls() command output. Please suggest

When I use the below command in Azure Databricks
display(dbutils.fs.ls("/mnt/MLRExtract/excel_v1.xlsx"))
My output is coming as wasbs://paycnt#sdvstr01.blob.core.windows.net/mnt/MLRExtract/excel_v1.xlsx
not as expected-- dbfs://mnt/MLRExtract/excel_v1.xlsx
Please suggest
Mounting a storage account to Databricks File System allows users to access them any number of times without any credentials. Any files or directories can be accessed from Databricks clusters using these mount points. The procedure you used allows you to mount blob storage container to DBFS.
So, you can access your blob storage container from DBFS using the mount point. The method dbutils.fs.ls(<mount_point>) displays all the files and directories available in that mount point. It is not necessary to provide path of a file, instead simply use:
display(dbutils.fs.ls(“/mnt/MLRExtract/”))
The above command returns all the files available in the mount point (which is your blob storage container). You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container too.
Refer to the following link to understand more about Databricks file system.
https://docs.databricks.com/data/databricks-file-system.html

How to access different storage accounts with same container name in databricks notebooks

I have 2 different storage accounts with same container name. Lets say tenant1 and tenant2 as storage account name with "appdata" as container name in both accounts. I can create and mount both containers to dbfs. But i am unable to read/write dynamically by passing storage account names to the mount point code. since dbfs has mnt/containername as mount point in dbfs, only latest or previously passed storage account's mount point is being referred in databricks. How to achieve my goal here?
Mount points should be static, so you just need to have two different mount points pointing to the correct container, something like this:
/mnt/storage1_appdata
/mnt/storage2_appdata
so if you want your code be dynamic, use the f"/mnt/{storage_name}_appdata".
It's not recommended to dynamically remount containers - you can get cryptic errors when you remount mount point while somebody is reading/writing data using it.
Also, you can access ADLS directly if you specify correct configuration for your cluster/job (see doc) - you can even access both containers at the same time, just need to setup configuration for both storage accounts:
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net",
"OAuth")
spark.conf.set(
"fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(
"fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net",
"<application-id>")
spark.conf.set(
"fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set(
"fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

Is there a way to figure out in which region a Google Cloud Storage bucket is hosted?

NCBI (the National Center for Biotech Info) generously provided their data for 3rd parties to consume. The data is located in cloud buckets such as gs://sra-pub-run-1/. I would like to read this data without incurring additional costs, which I believe can be achieved by reading it from the same region as where the bucket is hosted. Unfortunately, I can't figure out in which region the bucket is hosted (NCBI mentions in their docs that's in the US, but not where in the US). So my questions are:
Is there a way to figure out in which region a bucket that I don't own, like gs://sra-pub-run-1/ is hosted?
Is my understanding correct that reading the data from instances in the same region is free of charge? What if the GCS bucket is multi-region?
Doing a simple gsutil ls -b -L either provides no information (when listing a specific directory within sra-pub-run-1 or I get a permission denied error if I try to list info on gs://sra-pub-run-1/ directly using:
gsutil -u metagraph ls -b gs://sra-pub-run-1/
You cannot specify a specific Compute Engine zone as a bucket location, but all Compute Engine VM instances in zones within a given region have similar performance when accessing buckets in that region.
Billing-wise, egressing data from Cloud Storage into a Compute Engine instance in the same location/region (for example, US-EAST1 to US-EAST1) is free, regardless of zone.
So, check the "Location constraint" of the GCS bucket (gsutil ls -Lb gs://bucketname ), and if it says "US-EAST1", and if your GCE instance is also in US-EAST1, downloading data from that GCS bucket will not incur an egress fee.

RMAN backup into Google Cloud Storage

I want to take Oracle database backup using RMAN directly into the Google Cloud Storage
I am unable to find the plugin to use to take the RMAN backups into Cloud Storage. We have a plugin for Amazon S3 and am looking for one such related to Google Cloud Storage.
I don't believe there's an official way of doing this. Although I did file a Feature Request for the Cloud Storage engineering team to look into that you can find here.
I recommend you to star the Feature Request, for easy visibility and access, allowing you to view its status updates. The Cloud Storage team might ask questions there too.
You can use gcsfuse to mount GCS bucket as file systems on your machine and use RMAN to create backups there.
You can find more information about gcsfuse on its github page. Here are the basic steps to mount a bucket and run RMAN:
Create a bucket oracle_bucket. Check that it doesn't have a retention policy defined on it (it looks like gcsfuse has some issues with retention policies).
Please have a look at mounting.md that describes credentials for GCS. For example, I created a service account with Storage Admin role and created a JSON key for it.
Next, set up credentials for gcsfuse on your machine. In my case, I set GOOGLE_APPLICATION_CREDENTIALS to the path to JSON key from step 1. Run:
sudo su - oracle
mkdir ./mnt_bucket
gcsfuse --dir-mode 755 --file-mode 777 --implicit-dirs --debug_fuse oracle_bucket ./mnt_bucket
From gcsfuse docs:
Important: You should run gcsfuse as the user who will be using the
file system, not as root. Do not use sudo.
Configure RMAN to create a backup in mnt_bucket. For example:
configure controlfile autobackup format for device type disk to '/home/oracle/mnt_bucket/%F';
configure channel device type disk format '/home/oracle/mnt_bucket/%U';
After you run backup database you'll see a backup files created in your GCS bucket.

mount google cloud storage bucket but cache locally

I would like to know if there is a way to mount google cloud storage bucket as a folder for the first time
and each time we read the file, cache it locally (so it won't use money/bandwidth).
GCSFUSE has two type of caching available, Stat caching and type caching. You can refer to this document which provide detailed information on these types of caching with there trade-offs.