How to access different storage accounts with same container name in databricks notebooks - pyspark

I have 2 different storage accounts with same container name. Lets say tenant1 and tenant2 as storage account name with "appdata" as container name in both accounts. I can create and mount both containers to dbfs. But i am unable to read/write dynamically by passing storage account names to the mount point code. since dbfs has mnt/containername as mount point in dbfs, only latest or previously passed storage account's mount point is being referred in databricks. How to achieve my goal here?

Mount points should be static, so you just need to have two different mount points pointing to the correct container, something like this:
/mnt/storage1_appdata
/mnt/storage2_appdata
so if you want your code be dynamic, use the f"/mnt/{storage_name}_appdata".
It's not recommended to dynamically remount containers - you can get cryptic errors when you remount mount point while somebody is reading/writing data using it.
Also, you can access ADLS directly if you specify correct configuration for your cluster/job (see doc) - you can even access both containers at the same time, just need to setup configuration for both storage accounts:
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net",
"OAuth")
spark.conf.set(
"fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(
"fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net",
"<application-id>")
spark.conf.set(
"fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set(
"fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

Related

Confusion in the dbutils.fs.ls() command output. Please suggest

When I use the below command in Azure Databricks
display(dbutils.fs.ls("/mnt/MLRExtract/excel_v1.xlsx"))
My output is coming as wasbs://paycnt#sdvstr01.blob.core.windows.net/mnt/MLRExtract/excel_v1.xlsx
not as expected-- dbfs://mnt/MLRExtract/excel_v1.xlsx
Please suggest
Mounting a storage account to Databricks File System allows users to access them any number of times without any credentials. Any files or directories can be accessed from Databricks clusters using these mount points. The procedure you used allows you to mount blob storage container to DBFS.
So, you can access your blob storage container from DBFS using the mount point. The method dbutils.fs.ls(<mount_point>) displays all the files and directories available in that mount point. It is not necessary to provide path of a file, instead simply use:
display(dbutils.fs.ls(“/mnt/MLRExtract/”))
The above command returns all the files available in the mount point (which is your blob storage container). You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container too.
Refer to the following link to understand more about Databricks file system.
https://docs.databricks.com/data/databricks-file-system.html

Folder level access control inside containers in ADLS

I can see storage service level and container level access control but Is it possible to assign folder level access control for users and service principals?
No, there isn't a folder level access control in ADLS. We can't assign it for users or service principals.
Azure Data Lake only suppurt: Service, Container, Object level access control.
HTH.

Mount a shared volume to Kubernetes cluster so that all users can access same storage and share files

I am following Zero to JupyterHub with Kubernetes to create a jupyterHub environment for my team to use.
Using Google Kubernetes Engine and every user gets his/her own storage and files are stored - this setup works fine.
I am having trouble as how should I create a volume or shared database so that everyone in team can see each other's notebooks, share file's and data.
To explain more, in desired setup - when a user signs in and goes to his/her jupyter image - every user sees the same folder "shared" and one can create individual folders for themselves inside that folder but are able to reuse code that someone else has already written.
I looked into NFS with Firestore but that seems very expensive.
As in the documentation gcePersistenceDisk do not support multiple read and write.
There is alternative solution for the problem. Rook is a storage backend various storage provisioner available through it. One of them is Ceph which has shared filesystem solution on kubernetes.

Copying directories into minikube and persisting them

I am trying to copy some directories into the minikube VM to be used by some of the pods that are running. These include API credential files and template files used at run time by the application. I have found you can copy files using scp into the /home/docker/ directory, however these files are not persisted over reboots of the VM. I have read files/directories are persisted if stored in the /data/ directory on the VM (among others) however I get permission denied when trying to copy files to these directories.
Are there:
A: Any directories in minikube that will persist data that aren't protected in this way
B: Any other ways of doing the above without running into this issue (could well be going about this the wrong way)
To clarify, I have already been able to mount the files from /home/docker/ into the pods using volumes, so it's just the persisting data I'm unclear about.
Kubernetes has dedicated object types for these sorts of things. API credential files you might store in a Secret, and template files (if they aren't already built into your Docker image) could go into a ConfigMap. Both of them can either get translated to environment variables or mounted as artificial volumes in running containers.
In my experience, trying to store data directly on a node isn't a good practice. It's common enough to have multiple nodes, to not directly have login access to those nodes, and for them to be created and destroyed outside of your direct control (imagine an autoscaler running on a cloud provider that creates a new node when all of the existing nodes are 90% scheduled). There's a good chance your data won't (or can't) be on the host where you expect it.
This does lead to a proliferation of Kubernetes objects and associated resources, and you might find a Helm chart to be a good resource to tie them together. You can check the chart into source control along with your application, and deploy the whole thing in one shot. While it has a couple of useful features beyond just packaging resources together (a deploy-time configuration system, a templating language for the Kubernetes YAML itself) you can ignore these if you don't need them and just write a bunch of YAML files and a small control file.
For minikube, data kept in $HOME/.minikube/files directory is copied to / directory in VM host by minikube.

Can the operating system use ebs transparently?

I understand we can attach and detach a volume to an instance dynamically.My question is that will the OS allocation these plysicall resource automatically or it should be configured by the user i.e. create a mount point in for file system and explicitly tell the application where the mount point is ?
I use this cloud formation to deploy mongodb to aws,the template give users option to specify the volume size to host the database server,just wonder even if I allocate the physical resource,how can the template use it ? How can I know which volume the data reside.When I try to detach one of volume for the instance ,things just break.But I am sure I do not need so many volumes to host data
Yes you need to do that manually as soon as you create a ebs and attach it to a certain instance,you need to follow following steps(On linux systems)
- Check if volume is attached and get its name.
lsblk
- Format the newly attached volume
mkfs -t ext4 /dev/<volume name>
- Create a mount_point
mkdir mount_point
- mount the volume to mount point
mount /dev/<volume_name> mount_point
- Verify the newly attached partition
df -Ht
Can't see your cloud formation template