cannot specify location of bucket when creating a glcoud storage bucket - gcloud

The following code, I would think, would build a bucket in the us-west region but on my google console the region is listed as multi-regional.
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.create_bucket(bucket_name)
bucket.location = 'us-west2-a'

from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.create_bucket(bucket_name)
bucket.location = 'us-west2-a'
The problem in your code sample is that you have specified 'us-west2-a' which is a zone name instead of puting 'us-west2' which is the region (location) name.
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.create_bucket(bucket_name)
bucket.location = 'us-west2'
By changing to 'us-west2' it should create your bucket in the desired location.
References:
Storage - Location sample code
Google Cloud Locations

Related

Azure linked service using python

I have created a Linked service in azure data factory using azure portal. I want to connect to this Linked service in notebook activity in synapse using python. Do we have any such api?
Please let me know.
Thanks
As per your requirements you can directly copy data from synapse notebook to your Azure blob storage using python itself without creating linked service in data factory.
I created Azure blob storage and created new container and generated SAS for the blob storage.
I am have created data frame in synapse notebook using python using below code:
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import pandas as pd
data = {
'Name': ['ABC', 'CDE', 'BBA', 'DSA'],
'Age': [25, 30, 35, 40],
'Gender': ['F', 'M', 'M', 'M']
}
df = pd.DataFrame(data)
I want to save the dataframe as csv file in azure blob storage for that I converted the dataframe to csv format using below code:
csv_data = df.to_csv(index=False)
I copied the connection string of blob storage account.
I copied the data into blob storage using below code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
connection_string = "<connection string>"
container_name = "<container name>"
blob_name = "<filename>.csv"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob_name)
blob_client.upload_blob(csv_data, overwrite=True)
It executed successfully.
My file is created in blob storage container successfully.

Find Last modified timestamp of a files/folders in Azure Datalake through python script in Azure databricks that uses Credential passthrough

I have an Azure DataLake Storage Gen2 which contains a few Parquet files. My Organization has enabled credential passthrough and so I am able to create a python script in Azure Databricks and access the files available in ADLS using dbutils.fs.ls. All these work fine.
Now, I need to access the last modified timestamp of these files too. I found a link that does this. However, it uses BlockBlobService and requires an account_key.
I do not have an account key and can't get one due to security policies of the organization. I am unsure of how to do the same using Credential passthrough. Any ideas here?
You can try to mount the Azure DataLake Storage Gen2 instance with credentials passthrough.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
mount_name = 'localmountname'
container_name = 'containername'
storage_account_name = 'datalakestoragename'
dbutils.fs.mount(
source = f"abfss://{container_name}#{storage_account_name}.dfs.core.windows.net/",
mount_point = f"/mnt/{mount_name}>",
extra_configs = configs)
You can do this using the Hadoop FileSystem object accessible via Spark:
import time
path = spark._jvm.org.apache.hadoop.fs.Path
fs = path('abfss://container#storageaccount.dfs.core.windows.net/').getFileSystem(sc._jsc.hadoopConfiguration())
res = fs.listFiles(path('abfss://container#storageaccount.dfs.core.windows.net/path'), True)
while res.hasNext():
file = res.next()
localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
print(f"{file.getPath()}: {localTime}")
Note that that the True parameter in the listFiles() method means recursive.

Load model from Google Cloud Storage without downloading

Is there a way to serve model from Google Cloud Storage without actually downloading a copy of model? like streaming the data directly?
I'm trying to load a fasttext model that is hosted on Google Cloud Storage. everytime i run the program, it needs to get and download a copy of that model in the bucket.
language_model_filename = 'lid.176.bin' // filename in GCS
language_model_local = 'lid.176.bin' // local file name when downloaded
bucket = storage_client.get_bucket(CLOUD_STORAGE_BUCKET)
blob = bucket.blob(language_model_filename)
blob.download_to_filename(language_model_local)
language_model = FastText.load_model(language_model_local)
You can use Streaming Tranfers for that purpose. As explained in the documentation, you can use the third party boto client library plugin for Cloud Storage.
A streaming download example would look like this:
import sys
downloaded_file = 'saved_data_file'
MY_BUCKET = 'my_app_bucket'
object_name = 'data_file'
src_uri = boto.storage_uri(MY_BUCKET + '/' + object_name, 'gs')
src_uri.get_key().get_file(sys.stdout)

Creating bucket in Google Cloud Storage in custom location

I would like to create a bucket in GCS based in Europe using the python client.
from google.cloud import storage
Instantiates a client
storage_client = storage.Client()
The name for the new bucket
bucket_name = 'my-new-bucket'
Creates the new bucket
bucket = storage_client.create_bucket(bucket_name)
print('Bucket {} created.'.format(bucket.name))
This creates the bucket multiregional in the US. How can I change this to Europe?
The create_bucket method is limited. For more parameters, you'd create a bucket resource and invoke its create() method, like so:
storage_client = storage.Client()
bucket = storage_client.bucket('bucket-name')
bucket.create(location='EU')
Bucket.create has a few other properties and is documented: https://googleapis.github.io/google-cloud-python/latest/storage/buckets.html#google.cloud.storage.bucket.Bucket.create
You can try with this:
def create_bucket(bucket_name):
storage_client = storage.Client()
bucket = storage_client.create_bucket(bucket_name, location='EUROPE-WEST1')
print("Bucket {} created".format(bucket.name))

Read data stored in zip file in Google Cloud Storage from Notebook in Google Cloud Datalab

I have a zip file containing a relatively large dataset (1Gb) stored in a zip file in Google Cloud Storage instance.
I need to use Notebook hosted in Google Cloud Datalab to access that file and the data contained there. How do I go about this?
Thank you.
Can you try the following?
import pandas as pd
# Path to the object in Google Cloud Storage that you want to copy
sample_gcs_object = 'gs://path-to-gcs/Hello.txt.zip'
# Copy the file from Google Cloud Storage to Datalab
!gsutil cp $sample_gcs_object 'Hello.txt.zip'
# Unzip the file
!unzip 'Hello.txt.zip'
# Read the file into a pandas DataFrame
pandas_dataframe = pd.read_csv('Hello.txt')