Load model from Google Cloud Storage without downloading - google-cloud-storage

Is there a way to serve model from Google Cloud Storage without actually downloading a copy of model? like streaming the data directly?
I'm trying to load a fasttext model that is hosted on Google Cloud Storage. everytime i run the program, it needs to get and download a copy of that model in the bucket.
language_model_filename = 'lid.176.bin' // filename in GCS
language_model_local = 'lid.176.bin' // local file name when downloaded
bucket = storage_client.get_bucket(CLOUD_STORAGE_BUCKET)
blob = bucket.blob(language_model_filename)
blob.download_to_filename(language_model_local)
language_model = FastText.load_model(language_model_local)

You can use Streaming Tranfers for that purpose. As explained in the documentation, you can use the third party boto client library plugin for Cloud Storage.
A streaming download example would look like this:
import sys
downloaded_file = 'saved_data_file'
MY_BUCKET = 'my_app_bucket'
object_name = 'data_file'
src_uri = boto.storage_uri(MY_BUCKET + '/' + object_name, 'gs')
src_uri.get_key().get_file(sys.stdout)

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Is there a way to use the data from Google Cloud Storage directly in Colab?

I want to use a dataset (170+GB) in Google Colab. I have two questions:
Since the available space in Colab is about 66GB, is there a way to use the data from GCS directly in colab, if the data is hosted in GCS? If not, what is a possible solution?
How can I upload the dataset to GCS directly from a downloadable link, since I cannot wget into colab due to the limited available space?
Any help is appreciated.
Authenticate :
from google.colab import auth
auth.authenticate_user()
install google sdk:
!curl https://sdk.cloud.google.com | bash
init the SDK to configure the project settings.
!gcloud init
1 . Download file from Cloud Storage to Google Colab
!gsutil cp gs://google storage bucket/your file.csv .
2 . Upload file from Google Colab to Cloud
gsutil cp yourfile.csv gs://gs bucket/
Hope it helps. Source
I have a working example, that uses tf.io.gfile.copy (doc).
import tensorflow as tf
# Getting file names based on patterns
gcs_pattern = 'gs://flowers-public/tfrecords-jpeg-331x331/*.tfrec'
filenames = tf.io.gfile.glob(gcs_pattern)
#['gs://flowers-public/tfrecords-jpeg-331x331/flowers00-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers01-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers02-230.tfrec',
#...
# Downloading the first file
origin = filenames[0]
dest = origin.split("/")[-1]
tf.io.gfile.copy(origin, dest)
After that if I run the ls command, I can see the file (flowers00-230.tfrec).
In some cases you may need authentication (from G.MAHESH's answer):
from google.colab import auth
auth.authenticate_user()

How to upload/download file from GCS to/from ftp server with Airflow FTPHook

I am currently trying to use the FTPHook in Airflow in order to upload and download file to/from a remote ftp. But I'm not sure if I can use the gs:// path as part of the source/destination path.
I currently don't want to use local folder within the AF pod since the file size might get big, so I would rather use gcs path directly or gcs file stream.
conn = FTPHook(ftp_conn_id='ftp_default')
conn.store_file('in', 'gs://bucket_name/file_name.txt')
link to the FTPHook code:
here
Thanks for any help!
I found a simple streaming solution to upload/download from gcs to ftp server and vice versa using pysftp which I'll like to share with you.
First, I found this solution, which was working great, but the only issue with that solution was that it didn't support upload file from gcs to FTP. So I was looking for something else.
So than I was looking into different approach, so I've found this google document which basically allow you to stream to/from blob file which was exactly what I was looking for.
params = BaseHook.get_connection(self.ftp_conn_id)
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
ftp = pysftp.Connection(host=params.host, username=params.login, password=params.password,
port=params.port,
cnopts=cnopts)
#This will download file from FTP server to GCS location
with ftp.open(self.ftp_folder + '/' + file_to_load, 'r+') as remote_file:
blob = bucket.blob(self.gcs_prefix + file_to_load)
blob.upload_from_file(remote_file)
#This will upload file from GCS to FTP server
with sftp.open(self.ftp_folder + '/' +file_name,'w+') as remote_file:
blob = bucket.blob(fileObject['name'])
blob.download_to_file(remote_file)
GCS does not implement FTP support, so this won't work.
It looks like FTP hook only knows how to deal with a local file path or buffer, not one of the GCS APIs.
You might be able to find (or write) some code that reads from FTP and writes to GCS.

Read data stored in zip file in Google Cloud Storage from Notebook in Google Cloud Datalab

I have a zip file containing a relatively large dataset (1Gb) stored in a zip file in Google Cloud Storage instance.
I need to use Notebook hosted in Google Cloud Datalab to access that file and the data contained there. How do I go about this?
Thank you.
Can you try the following?
import pandas as pd
# Path to the object in Google Cloud Storage that you want to copy
sample_gcs_object = 'gs://path-to-gcs/Hello.txt.zip'
# Copy the file from Google Cloud Storage to Datalab
!gsutil cp $sample_gcs_object 'Hello.txt.zip'
# Unzip the file
!unzip 'Hello.txt.zip'
# Read the file into a pandas DataFrame
pandas_dataframe = pd.read_csv('Hello.txt')

How do you use storage service in Bluemix?

I'm trying to insert some storage data onto Bluemix, I searched many wiki pages but I couldn't come to conclude how to proceed. So can any one tell me how to store images, files in storage of Bluemix through any language code ( Java, Node.js)?
You have several options at your disposal for storing files in your app. None of them include doing it in the app container file system as the file space is ephemeral and will be recreated from the droplet each time a new instance of your app is created.
You can use services like MongoLab, Cloudant, Object Storage, and Redis to store all kinda of blob data.
Assuming that you're using Bluemix to write a Cloud Foundry application, another option is sshfs. At your app's startup time, you can use sshfs to create a connection to a remote server that is mounted as a local directory. For example, you could create a ./data directory that points to a remote SSH server and provides a persistent storage location for your app.
Here is a blog post explaining how this strategy works and a source repo showing it used to host a Wordpress blog in a Cloud Foundry app.
Note that as others have suggested, there are a number of services for storing object data. Go to the Bluemix Catalog [1] and select "Data Management" in the left hand margin. Each of those services should have sufficient documentation to get you started, including many sample applications and tutorials. Just click on a service tile, and then click on the "View Docs" button to find the relevant documentation.
[1] https://console.ng.bluemix.net/?ace_base=true/#/store/cloudOEPaneId=store
Check out https://www.ng.bluemix.net/docs/#services/ObjectStorageV2/index.html#gettingstarted. The storage service in Bluemix is OpenStack Swift running in Softlayer. Check out this page (http://docs.openstack.org/developer/swift/) for docs on Swift.
Here is a page that lists some clients for Swift.
https://wiki.openstack.org/wiki/SDKs
As I search There was a service that name was Object Storage service and also was created by IBM. But, at the momenti I couldn't see it in the Bluemix Catalog. I guess , They gave it back and will publish new service in the future.
Be aware that pobject store in bluemix is now S3 compatible. So for instance you can use Boto or boto3 ( for python guys ) It will work 100% API comaptible.
see some example here : https://ibm-public-cos.github.io/crs-docs/crs-python.html
this script helps you to list recursively all objects in all buckets :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to specify your credentials this would be :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to create a "hello.txt" file in a new bucket. :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
s3.Object(my_bucket, 'hello.txt').put(Body=b"I'm a test file")
If you want to upload a file in a new bucket :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
timestampstr = str (timestamp)
s3.Bucket(my_bucket).upload_file(<location of yourfile>,<your file name>, ExtraArgs={ "ACL": "public-read", "Metadata": {"METADATA1": "resultat" ,"METADATA2": "1000","gid": "blabala000", "timestamp": timestampstr },},)