How to import a file from Google Cloud Storage to H2O running in R - google-cloud-storage

I would like to import a csv file from my Google Cloud Storage bucket into H2O running in R locally (h2o.init(ip = "localhost")).
I tried following the instructions at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/cloud-integration/gcs.html?highlight=environment.
I can already upload csv files from R to GCS and vice-versa using the R package cloudml. So I am reasonably sure I have the authorizations set correctly.
I have tried using Sys.setenv(GOOGLE_APPLICATION_CREDENTIALS = "/full/path/to/auth.json"). I tried using the terminal from Rstudio to do the same thing: export GOOGLE_APPLICATION_CREDENTIALS="/full/path/to/auth.json". I also tried gcloud auth application-default login using the terminal from Rstudio.
But in every case, I could not make this work, from within Rstudio:
h2o.init()
h2o::h2o.importFile(path = "gs://[gcs_bucket]/[tbl.csv],
destination_frame = "tbl_from_gcs")
H2O throws the error:
Error in h2o.importFolder(path, pattern = "", destination_frame = destination_frame, :
all files failed to import
If I turn on logging (h2o::h2o.startLogging("logfile")), it shows:
GET http://localhost:54321/3/ImportFiles?path=gs%3A%2F%2F[gcs_bucket]%2F[tbl.csv]&pattern=
postBody:
curlError: FALSE
curlErrorMessage:
httpStatusCode: 200
httpStatusMessage: OK
millis: 182
{"__meta":{"schema_version":3,"schema_name":"ImportFilesV3","schema_type":"ImportFiles"},"_exclude_fields":"","path":"gs://[gcs_bucket]/[tbl.csv]","pattern":"","files":[],"destination_frames":[],"fails":["gs://[gcs_bucket]/[tbl.csv]"],"dels":[]}
(Obviously, I changed the bucket name and table name, but hopefully you get the idea.)
I am running h2o version 3.26.0.2 in R 3.6.1 and Rstudio 1.2.1578. (I am running Rstudio server in Docker on my local server using rocker/tidyverse:latest, FYI.)
If anyone could walk me through the steps to authenticate H2O so it can access GCS buckets directly, I would appreciate it. I know I could use cloudml or googleCloudStorageR as a workaround, but I would like to be able to use H2O directly so I can more easily switch from a local H2O cluster to a cloud H2O cluster.

I found one solution to this authentication issue: Because I am running h2o in Docker swarm, I can set an environment variable for the container in Docker Compose.
The relevant parts of the docker compose file look like this:
environment:
- GOOGLE_APPLICATION_CREDENTIALS=/run/secrets/google_auth_secret
secrets:
- google_auth_secret
...
secrets:
google_auth_secret:
file: ./gcloud_auth.json
Where gcloud_auth.json is the credentials file described here for your GCS bucket.

Related

gcloud transfer jobs does not recognize my azure credentials file

Whenever I try to use the gcloud transfer cli to create a transfer job from an azure blob container to a google cloud storage bucket, I am being thrown the following error :
ERROR: gcloud crashed (ValueError): Source creds file must be JSON or INI format.
the file in the json path I fill in looks like this :
{"sasToken": "working_token"}
and I also tried this
{
"sasToken": "working_token"
}
As it's stated here but none of them actually works.
I also tried to put the content of the file in the --source-creds-file option but it says this is not a path to a file, which is true.
When using the console to create the same job, it just works, so has anyone got the cli to work? Any idea?
Thanks in advance

unexpected "ed25519-nkey" algorithm error when using NAS and NSC of NATS.io

A team I'm working with, has created a NAS Docker container. The Dockerfile uses FROM synadia/nats-account-server:0.8.4 and installs NSC using curl -L https://raw.githubusercontent.com/nats-io/nsc/master/install.py | python. When NAS is run in the Docker container, it is given a path to a server.conf file that contains operatorjwtpath: "/nsc/accounts/nats/OperatorName/OperatorName.jwt".
The problem is, that when I generate the operator on my PC using nsc add operator -i and when I run the Docker container on AWS Fargate and mount the JWT file to the appropriate folder using an AWS EFS filesystem, the container crashes and shows the error unexpected "ed25519-nkey" algorithm.
According to the NATS basics page, the algorithm that should be used is "alg": "ed25519". But when I generated the JWT and decoded it on this site, I see that what's being used is "alg": "ed25519-nkey".
So what is going on here? I can't find any specific info about an algorithm that has the "nkey" appended to its name. This is the default JWT that's generated. Why is it different from what the NAS algorithm expects? How do I solve this error?
Extra info: According to this site, it's supposed to be due to a version conflict, but even upgrading to FROM synadia/nats-account-server:1.0.0 didn't solve it.

Is there a way to use the data from Google Cloud Storage directly in Colab?

I want to use a dataset (170+GB) in Google Colab. I have two questions:
Since the available space in Colab is about 66GB, is there a way to use the data from GCS directly in colab, if the data is hosted in GCS? If not, what is a possible solution?
How can I upload the dataset to GCS directly from a downloadable link, since I cannot wget into colab due to the limited available space?
Any help is appreciated.
Authenticate :
from google.colab import auth
auth.authenticate_user()
install google sdk:
!curl https://sdk.cloud.google.com | bash
init the SDK to configure the project settings.
!gcloud init
1 . Download file from Cloud Storage to Google Colab
!gsutil cp gs://google storage bucket/your file.csv .
2 . Upload file from Google Colab to Cloud
gsutil cp yourfile.csv gs://gs bucket/
Hope it helps. Source
I have a working example, that uses tf.io.gfile.copy (doc).
import tensorflow as tf
# Getting file names based on patterns
gcs_pattern = 'gs://flowers-public/tfrecords-jpeg-331x331/*.tfrec'
filenames = tf.io.gfile.glob(gcs_pattern)
#['gs://flowers-public/tfrecords-jpeg-331x331/flowers00-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers01-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers02-230.tfrec',
#...
# Downloading the first file
origin = filenames[0]
dest = origin.split("/")[-1]
tf.io.gfile.copy(origin, dest)
After that if I run the ls command, I can see the file (flowers00-230.tfrec).
In some cases you may need authentication (from G.MAHESH's answer):
from google.colab import auth
auth.authenticate_user()

Error in Google Cloud Shell Commands while working on the lab (Securing Google Cloud with CFT Scorecard)

I am working in a GCP lab (Securing Google Cloud with CFT Scorecard). All instructions for the lab are given.
First I have to run the following two commands to set environment variables
export GOOGLE_PROJECT=$DEVSHELL_PROJECT_ID
export CAI_BUCKET_NAME=cai-$GOOGLE_PROJECT
In the second command given above I don't know what to replace with my own credentials? May be that is the reason I am getting error.
Now I have to enable the "cloudasset.googleapis.com" gcloud service. For this they gave the following command.
gcloud services enable cloudasset.googleapis.com \
--project $GOOGLE_PROJECT
Error for this is given in the screeshot attached herewith:
Error in the serviec enabling command
Next step is to clone the policy: The given command for that is:
git clone https://github.com/forseti-security/policy-library.git
After that they said: "You realize Policy Library enforces policies that are located in the policy-library/policies/constraints folder, in which case you can copy a sample policy from the samples directory into the constraints directory".
and gave this command:
cp policy-library/samples/storage_blacklist_public.yaml policy-library/policies/constraints/
On running this command I received this:
error on running the directory command
Finally they said "Create the bucket that will hold the data that Cloud Asset Inventory (CAI) will export" and gave the following command:
gsutil mb -l us-central1 -p $GOOGLE_PROJECT gs://$CAI_BUCKET_NAME
I am confused in where to replace my own credentials like in the place of project_Id I wrote my own project id.
Also I don't know these errors are ocurring. Kindly help me.
I'm unable to access the tutorial.
What happens if you run the following:
echo ${DEVSHELL_PROJECT_ID}
I suspect you'll get an empty result because I think this environment variable isn't actually set.
I think it should be:
echo ${DEVSHELL_GCLOUD_CONFIG}
Does that return a result?
If so, perhaps try using that variable instead:
export GOOGLE_PROJECT=${DEVSHELL_GCLOUD_CONFIG}
export CAI_BUCKET_NAME=cai-${GOOGLE_PROJECT}
It's not entirely clear to me why this tutorial is using this approach but, if the above works, it may get you further along.
We're you asked to create a Google Cloud Platform project?
As per the shared error, this seems to be because your env variable GOOGLE_PROJECT is not set. You can verify it by using echo $GOOGLE_PROJECT and seeing whether it returns the project ID or not. You could also use echo $DEVSHELL_PROJECT_ID. If that returns the project ID and the former doesn't, it means that you didn't export the variable as stated at the beginning.
If the problem is that GOOGLE_PROJECT doesn't have any value, there are different approaches on how to solve it.
Set the env variable as you explained at the beginning. Obviously this will only work if the variable DEVSHELL_PROJECT_ID is also set.
export GOOGLE_PROJECT=$DEVSHELL_PROJECT_ID
Manually set the project ID into that variable. This is far from ideal because in Qwiklabs they create a new temporal project on every lab, so this would've only worked if you were still on that project. The project ID can be seen on both of your shared screenshots.
export GOOGLE_PROJECT=qwiklabs-gcp-03-c6e1787dc09e
Avoid using the argument --project. According to the documentation, the aforementioned argument is optional and if none is used the command will take the one by default, which will be on the configuration settings. You can get the current project by using this:
gcloud config get-value project
If the previous command matches the project ID you want to use, you can simply issue the following command:
gcloud services enable cloudasset.googleapis.com
Notice that the project ID is not being explicitly mentioned using --project.
Regarding your issue with the GitHub file, I have checked the repository and the file storage_blacklist_public.yaml doesn't seem to be in the directory policy-library/samples. There seems to be a trace that it was once there, but it isn't anymore, they should probably update the lab as it isn't anymore.
About your credentials confusion, you don't have to use your own project ID, just the one given on your lab. If I recall properly all the needed data should be on the left side of the lab. Still, you shouldn't need to authenticate in a normal situation as you are already logged in your temporal project if you are accessing it form the Cloud Shell, which is where you should be doing all this.
Adding this for the later versions
in the gcloud shell you can set a temp variable for the current project id with
PROJECT_ID="$(gcloud config get-value project)"
then use like
--project ${PROJECT_ID}

How do you use storage service in Bluemix?

I'm trying to insert some storage data onto Bluemix, I searched many wiki pages but I couldn't come to conclude how to proceed. So can any one tell me how to store images, files in storage of Bluemix through any language code ( Java, Node.js)?
You have several options at your disposal for storing files in your app. None of them include doing it in the app container file system as the file space is ephemeral and will be recreated from the droplet each time a new instance of your app is created.
You can use services like MongoLab, Cloudant, Object Storage, and Redis to store all kinda of blob data.
Assuming that you're using Bluemix to write a Cloud Foundry application, another option is sshfs. At your app's startup time, you can use sshfs to create a connection to a remote server that is mounted as a local directory. For example, you could create a ./data directory that points to a remote SSH server and provides a persistent storage location for your app.
Here is a blog post explaining how this strategy works and a source repo showing it used to host a Wordpress blog in a Cloud Foundry app.
Note that as others have suggested, there are a number of services for storing object data. Go to the Bluemix Catalog [1] and select "Data Management" in the left hand margin. Each of those services should have sufficient documentation to get you started, including many sample applications and tutorials. Just click on a service tile, and then click on the "View Docs" button to find the relevant documentation.
[1] https://console.ng.bluemix.net/?ace_base=true/#/store/cloudOEPaneId=store
Check out https://www.ng.bluemix.net/docs/#services/ObjectStorageV2/index.html#gettingstarted. The storage service in Bluemix is OpenStack Swift running in Softlayer. Check out this page (http://docs.openstack.org/developer/swift/) for docs on Swift.
Here is a page that lists some clients for Swift.
https://wiki.openstack.org/wiki/SDKs
As I search There was a service that name was Object Storage service and also was created by IBM. But, at the momenti I couldn't see it in the Bluemix Catalog. I guess , They gave it back and will publish new service in the future.
Be aware that pobject store in bluemix is now S3 compatible. So for instance you can use Boto or boto3 ( for python guys ) It will work 100% API comaptible.
see some example here : https://ibm-public-cos.github.io/crs-docs/crs-python.html
this script helps you to list recursively all objects in all buckets :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to specify your credentials this would be :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to create a "hello.txt" file in a new bucket. :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
s3.Object(my_bucket, 'hello.txt').put(Body=b"I'm a test file")
If you want to upload a file in a new bucket :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
timestampstr = str (timestamp)
s3.Bucket(my_bucket).upload_file(<location of yourfile>,<your file name>, ExtraArgs={ "ACL": "public-read", "Metadata": {"METADATA1": "resultat" ,"METADATA2": "1000","gid": "blabala000", "timestamp": timestampstr },},)