Execute a file in Google Cloud Platform (GCP) bucket using Scala - scala

I'm looking to execute Scala code in text file using GCP with Spark Shell.
Using GCP (Google Cloud Platform), I've done the following:
Created a DataProc instance and named it gcp-cluster-091122.
Created a Cloud Bucket and named it gcp-bucket-091122p.
Created a simple text file called 1.txt and uploaded the file into the recently created GCP bucket, gcp-bucket-091122.
Logged onto the VM-instance SSH-in-browser and entered the command spark-shell to access the scala > prompt.
From here, how does one read/execute a particular file uploaded into a GCP bucket? I've researched this topic, but I've been unsuccessful.
I've also used "GCS Fuse" plug-in code to successfully mount the GCP bucket, gcp-bucket-091122 onto a created local file directory in SSH called lfs-directory-091122.
So an additional question would be how to execute a file located in the local file directory using Spark Shell?

Related

Firestore: Copy/Import data from local emulator to Cloud DB

I need to copy Firestore DB data from my local Firebase emulator to the cloud instance. I can move data from the Cloud DB to the local DB fine, using the EXPORT functionality in the Firebase admin console. We have been working on the local database instance for 3-4 months and now we need to move it back to the Cloud. I have tried to move the local "--export-on-exit" files back to my storage bucket and then IMPORT from there to the Cloud DB, but it fails everytime.
I have seen one comment by Doug at https://stackoverflow.com/a/71819566/20390759 that this is not possible, but the best solution is to write a program to copy from local to cloud. I've started working on that, but I can't find a way to have both projects/databases open at the same time, both local and cloud. They are all the same project ID, app-key, etc.
Attempted IMPORT: I copied the files created by "--export-on-exit" in the emulator to my cloud storage bucket. Then, I selected IMPORT and chose the file I copied up to the bucket. I get this error message:
Google Cloud Storage file does not exist: /xxxxxx.appspot.com/2022-12-05 from local/2022-12-05 from local.overall_export_metadata
So, I renamed the file metadata file to match the directory name from the local system, and the IMPORT claims to initiate successfully, but then fails with no error message.
I've been using Firestore for several years, but we just started using the emulator this year. Overall I like it, but if I can't easily move data back to the Cloud, I can't use it. Thanks for any ideas/help.

MLflow Artifacts Storing artifacts(google cloud storage) but not displaying them in MLFlow UI

I am working on a docker environment(docker-compose) with a jupyter notebook docker image and a postgres docker image for running ML models and using google cloud storage to store the model artifacts. Storing the models on the cloud storage works fine but i can't get to show them within the MLFlow UI. I have seen similar problems but non of the solutions used google cloud storage as the storage location for artifacts. The error message says the following Unable to list artifacts stored under <gs-location> for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.What could possibly be causing this problem?
I had the exactly the same issue. Keywords are docker-compose, google cloud storage, success in storing in GCS, but failure in listing artifacts in UI.
In my case, it turns out that in docker-compose file, if you assign the env vars by reading from a .env file (eg. GOOGLE_APPLICATION_CREDENTIALS), the server might start before the assignment. The quick solve is to assign the env var directly with key environment: instead of using key env_file:.
For sensitive data that you still need to put in .env file, you can add wait time for the server, and add depends on: in docker-compose file to make sure that the database container starts before the mlflow server if you are using database-backed store.
I faced a same issue when running mlflow from local. The issue got resolved after adding GOOGLE_APPLICATION_CREDENTIALS to the environment variables.
https://googleapis.dev/python/google-api-core/latest/auth.html

Having problem to add local hasura to Google cloud run

Do you have some information or tutorial to add local hasura to google cloud run.
I already successfully set the hasura at google cloud run, but it seems i have a problem to connect it with our local database in hasura.
i got an error
ERROR: (gcloud.builds.submit) Unable to read file [cloudbuild.yaml]: [Errno 2] No such file or directory: 'cloudbuild.yaml'
Is there something is not configured yet or?
Best
Zaid
Your question is vague.
The error you reference is Google Cloud Build and suggests that you're trying gcloud builds submit ... and this is failing because the command is unable to find a cloudbuild.yaml file. It's entirely probably you want to do the deployment using Cloud Build but you'll need to create the cloudbuild.yaml file for this to work.
For those of unfamiliar with "hasura", do you mean hasura.io?
This appears to require a container image running that defaults to port :8080 (which is good as that's a default assumed by Cloud Run) and a connection to a PostreSQL database.
If you're using Cloud SQL to run PostgreSQL, you can follow the instructions here

How to download file from url and store it in aws s3 bucket?

as stated, I'm trying to download this dataset of zip folders containing images: https://data.broadinstitute.org/bbbc/BBBC006/ and store them in an s3 bucket so I can later unzip them in the bucket, reorganize them, and pull them in smaller chunks into a vm for some computation. Problem is, I don't know how to get the data from https://data.broadinstitute.org/bbbc/BBBC006/BBBC006_v1_images_z_00.zip for example or any of the other ones, to then send it s3
this is my first time using aws or really any cloud platform so please bear with me :]
Amazon EC2 provides a virtual computer just like a normal Linux or Windows computer.
Amazon S3 is a block storage service where you can upload/download files.
If you wish to copy files from a website to Amazon S3, you will need to write an application or script that will:
Download the files from the website
Upload them to Amazon S3
If you wish to do it from a script, you could use the AWS Command-Line Interface (CLI).
Or, you could do it from a programming language, see: SDKs and Programming Toolkits for AWS

Some questions about google Data fusion

I am discovering the tool and I have some questions:
-what do you exactly mean by the type File in (Source, Sink),
-is it also possible to send the result of the pipeline directly to a FTP server
I check the documentation, but I did not find this information
thank you
Short answer: File refers to the filesystem where the pipelines run. In Data Fusion context if you are using File sink the contents will be written to HDFS on Dataproc cluster.
Data Fusion has SFTP put actions that can be used to write to SFTP. Here is a simple pipeline of how to write to SFTP from GCS.
Step1: GCS Source to File Sink - This writes the content of GCS to HDFS on Dataproc when the pipeline is run
Step 2: SFTP Put action, that takes the output of File sink and upload to SFTP.
You need to configure the output path of File the same as source path in SFTP