Read data stored in zip file in Google Cloud Storage from Notebook in Google Cloud Datalab - google-cloud-storage

I have a zip file containing a relatively large dataset (1Gb) stored in a zip file in Google Cloud Storage instance.
I need to use Notebook hosted in Google Cloud Datalab to access that file and the data contained there. How do I go about this?
Thank you.

Can you try the following?
import pandas as pd
# Path to the object in Google Cloud Storage that you want to copy
sample_gcs_object = 'gs://path-to-gcs/Hello.txt.zip'
# Copy the file from Google Cloud Storage to Datalab
!gsutil cp $sample_gcs_object 'Hello.txt.zip'
# Unzip the file
!unzip 'Hello.txt.zip'
# Read the file into a pandas DataFrame
pandas_dataframe = pd.read_csv('Hello.txt')

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.
For example, when I have to read a csv in databricks I use the following code:
dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)
So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.
Is it possible to read csv files coming from Azure storage via pySpark in databricks?
I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.
import pandas as pd
import openpyxl, zipfile
#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
zip_ref.extractall('/dbfs/mnt/data/unzipped')
#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx')
ws = my_excel['worksheet1']
# create pandas dataframe
df = pd.DataFrame(ws.values)
# create spark dataframe
spark_df = spark.createDataFrame(df)
The problem is that this only is being executed in the driver VM of the cluster.
Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.
In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account
Sample df = spark.read.text("dbfs:/mymount/my_file.txt")
Reference: https://docs.databricks.com/data/databricks-file-system.html
and regarding ZIP file please refer
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

Is there a way to use the data from Google Cloud Storage directly in Colab?

I want to use a dataset (170+GB) in Google Colab. I have two questions:
Since the available space in Colab is about 66GB, is there a way to use the data from GCS directly in colab, if the data is hosted in GCS? If not, what is a possible solution?
How can I upload the dataset to GCS directly from a downloadable link, since I cannot wget into colab due to the limited available space?
Any help is appreciated.
Authenticate :
from google.colab import auth
auth.authenticate_user()
install google sdk:
!curl https://sdk.cloud.google.com | bash
init the SDK to configure the project settings.
!gcloud init
1 . Download file from Cloud Storage to Google Colab
!gsutil cp gs://google storage bucket/your file.csv .
2 . Upload file from Google Colab to Cloud
gsutil cp yourfile.csv gs://gs bucket/
Hope it helps. Source
I have a working example, that uses tf.io.gfile.copy (doc).
import tensorflow as tf
# Getting file names based on patterns
gcs_pattern = 'gs://flowers-public/tfrecords-jpeg-331x331/*.tfrec'
filenames = tf.io.gfile.glob(gcs_pattern)
#['gs://flowers-public/tfrecords-jpeg-331x331/flowers00-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers01-230.tfrec',
# 'gs://flowers-public/tfrecords-jpeg-331x331/flowers02-230.tfrec',
#...
# Downloading the first file
origin = filenames[0]
dest = origin.split("/")[-1]
tf.io.gfile.copy(origin, dest)
After that if I run the ls command, I can see the file (flowers00-230.tfrec).
In some cases you may need authentication (from G.MAHESH's answer):
from google.colab import auth
auth.authenticate_user()

Load model from Google Cloud Storage without downloading

Is there a way to serve model from Google Cloud Storage without actually downloading a copy of model? like streaming the data directly?
I'm trying to load a fasttext model that is hosted on Google Cloud Storage. everytime i run the program, it needs to get and download a copy of that model in the bucket.
language_model_filename = 'lid.176.bin' // filename in GCS
language_model_local = 'lid.176.bin' // local file name when downloaded
bucket = storage_client.get_bucket(CLOUD_STORAGE_BUCKET)
blob = bucket.blob(language_model_filename)
blob.download_to_filename(language_model_local)
language_model = FastText.load_model(language_model_local)
You can use Streaming Tranfers for that purpose. As explained in the documentation, you can use the third party boto client library plugin for Cloud Storage.
A streaming download example would look like this:
import sys
downloaded_file = 'saved_data_file'
MY_BUCKET = 'my_app_bucket'
object_name = 'data_file'
src_uri = boto.storage_uri(MY_BUCKET + '/' + object_name, 'gs')
src_uri.get_key().get_file(sys.stdout)

Copy files from one Google Cloud Storage Bucket to other using Apache Airflow

Problem: I want to copy files from a folder in Google Cloud Storage Bucket (e.g Folder1 in Bucket1) to another Bucket (e.g Bucket2). I can't find any Airflow Operator for Google Cloud Storage to copy files.
I just found a new operator in contrib uploaded 2 hours ago: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/gcs_to_gcs.py called GoogleCloudStorageToGoogleCloudStorageOperator that should copy an object from a bucket to another, with renaming if requested.
I know this is an old question but I found myself dealing with this task too. Since I'm using the Google Cloud-Composer, GoogleCloudStorageToGoogleCloudStorageOperator was not available in the current version.
I managed to solve this issue by using a simple BashOperator
from airflow.operators.bash_operator import BashOperator
with models.DAG(
dag_name,
schedule_interval=timedelta(days=1),
default_args=default_dag_args) as dag:
copy_files = BashOperator(
task_id='copy_files',
bash_command='gsutil -m cp <Source Bucket> <Destination Bucket>'
)
Is very straightforward, can create folders if you need and rename your files.
You can use GoogleCloudStorageToGoogleCloudStorageOperator
The below code is moving all the files from source bucket to destination.
Package: https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/gcs_to_gcs/index.html
backup_file = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='Move_File_to_backupBucket',
source_bucket='adjust_data_03sept2020',
source_object='*.csv',
destination_bucket='adjust_data_03sept2020_backup',
move_object=True,
google_cloud_storage_conn_id='connection_name',
dag=dag
)