How to check in GCP console, the number of put operations performed while uploading file to GCS bucket? - google-cloud-storage

I have uploaded a big file of more than 5 GB size but I want to ckeck this big file is counted as single put operation or it will counted as multiple put operations based on the number of small chunks GCP divided this file.
I am not able to find any such monitoring on GCP console. Kind help/guide me how to check the number of class A or class B operations performed so far in GCP.

Related

Streaming data(images) from Storage Bucket to training job on google AI-platform

I am running a training job using keras. The entire code is computationally heavy and the size of the number of images it should be training on is around 30GB. The training images are stored on Google storage. I have created a docker container for my training script and every works well according to the plan. I noticed that the images are being copied into the VM that is being allocated and copying the images from Google storage to the training job takes about 1 and half hour. This is increasing the training time and is not suitable for me as the number of images keep growing on a weekly basis by atleast 1GB. I have previously worked with AWS training jobs which gave me an option for file transmission from S3 to Sagemaker training job where the input can be 'pipe' into the training job. Is there a similar service in google cloud or is there a better solution rather than copying all the files to the VM allocated to the training job?
I am posting this answer with all the information I shared in the comment section.
According to the documentation it is advisable to use gcloud package to upload you application. In this link, you can find more about defining your system's environment variables as well as submitting your training job.
Regarding you comment about Cloud FUSE, it is an open source adapter which allows you to mount Cloud Storage Buckets as file systems on Linux or MacOS. Therefore, applications can interact with the mounted bucket like in a simple file system, providing virtually limitless storage if running in the cloud. However, in your case it is not recommended because it has much higher latency than a local file system. For this reason, it is also not recommended to run a database over Cloud Storage FUSE, you can read more about its limitations here.
Since you reported that your problem is with the training time. I would recommend you to use a custom tier to train your model, increasing the number of workers. Thus, the job would run faster. In addition, you can use the pre-configured tiers described in the documentation.
Just as a bonus information, when your model is trained, consider using Online predictions over Batch predictions. Online predictions are optimized to minimize latency of the results even though, you pass the data to a cloud-hosted ML model in both cases. Besides that, according to the documentation a Batch prediction can take several minutes while the Online prediction can return the results almost instantly.

Is writing files to a gcloud storage bucket supposed to be slow?

I'm using a gcloud storage bucket mounted to a VM instance with gcsfuse. I have no problems opening files and reading them when the files are stored on the storage bucket, but when I try to write files to the storage bucket it is enormously slow and when I say 'enormously' I mean at least 10 times slower if not 100 times. Is it supposed to be that way? If so, I guess I'm going to have to write files to a persistent disk, then upload the files to the storage bucket, then download the files to my personal computer from the storage bucket. Although the process will take the same amount of time, at least the psychological demoralization will not occur.
From Documentation:
Performance: Cloud Storage FUSE has much higher latency than a local file system. As such, throughput may be reduced when reading or writing one small file at a time. Using larger files and/or transferring multiple files at a time will help to increase throughput.
Individual I/O streams run approximately as fast as gsutil.
The gsutil rsync command can be particularly affected by latency because it reads and writes one file at a time. Using the top-level -m flag with the command is often faster.
Small random reads are slow due to latency to first byte (don't run a database over Cloud Storage FUSE!)
Random writes are done by reading in the whole blob, editing it locally, and writing the whole modified blob back to Cloud Storage. Small writes to large files work as expected, but are slow and expensive.
Optionally, please check out the gsutil tool or GCS Client Libraries, or even Storage Transfer Service since they may suit your needs better depending on your specific use case.
I hope this clarifies your concerns.

Push billions of records spread across CSV files in s3 to MongoDb

I have an s3 bucket, which gets almost 14-15 Billion records spread across 26000csv files, every day.
I need to parse these files and push it to mongo db.
Previously with just 50 to 100 million records, I was using bulk upsert with multiple parallel processes in an ec2 instance and it was fine. But since the number of records increased drastically, previous method is not that efficient.
So what will be the best method to do this?
You should look at mongoimport which is written in GoLang and can make effective use of threadsto parallelize the uploading. It's pretty fast. you would have to copy the files from S3 to local disk prior to uploading but if you put the node in the same region as the S3 bucket and the database it should run quickly. Also, you could use MongoDB Atlas and its API to turn up the IOPS on your cluster while you were loading and dial it down afterwards to speed up uploading.

Running data processing tasks on Google Buckets in GCP

We have a lot of big files (~ gigabytes) in our Google bucket. I would like to process these files and generate new ones. To be specific, these are JSON files, from which I want to extract one field and join some files into one.
I could write some scripts running as pods in Kubernetes, which would connect to the bucket and stream the data from there and back. But I find it ugly - is there something made specifically for data processing in buckets?
Smells like a Big Data problem.
Use Big Data softwares like Apache Spark for the processing of the huge files. Since, the data is there in the Google Cloud, would recommend Google Cloud Dataproc. Also, Big Data on K8S is a WIP and would recommend to leave K8S for now. Maybe use Big Data on K8S down the line in the future. More on Big Data on K8S (here and here).
With your solution (using K8S and hand made code), all the fault tolerance has to be handled manually. But, in the case of Apache Spark the fault tolerances (node going down, network failures etc) are taken care of automatically.
To conclude, I would recommend to forget about K8S for now and focus on Big Data for solving the problem.

Performance benchmarks for attaching read-only disks to google compute engine

Has anyone benchmarked the performance of attaching a singular, read-only disk to multiple Google Compute Engine instances (i.e., the same disk in read-only mode)?
The Google documentation ( https://cloud.google.com/compute/docs/disks/persistent-disks#use_multi_instances ) indicates that it is OK to attach multiple instances to the same disk, and personal experience has shown it to work at a small scale (5 to 10 instances), but soon we will be running a job across 500+ machines (GCE instances). We would like to know how performance scales out as the number of parallel attachments grows, and as the bandwidth of those attachments grows. We currently pull down large blocks of data (read-only) from Google Cloud Storage Buckets, and are wondering about the merits of switching to a Standard Persistent Disk configuration. This involves Terabytes of data, so we don't want to change course, willy-nilly.
One important consideration: It is likely that code on each of the 500+ machines will try to access the same file (400MB) at the same time. How do buckets and attached drives compare in that case? Maybe the answer is obvious - and it would save having to set up a rigorous benchmarking system (across 500 machines) ourselves. Thanks.
Persistent disks on GCE should have consistent performance. Currently that is 12MB/s and 30IOPS per 100GB of volume size for a standard persistent disk:
https://cloud.google.com/compute/docs/disks/persistent-disks#pdperformance
Using it on multiple instances should not change the disk's overall performance. It will however make it easier to use those limits since you don't need to worry about using the instance's maximum read speed. However, accessing the same data many times at once might. I do know how either persistent disks or GCS handle contention.
If it is only a 400MB file that are in contention, it may make sense to just benchmark the fastest method to deliver this separately. One possible solution is to make duplicates of your critical file and pick which one you access at random. This should cause less nodes to contend for each file.
Duplicating the critical file means a bigger disk and therefore also contributes to your IO performance. If you already intended to increase your volume size for better performance, the copies are free.