I've got a customer requirement to replicate a multiregion storage bucket (Mr Bucket A --> Mr Bucket B) so that every new object gets copied. Would Cloud Functions be the way to go here?
Cloud Storage already has multiregion replication. But if that's not what you're looking for, a Cloud Functions trigger that copies each new file might be the only efficient mechanism.
Related
If I want to use S3 bucket as a data store/path for my MongoDB instance, how can something like this be implemented?
Note: I want this to work with every S3 compliable storage (Wasabi/Digital Ocean), not just AWS S3.
Thanks.
I have an API where users can query some time-series data. But now I want to make the entire data set available for users to download for their own uses. How would I go about doing something like this? I have RDS, an EC2 instance setup. What would my next steps be?
In this scenario and without any other data or restrictions given, I would use S3 bucket in the center of this process.
Create an S3 Bucket to save the database/dataset dump.
Dump the database/dataset to S3. ( examples: docker, lambda )
Manually transform dataset to CSV or use a Lambda triggered on every dataset dump. (not sure if pg_dump can give you CSV out of the box)
Host those datasets in a bucket accessible to your users and allow access to them as per case:
You can create a publicly available bucket and share its HTTP URL.
You can create a pre-signed URL to allow limited access to your dataset
S3 is proposed since its cheap and you can find a lot of readily available tooling to work with.
I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.
Can you create something similar to a folder in mongoose(or perhaps MongoDB)?
I've tried creating separate databases for each new so-called "folder" but it gets a bit tedious after a while.
MongoDB is an object storage which does not store data in folder format. It stores them as documents, since it is a document-oriented datastore.
If you want a database or storage option resembling a folder, you might have to look into an object storage such as AWS S3 (cloud) or MinIO (local).
I have some data sets in Google cloud storage. I could find how I can append more data to this dataset. But if I want to merge the data set(Insert else update), how do I do it?
I have one option of using Hive - Insert overwrite. Is there any other better option?
Is there any option with Google cloud storage API itself?
Maybe this could be helpful: https://cloud.google.com/storage/docs/json_api/v1/objects/compose
Objects: compose
Concatenates a list of existing objects into a new object in the same bucket.
GCS treats your objects (files) as blobs, there are no in-built GCS operations on the text in your objects. There is an easier way to do the same as you are doing though.
App-engine hosted MapReduce provides in-built adapters to work with GCS. You can find the example code in this repo.