How to Sync Queryable Metadata with Cloud Blob Storage - mongodb

I am trying to understand the general architecture and components needed to link metadata with blob objects stored into the Cloud such as Azure Blob Storage or AWS.
Consider an application which allows users to upload a blob files to the cloud. With each file there would be a miriade of metadata describing the file, its cloud URL and perhaps emails of users the file is shared with.
In this case, the file gets save to the cloud and the metadata into some type of database somewhere else. How would you go about doing this transactionally so that it is guaranteed both the file was saved and the metadata? If one of the two fails the application would need to notify the user so that another attempt could be made.

There's no built-in mechanism to span transactions across two disparate systems, such as Neo4j/mongodb and Azure/AWS blob storage as you mentioned. This would be up to your app to manage. And how you go about that is really a matter of opinion/discussion.

Related

How to zip objects in an object storage

How would you go about organizing a process of zipping objects that reside an object storage?
For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.
Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.
A naive approach would look like this:
download all files from object storage on a disk,
zip all files into an archive,
put it .zip back on an object storage,
send a link to download the .zip file back to user.
However, there are multiple disadvantages here:
sometimes files for even single user add up to gigabytes,
if the process of zipping is interrupted, it has to start over.
What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?
Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.
The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.
Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.
You could make the process synchronous and display a progress bar, but that will complicate the design.

How to automatically create a new file based on an existing file within Google Cloud Storage

It's the first time I used Google Cloud, so I might ask the question in the wrong place.
 
Information provider upload a new file to Google Cloud Storage every day.
The file contains the information of all my clients/departments.
I have to sort through information and create a new file/s containing the relevant information for each department in my company .so that everyone gets the relevant information to them (security).
I can't figure out what are the steps I need to follow, to complete the task.
Can you help me?
You want to have a process that starts automatically and subsequently generates a new file once you upload something to Google Cloud Storage.
The easiest way to handle this is using Object Change Notifications. You can set up Object Change Notifications per bucket and this will send a POST request to an URL that you can define.
You can then easily set up a server (or run it on app engine) that will execute an action based on the POST request that it receives.
There is an even simpler option (although still in alpha) named cloud functions. Cloud functions is a serverless service that provides event based microservices (e.g. 'do this' if a new file is uploaded on GCS). This means you only have to write the code that defines what needs to happen if a new file is uploaded and then Cloud Functions will take care of executing the code when you upload a file to GCS. See this tutorial on using cloud functions with Google Cloud Storage.

Google Cloud Platform - Data Distribution

I am trying to figure out a proper solution for the following:
We have a client from which we want to receive data, for instance a binary that is 200Mbytes updated daily. We want them to deposit that data file(s) onto a local server near them (Europe).
We then want to do one of the following:
We want to retrieve the data, either from a local
server where we are (China/HK), or
We can log into their European
server where they have deposited the files and pull the files directly ourselves.
QUESTIONS:
Can Google's clould platform serve as a secure, easy way to provide a cloud drive for which to store and pull the data file?
Does Google's cloud platform distribute such that files pushed onto a server in Europe will be mirrored in a server over in East Asia? (that is, where and how would this distribution model work with regard to my example.)
For storing binary data, Google Cloud Storage is a fine solution. To answer your questions:
Secure: yes. Easy: yes, in that you don't need to write different code depending on your location, but there is a caveat on performance.
Google Cloud Storage replicates files for durability and availability, but it doesn't mirror files across all bucket locations. So for the best performance, you should store the data in a bucket located where you will access it the most frequently. For example, if you create the bucket and choose its location to be Europe, transfers to your European server will be fast but transfers to your HK server will be slow. See the Google Cloud Storage bucket locations documentation for details.
If you need frequent access from both locations, you could create one bucket in each location and keep them in sync with a tool like gsutil rsync

Using Google storage bucket as on-demand pull CDN

I am trying to use the Google Cloud Storage bucket to serve static files from a web server on GCE. I see in the docs that I have to copy files manually but I am searching for a way to dynamically copy files on demand just like other CDN services. Is that possible?
If you're asking whether Google Cloud Storage will automatically and transparently cache frequently-accessed content from your web server, then the answer is no, you will have to copy files to your bucket explicitly yourself.
However, if you're asking if it's possible to copy files dynamically (i.e., programmatically) to your GCS bucket, rather than manually (e.g., via gsutil or the web UI), then yes, that's possible.
I imagine you would use something like the following process:
# pseudocode, not actual code in any language
HandleRequest(request) {
gcs_uri = computeGcsUrlForRequest(request)
if exists(gcs_uri) {
data = read(gcs_uri)
return data to user
} else {
new_data = computeDynamicData(request)
# important! serve data to user first, to ensure low latency
return new_data to user
storeToGcs(new_data) # asynchronously, don't block the request
}
}
If this matches what you're planning to do, then there are several ways to accomplish this, e.g.,
language-specific libraries (recommended)
JSON API
XML API
Note that to avoid filling up your Google Cloud Storage bucket indefinitely, you should configure a lifecycle management policy to automatically remove files after some time or set up some other process to regularly clean up your bucket.

Adding to user metadata after an object has been created in Google Cloud Store

I want to add user metadata that is calculated from the stream as it is uploaded
I am using the Google Cloud Storage Client from inside a Servlet during a file upload.
The only solutions I can come up and tried are not really satisfactory for a couple of reasons.
Buffer the stream in memory, calculate the metadata as the stream is buffered then write the stream out to the Cloud Store after it has been completely read.
Write the stream to a bucket and calculate the metadata. Then read the object from the temporary bucket and write it to its final location with the calculated metadata.
Pre-calculate the metadata on the client and send it with the upload.
Why these aren't acceptable:
Doesn't work for large objects, which some of these will be.
Will cost a fortune if lots of objects are uploaded, which there will be.
Can't trust the clients, and some of the clients can't calculate some of what I need.
Is there any way to update the metadata of a Google Cloud Store object after the fact?
You are likely using the Google Cloud Storage Java Client for AppEngine library. This library is great for AppEngine users, but it offers only a subset of the features of Google Cloud Storage. It does not to my knowledge happen to support updating the metadata of existing objects. However, Google Cloud Storage definitely supports this.
You can use the Google API Java client library, which exposes the Google Cloud Storage's JSON API. With this library, you'll be able to use the storage.objects.update method or the storage.objects.patch method, both of which can update metadata (the difference is that update replaces any properties of the object that are already there, while patch just changes the specified fields). The code would look something like this:
StorageObject objectMetadata = new StorageObject();
.setName("OBJECT_NAME")
.setMetadata(ImmutableMap.of("key1", "value1", "key2", "value2"));
Storage.Objects.Patch patchObject = storage.objects().patch("mybucket", objectMetadata);
patchObject.execute();