How would you go about organizing a process of zipping objects that reside an object storage?
For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.
Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.
A naive approach would look like this:
download all files from object storage on a disk,
zip all files into an archive,
put it .zip back on an object storage,
send a link to download the .zip file back to user.
However, there are multiple disadvantages here:
sometimes files for even single user add up to gigabytes,
if the process of zipping is interrupted, it has to start over.
What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?
Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.
The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.
Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.
You could make the process synchronous and display a progress bar, but that will complicate the design.
Related
I am trying to understand the general architecture and components needed to link metadata with blob objects stored into the Cloud such as Azure Blob Storage or AWS.
Consider an application which allows users to upload a blob files to the cloud. With each file there would be a miriade of metadata describing the file, its cloud URL and perhaps emails of users the file is shared with.
In this case, the file gets save to the cloud and the metadata into some type of database somewhere else. How would you go about doing this transactionally so that it is guaranteed both the file was saved and the metadata? If one of the two fails the application would need to notify the user so that another attempt could be made.
There's no built-in mechanism to span transactions across two disparate systems, such as Neo4j/mongodb and Azure/AWS blob storage as you mentioned. This would be up to your app to manage. And how you go about that is really a matter of opinion/discussion.
I use gcloud node v0.24 for interacting with Google Cloud Storage. I've encountered an issue when immediate list after upload doesn't return all the files that were uploaded.
So the question is
does Bucket#getFiles always list files right after Bucket#upload?
or
is there any delay between upload's callback and when file becomes available (e.g. can be listed, downloaded)?
Note: below answer is no longer up to date -- GCS object listing is strongly consistent.
Google Cloud Storage provides strong global consistency for all read-after-write, read-after-update, and read-after-delete operations, including both data and metadata. As soon as you get a success response from an upload message, you may immediately read the object.
However, object and bucket listing is only eventually consistent. Objects will show up in a list call after you upload them, but not necessarily immediately.
In other words, if you know the name of an object that you have just uploaded, you can immediately download it, but you cannot necessarily discover that object by listing the objects in a bucket immediately.
For more, see https://cloud.google.com/storage/docs/consistency.
We've noticed that when trying to GET a shared file (doesn't matter if you're sharing, or you're being shared with) using the REST API, the FileSize field is simply non-existent, which is actually rather troublesome for our usages.
We can certainly download the file via one of the various links in the response, and then detect the size, but that would require processing on our backend, and we'd much prefer to fail out (if need be) on our frontend before reaching that point.
Per the API documentation:
fileSize long The size of the file in bytes. This is only populated for files with content stored in Drive.
Are shared files not technically "stored" in Google Drive?
Thanks all!
-Cory
I would like to synchronize uploads from our own server to our clients' dropboxes to which we have full access. syncing changes on dropbox is easy because i can use the delta call, but I need a more efficient way to identify and upload changes made locally to dropbox.
The sync api would be amazing for this but I'm not trying to make a mobile app so the languages with the api are not easily accessible (AFAIK). Is there an equivalent to the sync api for python running on a linux server?
Possible solution:
So far, I was thinking of using anydbm to store string,string dictionaries that would hold folder names as the key and the hash generated from the metadata call from the server. then I could query the dropbox and every time I run into a folder, I will check the folder compared with the metadata on the anydbm.
if there is a difference, compare the file dates/sizes in the folder and if there are any subfolders, recurse the function into them,
if it the same, skip the folder.
This should save a substantial amount of time compared to the current verification of each and every file, but if there are better solutions, please do let me know.
I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?
I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.