Delta for file upload - dropbox-api

I would like to synchronize uploads from our own server to our clients' dropboxes to which we have full access. syncing changes on dropbox is easy because i can use the delta call, but I need a more efficient way to identify and upload changes made locally to dropbox.
The sync api would be amazing for this but I'm not trying to make a mobile app so the languages with the api are not easily accessible (AFAIK). Is there an equivalent to the sync api for python running on a linux server?
Possible solution:
So far, I was thinking of using anydbm to store string,string dictionaries that would hold folder names as the key and the hash generated from the metadata call from the server. then I could query the dropbox and every time I run into a folder, I will check the folder compared with the metadata on the anydbm.
if there is a difference, compare the file dates/sizes in the folder and if there are any subfolders, recurse the function into them,
if it the same, skip the folder.
This should save a substantial amount of time compared to the current verification of each and every file, but if there are better solutions, please do let me know.

Related

How to zip objects in an object storage

How would you go about organizing a process of zipping objects that reside an object storage?
For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.
Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.
A naive approach would look like this:
download all files from object storage on a disk,
zip all files into an archive,
put it .zip back on an object storage,
send a link to download the .zip file back to user.
However, there are multiple disadvantages here:
sometimes files for even single user add up to gigabytes,
if the process of zipping is interrupted, it has to start over.
What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?
Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.
The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.
Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.
You could make the process synchronous and display a progress bar, but that will complicate the design.

Best DB solution for storing large files

I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?
I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.

Amazon S3 getDate() API Call?

Is there a way (API call) to know the current time on an Amazon S3 server?
Here is a bit of background to explain why I need this:
I have an iphone app that sometimes has to download a set of files from a bucket on a Amazon AWS S3 account.
Between two such downloads, the server files may be modified by a CMS (Web Content Management System), or not.
So, when a second download occurs, The client app tries to be efficient by downloading only the files that have been modified on the server since the previous such download.
To achieve this, the app stores the date of the last download and when a new download occurs, it just focuses on the files that have been modified on the server since the date of the last download (using there “modified date” property accessible using the SDK listObjects() function).
The problem with this is that the date on the phone and the modified dates on the s3 server may not be compatible. The phone user may have changed his phone date & time settings, etc.
To make this work, the saved “last download date” should come from an Amazon S3 API call to make sure all dates used by the app logic are in sync.
Is there such thing? Or maybe an alternative or a workaround?
You could use a file hash instead of the modified date. An Amazon S3Object has an etag property that is indeed such kind of hash. You retrieve this property the same way as you access date.
Have your client device save this hash along with the file. The next time you connect to the server, ask for the etag using the method about and compare the returned value to your local copy.
A different etag value will indicate to the client that the file has changed since the last download. This approach would be completely independent of any datetime functionality.

Performing Get Copy All Operation With Microsoft Sync Framework

I'm testing out Microsoft Sync Framework to try and see if it'll be suitable for a task that I'm working on. One of the things I'd like to be able to do is to have the option to not just send changed files, but instead to send all of the files (for example, if I'm syncing to a client machine for the first time, and so want to send all files).
I can't seem to find an example of this in the documentation, so any advice would be welcome.
if you're synching for the first time, then there is nothing special to configure as it will sync everything.
if you've already synched and want to re-send all files regardless of whether they've changed or not, just delete the metadata file and that should remove all knowledge of what has been synched.

Looking for a version control based backup tool

I'm traveling all the time (every 2-3 months, I'm in a new city or country), with no real permanent address. I've managed to work out all the kinks over the last couple of years...except having a good backup/sync solution.
I have a macbook pro & a thinkpad w701 (which runs two different VMs). It's a pain in the ass because making changes on one machine (such as adding some new music or updating some presentations) requires me to keep track of what changed where. And then every couple of weeks, after syncing the three different images, I try to manually sync it out to a backup drive that I carry around.
It's pretty much the most annoying thing ever...especially when I sometimes make changes on the backup drive and I have to remember not to override them.
What I'd really like is something simple that has more of a version control like workflow:
I can push out changes to some
central server (like a commit.
Example: I add some changes to my
music directory and then I can just
commit those changes to backup)
Before the backup happens, I'd like to see a "diff": what files will be
overridden, which one's newer, etc
I can access my files off the server (if I'm making an audio mix and need
to pull out some songs, I'd like to get them from the server. All the
backups can't just be one big binary
compressed zip blob)
Dropbox comes pretty close but it lacks the "commit" & "diff" functionality. I thought about using Amazon AWS but that falls short because I can't see diffs and can't access my files directly off aws.
Any ideas? Or any other solutions? I guess what I'd really like is TimeMachine in the cloud or maybe even a NAS that's securely accessible through the internet
You might want to use rsync. It's a unix synchronization tool you can use it on windows and unix variants (including Mac OS X). It uses delta copying to minimize transfer and hardlinks to minimize backup size.
You can access all files in every backup as though they were normal files. Diffing can be done using traditional tools. It is all command-line based so if you don't want that you will need to find GUI tools, but I don't know which you could use.
You would need a server with a rsync deamon/service. I don't know if there are providers for it but you can set up your own VPS starting at a few dollars a month.
Have you looked at Amazon S3? S3 is data storage mechanism and there are bunch of tools to "sync" your local directory with S3. Some of the tools are:
http://www.vinodlive.com/2007/08/20/amazon-s3-storage-tools/
Out of these , S3Sync should do what you are looking for, I.e. submit only changed files and a mode that would tell you changes before submitting them.