Google storage operations extremely slow when using Customer Managed Encryption Key - google-cloud-storage

We're planning on switching from Google managed keys to our own keys (working with sensitive medical data) but are struggling with the performance degradation when we turn on CMEK. We move many big files around storage in our application (5-200GB files), both with the Java Storage API and gsutil. The former stops working on even 2GB size files (times out, and when timeouts are raised silently does not copy the files), and the latter just takes about 100x longer.
Any insights into this behaviour?

When using CMEK, you are actually using an additional layer of encryption on top of Google-managed encryption keys and not replacing them. As for gsutil, if your moving process involves including the objects’ hashes then gsutil will perform an additional operation per object, this might explain why moving the big files is taking much longer than usual.
As a workaround, you may instead use resumable uploads. This type of upload works best with large files since it includes the option of uploading files in multiple chunks which allows you to resume an operation even if the flow of data is interrupted.

Related

Best DB solution for storing large files

I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?
I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.

Any optimizations in reducing the number of disk accesses for inode number lookup by web-servers?

Web-servers typically have a document root denoting the filesystem sub-tree visible via the web. Consequently for eg., if the document root is: /home/foouser/public_html/, then the web-server would map a request for http://www.foo.com/pics/foo.jpg to /home/foouser/public_html/pics/foo.jpg. This results in a series of disk requests to obtain the inode-number of foo.jpg.
Do web-servers do any optimizations to reduce the number of disk accesses (or) is it the role of the server-admin to set the document root as close to "/" as possible, to reduce the number of disk-accesses in the filename to inode number translation?
I know this isn't directly the answer to your question, but by setting up a caching strategy you can drastically reduce disk reads. Especially if your static content is not hosted on your server.
Options:
Host static content on a CDN:
Pros: Off-load all load onto someone else's network. Cost?
Cons: Potentially less control. Cost?
Use Contendo/Akamai, which is also a CDN, but with some differences.
Pros: Host your content, but after the first read the cdn will handle caching based on the headers you send with your content (static or not)
Cons: Sometimes headers are really annoying to manage. Cache busting (breaking your own cache) can be annoying to handle when you want to replace old content.
Cache things locally. If you are making a DB request for instance you can cache the request. Next time your code is run check your in memory cache first (as opposed to make a db request immediately). You could cache entire pages then at an application controller/route level check if there is a cached version of the page/asset and serve that.
Pros: Lots of control. You can cache almost anything.
Cons: A ton of work to set up caching on every little thing. You need a strategy for every part of your website.
My recommendation is to start out by moving your assets to AmazonS3 or Rackspace or something. Joyent has something for this as well. You could then enable cloudfront for s3 which will turn on the cdn, which caches things in various regions. This is a really cheap solution (depending on the amount of files you have).
You could also go the contendo route.
The caching on the application side route takes quite a bit of work and completely depends on your server/language/db/configuration.

Apple "Avoid writing cache files to disk." - where should I save cache?

In Apple's Performance Tuning Guide there is a writing:
Avoid writing cache files to disk. The only exception to this rule is
when your app quits and you need to write state information that can
be used to put your app back into the same state when it is next
launched.
I'm saving a lot of cache files in Library/Cache directory, because my app deals with web services, and nobody likes the white screen. What does this statement mean? I shouldn't do this or what?
Thank you!
Well, "avoid" means "avoid if possible, because writing/reading is relatively slow". If by caching a small amount of data (I assume the definitions of the web services retrieved from somewhere?) you can improve the performance of your app's startup, by all means do it. If you are only using this data for one run of your application, and the next run will re-fetch this anyway, use an in-memory cache.
Library\Caches is basically designed to store data you fetched from somewhere which provides performance boosts when stored locally.
The text from Apple feels like more a general guideline against overusing storage if you don't need data to persist from one run of your application to another.

Store files on disk or MongoDB

I am creating a mongodb/nodejs blogging system (similar to wordpress).
I currently have the images being saved on the disk and a pointer being placed in mongo. I was wondering since I have all sessions being stored in MongoDB to enable easy load balancing across servers, would storing the actual files in Mongo also be a smart idea for easy multiserver setups and/or performance gains.
If everything is stored in a DB, you can simply spawn more web servers and/or mongo replications to scale horizontally
Opinions?
MongoDB is a good option to store your files (I'm talking about GridFS), specially for the use case you described above.
When you store files into MongoDB (GridFS, not documents), you get all the replication and sharding capability for free, which is awesome.
If you have to spawn a new server and you have the files already into MongoDB, all you have to do is to enable replication (thus scale horizontally). I'm sure this can save you a lot of headaches.
Resources:
Is GridFS fast and reliable enough for production?
http://www.mongodb.org/display/DOCS/GridFS
http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/
Aside from GridFS, you might be considering a cloud-based deployment. In that case, you might consider storing files in cloud-specific storage (Windows Azure has Blob Storage, for example). Sticking with Windows Azure for this example (since that's what I work with), you'd reference a file by its storage account URI. For example:
https://mystorageacct.blob.core.windows.net/mycontainer/myvideo.wmv
Since you'd be storing the MongoDB database itself in its own blob (and mounted as disk volume on your Linux or Windows VM), you could then choose to store your files in either the same storage account or a completely different storage account (with each storage account providing 100TB 200TB of storage).
Storing the image as document in mongodb would be a bad idea, as the resources which could have been used to send a large amount of informational data would be used for sending files.
Have a look at mongoDb file storage GridFS , that might solve your problem
of storing images, and providing horizontal scalability as well.
http://www.mongodb.org/display/DOCS/GridFS
http://www.mongodb.org/display/DOCS/GridFS+Specification

Best strategy for synching data in iPhone app

I am working on a regular iPhone app which pulls data from a server (XML, JSON, etc...), and I'm wondering what is the best way to implement synching data. Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access and flexibility (adaptable when the structure of the database changes slightly, like a new column). I know it varies from app to app, but can you guys share some of your strategy/experience?
For me, I'm thinking of something like this:
1) Store Last Modified Date in iPhone
2) Upon launching, send a message like getNewData.php?lastModifiedDate=...
3) Server will process and send back only modified data from last time.
4) This data is formatted as so:
<+><data id="..."></data></+> // add this to SQLite/CoreData
<-><data id="..."></data></-> // remove this
<%><data id="..."><attribute>newValue</attribute></data></%> // new modified value
I don't want to make <+>, <->, <%>... for each attribute as well, because it would be too complicated, so probably when receive a <%> field, I would just remove the data with the specified id and then add it again (assuming id here is not some automatically auto-incremented field).
5) Once everything is downloaded and updated, I will update the Last Modified Date field.
The main problem with this strategy is: If the network goes down when I am updating something => the Last Modified Date is not yet updated => next time I relaunch the app, I will have to go through the same thing again. Not to mention potential inconsistent data. If I use a temporary table for update and make the whole thing atomic, it would work, but then again, if the update is too long (lots of data change), the user has to wait a long time until new data is available. Should I use Last-Modified-Date for each of the data field and update data gradually?
I would start by making the update routine atomic, since you'll have enough on your hands figuring out how to get the client-server communication working properly.
After that is a good time to consider tweaking it to be incremental, but only after you do some testing to figure out if it's really necessary. If you're tuning your update protocol to be as low bandwidth as possible, you might discover that even a "big" update is downloaded fast enough.
Another way to look at it is to ask yourself, how often is there going to be network trouble when an average user is doing a sync? You probably don't want to tune for unlikely scenarios.
If you are trying to optimize (minimize) the data transfer you may want to consider a different format than XML, since XML is fairly verbose. Or at least you may want to trade in XML readability for space by making each element name and attribute as small as possible, and eliminate all unnecessary whitespace.
Your basic scheme is good. The thing you need to do is to somehow make your updates idempotent so that you can restart a partially-completed transfer without risk. This is a better way to go than to try to implement some sort of true atomic commit (though you could do that too, using, eg, the SQLite database).
In our experience fairly large updates (10s of KB) can be downloaded quite rapidly, if the server is fast enough. No great need to break updates up into tiny bits. But certainly it won't hurt to try to minimize the amount of data transferred by keeping more granular info on "last update".
(And definitely you should use JSON rather than XML as your transmitted data representation.)
Wonder if you have considered using a Sync Framework to manage the synchronization. If that interests you can take a look at the open source project, OpenMobster's Sync service. You can do the following sync operations
two-way
one-way client
one-way device
bootup
Besides that, all modifications are automatically tracked and synced with the Cloud. You can have your app offline when network connection is down. It will track any changes and automatically in the background synchronize it with the cloud when the connection returns. It also provides synchronization like iCloud across multiple devices
Also, modifications in the Cloud are synched using Push notifications, so the data is always current even if it is stored locally.
In your case,
Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access
Speed: Only the changes are sent across the network in both directions
Robustness: It stores data in a transactional store like sqlite and any failed updates are communicated in the SyncML payload. Only the successful operations are processed while the failed operations are re-tried during the next sync
Here is a link to the open source project: http://openmobster.googlecode.com
Here is a link to iPhone App Sync: http://code.google.com/p/openmobster/wiki/iPhoneSyncApp