Sugestions about file storage in Amazon AWS - mongodb

I'm developing a Asp.Net MVC project that will be hosted in Amazon AWS, but I have some questions about storage of the client's files. The documentation from Amazon is not clear to me and I'm looking for some directions and experiences here.
1 - each client have a few files with low space disk requirements, low update frequency but very high access frequency (like brand image and even sensitive files like certificates). Is appropriate to storage this files in app_data folder in web server?
2 - the most critical to me are sensitive documents (from hundreds to dozen of thousands per client, most like xml signed files). This files has a medium read access frequency but a very high demand for creation. One solution I found is MongoDB, wich give me some freedom to manage the storage policy and allow me a external backup easy, but I'm not sure about that. Other options are to use the Amazon Storage and handle all this files and GBs in there with a lot of folders or maybe use a regular database and save the files as xml or bin.
My concerns are about the amount of data, the security and the reliable in case of disaster as most of this documents has legal value.

You could, but storing them locally, violates the shared nothing architecture and would limit your scaling options. Amazon S3 is a good option here. You can set some files public and serve them direct from s3 (or with cloudfront) and keep other private and provide access via signed urls.
Again, you can put the files on s3 and make them private. You will still probably store references to the files in your database. Generally its not a great idea to store large blob files in a database since they are often not well optimized to access them.

Related

How can I have a software access files on the cloud

So I have a small company with plenty of documents and I want to set up an archiving system. I have several employees with different levels of permissions to access the files on the server. This will serve as an archive system plus a management system, as employees can read and write files (depending on the permission) for a certain project, then the admin can prevent access to certain directory (i.e. project).
So after some research I think the best idea is to have a cloud-based NAS in which a user can have locally by giving the correct username and password. Then a software will access these files (which are now local) and can display some data (e.g. project progress, minutes of meetings), or the user can access the files directly.
Does any of this make sense? I mean is that what NAS can actually do and can it be done on the cloud? and can users access the file system (with restrictions) given username and password (much like if it were a network). Is there a better alternative for my purposes?
To the best of my knowledge, I can, instead, create a software that accesses the cloud directly, but how can I get the users to write files and be stored on the cloud? won't that be more complicated to implement? Can I use an RDMS for it? I've used it before but never for files.
If I understand your use case correctly, all you really want is to have access to different files for different roles within your company, is this correct?
To the best of my knowledge, I believe that Google provide corporate accounts which are quite affordable which should have access control schemes suiting what you need (after all, storing files on scaling storage, with various access controls in an offsite location and with redundancy is partly what the cloud is for).
If not, or if this solution isn't appealing to you and you would prefer to use your NAS, the best way to do this would be to use Google's Backup and Sync application (you can download this by clicking the cog icon on Drive and selecting it). If you install and run this on an admin computer that is always on (and always connected (mounted) with your NAS), you can set a root folder on the NAS as your Drive sync folder. Any files added to this folder will be uploaded to Drive, and any added to Drive will be automatically downloaded. After this you can configure the access control on the NAS using various user accounts and roles, and have each employee mount the store using their own credentials, revealing only the files they have access to.

How to store and organize uploaded images on webserver?

I am writing a server that allows user to upload images. It appears that most people tend to store those files on the filesystem directly.
My question would be if that really is the way how to do it. I'm not familiar with the capacities of a server but what I'm curious about is e.g. how to make sure that the server does not run out of (hard drive) memory?
I would also like to know how one would organize those files for many different users. Is it enough to just store it like war/images/<user-database-id>/<uuid-for-image>.(jpeg|png) by just using the user ID inside the database or are there a lot more things to consider when it comes to storing images?
I think your best bet would be to use a cloud storage system such as Amazon S3, Google Cloud Storage, Rackspace, or MS Azure.
Using a path like the one you suggested ought to be possible but you could also omit the user-database-id if that database already gives you a list of objects owned by that user.

uploading images to php app on GCE and storing them onto GCS

I have a php app running on several instances of Google Compute Engine (GCE). The app allows users to upload images of various sizes, resizes the images and then stores the resized images (and their thumbnails) in the storage disk and their meta data in the database.
What I've been trying to find is a method for storing the images onto Google Cloud Storage (GCS) through the php app running on GCE instances. A similar question was asked here but no clear answer was given there. Any hints or guidance on the best way for achieving this is highly appreciated.
You have several options, all with pros and cons.
Your first decision is how users upload data to your service. You might choose to have customers upload their initial data to Google Cloud Storage, where your app would then fetch it and transform it, or you could choose to have them upload it directly to your service. Let's assume you choose the second option, and you want users to stream data directly to your service.
Your service then transforms the data into a different size. Great. You now have a new file. If this was video, you might care about streaming the data to Google Cloud Storage as you encode it, but for images, let's assume you want to process the whole thing locally and then store it in GCS afterwards.
Now we have to get a file into GCS. It's a PHP app, and so as you have identified, your main three options are:
Invoke the GCS JSON API through the Google API PHP client.
Invoke either the GCS XML or JSON API via custom code.
Use gsutil.
Using gsutil will be the easiest solution here. On GCE, it automatically picks up appropriate credentials for your service account, and it's got several useful performance optimizations and tuning that a raw use of the API might not do without extra work (for example, multithreaded uploads). Plus it's already installed on your GCE instances.
The upside of the PHP API is that it's in-process and offers more fine-grained, programmatic control. As your logic gets more complicated, you may eventually prefer this approach. Getting it to perform as well as gsutil may take some extra work, though.
This choice is comparable to copying files via SCP with the "scp" command line application or by using the libssh2 library.
tl;dr; Using gsutil is a good idea unless you have a need to handle interactions with GCS more directly.

file storage + permissions : mongodb vs filesystem approach

The java web app I'm developing allows users to upload files (pictures and documents) to their profiles and define access rules for those files (define which of the other users are able to view / download the file). The access control / permission system is custom made and rules are stored in mongoDB alongside the user's profile and the actual file entry.
Knowing that I need the application and storage to be distributed and fault-tolerant I need to figure out which is the best strategy for file storage.
Should I store the files inside mongoDB in the files collection where the file document containing description and access rules are located ?
Or should I store the files inside the server's file system and keep the path in the mongoDB document? With the filesystem approach will I still be able to enforce the user defined access permissions and how?
Finally in the filesystem approach how do I distribute files accross servers? Should I use dedicated servers for this or can I store the files on the webapp servers or mongodb servers ?
Thanks a lot for all your insights! Any help or feedback appreciated.
Alex
There are several alternatives:
put files in a storage service (e.g. S3): easy and much space but bad perf
put files in a local filesystem: fast but doesnt scale
put files in mongodb docs: easy, powerful and scalable but limited to 16MB
use GridFS layer of mongodb. Functionalities are limited but it is made for scalability (thanks to sharding) and is fairly fast too. Note you can put info about file (permission etc) right into the file's metadata object.
In your case it sounds like last option may be best, there are quite a few users who switched from FS to gridFS and it worked very well for them.
Things to keep in mind:
gridfs sharding works but is not perfect: usually only data is sharded, not the metadata. Not a big deal but the shard with metadata must be very safe.
it can be beneficial to use gridfs in a separate mongodb cluster from your core data, since requirements (storage, backup, etc) are usually different.

How do we share data between two different services

I am currently working on a web service which is periodically polled. It does not store its state and is instantiated everytime it is queried. Essentially, it retrieves the state of other external entities e.g. databases and delivers it back to the requester.
Recently, the need to store state as arisen in that
There is the need to continously collect data from a particular source and store the bits that are important/relevant
There is the need to collect the aggregate of a particular data source over a period of time
I came up with the following idea:
My main concern here is the fact that I am using a static class (essentially a global) to share data between the two services. Is there a better way to doing this?
edit: Thanks for the responses thus far. Apologies for the vaguesness of this question: just trying to work out what is the best way to share data across different services and am unsure as to the specifics (i.e. what is required). The platform that I am developing on is the .NET framework and both services are simply WCF services hosted as a Windows service.
The database route sounds like the most conventional way to go - however I am reluctant to go down that path for now (mainly for deployment/setup issues; it introduces the need to create new tables, etc in addition to simply installing the software) for at this point the transfer of relatively small amounts of data. This may of course change in the future and going the database route might be the way to go at that point.
Is there any other way besides adding a database persistance layer?
If you need to collect and aggregate data, you might want to consider using a database between the two layers. Or have I misunderstood something?
You should consider enhancing your question with more requirements: pretty much all options are open here.
Sure - how about data binding? I don't have a lot of information to go on here - about your platform but most sufficiently advanced systems offer it in some form.
You could replace your static shared data with some database representation, with a caching layer (like memcached) between the database and the webservice, so that most of the time the data is available very quickly from the cache, but can be retrieved from the database as needed.
I appreciate that you want to keep the architecture simple. Depending on the magnitude of items you have to look up and there permanency, you might just consider leveraging your file system or a message queue. It sounds like you want a file system, because that sounds the least amount of impact to your design.
If you start dealing with tens of thousands of small files, your directories could get hard to navigate and slow to do file lookups on. I typically shoot for about 1000 - 10000 files per directory, and concoct a routine that can generate a path to the file depending on the file name pattern. Keeping the number of subdirectories even is important, some file systems have a limit on the number of subdirectories in a parent directory.