I am developing a REST application that can be used as a data upload service for large file. I create chunks of the file and upload each chunk. I would like to have multiple services running this service (For load balancing). I would like my REST service to be a stateless system (No information about each stored chunk). This will help me avoid server affinity. If i allow server affinity, i can have a server for each upload request and the chunks can be stored in a temporary file in the disk and can be moved to some other place once the upload is complete.
Ideally i would use a central place for the data to be stored. I would like to avoid this as this is a single point of failure (bad in a distributed system). So i was thinking about using a distributed file system say like HDFS but appending to file is not very efficient and so this is not an option.
Is it possible to use some kind of a cache for storing the data? Since the size of the data is quite big (2 -3 GB files) traditional cache solutions like Memcache cannot be used.
Is there any other option to solve this problem. Am I not looking in any particular direction?
Any help will be greatly appreciated.
Related
I am asking for advice on possibly better solutions for the part of the project I'm working on. I'll first give some background and then my current thoughts.
Background
Our clients can use my company's products to generate potentially large data sets for use in their industry. When the data sets are generated, the clients will file a processing request to us.
We want to send the clients a summary email which contains some statistical charts as well as sampling points from the data sets so they can do some initial quality control work. If the data sets are of bad quality, they don't need to file any request.
One problem is that the charts and sampling points can be potentially too large to be sent in an email. The charts and the sampling points we want to include in the emails are pictures. Although we can use low-quality format such as JPEG to save space, we cannot control how many data sets would be included in the summary email, so the total size could still exceed the normal email size limit.
In terms of technologies, we are mainly developing in Python on Ubuntu 14.04.
Goals of the Solution
In general, we want to present a report-like thing to the clients to do some initial QA. The report may contains external links but does not need to be very interactive. In other words, a static report should be fine.
We want to reduce the steps or things that our clients must do to read the report. For example, if the report can be just an email, the user only needs to 1). log in and 2). open the email. If they use a client software, they may skip 1). and just open and begin to read.
We also want to minimize the burden of maintaining extra user accounts for both us and our clients. For example, if the solution requires us to register a new user account, this solution is, although still acceptable, not ranked very high.
Security is important because our clients don't want their reports to be read by unauthorized third parties.
We want the process automated. We want the solution to provide programming interface so that we can automate the report sending/sharing process.
Performance is NOT a critical issue. Our user base is not large. I think at most in hundreds. They also don't generate data that frequently, at most once a week. We don't need real-time response. Even a delay of a few hours is still acceptable.
My Current Thoughts of Solution
Possible solution #1: In-house web service. I can set up a server machine and develop our own web service. We put the report into our database and the clients can then query via the Internet.
Possible solution #2: Amazon Web Service. AWS is quite mature but I'm not sure if they could be expensive because so far we just wanna share a report with our remote clients which doesn't look like a big deal to use AWS.
Possible solution #3: Google Drive. I know Google Drive provides API to do uploading and sharing programmatically, but I think we need to register a dedicated Google account to use that.
Any better solutions??
You could possibly use AWS S3 and Cloudfront. Files can easily be loaded into S3 using the AWS SDK's and API. You can then use the API to generate secure links to the files that can only be opened for a specific time and optionally from a specific IP.
Files on S3 can also be automatically cleaned up after a specific time if needed using lifecycle rules.
Storage and transfer prices are fairly cheap with AWS and remember that the S3 storage cost indicated is by the month so if you only have an object loaded for a few days then you only pay for a few days.
S3: http://aws.amazon.com/s3/pricing
Cloudfront: https://aws.amazon.com/cloudfront/pricing/
Here's a list of the SDK's for AWS:
https://aws.amazon.com/tools/#sdk
Or you can use their command line tools for Windows batch or powershell scripting:
https://aws.amazon.com/tools/#cli
Here's some info on how the private content urls are created:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
I will suggest to built this service using mix of your #1 and #2 options. You can do the processing and for transferring the data leverage AWS S3 which is quiet cheap.
Example: 100GB costs like approx $3.
Also AWS S3 will be beneficial as you are covered for any disaster on your local environment your data will be safe in S3.
For security you can leverage data encryption and signed URLS in AWS S3.
Current implementation-
Divide the original file into files equal to the number of servers.
Ensure each server picks one file for processing.
Each server splits the file into 90 buckets.
Use ForkManager to fork 90 processes, each operating on a bucket.
The child processes will make the API calls.
Merge the output of child processes.
Merge the output of each server.
Stats-
The size of the content downloaded using the API call is 40KB.
On 2 servers, the above process for a 225k user file runs in 15 minutes. My aim is to finish a 10 million file in 30 minutes. (Hope this doesn't sound absurd!)
I contemplated using BerkeleyDB but, couldn't find how do I convert the BerkeleyDB file into normal ASCII file.
This sounds like a one-time operation to me. Although I don't understand the 30 minute limit, I have a few suggestions I know from experience.
First of all, as I said in my comment, your bottleneck will not be reading the data from your files. It will also not be writing the results back to a harddrive. The bottleneck will be in the transfer between your machines and the remote machines. Your setup sounds sophisticated, but that might not help you in this situation.
If you are hitting a webservice, someone is running that service. There are servers that can only handle a certain ammount of load. I have brought down the dev environment servers of a big logistics company with a very small load test I ran at night. Often, these things are equipped for long-term load, but not short, heavy load.
Since IT is all about talking to each other through various protocols, like web services or other APIs, you should also consider just talking to the people who run this service. If you have a business-relationship, that is easy. If not, try to find a way to reach them and to ask if their service is able to handle so many requests at all. You could end up with them excluding you permanently because to their admins it looks like you tried to DDOS them.
I'd ask them if you could send them the files (or an excerpt of the data, cut down to what is relevant for processing) so they can do the operations in batch on their side. That way, you remove the load for processing everything as web requests, and the time it takes to do these requests.
I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?
I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.
What is a good tool for applying a layer of caching between a webserver and an application server.
Basic Requirements:
The application server needs a way to remove items from the cache and put items in the cache with an expiration date.
The webserver needs a way to pull items out of the cache in a very light-weight, fast manner without requiring thread allocation on the application server.
It does not neccessarily need to be a distributed cache (accessible from multiple machines), but it wouldn't hurt.
Strategies I have considered:
Static file caching. Request comes in, gets hashed, if a file exists we serve it, if not we route the request to the app server. Is high I/O a problem or file locking problems due to concurrency? Is it accurate that the file system is actually very fast due to kernel level caching in memory.
Using a key-value DB like mongodb, or redis. This would store the finished HTML/JSON fragments in db. The webserver would be equipped to read from the DB and route to the app server if needed. The app server would be equipped to insert/remove from the DB.
A memory cache like memcached or Varnish (don't know much about Varnish). My only concern with memcached is that I'm going to want to cache 3 - 10 gigabytes of data at any given time, which is more than I can safely allocate in memory. Does memcached have a method to spill to the filesystem?
Any thoughts on some techniques and pitfalls when trying this type of caching layer?
You can also use GigaSpaces XAP in memory data grid for caching and even hosting your web application. You can choose just the caching option or combine the power of two and gain single management of your environment along other things.
Unlike the key value pair approach you suggested, using GigaSpaces XAP you'll be able to have complex queries such as SQL, object based temples and much more. In your caching scenario you should check out more specifically the local cache related features.
Local Cache
Web Container
Disclaimer, I am a developer in GigaSpaces.
Eitan
Just to answer this from the POV of using Coherence (http://coherence.oracle.com/):
1. The application server needs a way to remove items from the cache and put items in the cache with an expiration date.
// remove one item from cache
cache.remove(key);
// remove multiple items from cache
cache.keySet().removeAll(keylist);
2. The webserver needs a way to pull items out of the cache in a very light-weight, fast manner without requiring thread allocation on the application server.
// access one item from cache
Object value = cache.get(key);
// access multiple items from cache
Map mapKV = cache.getAll(keylist);
3. It does not neccessarily need to be a distributed cache (accessible from multiple machines), but it wouldn't hurt.
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
A memory cache like memcached or Varnish (don't know much about Varnish). My only concern with memcached is that I'm going to want to cache 3 - 10 gigabytes of data at any given time, which is more than I can safely allocate in memory. Does memcached have a method to spill to the filesystem?
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
I am creating a mongodb/nodejs blogging system (similar to wordpress).
I currently have the images being saved on the disk and a pointer being placed in mongo. I was wondering since I have all sessions being stored in MongoDB to enable easy load balancing across servers, would storing the actual files in Mongo also be a smart idea for easy multiserver setups and/or performance gains.
If everything is stored in a DB, you can simply spawn more web servers and/or mongo replications to scale horizontally
Opinions?
MongoDB is a good option to store your files (I'm talking about GridFS), specially for the use case you described above.
When you store files into MongoDB (GridFS, not documents), you get all the replication and sharding capability for free, which is awesome.
If you have to spawn a new server and you have the files already into MongoDB, all you have to do is to enable replication (thus scale horizontally). I'm sure this can save you a lot of headaches.
Resources:
Is GridFS fast and reliable enough for production?
http://www.mongodb.org/display/DOCS/GridFS
http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/
Aside from GridFS, you might be considering a cloud-based deployment. In that case, you might consider storing files in cloud-specific storage (Windows Azure has Blob Storage, for example). Sticking with Windows Azure for this example (since that's what I work with), you'd reference a file by its storage account URI. For example:
https://mystorageacct.blob.core.windows.net/mycontainer/myvideo.wmv
Since you'd be storing the MongoDB database itself in its own blob (and mounted as disk volume on your Linux or Windows VM), you could then choose to store your files in either the same storage account or a completely different storage account (with each storage account providing 100TB 200TB of storage).
Storing the image as document in mongodb would be a bad idea, as the resources which could have been used to send a large amount of informational data would be used for sending files.
Have a look at mongoDb file storage GridFS , that might solve your problem
of storing images, and providing horizontal scalability as well.
http://www.mongodb.org/display/DOCS/GridFS
http://www.mongodb.org/display/DOCS/GridFS+Specification