Lightweight HTTP application/server for static content - webserver

I am in need of a scalable and performant HTTP application/server that will be used for static file serving/uploading. So I only need support for GET and PUT operations.
However, there are a few extra features that I need:
Custom authentication: I need to
check credentials against a database for each request.
Thus I must be able to integrate propietary
database interaction.
Support for
signed access keys: The access to
resources via PUT should be signed
using a key like http://uri/?key=foo The key then contains information about the request like md5(user + path + secret) which allows me to block unwanted requests. The application/server should allow me to check for this.
Performance: I'd like to avoid piping content as much as possible. Otherwise the whole application could be implemented in Perl/etc. in a few lines as CGI.
Perlbal (in webserver mode) looks nice, however the single-threaded model does not fit with my database lookup and it does also not support query strings.
Lighttp/Nginx/… have some modules for these tasks, however it is not feasible putting everything together without ending up writing own extensions/modules.
So how would you solve this? Are there other leightweight webservers available for this?
Should I implement an application inside of a webserver (i.e. CGI). How can I avoid/speed up piping content between the webserver and my application.
Thanks in advance!

Have a look at nodejs http://nodejs.org/
There are a few modules for static web servers and database interfaces:
http://wiki.github.com/ry/node/modules
You might have to write your own file upload handler, or use one from this example http://www.componentix.com/blog/13

nginx + spawn-fcgi + fcgi application written in C + memcached + sqlite serves for similar task well, latency is about 20-30 ms for small data and fast connections from the same local network. As far as I know production server handles about 100-150 requests per second with no problem. On test server I peaked up to 20k requests per second, again with no problem, average latency were about 60 ms. Aggressive caching and UNIX domain sockets is the key.
Do not know how that configuration will behave on frequent PUT requests, in our task they are very rare and typically batched.

Related

Need advice: How to share a potentially large report to remote users?

I am asking for advice on possibly better solutions for the part of the project I'm working on. I'll first give some background and then my current thoughts.
Background
Our clients can use my company's products to generate potentially large data sets for use in their industry. When the data sets are generated, the clients will file a processing request to us.
We want to send the clients a summary email which contains some statistical charts as well as sampling points from the data sets so they can do some initial quality control work. If the data sets are of bad quality, they don't need to file any request.
One problem is that the charts and sampling points can be potentially too large to be sent in an email. The charts and the sampling points we want to include in the emails are pictures. Although we can use low-quality format such as JPEG to save space, we cannot control how many data sets would be included in the summary email, so the total size could still exceed the normal email size limit.
In terms of technologies, we are mainly developing in Python on Ubuntu 14.04.
Goals of the Solution
In general, we want to present a report-like thing to the clients to do some initial QA. The report may contains external links but does not need to be very interactive. In other words, a static report should be fine.
We want to reduce the steps or things that our clients must do to read the report. For example, if the report can be just an email, the user only needs to 1). log in and 2). open the email. If they use a client software, they may skip 1). and just open and begin to read.
We also want to minimize the burden of maintaining extra user accounts for both us and our clients. For example, if the solution requires us to register a new user account, this solution is, although still acceptable, not ranked very high.
Security is important because our clients don't want their reports to be read by unauthorized third parties.
We want the process automated. We want the solution to provide programming interface so that we can automate the report sending/sharing process.
Performance is NOT a critical issue. Our user base is not large. I think at most in hundreds. They also don't generate data that frequently, at most once a week. We don't need real-time response. Even a delay of a few hours is still acceptable.
My Current Thoughts of Solution
Possible solution #1: In-house web service. I can set up a server machine and develop our own web service. We put the report into our database and the clients can then query via the Internet.
Possible solution #2: Amazon Web Service. AWS is quite mature but I'm not sure if they could be expensive because so far we just wanna share a report with our remote clients which doesn't look like a big deal to use AWS.
Possible solution #3: Google Drive. I know Google Drive provides API to do uploading and sharing programmatically, but I think we need to register a dedicated Google account to use that.
Any better solutions??
You could possibly use AWS S3 and Cloudfront. Files can easily be loaded into S3 using the AWS SDK's and API. You can then use the API to generate secure links to the files that can only be opened for a specific time and optionally from a specific IP.
Files on S3 can also be automatically cleaned up after a specific time if needed using lifecycle rules.
Storage and transfer prices are fairly cheap with AWS and remember that the S3 storage cost indicated is by the month so if you only have an object loaded for a few days then you only pay for a few days.
S3: http://aws.amazon.com/s3/pricing
Cloudfront: https://aws.amazon.com/cloudfront/pricing/
Here's a list of the SDK's for AWS:
https://aws.amazon.com/tools/#sdk
Or you can use their command line tools for Windows batch or powershell scripting:
https://aws.amazon.com/tools/#cli
Here's some info on how the private content urls are created:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
I will suggest to built this service using mix of your #1 and #2 options. You can do the processing and for transferring the data leverage AWS S3 which is quiet cheap.
Example: 100GB costs like approx $3.
Also AWS S3 will be beneficial as you are covered for any disaster on your local environment your data will be safe in S3.
For security you can leverage data encryption and signed URLS in AWS S3.

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

Need inputs on optimising web service calls from Perl

Current implementation-
Divide the original file into files equal to the number of servers.
Ensure each server picks one file for processing.
Each server splits the file into 90 buckets.
Use ForkManager to fork 90 processes, each operating on a bucket.
The child processes will make the API calls.
Merge the output of child processes.
Merge the output of each server.
Stats-
The size of the content downloaded using the API call is 40KB.
On 2 servers, the above process for a 225k user file runs in 15 minutes. My aim is to finish a 10 million file in 30 minutes. (Hope this doesn't sound absurd!)
I contemplated using BerkeleyDB but, couldn't find how do I convert the BerkeleyDB file into normal ASCII file.
This sounds like a one-time operation to me. Although I don't understand the 30 minute limit, I have a few suggestions I know from experience.
First of all, as I said in my comment, your bottleneck will not be reading the data from your files. It will also not be writing the results back to a harddrive. The bottleneck will be in the transfer between your machines and the remote machines. Your setup sounds sophisticated, but that might not help you in this situation.
If you are hitting a webservice, someone is running that service. There are servers that can only handle a certain ammount of load. I have brought down the dev environment servers of a big logistics company with a very small load test I ran at night. Often, these things are equipped for long-term load, but not short, heavy load.
Since IT is all about talking to each other through various protocols, like web services or other APIs, you should also consider just talking to the people who run this service. If you have a business-relationship, that is easy. If not, try to find a way to reach them and to ask if their service is able to handle so many requests at all. You could end up with them excluding you permanently because to their admins it looks like you tried to DDOS them.
I'd ask them if you could send them the files (or an excerpt of the data, cut down to what is relevant for processing) so they can do the operations in batch on their side. That way, you remove the load for processing everything as web requests, and the time it takes to do these requests.

Using CouchDB as interface. Is it appropriate way?

our devices (microscopes with cameras) produce images and additional information to each image.
Now a middleware supplies wants to connect these devices to lab automation system. They have to acquire the data and we have to provide it. An astonishing thing for me was their interface suggestion - a very cryptical token separated format (ASTM E1394-97). Unfortunatelly, they even can't accomodate images in their protocol, and are aiming to get file-paths.
I thought it is not the up-to date approach. While lookink for alternatives, I saw CoachDB.
So, my idea was, our devices would import data including images in CoachDB and they could get the data. It seems even, that using mustache, we could produce the format they want (ascii-text) and placing URLs as image references instead of path's.
My question is, did someone applied CoachDB for such a use case already? It seems to be a little-bit misuse of CoachDB, as the main intention is interface not data storage. Another point disturbing me is, that the inventor of CoachDB went to other project Coachbase. Could it mean lack of support for CoachDB in the future?
Thank you very much for any insights and suggestions!
It's ok use-case and actually we're using CouchDB in such way - as proxing middleware between medical laboratory analyzers and LIS. Some of them publish images or pdf data on shared folders and we'd just loading them into related document as attachments.
More over you'd like to know, CouchDB is able to serve external processes (aka os_daemons) and take care about their lifespan: restarting if someone had terminated and starting right after you update config options through HTTP interface. This helps to setup ASTM client and server processes since this protocol is different from HTTP (which is native for CouchDB) which communicates with devices and creates documents as regular CouchDB clients. In same way you may setup daemons to monitor shared folders for specific files. And all this is just CouchDB with few "low bounded" plugins.

How can I communicate across Perl CGI scripts?

I am searching for efficient ways of communication across two Perl
scripts. I have two scripts; Script 1 generates some data. I want my
Script 2 to be able to access that information.
The easiest/dumbest
way is to write the data generated by Script 1 as a file and read it
later using Script 2. Is there any other way than this? Can I store
the data in memory and make it available to Script 2 (of course with
support from my Linux )? Meaning malloc some data by Script 1 and make
Script 2 able to access it.
There is no guarantee that Script 2 will be run after Script 1. So
there should be some way to free that memory using a watchdog timer.
Let me reveal some more context. I am running these scripts on a web-server using CGI-Perl. So at the click of a button Script 1 is run and it generates a html web-page. Now the user can add some inputs to to this generated web-page and click a button on this new page.Now Script 2 should be able to read the data on new web-page.I can post the data back to web-server again but a more efficient way is to keep a copy of generated page in server also and make it available to script 2. Now, I would like to avoid writing down the generated page as a file. I was thinking of storing it in memory
This depends somewhat on your usage... one large set of data? Many small messages? Di you canre at all about data persistance? Is it TOTALLY asynchronous?
Some of the options are:
For any but the most high performace web sites, the best approach is to write our the HTML pages to files!. Unless the intrer-process communication is benchmarked to be the botttleneck in performance, don't both with any of the non-file solutions (shared memory, cache, intermediate server).
Specifically for two CGI scripts on the same server, if you run them under mod_perl or some other arrangement which shares Perl interpreter between 2 CGI processes, you can develop a package to serve as cache, which -with its package level variable - would be preserved in memory by mod_perl as long as mod_perl is running and can thus be used by a writer CGI process and a reader CGI process to communicate. Of course the usual synchronization/deadlock and persistance issues associated with reader/writer need to be considered.
As an alternative, use Apache::Session sessions to store inter-session data.
As you noted, shared memory. For example use IPC::ShareLite, IPC::Cache, or this solution from perlmonks.
Also, please check Chapter 16 Recipe 12 "Sharing Variables in Different Processes" from O'Reilly's "Perl Cookbook" (no link since non-pirated versions aren't online anywhere I know of)
Use a permanent medium. A file is one option. A database is another.
For async, use an intermediate messaging system (MQ, Tibco, something more lightweight). Probably a bit of an overkill in this scenario but a valid option to be aware of. This one is likely to be pretty stablem solid and optmized, but possibly not free and less flexible/tailored.
Or roll your own simple messaging system server - it's not THAT complicated for very simple one you seem to need.
Listen on one port for requests from first process to store data, listen on another port for requests from consumer process to send you that data, store the data in a storage area in memory and purge it when it expires using alarms or separate watcher child process).
You've tagged your question as "cgi". Are they both CGI programs? In that case, they can just talk to each other by making HTTP requests.
However, you'll have to tell a lot more about why you are trying to do this and what you need to accomplish for us to help you. It's certainly easy for Perl programs to communicate with each other in some fashion, but that doesn't mean it's the right answer for you.
When you have complex requirements for interaction among CGI programs, you probably want to move to a web framework that handles a lot of those details for you. Catalyst might be where'd you want to start. There's even a book for it.