Need inputs on optimising web service calls from Perl - perl

Current implementation-
Divide the original file into files equal to the number of servers.
Ensure each server picks one file for processing.
Each server splits the file into 90 buckets.
Use ForkManager to fork 90 processes, each operating on a bucket.
The child processes will make the API calls.
Merge the output of child processes.
Merge the output of each server.
Stats-
The size of the content downloaded using the API call is 40KB.
On 2 servers, the above process for a 225k user file runs in 15 minutes. My aim is to finish a 10 million file in 30 minutes. (Hope this doesn't sound absurd!)
I contemplated using BerkeleyDB but, couldn't find how do I convert the BerkeleyDB file into normal ASCII file.

This sounds like a one-time operation to me. Although I don't understand the 30 minute limit, I have a few suggestions I know from experience.
First of all, as I said in my comment, your bottleneck will not be reading the data from your files. It will also not be writing the results back to a harddrive. The bottleneck will be in the transfer between your machines and the remote machines. Your setup sounds sophisticated, but that might not help you in this situation.
If you are hitting a webservice, someone is running that service. There are servers that can only handle a certain ammount of load. I have brought down the dev environment servers of a big logistics company with a very small load test I ran at night. Often, these things are equipped for long-term load, but not short, heavy load.
Since IT is all about talking to each other through various protocols, like web services or other APIs, you should also consider just talking to the people who run this service. If you have a business-relationship, that is easy. If not, try to find a way to reach them and to ask if their service is able to handle so many requests at all. You could end up with them excluding you permanently because to their admins it looks like you tried to DDOS them.
I'd ask them if you could send them the files (or an excerpt of the data, cut down to what is relevant for processing) so they can do the operations in batch on their side. That way, you remove the load for processing everything as web requests, and the time it takes to do these requests.

Related

Site on two different servers

Im considering taking web server from China to reduce site loading times from China/China users. Problem is, how to sync/keep same data between two sites? When editing content in the site it should update these changes to site in China server.
Server is running Linux, Apache and MySQL. Website is using WordPress.
FYI I'm already using CDN and site loading speed is still too long from China.
Basically your solution would need to...
Copy the entire contents of your http'd directory from the main server to the Chinese server.
Copy the entire contents of your MySQL database from the main server to the Chinese server.
Perform these tasks at a regular interval without manual intervention.
I can guide you to references that will help with each task and sometimes can show you a quick example. However, if you want to get it to work and especially if you want to optimize the process, you're going to have to look through the references yourself.
If I didn't do it this way this answer would get even more horrendously long that it already is.
Before we start you should remember...
Thing 0 - Please Try Not to be Intimidated by the Length of this Answer
I know I've written a lot, perhaps more than I should have, but I guarantee you are capable of implementing this in no more than a day. I have tried to be thorough but that does not mean that what I'm describing is particularly complicated.
Thing 1 - Shutdown your Chinese Server During Transfer
This transfer of data is going to make your Chinese server unusable while it's in progress, as you might have guessed. You need to make sure that you're Chinese server is not operational during the transfer. Otherwise the server might have only partial data available which could cause problems for both client and server, particularly in relation to MySQL.
Thing 2 - Use Compression as much as You Can
As time consuming as compression and decompression can be for large amounts of data, believe me it is nothing compared to the time you will waste sending the uncompressed data to China. Network usage, not processor time, is really going to be the limiting factor in getting the transfer done quickly. Try to send compressed files whenever possible.
Thing 3 - Try to Use Checksums
Sending all your data, particularly in compressed format, will leave it vulnerable to corruption in transit. Whenever you send a file I encourage you to use some kind of checksum on the data to verify that it has not been corrupted. For brevity I will not be showing you how to do this but I'm sure you're smart enough to figure out how to pepper in some verification.
In case you're not familiar with checksums, the Wikipedia article about them is pretty straight forward. The most commonly used are the MD5 and the SHA-1, but both of those are somewhat collision prone. I would recommend the SHA-2 (also called SHA-256/512) or the very new SHA-3.
Copying your Http'd Directory to the Chinese Server
As far as I know (and I could be wrong) there is no built in way to transfer files from one Apache server to another...so you're going to have to write your own script for this.
You're also going to need to have two separate scripts: one for the main server and one for the Chinese server. Here's a breakdown of what each script needs to do.
On your main server...
Log in as you're Apache server's user. (Reference for switching users.)
zip/gzip/tar.gz your http'd directory's contents. (Reference for zip. Reference for gzip. Reference for tar.)
scp (secure copy) the compressed file to your Chinese server. Make sure to copy it to the username that Apache runs under. (Reference for scp.)
Delete the compressed file.
Initiate the Chinese server's script (this will be discussed later).
You will likely be using a shell script for all of this, so I hope you're familiar with the terminal. A simple example would look like this.
#!/bin/sh
## First I'll define some variables to explain this better.
APACHE_USER="whatever your Apache server's username is (usually it's www-data)";
WWW_DIR="your http'd directory relative to ~ (usually it's /var/www)";
CHINA_HOST="the host name/IP address of your Chinese server"
CHINA_USER="Apache's username on the Chinese server";
CHINA_PWD="Apache's user password on the Chinese server";
CHINA_HOME="the home directory of the Apache user on your Chinese server";
## Now to the real scripting. I will be using zip for compression.
su - "$APACHE_USER";
zip -r copy.zip "$WWW_DIR";
scp copy.zip "$CHINA_USER#$CHINA_HOST:$CHINA_HOME" < echo $CHINA_PWD;
rm copy.zip;
## Then you initiate the next step of the process.
## Like I said this will be covered later.
On your Chinese server...
Log in as the Apache user.
Delete the content of the http'd directory (probably /var/www relative to ~).
Decompress the scp'd file (this will change depending on how you compressed it).
Copy the decompressed directory to the http'd directory (this step is unnecessary if you choose to compress with zip).
Deleted the compressed, scp'd file.
Notify main server to continue next step (again, will be discussed later).
This is pretty straight forward and I don't think you need another example for this part.
Copying the MySQL Database Contents
You can find a good reference for how to do this in this article from the MySQL website. Basically copying database contents is a built in feature. Try to make use of the compression options!
Performing these Tasks at Regular Intervals without Manual Intervention
Ok this is where things get kind of complicated.
The first thing you need to know is how to schedule tasks at regular intervals on Linux. This is done with a command line tool called crontab. You can see good examples for setting up cron jobs in this article, and the full crontab documentation here.
However what will take more skill than just scheduling the job at regular intervals will be synchronizing the data transfer. If you simply set one server to send data at a certain time and the other to receive it at a certain time, you will get many bugs. Be sure of that.
My recommendation would be to create a socket in the Chinese server that listens for instructions from the main server.
This can be done in a variety of languages. Because you're using Linux I would recommend doing this in C, but it can be done in almost any language including Bash.
A full example would be too much but basically this will be the flow of what you have to do.
Socket in China listens for connections.
Cron job in main server connects to China socket.
Main server authenticates itself.
Chinese server stops Apache, stops accepting requests.
Chinese server acknowledges authentication approved.
Main server scp's website contents to Chinese server.
Main server tells Chinese server that scp is complete.
Chinese server replaces Apache's http'd directory's contents with the data that has been scp'd.
Chinese server announces success to main server.
Main server copies MySQL data.
Main server tells Chinese server process is complete.
Chinese server resumes Apache service.
Chinese server notify's main server that service is resumed.
Socket is closed.
Chinese server goes back to listening for connection from main server.
I hope this helps!

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

How can I communicate across Perl CGI scripts?

I am searching for efficient ways of communication across two Perl
scripts. I have two scripts; Script 1 generates some data. I want my
Script 2 to be able to access that information.
The easiest/dumbest
way is to write the data generated by Script 1 as a file and read it
later using Script 2. Is there any other way than this? Can I store
the data in memory and make it available to Script 2 (of course with
support from my Linux )? Meaning malloc some data by Script 1 and make
Script 2 able to access it.
There is no guarantee that Script 2 will be run after Script 1. So
there should be some way to free that memory using a watchdog timer.
Let me reveal some more context. I am running these scripts on a web-server using CGI-Perl. So at the click of a button Script 1 is run and it generates a html web-page. Now the user can add some inputs to to this generated web-page and click a button on this new page.Now Script 2 should be able to read the data on new web-page.I can post the data back to web-server again but a more efficient way is to keep a copy of generated page in server also and make it available to script 2. Now, I would like to avoid writing down the generated page as a file. I was thinking of storing it in memory
This depends somewhat on your usage... one large set of data? Many small messages? Di you canre at all about data persistance? Is it TOTALLY asynchronous?
Some of the options are:
For any but the most high performace web sites, the best approach is to write our the HTML pages to files!. Unless the intrer-process communication is benchmarked to be the botttleneck in performance, don't both with any of the non-file solutions (shared memory, cache, intermediate server).
Specifically for two CGI scripts on the same server, if you run them under mod_perl or some other arrangement which shares Perl interpreter between 2 CGI processes, you can develop a package to serve as cache, which -with its package level variable - would be preserved in memory by mod_perl as long as mod_perl is running and can thus be used by a writer CGI process and a reader CGI process to communicate. Of course the usual synchronization/deadlock and persistance issues associated with reader/writer need to be considered.
As an alternative, use Apache::Session sessions to store inter-session data.
As you noted, shared memory. For example use IPC::ShareLite, IPC::Cache, or this solution from perlmonks.
Also, please check Chapter 16 Recipe 12 "Sharing Variables in Different Processes" from O'Reilly's "Perl Cookbook" (no link since non-pirated versions aren't online anywhere I know of)
Use a permanent medium. A file is one option. A database is another.
For async, use an intermediate messaging system (MQ, Tibco, something more lightweight). Probably a bit of an overkill in this scenario but a valid option to be aware of. This one is likely to be pretty stablem solid and optmized, but possibly not free and less flexible/tailored.
Or roll your own simple messaging system server - it's not THAT complicated for very simple one you seem to need.
Listen on one port for requests from first process to store data, listen on another port for requests from consumer process to send you that data, store the data in a storage area in memory and purge it when it expires using alarms or separate watcher child process).
You've tagged your question as "cgi". Are they both CGI programs? In that case, they can just talk to each other by making HTTP requests.
However, you'll have to tell a lot more about why you are trying to do this and what you need to accomplish for us to help you. It's certainly easy for Perl programs to communicate with each other in some fashion, but that doesn't mean it's the right answer for you.
When you have complex requirements for interaction among CGI programs, you probably want to move to a web framework that handles a lot of those details for you. Catalyst might be where'd you want to start. There's even a book for it.

How is Accurev Performance?

How is performance in the current version (4.7) of Accurev?
time to checkout per 100mb, per gb?
time to commit per # of files or mb?
responsiveness of gui when 100+ streams?
I just had a demo of Accurev, and the streams look like a lightweight way to model workflow around code/projects. I've heard people praising Accurev for the streams back end and complaining about performance. Accurev appears to have worked on the performance, but I'd like to get some real world data to make sure it isn't a case of demos-well-runs-less-well.
Does anyone have Accurev performance anecdotes or (even better) data from testing?
I don't have any numbers but I can tell you where we have noticed performance issues.
Our builds typically use 30-40K files from source control. In my workspace currently there are over 66K files including build intermediate and output files, over 15GB in size. To keep AccuRev working responsively we aggressively use the ignore elements so AccuRev ignores any intermediate files such as *.obj. In addition we use the time stamp optimization. In general running an update is quick, but the project sizes are typically 5-10 people so normally only a couple of dozen files come down if you update daily. Even if someone made changes that touched lots of files speed is not an issue. On the other hand a full populate of all 30K+ files is slow. I don't have a time since I seldom do this and on the rare occasion I do, I run the populate when I'm going to lunch or a meeting. I expect it could be as much as 10 minutes. In general source files come down very quickly, but we have some large binary files, 10-20MB, that take a couple of seconds each.
If the exclude rules and ignore elements are not correctly configured, AccuRev can take a couple of minutes to run an update for workspaces of this size. When I hear of other developers complaining about the speed I know something is miss-configured and we get it straightened out.
A year or so ago one of the project updated boost with 25K+ files and also added FireFox to the repository (forget the size but made boost look small.) They also added ICU, wrote a lot of software and modified countless files. In all I recall there were approx 250K+ files sitting in a stream. I unfortunately decided that all their good code should be promoted to the root so all projects could share. This turned out to be a little beyond what AccuRev could handle well. It was a multi hour process getting all the changes promoted. As I recall once FireFox was promoted the rest went smoothly - perhaps a single transaction with over 100K files was the issue?
I recently updated boost and so had to keep and promote 25K+ files. It took a minute or two but not unreasonable considering the number of files and the size of the binaries.
As for the number of streams, we have over 800 streams and workspaces. Performance here is not an issue. In general I find the large number of streams hard to navigate so I run a filtered view of just my workspaces and the just streams I'm interested in. However when I need to look at the unfiltered list to find something performance is fine.
As a final note, AccuRev support is terrific - we call them the voice in the sky. Every now and again we shoot ourselves in the foot using AccuRev and wind up clueless on how to fix things. Almost always we did something dumb and then tried something dumber to fix it. Eventually we place a support request and next thing we know they are walking us through the steps to righteousness either on the phone or a goto meeting. I've even contacted them for trivial things that I just don't have time to figure out as I'm having a hectic day and they kindly walk me through it rather than telling me to RTFM.
Edit 2014: We can now get acceptable X-Windows performance by using the commercial version of RealVNC.
Original comment:This answer applies to any version of Accurev, not just 4.7. Firstly, GUI performance might be OK if you can use the web client. If you can't use the web client and if you want GUI performance then you'd better be using Windows, or have all your developers in one place, i.e. where the Accurev server is located. Try to run the GUI on X-Windows over a WAN ? Forget it : our experience has been dozens of seconds or minutes for basic point and click operations. This is over a fairly good WAN about 800 miles distant, with an almost optimal ping time. This is not a failing of Accurev, but of X-Windows, and you'll likely have similar problems with other X applications over a WAN. So avoid basic X if you possibly can. Currently we cannot, and our WAN users are forcibly relegated to command-line only. The basic problem is that Accurev is is centralized and you can't increase the speed of light. I believe you can get around WAN latency by running Accurev Replication Servers, but that still does not properly address the problem if you have remote developers at single-person offices over VPN. It is ironic that the replication servers somewhat turn this centralized VCS into a form of DVCS. If you don't have replication servers then a horrible but somewhat workable work-around is to use a delta-synchronization tool such as rsync to sync your source tree between your local machine where you can run the GUI (i.e. GUI running directly on your Windows or Linux laptop), and the machine where you're actually working (e.g. UNIX machine 1,000 miles away). Another option is to use something like VNC which works better over a WAN than X, connecting to a virtual desktop at the Accurev server's location, and use X from there. At my workplace more than one team has resorted to using Mercurial on the side and promoting to Accurev only when it's strictly necessary. As Stephen Nutt points out above, other necessary work is to use time-stamp optimization and ignores. We also have our Accurev admins (yes, it requires you employ people to baby sit it) complain when we need to include large numbers of files, despite the fact they form a core part of our product and MUST be included and version controlled. Draw your own conclusions.

Detect a file in transit?

I'm writing an application that monitors a directory for new input files by polling the directory every few seconds. New files may often be several megabytes, and so take some time to fully arrive in the input directory (eg: on copy from a remote share).
Is there a simple way to detect whether a file is currently in the process of being copied? Ideally any method would be platform and filesystem agnostic, but failing that specific strategies might be required for different platforms.
I've already considered taking two directory listings separaetd by a few seconds and comparing file sizes, but this introduces a time/reliability trade-off that my superiors aren't happy with unless there is no alternative.
For background, the application is being written as a set of Matlab M-files, so no JRE/CLR tricks I'm afraid...
Edit: files are arriving in the input directly by straight move/copy operation, either from a network drive or from another location on a local filesystem. This copy operation will probably be initiated by a human user rather than another application.
As a result, it's pretty difficult to place any responsibility on the file provider to add control files or use an intermediate staging area...
Conclusion: it seems like there's no easy way to do this, so I've settled for a belt-and-braces approach - a file is ready for processing if:
its size doesn't change in a certain period of time, and
it's possible to open the file in read-only mode (some copying processes place a lock on the file).
Thanks to everyone for their responses!
The safest method is to have the application(s) that put files in the directory first put them in a different, temporary directory, and then move them to the real one (which should be an atomic operation even when using FTP or file shares). You could also use naming conventions to achieve the same result within one directory.
Edit:
It really depends on the filesystem, on whether its copy functionality even has the concept of a "completed file". I don't know the SMB protocol well, but if it has that concept, you could write an app that exposes an SMB interface (or patch Samba) and an API to get notified for completed file copies. Probably a lot of work though.
This is a middleware problem as old as the hills, and the short answer is: no.
The two 'solutions' put the onus on the file-uploader: (1) upload the file in a staging directory and then move it into the destination directory (2) upload the file, and then create/upload a 'ready' file that indicates the state of the content file.
The 1st one is the better, but both are inelegant. The truth is that better communication media exist than the filesystem. Consider using some IPC that involves only a push or a pull (and not both, as does the filesystem) such as an HTTP POST, a JMS or MSMQ queue, etc. Furthermore, this can also be synchronous, allowing the process receiving the file to acknowledge the content, even check it for worthiness, and hand the client a receipt - this is the righteous road to non-repudiation. Follow this, and you will never suffer arguments over whether a file was or was not delivered to your server for processing.
M.
One simple possibility would be to poll at a fairly large interval (2 to 5 minutes) and only acknowledge the new file the second time you see it.
I don't know of a way in any OS to determine whether a file is still being copied, other than maybe checking if the file is locked.
How are the files getting there? Can you set an attribute on them as they are written and then change the attribute when write is complete? This would need to be done by the thing doing the writing ... which sounds like it isn't an option.
Otherwise, caching the listing and treating a file as new if it has the same file size for two consecutive listings is the best way I can think of.
Alternatively, you could use the modified time on the file - the file has to be new and have a modified time that is at least x in the past. But I think this will be about equivalent to caching the listing.
It you are polling the folder every few seconds, its not much of a time penalty is it? And its platform agnostic.
Also, linux only: http://www.linux.com/feature/144666
Like cron but for files. Not sure how it deals with your specific problem - but may be of use?
What is your OS. In unix you can use the "lsof" utility to determine if a user has the file open for write. Apparently somewhere in the MS Windows Process Explorer there is the same functionality.
Alternativly you could just try an exclusive open on the file and bail out of this fails. But this can be a little unreliable and its easy to tread on your own toes.