Lazy Expiration in Memcached - memcached

I read this - How does the lazy expiration mechanism in memcached operate?
So I have a question. Is it possible/recommended to make a program myself that periodically checks for all items in memcached, sending GET requests for each item so that expired items are removed?
The reason I want this is because I want to have a proper control of the used percentage in Memcached. If that percentage is near 100%, I will never be sure if I should add more memory, or if I should not worry because there are many expired items.
I use PHP, and unfortunately this doesn't return all the items in memcached (no clue why):
$allSlabs = $memcache->getExtendedStats('slabs');
foreach ($allSlabs as $server => $slabs) {
foreach ($slabs as $slabId => $slabMeta) {
$cdump = $memcache->getExtendedStats('cachedump', (int)$slabId);
foreach ($cdump as $keys => $arrVal) {
foreach ($arrVal as $key => $v) {
echo $key, "\n";
}
}
}
}

You should just use the eviction rate to see if you need to add more memory. The eviction stats will increase if memcached had to throw out an object in order to make room for storing the next one. (and you can look at the "reclaimed" stat to see the number of times we could reclaim the memory for an item to store a new item.
The code you suggest would jam down the server by dumping its content every now and then.

Related

PERL - Net::Websocket::Server with external periodic event

In my server program I need to have ability to iterate every 5 minutes through all opened connections and see which is really "active" or not.
I know that the best approach is to use "heart beat", but then, the server need to have somehow ability to check weather the connection is "off" in order to delete the associated "user parameters" that is attached to the connection.
My first approach was to use "Async" module, but this works in a separate process - so I cannot delete any element from the main process unless I use a technique to invoke a subroutine from the main process called from the child process (I don't know how, any help will be warmly welcomed).
Another possibility using Async is create a static client that is all the time on (also in the server) and sending "commands" to the server, but to me it looks "too exaggerating" to create such "wasting memory" client in the server, and also "eat" CPU time (I think much more than simple event like equivalent to "setTimeout" in JS).
Yet another approach is to use EV: But when I call EV::run it will NOT RUN anything ELSE than this "periodic event" - means that it will not reach the next line where the ->Start for the server is.
Placing it after the ->Start will make this event useless too: As the server works the program will not go behind the ->Start line.
Using EV::run EV::RUN_NOWAIT; will make the server work, but the EV will somehow not work, for a strange reason (Anyone know how can I still make it work?)
I prefer to not use Net::Websocket::EV, because as per their script, it doesn't do the handshake automatically, and many things (as well as SSL connection that I have) I will need to do manually and for me it is change a lot in my program.
PROBLEM SUMMARY:
How to make the code in EV run every 5 minutes, together with the server?
my %CON; # Connections (And user data) hash
my $__ConChk=EV::periodic 0, 300, 0, sub {
my #l=keys %CON;
for(my $i=0 ; $i<#l ; $i++) {
if($CON{$l[$i]}{"T"}+3600<time()) { # I give one hour time to be completely offline (for different reasons)
$CON{$l[$i]}{"C"}->disconnect(); delete $CON{$l[$i]};
}
}
};
EV::run EV::RUN_NOWAIT; # This is not working - Seems to be ignored!
Net::WebSocket::Server->new(
listen => $ssl, # Earlier preset
silence_max=>60, # Time to just shut the connection off, but don't delete user data
on_connect => sub {
my($serv,$conn)=#_;
my $cid; # Connection ID (for the hash)
$conn->on(
handshake => sub {
my($conn,$handshake)=#_;
# Create user data in $CON{$cid}
},
binary => sub {
$CON{$cid}{"T"}=time();
# Handling of single incomming message (command)
},
disconnect => sub {
# Do NOT DELETE THE ENTRY!! Maybe the connection drop due to instability!!
}
);
}
)->start; # This will run - but ignoring EV::run - what to do....?
undef $__ConChk;

How to manage a long running process in a Catalyst App?

This is my first Catalyst app and I'm not sure how to solve the following problem.
The user enters some data in a form and selects a file (up to 100MB) for uploading. After submitting the form, the actual computation takes up to 5 minutes and the results are stored in a DB.
What I want to do is to run this process (and maybe also the file upload) in the background to avoid a server timeout. There should be some kind of feedback to the user (like a message "Job has been started" or a progress bar). The form should be blocked while the job is still running. A result page should be displayed once the job finished.
In hours of reading I stumbled upon concepts like asynchronous requests, job queues, daemons, Gearman, or Catalyst::Plugin::RunAfterRequest.
How would you do it?
Thanks for helping a web dev novice!
PS: In my current local app the work is done in parallel with Parallel::ForkManager. For the real app, would it be advisable to use a cloud computing service like Amazon EC2? Or just find a hoster who offers multi-core servers?
Somehow I couldn't get the idea of File::Queue. For non-blocking parallel execution, I ended up using a combination of TheSchwartz and Parallel::Prefork like it is implemented in the Foorum Catalyst App.
Basically, there are 5 important elements. Maybe this summary will be helpful to others.
1) TheSchwartz DB
2) A client (DB handle) for the TheSchwartz DB
package MyApp::TheSchwartz::Client;
use TheSchwartz;
sub theschwartz {
my $theschwartz = TheSchwartz->new(
databases => [ {
dsn => 'dbi:mysql:theschwartz',
user => 'user',
pass => 'pass',
} ],
verbose => 1,
);
return $theschwartz;
}
3) A job worker (where the actual work is done)
package MyApp::TheSchwartz::Worker::Test;
use base qw( TheSchwartz::Moosified::Worker );
use MyApp::Model::DB; # Catalyst DB connect_info
use MyApp::Schema; # Catalyst DB schema
sub work {
my $class = shift;
my $job = shift;
my ($args) = $job->arg;
my ($arg1, $arg2) = #$args;
# re-use Catalyst DB schema
my $connect_info = MyApp::Model::DB->config->{connect_info};
my $schema = MyApp::Schema->connect($connect_info);
# do the heavy lifting
$job->completed();
}
4) A worker process TheSchwartzWorker.pl that monitors the table job non-stop
use MyApp::TheSchwartz::Client qw/theschwartz/; # db connection
use MyApp::TheSchwartz::Worker::Test;
use Parallel::Prefork;
my $client = theschwartz();
my $pm = Parallel::Prefork->new({
max_workers => 16,
trap_signals => {
TERM => 'TERM',
HUP => 'TERM',
USR1 => undef,
}
});
while ($pm->signal_received ne 'TERM') {
$pm->start and next;
$client->can_do('MyApp::TheSchwartz::Worker::Test');
my $delay = 10; # When no job is available, the working process will sleep for $delay seconds
$client->work( $delay );
$pm->finish;
}
$pm->wait_all_children();
5) In the Catalyst controller: insert a new job into the table job and pass some arguments
use MyApp::TheSchwartz::Client qw/theschwartz/;
sub start : Chained('base') PathPart('start') Args(0) {
my ($self, $c ) = #_;
$client = theschwartz();
$client->insert(‘MyApp::TheSchwartz::Worker::Test’, [ $arg1, $arg2 ]);
$c->response->redirect(
$c->uri_for(
$self->action_for('archive'),
{mid => $c->set_status_msg("Run '$name' started")}
)
);
}
The new run is greyed out on the "archive" page until all results are available in the database.
Put the job in a queue and do it in a different process, outside of the Web application. While you Catalyst process is busy, even if using Catalyst::Plugin::RunAfterRequest, it cannot be used to process other web requests.
There are very simple queuing systems, like File::Queue. Basically, you assign a job ID to the document, put it in the queue. Another process checks the queue and picks up new jobs.
You can save the job status in a database, or anything accessible any the web applications. On the front end, you can poll the job status every X seconds or minutes to give feedback to the user.
You have to figure out how much memory and CPU you need. Multi-core CPU or multiple CPUs may not be required, even if you have several processes running. Choosing between a dedicated server or cloud like EC2 is more about the flexibility (resizing, snapshot, etc.) vs. price.

Why does a program with Parallel::Loops exhaust my memory?

I've inherited some code at work i'm trying to improve on. My Perl skills are somewhat lacking so would love some assistance!
Essentially this script is SNMP polling a network of thousands of nodes to update it's local interface index cache. I've found it's hitting a problem where it's exhausting it's memory and failing. Code as follows (heavily reduced but i think you'll get the jist)
use strict;
use warnings;
use Parallel::Loops;
my %snmp_results;
my $maxProcs = 50;
my #exceptions;
my #devices;
my %snmp_results;
my $pl = Parallel::Loops->new($maxProcs);
$pl->share(\%snmp_results, \#exceptions );
load_devices();
get_snmp_interfaces();
sub get_snmp_interfaces {
$pl->foreach( \#devices, sub {
my ($name, $community, $snmp_ver) = #$_;
# Create the new ifindex cache, and return an array reference to the new entries
my $result = getSNMPIFFull($name, $community, $snmp_ver);
if (defined $result && $result ne "") {
my %cache = %{$result};
print "Got cache for $name\n";
# Build hash of all the links polled through SNMP
# [ifindex, ifdesc, ifalias, ifspeed, ip]
for my $link (keys %cache) {
$snmp_results{$name}{$cache{$link}[0]} = [$cache{$link}[0], $cache{$link}[1], $cache{$link}[2], $cache{$link}[3], $cache{$link}[4]];
}
}
else {
push(#exceptions, "Unable to poll $name - $community - $snmp_ver");
}
});
}
This particular VM has 3.1GB of ram alloctable and is idling on about 83MB usage when this script is not running. If i drop the maxProcs down to 25, it will finish fine but this script can already take a long time given the sheer number of devices + latency so would rather keep the parallelism high!
I have a feeling that the $pl->share() is sharing the ever-expanding %snmp_results with each forked process which is definitely not necessary since it's not reading/modifying other entries: just adding new entries. Is there a better way I can be doing this?
I'm also slightly unsure about my %cache = %{$result};. If this is just creating a pointer as a hash then cool but if it's doing a copy, that's also a bit wasteful!
Any help will be greatly appreciated!
Documentation of the module can be found in the CPAN here.
There's one part talking about the performance:
Also, if each loop sub returns a massive amount of data, this needs to
be communicated back to the parent process, and again that could
outweigh parallel performance gains unless the loop body does some
heavy work too.
You are probably moving around complete copies of the variables in memory, pushing to the machine's limit if the MIB to poll and number of machines are big enough.
Since what you are doing is an I/O intensive task and not a CPU task that could benefit of parallel CPU processing, I would reconsider the approach of launching so many (50!) threads for polling.
Run the program with $maxProcs down to 1 to 5 processes and see how it behaves. Do some profiling of your code, attaching Devel::NYTProf to check where you are consuming time and if increasing the number of processes actually leads to a better performance.
Reconsider using Parallel::Loops for this task. You may get better performance with use threads[1] and a hash shared between the different threads (use threads::shared).
Apologies if this could have been a comment. Starting in SO is difficult due to all the limitations that are in place :(
If you already found a solution it would be great if you could share with us your findings. I didn't know Parallel::Loops before and I think I can give it some use.

Should I re-use a single HTML::SimpleLinkExtor object for memory efficiency?

So this may seem like a silly question, but I'm building an application where memory is a very limited resource so I need to be as cautious about memory usage as I can. So my question is, which of the following is more memory efficient?
while(<LINKS_FILE>) {
my $extor = HTML::SimpleLinkExtor->new($resp->base); #$resp from above somewhere
$extor->parse($_);
my #links = $extor->links;
for my $link (#links) { print "$link\n" }
}
or
my $extor = HTML::SimpleLinkExtor->new($resp->base); #$resp from above somewhere
while(<LINKS_FILE>) {
$extor->parse($_);
my #links = $extor->links;
for my $link (#links) { print "$link\n" }
$extor->clear_links;
}
So in the first it creates a new HTML::SimleLinkExtor object every time, whereas in the second it just kind of resets the same one for use again. So it seems to me like the second one would be more memory efficient, but to be honest I don't really know how good perl is about releasing memory back to the os, or if it's gonna hold on to the memory for some of the HTML::SimpleLinkExtor objects even after they're out of scope. Thanks for the help!
I am not inclined to spend time profiling, but if I were in your situation, I would try HTML::LinkExtor first. If you provide a callback, it will not save the links it finds internally, reducing the footprint of your application. You can then decide whether to store the links, or maybe write to an external file, to keep memory use to a minimum:
use HTML::LinkExtor;
my $parser = HTML::LinkExtor->new(sub {
my($tag, %links) = #_;
print "$tag #{[%links]}\n";
});
$parser->parse_file("index.html");

Caching & avoiding Cache Stampedes - multiple simultaneous calculations

We have a very expensive calculation that we'd like to cache. So we do something similar to:
my $result = $cache->get( $key );
unless ($result) {
$result = calculate( $key );
$cache->set( $key, $result, '10 minutes' );
}
return $result;
Now, during calculate($key), before we store the result in the cache, several other requests come in, that also start running calculate($key), and system performance suffers because many processes are all calculating the same thing.
Idea: Lets put a flag in the cache that a value is being calculated, so the other requests just wait for that one calculation to finish, so they all use it. Something like:
my $result = $cache->get( $key );
if ($result) {
while ($result =~ /Wait, \d+ is running calculate../) {
sleep 0.5;
$result = $cache->get( $key );
}
} else {
$cache->set( $key, "Wait, $$ is running calculate()", '10 minutes' );
$result = calculate( $key );
$cache->set( $key, $result, '10 minutes' );
}
return $result;
Now that opens up a whole new can of worms. What if $$ dies before it sets the cache. What if, what if... All of them solvable, but since there is nothing in CPAN that does this (there is something in CPAN for everything), I start wondering:
Is there a better approach? Is there a particular reason e.g. Perl's Cache and Cache::Cache classes don't provide some mechanism like this? Is there a tried and true pattern I could use instead?
Ideal would be a CPAN module with a debian package already in squeeze or a eureka moment, where I see the error of my ways... :-)
EDIT: I have since learned that this is called a Cache stampede and have updated the question's title.
flock() it.
Since your worker processes are all on the same system, you can probably use good, old-fashioned file locking to serialize the expensive calculate()ions. As a bonus, this technique appears in several of the core docs.
use Fcntl qw(:DEFAULT :flock); # warning: this code not tested
use constant LOCKFILE => 'you/customize/this/please';
my $result = $cache->get( $key );
unless ($result) {
# Get an exclusive lock
my $lock;
sysopen($lock, LOCKFILE, O_WRONLY|O_CREAT) or die;
flock($lock, LOCK_EX) or die;
# Did someone update the cache while we were waiting?
$result = $cache->get( $key );
unless ($result) {
$result = calculate( $key );
$cache->set( $key, $result, '10 minutes' );
}
# Exclusive lock released here as $lock goes out of scope
}
return $result;
Benefit: worker death will instantly release the $lock.
Risk: LOCK_EX can block forever, and that is a long time. Avoid SIGSTOPs, perhaps get comfortable with alarm().
Extension: if you don't want to serialize all calculate() calls, but merely all calls for the same $key or some set of keys, your workers can flock() /some/lockfile.$key_or_a_hash_of_the_key.
Use lock? Or maybe that would be an overkill? Or if it is possible, precalculate the result offline then use it online?
Although it may (or may not) be overkill for your use case, have you considered using a message queue for the processing? RabbitMQ seems to be a popular choice in the Perl community at the moment and it is supported through the AnyEvent::RabbitMQ module.
The basic strategy in this case would be to submit a request to the message queue whenever you need to calculate a new key. The queue could then be set to calculate only a single key at a time (in the order requested) if that's all you can reliably handle. Alternately, if you can safely compute multiple keys concurrently, the queue can also be used to consolidate multiple requests for the same key, computing it once and returning the result to all clients who requested that key.
Of course, this would add a bit of complexity and AnyEvent calls for a somewhat different programming style than you may be used to (I would offer an example, but I've never really gotten the hang of it myself), but it may offer sufficient gains in efficiency and reliability to make those costs worth your while.
I agree generally with pilcrow's approach above. I would add one thing to it: Investigate the use of the memoize() function to potentially speed up the calculate() operation in your code.
See http://perldoc.perl.org/Memoize.html for details