Keeping stream open with AnyEvent::Twitter::Stream - perl

I'm using AnyEvent::Twitter::Stream to write a Twitter bot. I'm new to event-based programming, so I'm relying heavily on the docs.
Since this is a bot, I want the stream to keep going, unsupervised, when there's an error (such as a broken pipe, or a timeout error).
If I don't include an error handler at all, the entire program dies on error. Similarly, if I use an error handler like what's in the sample docs:
on_error => sub {
my $error = shift;
warn "Error: $error";
$done->send;
},
The program dies. If I remove the "$done->send;" line, the stream is interrupted and the program hangs.
I've looked at the (sparse) docs for AE::T::S, and for AnyEvent, but I'm not sure what I need to do to keep things going. The stream can give me 5,000 events a minute, and I can't lose this on random network hiccups.
Thanks.

Solution for anyone else using this:
I spoke to someone more knowledgeable about event-based stuff. Basically, each ::Stream object represents a single HTTP request to Twitter; when that request dies, that's the end of a stream. The on_error method is called at this point, so it's just giving you a place to say that the stream ended, for logging purposes or whatever. This method doesn't let you restart the stream.
If you want the bot to keep going, you have to create a new stream, either by having the OS restart the program, or just wrapping the entirety of the program in a while(1) {} loop.
Hope this helps someone else!

Related

How to request a replay of an already received fix message

I have an application that could potentitally throw an error on receiving a ExecutionReport (35=8) message.
This error is thrown at the application level and not at the fix engine level.
The fix engine records the message as seen and therefore will not send a ResendRequest (35=2). However the application has not processed it and I would like to manually trigger a re-processing of the missed ExecutionReport.
Forcing a ResendRequest (35=2) does not work as it requires modifying the expected next sequence number.
I was wonderin if FIX supports replaying of messages but without requiring a sequence number reset?
When processing an execution report, you should not throw any exceptions and expect FIX library to handle it. You either process the report or you have a system failure (i.e. call abort()). Therefore, if your code that handles execution report throws an exception and you know how to handle it, then catch it in that very same function, eliminate the cause of the problem and try processing again. For example (pseudo-code):
// This function is called by FIX library. No exceptions must be thrown because
// FIX library has no idea what to do with them.
void on_exec_report(const fix::msg &msg)
{
for (;;) {
try {
// Handle the execution report however you want.
handle_exec_report(msg);
} catch(const try_again_exception &) {
// Oh, some resource was temporarily unavailable? Try again!
continue;
} catch(const std::exception &) {
// This should never happen, but it did. Call 911.
abort();
}
}
}
Of course, it is possible to make FIX library do a re-transmission request and pass you that message again if exception was thrown. However, it does not make any sense at all because what is the point of asking the sender (over the network, using TCP/IP) to re-send a message that you already have (up your stack :)) and just need to process. Even if it did, what's the guarantee it won't happen again? Re-transmission in this case is not only doesn't sound right logically, the other side (i.e. exchange) may call you up and ask to stop doing this crap because you put too much load on their server with unnecessary re-transmit (because IRL TCP/IP does not lose messages and FIX sequence sync process happens only when connecting, unless of course some non-reliable transport is used, which is theoretically possible but doesn't happen in practice).
When aborting, however, it is FIX library`s responsibility not to increment RX sequence unless it knows for sure that user has processed the message. So that next time application starts, it actually performs synchronization and receives missing messages. If QuickFIX is not doing it, then you need to either fix this, take care of this manually (i.e. go screw with the file where it stores RX/TX sequence numbers), or use some other library that handles this correctly.
This is the wrong thing to do.
A ResendRequest tells the other side that there was some transmission error. In your case, there wasn't, so you shouldn't do that. You're misusing the protocol to cover your app's mistakes. This is wrong. (Also, as Vlad Lazarenko points out in his answer, if they did resend it, what's to say you won't have the error again?)
If an error occurs in your message handler, then that's your problem, and you need to find it and fix it, or alternately you need to catch your own exception and handle it accordingly.
(Based on past questions, I bet you are storing ExecutionReports to a DB store or something, and you want to use Resends to compensate for DB storage exceptions. Bad idea. You need to come up with your own redundancy solution.)

Boost Async UDP Client

I've read through the boost:asio documentation (which appears silent on async clients), and looked through here, but can't seem to find the forest for the trees here.
I've got a simulation that has a main loop that looks like this:
for(;;)
{
a = do_stuff1();
do_stuff2(a);
}
Easy enough.
What I'd like to do, is modify it so that I have:
for(;;)
{
a = do_stuff1();
check_for_new_received_udp_data(&b);
modify_a_with_data_from_b(a,b);
do_stuff2(a);
}
Where I have the following requirements:
I cannot lose data just because I wasn't actively listening. IE I don't want to lose packets because I was in do_stuff2() instead of check_for_new_received_udp_data() at the time the server sent the packet.
I can't have check_for_new_received_udp_data() block for more than about 2ms, since the main for loop needs to execute at 60Hz.
The server will be running elsewhere, and has a completely erratic schedule. Sometimes there will be no data, othertimes I may get the same packet repeatedly.
I've played with the async UDP, but that requires calling io_service.run(), which blocks indefinitely, so that doesn't really help me.
I thought about timing out a blocking socket read, but it seems you have to cheat and get out of the boost calls to do that, so that's a non-starter.
Is the answer going to involve threading? Either way, could someone kindly point me to an example that is somewhat similar? Surely this has been done before.
To avoid blocking in the io_service::run() you can use io_service::poll_one().
Regarding loosing UDP packets, I think you are out of luck. UDP does not guarantee delivery, and any part of the network may decide to drop UDP packets if there is much traffic. If you need to ensure delivery you need to have either implement some sort of flow control or just use TCP.
I think your problem is that you're still thinking synchronously. You need to think asynchronously.
Async read on UDP socket - will call handler when data arrives.
Within that handler do your processing on the incoming data. Keep in mind that while you're processing, if you have a single thread, nothing else dispatches. This can be perfectly OK (UDP messages will still be queued in the network stack...).
As a result of this you could be starting other asynchronous operations.
If you need to do work in parallel that is essentially unrelated or offline that will involve threads. Create a thread that calls io_service.run().
If you need to do periodic work in an asynch framework use timers.
In your particular example we can rearrange things like this (psuedo-code):
read_handler( ... )
{
modify_a_with_data_from_b(a,b);
do_stuff2(a);
a = do_stuff1();
udp->async_read( ..., read_handler );
}
periodic_handler(...)
{
// do periodic stuff
timer.async_wait( ..., periodic_handler );
}
main()
{
...
a = do_stuff1();
udp->async_read( ..., read_handler )
timer.async_wait( ..., periodic_handler );
io_service.run();
}
Now I'm sure there are other requirements that aren't evident from your question but you'll need to figure out an asynchronous answer to them, this is just an idea. Also ask yourself if you really need an asynchronous framework or just use the synchronous socket APIs.

Perl, how to fetch data from urls in parallel?

I need to fetch some data from many web data providers, who do not expose any service, so I have to write something like this, using for example WWW::Mechanize:
use WWW::Mechanize;
#urls = ('http://www.first.data.provider.com', 'http://www.second.data.provider.com', 'http://www.third.data.provider.com');
%results = {};
foreach my $url (#urls) {
$mech = WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
consume(%results);
Is there some (possibly simple ;-) way to fetch data to a common %results variable, simultaneously, i.e: in parallel, from all the providers?
threads are to be avoided in Perl. use threads is mostly for
emulating UNIX-style fork on Windows; beyond that, it's pointless.
(If you care, the implementation makes this fact very clear. In perl,
the interpreter is a PerlInterpreter object. The way threads
works is by making a bunch of threads, and then creating a brand-new
PerlInterpreter object in each thread. Threads share absolutely
nothing, even less than child processes do; fork gets you
copy-on-write, but with threads, all the copying is done in Perl
space! Slow!)
If you'd like to do many things concurrently in the same process, the
way to do that in Perl is with an event loop, like
EV,
Event, or
POE, or by using Coro. (You can
also write your code in terms of the AnyEvent API, which will let
you use any event loop. This is what I prefer.) The difference
between the two is how you write your code.
AnyEvent (and EV, Event,
POE, and so on) forces you to write your code in a callback-oriented
style. Instead of control flowing from top to bottom, control is in a
continuation-passing style. Functions don't return values, they call
other functions with their results. This allows you to run many IO
operations in parallel -- when a given IO operation has yielded
results, your function to handle those results will be called. When
another IO operation is complete, that function will be called. And
so on.
The disadvantage of this approach is that you have to rewrite your
code. So there's a module called Coro that gives Perl real
(user-space) threads that will let you write your code top-to-bottom,
but still be non-blocking. (The disadvantage of this is that it
heavily modifies Perl's internals. But it seems to work pretty well.)
So, since we don't want to rewrite
WWW::Mechanize
tonight, we're going to use Coro. Coro comes with a module called
Coro::LWP that will make
all calls to LWP be
non-blocking. It will block the current thread ("coroutine", in Coro
lingo), but it won't block any other threads. That means you can make
a ton of requests all at once, and process the results as they become
available. And Coro will scale better than your network connection;
each coroutine uses just a few k of memory, so it's easy to have tens
of thousands of them around.
With that in mind, let's see some code. Here's a program that starts
three HTTP requests in parallel, and prints the length of each
response. It's similar to what you're doing, minus the actual
processing; but you can just put your code in where we calculate the
length and it will work the same.
We'll start off with the usual Perl script boilerplate:
#!/usr/bin/env perl
use strict;
use warnings;
Then we'll load the Coro-specific modules:
use Coro;
use Coro::LWP;
use EV;
Coro uses an event loop behind the scenes; it will pick one for you if
you want, but we'll just specify EV explicitly. It's the best event
loop.
Then we'll load the modules we need for our work, which is just:
use WWW::Mechanize;
Now we're ready to write our program. First, we need a list of URLs:
my #urls = (
'http://www.google.com/',
'http://www.jrock.us/',
'http://stackoverflow.com/',
);
Then we need a function to spawn a thread and do our work. To make a
new thread on Coro, you call async like async { body; of the
thread; goes here }. This will create a thread, start it, and
continue with the rest of the program.
sub start_thread($) {
my $url = shift;
return async {
say "Starting $url";
my $mech = WWW::Mechanize->new;
$mech->get($url);
printf "Done with $url, %d bytes\n", length $mech->content;
};
}
So here's the meat of our program. We just put our normal LWP program
inside async, and it will be magically non-blocking. get blocks,
but the other coroutines will run while waiting for it to get the data
from the network.
Now we just need to start the threads:
start_thread $_ for #urls;
And finally, we want to start handling events:
EV::loop;
And that's it. When you run this, you'll see some output like:
Starting http://www.google.com/
Starting http://www.jrock.us/
Starting http://stackoverflow.com/
Done with http://www.jrock.us/, 5456 bytes
Done with http://www.google.com/, 9802 bytes
Done with http://stackoverflow.com/, 194555 bytes
As you can see, the requests are made in parallel, and you didn't have
to resort to threads!
Update
You mentioned in your original post that you want to limit the number of HTTP requests that run in parallel. One way to do that is with a semaphore,
Coro::Semaphore in Coro.
A semaphore is like a counter. When you want to use the resource that a semaphore protects, you "down" the semaphore. This decrements the counter and continues running your program. But if the counter is at zero when you try to down the semaphore, your thread/coroutine will go to sleep until it is non-zero. When the count goes up again, your thread will wake up, down the semaphore, and continue. Finally, when you're done using the resource that the semaphore protects, you "up" the semaphore and give other threads the chance to run.
This lets you control access to a shared resource, like "making HTTP requests".
All you need to do is create a semaphore that your HTTP request threads will share:
my $sem = Coro::Semaphore->new(5);
The 5 means "let us call 'down' 5 times before we block", or, in other words, "let there be 5 concurrent HTTP requests".
Before we add any code, let's talk about what can go wrong. Something bad that could happen is a thread "down"-ing the semaphore, but never "up"-ing it when it's done. Then nothing can ever use that resource, and your program will probably end up doing nothing. There are lots of ways this could happen. If you wrote some code like $sem->down; do something; $sem->up, you might feel safe, but what if "do something" throws an exception? Then the semaphore will be left down, and that's bad.
Fortunately, Perl makes it easy to have scope Guard objects, that will automatically run code when the variable holding the object goes out of scope. We can make the code be $sem->up, and then we'll never have to worry about holding a resource when we don't intend to.
Coro::Semaphore integrates the concept of guards, meaning you can say my $guard = $sem->guard, and that will automatically down the semaphore and up it when control flows away from the scope where you called guard.
With that in mind, all we have to do to limit the number of parallel requests is guard the semaphore at the top of our HTTP-using coroutines:
async {
say "Waiting for semaphore";
my $guard = $sem->guard;
say "Starting";
...;
return result;
}
Addressing the comments:
If you don't want your program to live forever, there are a few options. One is to run the event loop in another thread, and then join on each worker thread. This lets you pass results from the thread to the main program, too:
async { EV::loop };
# start all threads
my #running = map { start_thread $_ } #urls;
# wait for each one to return
my #results = map { $_->join } #running;
for my $result (#results) {
say $result->[0], ': ', $result->[1];
}
Your threads can return results like:
sub start_thread($) {
return async {
...;
return [$url, length $mech->content];
}
}
This is one way to collect all your results in a data structure. If you don't want to return things, remember that all the coroutines share state. So you can put:
my %results;
at the top of your program, and have each coroutine update the results:
async {
...;
$results{$url} = 'whatever';
};
When all the coroutines are done running, your hash will be filled with the results. You'll have to join each coroutine to know when the answer is ready, though.
Finally, if you are doing this as part of a web service, you should use a coroutine-aware web server like Corona. This will run each HTTP request in a coroutine, allowing you to handle multiple HTTP requests in parallel, in addition to being able to send HTTP requests in parallel. This will make very good use of memory, CPU, and network resources, and will be pretty easy to maintain!
(You can basically cut-n-paste our program from above into the coroutine that handles the HTTP request; it's fine to create new coroutines and join inside a coroutine.)
Looks like ParallelUserAgent is what you're looking for.
Well, you could create threads to do it--specifically see perldoc perlthrtut and Thread::Queue. So, it might look something like this.
use WWW::Mechanize;
use threads;
use threads::shared;
use Thread::Queue;
my #urls=(#whatever
);
my %results :shared;
my $queue=Thread::Queue->new();
foreach(#urls)
{
$queue->enqueue($_);
}
my #threads=();
my $num_threads=16; #Or whatever...a pre-specified number of threads.
foreach(1..$num_threads)
{
push #threads,threads->create(\&mechanize);
}
foreach(#threads)
{
$queue->enqueue(undef);
}
foreach(#threads)
{
$_->join();
}
consume(\%results);
sub mechanize
{
while(my $url=$queue->dequeue)
{
my $mech=WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
}
Note that since you're storing your results in a hash (instead of writing stuff to a file), you shouldn't need any kind of locking unless there's a danger of overwriting values. In which case, you'll want to lock %results by replacing
$results{$url} = parse($resp->content());
with
{
lock(%results);
$results{$url} = parse($resp->content());
}
Try https://metacpan.org/module/Parallel::Iterator -- saw a very good presentation about it last week, and one of the examples was parallel retrieval of URLs -- it's also covered in the pod example. It's simpler than using threads manually (although it uses fork underneath).
As far as I can tell, you'd still be able to use WWW::Mechanize, but avoid messing about with memory sharing between threads. It's a higher-level model for this task, and might be a little simpler, leaving the main logic of #Jack Maney's mechanize routine intact.

Why does not catalyst die only once in a chained action?

Consider the following actions:
sub get_stuff :Chained('/') :PathPart('stuff') :CaptureArgs(1) {
my ($self,$c,$stuff_id) = #_;
die "ARRRRRRGGGG";
}
sub view_stuff :Chained('get_stuff') :PathPart('') :Args(0){
die "DO'H";
}
Now if you request '/stuff/314/' , you'll get
Error: ARRRRG in get_stuff at ...
Error: DO'H in view_stuff at ...
Is there a reason why not just throw the error at the first failing chain link?
Why is catalyst trying to carry on the chain?
I'm not sure of the answer as to 'why' but I presume it was done that way to give flexibility.
You should probably catch the error with eval (or preferably something like Try::Tiny or TryCatch) and call $c->detach if you want to stop processing actions.
Catalyst::Plugin::MortalForward
The chosen answer may be outdated. Catalyst can die early, when the application's config key abort_chain_on_error_fix is set.
__PACKAGE__->config(abort_chain_on_error_fix => 1);
See the documentation about Catalyst configurations. It also states, that this behaviour may be standard in future.
Cucabit is right, detach is the way to go. As to the why, normally in a perl process, 'die' stops the process. In Catalyst you don't want that. If for instance you run your Catalyst app under FastCGI, you spawn one or more standalone processes that handle multiple requests. If the first request would kill the process itself, the web server would have to respawn the FastCGI process in order to be able to handle the next call. I think for that, Catalyst catches 'die' (its used a lot as the default 'do_something() or die $!') and turns it into an Exception.
You could also end the process with 'exit' I guess, but you end up with the same problems as above, killing the process.
What you can of course do is create your own 'die' method that logs the error passed with the default log object and then calls detach or something. it should also be possible to redefine the Catalyst exception handling as anything is possible with Catalyst :)

Scala actors: receive vs react

Let me first say that I have quite a lot of Java experience, but have only recently become interested in functional languages. Recently I've started looking at Scala, which seems like a very nice language.
However, I've been reading about Scala's Actor framework in Programming in Scala, and there's one thing I don't understand. In chapter 30.4 it says that using react instead of receive makes it possible to re-use threads, which is good for performance, since threads are expensive in the JVM.
Does this mean that, as long as I remember to call react instead of receive, I can start as many Actors as I like? Before discovering Scala, I've been playing with Erlang, and the author of Programming Erlang boasts about spawning over 200,000 processes without breaking a sweat. I'd hate to do that with Java threads. What kind of limits am I looking at in Scala as compared to Erlang (and Java)?
Also, how does this thread re-use work in Scala? Let's assume, for simplicity, that I have only one thread. Will all the actors that I start run sequentially in this thread, or will some sort of task-switching take place? For example, if I start two actors that ping-pong messages to each other, will I risk deadlock if they're started in the same thread?
According to Programming in Scala, writing actors to use react is more difficult than with receive. This sounds plausible, since react doesn't return. However, the book goes on to show how you can put a react inside a loop using Actor.loop. As a result, you get
loop {
react {
...
}
}
which, to me, seems pretty similar to
while (true) {
receive {
...
}
}
which is used earlier in the book. Still, the book says that "in practice, programs will need at least a few receive's". So what am I missing here? What can receive do that react cannot, besides return? And why do I care?
Finally, coming to the core of what I don't understand: the book keeps mentioning how using react makes it possible to discard the call stack to re-use the thread. How does that work? Why is it necessary to discard the call stack? And why can the call stack be discarded when a function terminates by throwing an exception (react), but not when it terminates by returning (receive)?
I have the impression that Programming in Scala has been glossing over some of the key issues here, which is a shame, because otherwise it's a truly excellent book.
First, each actor waiting on receive is occupying a thread. If it never receives anything, that thread will never do anything. An actor on react does not occupy any thread until it receives something. Once it receives something, a thread gets allocated to it, and it is initialized in it.
Now, the initialization part is important. A receiving thread is expected to return something, a reacting thread is not. So the previous stack state at the end of the last react can be, and is, wholly discarded. Not needing to either save or restore the stack state makes the thread faster to start.
There are various performance reasons why you might want one or other. As you know, having too many threads in Java is not a good idea. On the other hand, because you have to attach an actor to a thread before it can react, it is faster to receive a message than react to it. So if you have actors that receive many messages but do very little with it, the additional delay of react might make it too slow for your purposes.
The answer is "yes" - if your actors are not blocking on anything in your code and you are using react, then you can run your "concurrent" program within a single thread (try setting the system property actors.maxPoolSize to find out).
One of the more obvious reasons why it is necessary to discard the call stack is that otherwise the loop method would end in a StackOverflowError. As it is, the framework rather cleverly ends a react by throwing a SuspendActorException, which is caught by the looping code which then runs the react again via the andThen method.
Have a look at the mkBody method in Actor and then the seq method to see how the loop reschedules itself - terribly clever stuff!
Those statements of "discarding the stack" confused me also for a while and I think I get it now and this is my understanding now. In case of "receive" there is a dedicated thread blocking on the message (using object.wait() on a monitor) and this means that the complete thread stack is available and ready to continue from the point of "waiting" on receiving a message.
For example if you had the following code
def a = 10;
while (! done) {
receive {
case msg => println("MESSAGE RECEIVED: " + msg)
}
println("after receive and printing a " + a)
}
the thread would wait in the receive call until the message is received and then would continue on and print the "after receive and printing a 10" message and with the value of "10" which is in the stack frame before the thread blocked.
In case of react there is no such dedicated thread, the whole method body of the react method is captured as a closure and is executed by some arbitrary thread on the corresponding actor receiving a message. This means only those statements that can be captured as a closure alone will be executed and that's where the return type of "Nothing" comes to play. Consider the following code
def a = 10;
while (! done) {
react {
case msg => println("MESSAGE RECEIVED: " + msg)
}
println("after react and printing a " + a)
}
If react had a return type of void, it would mean that it is legal to have statements after the "react" call ( in the example the println statement that prints the message "after react and printing a 10"), but in reality that would never get executed as only the body of the "react" method is captured and sequenced for execution later (on the arrival of a message). Since the contract of react has the return type of "Nothing" there cannot be any statements following react, and there for there is no reason to maintain the stack. In the example above variable "a" would not have to be maintained as the statements after the react calls are not executed at all. Note that all the needed variables by the body of react is already be captured as a closure, so it can execute just fine.
The java actor framework Kilim actually does the stack maintenance by saving the stack which gets unrolled on the react getting a message.
Just to have it here:
Event-Based Programming without Inversion of Control
These papers are linked from the scala api for Actor and provide the theoretical framework for the actor implementation. This includes why react may never return.
I haven't done any major work with scala /akka, however i understand that there is a very significant difference in the way actors are scheduled.
Akka is just a smart threadpool which is time slicing execution of actors...
Every time slice will be one message execution to completion by an actor unlike in Erlang which could be per instruction?!
This leads me to think that react is better as it hints the current thread to consider other actors for scheduling where as receive "might" engage the current thread to continue executing other messages for the same actor.