perl socket client won't recognize end of message - perl

I am running a simple perl socket client that queries a server, and then tries to close the socket once a keyword or phrase is encountered.
...
local $\ = "\x{0d}";
while($line = <$sock>){
print $line."\n";
last if ($line =~ /C6/);
}
...
close($sock);
I'd be happy to terminate on either the (0x0d) or the "C6" string - they both terminate the message. I'm monitoring it with Wireshark, and both triggers occur at the end of the message, yet I can't break out of the while loop, either with a break or last, nor does $line ever print.
Ideas? TIA

You don't exit when you receive C6 without receiving a Carriage Return (or EOF) because your code always waits for a Carriage Return (or EOF). Fix:
# sysread returns as soon as data is available, so this is a just a maximum.
use constant BLOCK_SIZE => 4*1024*1024;
my $buf = '';
while (1) {
my $rv = sysread($sock, $buf, length($buf), 4*1024*1024);
die($!) if !defined($rv);
last if !$rv;
process_message($1)
while $buf =~ s/^( (?: [^C\x0D] | C (?=[^6]) )*+ (?: C6 | \x0D ) )//xs;
}
die("Premature EOF") if length($buf);

I think the root of your problem here is that you've got $\ set, which is the output record separator, rather than $/ which is the input record separator.
So your while is waiting for a \n to occur before handing $line onto the rest of the loop.
But failing that, there's also a question of buffering and autoflushing on your socket.
And ... when you say you're monitoring with wireshark, how sure are you that those values are part of the packet content payload rather than part of the packet? Do you actually get \n sent from the server as part of the packet at any point?

Related

Why is my last line is always output twice?

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Read perl file handle with $INPUT_RECORD_SEPARATOR as a regex

I'm looking for a way to read from a file handle line by line (and then execute a function on each line) with the following twist: what I want to treat as a "line" shall be terminated by varying characters and not just a single character that I define as $/. I now that $INPUT_RECORD_SEPARATOR or $/ do not support regular expressions or passing a list of characters to be treated as line terminators and this is where my problem lies.
My file handle comes from stdout of a process. Thus, I cannot seek inside the file handle and the full content is not available immediately but is produced bit by bit as the process is executed. I want to be able to attach things like a timestamp to each "line" the process produces using a function that I called handler in my examples. Each line should be handled as soon as it gets produced by the program.
Unfortunately, I can only come up with a way that either executes the handler function immediately but seems horribly inefficient or a way that uses a buffer but will only lead to "grouped" calls of the handler function and thus, for example, produce wrong timestamps.
In fact, in my specific case, my regex would even be very simple and just read /\n|\r/. So for this particular problem I don't even need full regex support but just the possibility to treat more than one character as the line terminator. But $/ doesn't support this.
Is there an efficient way to solve this problem in Perl?
Here is some quick pseudo-perl code to demonstrate my two approaches:
read the input file handle byte-by-byte
This would look like this:
my $acc = "";
while (read($fd, my $b, 1)) {
$acc .= $b;
if ($acc =~ /someregex$/) {
handler($acc);
$acc = "";
}
}
The advantage here is, that handler gets immediately dispatched once enough bytes are read. The disadvantage is, that we do string appending and check the regex for every single byte we read from $fd.
read the input file handle with blocks of X-byte at a time
This would look like this:
my $acc = "";
while (read($fd, my $b, $bufsize)) {
if ($b =~ /someregex/) {
my #parts = split /someregex/, $b;
# for brevity lets assume we always get more than 2 parts...
my $first = shift #parts;
handler(acc . $first);
my $last = pop #parts;
foreach my $part (#parts) {
handler($part);
}
$acc = $last;
}
}
The advantage here is, that we are more efficient as we only check every $bufsize bytes. The disadvantage is, that the execution of handler has to wait until $bufsize bytes have been read.
Setting $INPUT_RECORD_SEPARATOR to a regex wouldn't help, because Perl's readline uses buffered IO, too. The trick is to use your second approach but with unbuffered sysread instead of read. If you sysread from a pipe, the call will return as soon as data is available, even if the whole buffer couldn't be filled (at least on Unix).
The suggestion by nwellnhof allowed me to implement a solution to this problem:
my $acc = "";
while (1) {
my $ret = sysread($fh, my $buf, 1000);
if ($ret == 0) {
last;
}
# we split with a capturing group so that we also retain which line
# terminator was used
# a negative limit is used to also produce trailing empty fields if
# required
my #parts = split /(\r|\n)/, $buf, -1;
my $numparts = scalar #parts;
if ($numparts == 1) {
# line terminator was not found
$acc .= $buf;
} elsif ($numparts >= 3) {
# first match needs special treatment as it needs to be
# concatenated with $acc
my $first = shift #parts;
my $term = shift #parts;
handler($acc . $first . $term);
my $last = pop #parts;
for (my $i = 0; $i < $numparts - 3; $i+=2) {
handler($parts[$i] . $parts[$i+1]);
}
# the last part is put into the accumulator. This might
# just be the empty string if $buf ended in a line
# terminator
$acc = $last;
}
}
# if the output didn't end with a linebreak, handle the rest
if ($acc ne "") {
handler($acc);
}
My tests show that indeed sysread will return even before having read 1000 characters if there is a pause in the input stream. The code above takes care to concatenate multiple messages of length 1000 and split messages with a lesser length or multiple terminators correctly.
Please shout if you see any bug in above code.

Print pipe per character instead of newline

In Perl I do like this now:
while (<$pipe>)
{
print $_;
}
But this just gives me the output linewise. How can I make this print lets say, each character, or split by \r instead of \n. The pipe only feeds me data on newline.
(I want to print the output from another process, and that process is using \r to print its process progress, this just ends up as a simple line with 100% for me..)
Perl has a concept of an “input record separator” $/ which is usually set to separate lines. You can read the full documentation here. Whenever you read a line/record from a filehandle, data is read until the end of file, or until the current string inside the $/ variable has been read.
For example: Given the input bytes aabaa, and $/ = "b", then my #records = <$fh> would produce ('aab', 'aa'). Note that the separator is always included, but can be removed with chomp (regardless of what the separator has been set to).
When reading from a file, the $/ has to be set before the lines are read, like so:
local $/ = "\r"; # "local" avoids overriding this value everywhere
while(my $line = <$pipe>) {
chomp $line;
...
}
There are a few special values for $/:
The empty string $/ = '' treats a sequence of two or more consecutive \ns as the separator (“paragraph mode”).
If set to a numeric reference, then that number of characters is read: $/ = \42 (read in 42-character chunks). In this case, one would rather use the read function.

Is it ever safe to combine select(2) and buffered IO for file handles?

I am using IO::Select to keep track of a variable number of file handles for reading. Documentation I've come across strongly suggests not to combine the select statement with <> (readline) for reading from the file handles.
My situation:
I will only ever use each file handle once, i.e. when the select offers me the file handle, it will be completely used and then removed from the select. I will be receiving a hash and a variable number of files. I do not mind if this blocks for a time.
For more context, I am a client sending information to be processed by my servers. Each file handle is a different server I'm talking to. Once the server is finished, a hash result will be sent back to me from each one. Inside that hash is a number indicating the number of files to follow.
I wish to use readline in order to integrate with existing project code for transferring Perl objects and files.
Sample code:
my $read_set = IO::Select()->new;
my $count = #agents_to_run; #array comes as an argument
for $agent ( #agents_to_run ) {
( $sock, my $peerhost, my $peerport )
= server($config_settings{ $agent }->
{ 'Host' },$config_settings{ $agent }->{ 'Port' };
$read_set->add( $sock );
}
while ( $count > 0) {
my #rh_set = IO::Select->can_read();
for my $rh ( #{ $rh_set } ) {
my %results = <$rh>;
my $num_files = $results{'numFiles'};
my #files = ();
for (my i; i < $num_files; i++) {
$files[i]=<$rh>;
}
#process results, close fh, decrement count, etc
}
}
Using readline (aka <>) is quite wrong for two reasons: It's buffered, and it's blocking.
Buffering is bad
More precisely, buffering using buffers that cannot be inspected is bad.
The system can do all the buffering it wants, since you can peek into its buffers using select.
Perl's IO system cannot be allowed to do any buffering because you cannot peek into its buffers.
Let's look at an example of what can happen using readline in a select loop.
"abc\ndef\n" arrives on the handle.
select notifies you that there is data to read.
readline will try to read a chunk from the handle.
"abc\ndef\n" will be placed in Perl's buffer for the handle.
readline will return "abc\n".
At this point, you call select again, and you want it to let you know that there is more to read ("def\n"). However, select will report there is nothing to read since select is a system call, and the data has already been read from the system. That means you will have to wait for more to come in before being able to read "def\n".
The following program illustrates this:
use IO::Select qw( );
use IO::Handle qw( );
sub producer {
my ($fh) = #_;
for (;;) {
print($fh time(), "\n") or die;
print($fh time(), "\n") or die;
sleep(3);
}
}
sub consumer {
my ($fh) = #_;
my $sel = IO::Select->new($fh);
while ($sel->can_read()) {
my $got = <$fh>;
last if !defined($got);
chomp $got;
print("It took ", (time()-$got), " seconds to get the msg\n");
}
}
pipe(my $rfh, my $wfh) or die;
$wfh->autoflush(1);
fork() ? producer($wfh) : consumer($rfh);
Output:
It took 0 seconds to get the msg
It took 3 seconds to get the msg
It took 0 seconds to get the msg
It took 3 seconds to get the msg
It took 0 seconds to get the msg
...
This can be fixed using non-buffered IO:
sub consumer {
my ($fh) = #_;
my $sel = IO::Select->new($fh);
my $buf = '';
while ($sel->can_read()) {
sysread($fh, $buf, 64*1024, length($buf)) or last;
while ( my ($got) = $buf =~ s/^(.*)\n// ) {
print("It took ", (time()-$got), " seconds to get the msg\n");
}
}
}
Output:
It took 0 seconds to get the msg
It took 0 seconds to get the msg
It took 0 seconds to get the msg
It took 0 seconds to get the msg
It took 0 seconds to get the msg
It took 0 seconds to get the msg
...
Blocking is bad
Let's look at an example of what can happen using readline in a select loop.
"abcdef" arrives on the handle.
select notifies you that there is data to read.
readline will try to read a chunk from the socket.
"abcdef" will be placed in Perl's buffer for the handle.
readline hasn't received a newline, so it tries to read another chunk from the socket.
There is no more data currently available, so it blocks.
This defies the purpose of using select.
[ Demo code forthcoming ]
Solution
You have to implement a version of readline that doesn't block, and only uses buffers you can inspect. The second part is easy because you can inspect the buffers you create.
Create a buffer for each handle.
When data arrives from a handle, read it but no more. When data is waiting (as we know from select), sysread will return what's available without waiting for more to arrive. That makes sysread perfect for this task.
Append the data read to the appropriate buffer.
For each complete message in the buffer, extract it and process it.
Adding a handle:
$select->add($fh);
$clients{fileno($fh)} = {
buf => '',
...
};
select loop:
use experimental qw( refaliasing declared_refs );
while (my #ready = $select->can_read) {
for my $fh (#ready) {
my $client = $clients{fileno($fh)};
my \$buf = \($client->{buf}); # Make $buf an alias for $client->{buf}
my $rv = sysread($fh, $buf, 64*1024, length($buf));
if (!$rv) {
delete $clients{fileno($fh)};
$sel->remove($fh);
if (!defined($rv)) {
... # Handle error
}
elsif (length($buf)) {
... # Handle eof with partial message
}
else {
... # Handle eof
}
next;
}
while ( my ($msg) = $buf =~ s/^(.*)\n// )
... # Process message.
}
}
}
By the way, this is much easier to do using threads, and this doesn't even handle writers!
Note that IPC::Run can do all the hard work for you if you're communicating with a child process, and that asynchronous IO can be used as an alternative to select.
After much discussion with #ikegami, we determined that in my extremely specific case the readline is actually not an issue. I'm still leaving ikegami's as the accepted right answer because it is far and away the best way to handle the general situation, and a wonderful writeup.
Readline (aka <>) is acceptable in my situation due to the following facts:
The handle is only returned once from the select statement, and then it is closed/removed
I only send one message through the file handle
I do not care if read handles block
I am accounting for timeouts and closed handle returns from select (error checking not included in the sample code above)

How do I read paragraphs at a time with Perl?

When I write
#!/usr/bin/perl -w
use strict;
while( <DATA> ) {
print "\n-------------------------\n\n";
print;
<>;
}
after each "return" I get one line.
Why don't I get with the next script after each "return" one paragraph?
#!/usr/bin/perl -w
use strict;
local $/ = "";
while( <DATA> ) {
print "\n-------------------------\n\n";
print;
<>;
}
__DATA__
line one
line two
line three
line four
line five
line six
line seven
line eigth
line nine
line ten
line eleven
line twelve
In your first script, with the $/ variable set to default "\n", the <DATA> will only return one line at a time.
I believe the second script does what you want, it's just that <> won't terminate the read on a 'return' but rather on a <ctrl-d> due to your $/ setting (as someone else pointed out <> reads from STDIN but I think you already know that and are using it to regulate the output).
If you really want to regulate the output with 'return' then you need to do more with $/ in the loop.
while( <DATA> ) {
print "\n-------------------------\n\n";
print;
$/ = "\n"; # default so that the following terminates the read on 'return'
<>;
$/ = "";
}
I guess you're expecting this line
local $/ = "";
to change the behaviour of
<DATA>
to keep reading until the end of the data.
But in fact it takes something like this
{
local $/; # $/ becomes undef in this block
...
}
to enable slurp mode (and to contain that mode to the scope inside the {curlys}).
In effect it's saying "forget about thinking of newlines as the end-of-record marker",
Besides that... there's a tie fighter in your code!
while( <DATA> ) {
print "\n-------------------------\n\n";
print;
<>; # <-- Feel the power of the DARK SIDE!!!
}
This little guy will read from STDIN, not from DATA - is that really what you want?
Using <> that way (interactively) in paragraph mode is going to be confusing. It won't return when you hit "return"; instead, it will read until it gets a non empty line (the start of a paragraph), then read until it gets an empty line (the end of that paragraph), then continue reading until it gets a non-empty line (the start of the following paragraph - which will be buffered, not returned) so it knows that it's discarded any extra empty lines.
Perhaps you should be using:
local $/ = "\n"; <>
at the end of your loop instead. Or maybe POSIX::getchar().