Filter from large log file - perl

I would like to do the following without changing the file for a large log file in Windows format
Remove all CRLF characters
Insert a blank line between the "CLG..." "TRC..." in the last line of the log file
After reading the results in paragraph mode, print the paragraph if a particular string exists
code below does not work.
use strict;
use warnings;
my $ID = "D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2";
my $SDP;
open (LOG, "file.log") || die $!;
my $line;
while(<LOG>) {
$line .= $_;
$line =~s/\r//g;
}
local $/ = '';
while (<>) {
if ( /Call-ID:\s+(.+)/ and $ID ) {
$SDP = 1;
print;
next;
}
print if $SDP && /\brtpmap\b/;
$SDP = 0;
}
close(LOG);
Jan 28 11:39:37.525 CET: //1393628/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Received:^M
SIP/2.0 200 OK^M
Via: SIP/2.0/UDP 10.218.16.2:5060;branch=z9hG4bKB22001ED5^M
From: "Frankeerapparaat Secretariaat" <sip:089653717#10.210.2.49>;tag=E7E0EF64-192F^M
To: <sip:022046187#10.210.2.49>;tag=25079324~19cc0abf-61d9-407f-a138-96eaffee1467-27521338^M
Date: Mon, 28 Jan 2013 10:39:32 GMT^M
Call-ID: D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2^M
CSeq: 102 INVITE^M
Allow: INVITE, OPTIONS, INFO, BYE, CANCEL, ACK, PRACK, UPDATE, REFER, SUBSCRIBE, NOTIFY^M
Allow-Events: presence^M
Supported: replaces^M
Supported: X-cisco-srtp-fallback^M
Supported: Geolocation^M
Session-Expires: 1800;refresher=uas^M
Require: timer^M
P-Preferred-Identity: <sip:022046187#10.210.2.49>^M
Remote-Party-ID: <sip:022046187#10.210.2.49>;party=called;screen=no;privacy=off^M
Contact: <sip:022046187#10.210.2.49:5060>^M
Content-Type: application/sdp^M
Content-Length: 209^M
^M
v=0^M
o=CiscoSystemsCCM-SIP 2000 1 IN IP4 10.210.2.49^M
s=SIP Call^M
c=IN IP4 10.210.2.1^M
t=0 0^M
m=audio 16844 RTP/AVP 8 101^M
a=rtpmap:8 PCMA/8000^M
a=ptime:20^M
a=rtpmap:101 telephone-event/8000^M
a=fmtp:101 0-15^M
^M
Jan 28 11:39:37.529 CET: //1393628/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Sent:^M
ACK sip:022046187#10.210.2.49:5060 SIP/2.0^M
Via: SIP/2.0/UDP 10.218.16.2:5060;branch=z9hG4bKB2247150A^M
From: "Frankeerapparaat Secretariaat" <sip:089653717#10.210.2.49>;tag=E7E0EF64-192F^M
To: <sip:022046187#10.210.2.49>;tag=25079324~19cc0abf-61d9-407f-a138-96eaffee1467-27521338^M
Date: Mon, 28 Jan 2013 10:39:36 GMT^M
Call-ID: D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2^M
Max-Forwards: 70^M
CSeq: 102 ACK^M
Authorization: Digest username="Genk_AC_1",realm="infraxnet.be",uri="sip:022046187#10.210.2.49:5060",response="9546733290a96d1470cfe29a7500c488",nonce="5V/Jt8FHd5I8uaoahshiaUud8O6UujJJ",algorithm=MD5^M
Allow-Events: telephone-event^M
Content-Length: 0^M
^M
^M
Jan 28 11:39:37.529 CET: //1393627/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Sent:^M
SIP/2.0 200 OK^M
Via: SIP/2.0/UDP 192.168.8.11:5060;branch=z9hG4bK24ecaaaa6dbd3^M
From: "Frankeerapparaat Secretariaat" <sip:3717#192.168.8.11>;tag=e206cc93-1791-457a-aaac-1541296cf17c-29093746^M
To: <sip:022046187#192.168.8.28>;tag=E7E0F8A4-EA3^M
Date: Mon, 28 Jan 2013 10:39:32 GMT^M
Call-ID: fedc8f80-10615564-45df0-b08a8c0#192.168.8.11^M
CSeq: 101 INVITE^M
Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, PRACK, UPDATE, REFER, SUBSCRIBE, NOTIFY, INFO, REGISTER^M
Allow-Events: telephone-event^M
Remote-Party-ID: <sip:022046187#192.168.8.28>;party=called;screen=no;privacy=off^M
Contact: <sip:022046187#192.168.8.28:5060>^M
Supported: replaces^M
Supported: sdp-anat^M
Server: Cisco-SIPGateway/IOS-15.3.1.T^M
Session-Expires: 1800;refresher=uas^M
Require: timer^M
Supported: timer^M
Content-Type: application/sdp^M
Content-Disposition: session;handling=required^M
Content-Length: 247^M
^M
v=0^M
o=CiscoSystemsSIP-GW-UserAgent 7276 9141 IN IP4 192.168.8.28^M
s=SIP Call^M
c=IN IP4 192.168.8.28^M
t=0 0^M
m=audio 30134 RTP/AVP 8 101^M
c=IN IP4 192.168.8.28^M
a=rtpmap:8 PCMA/8000^M
a=rtpmap:101 telephone-event/8000^M
a=fmtp:101 0-15^M
a=ptime:20^M
^M
CLG(2022-11-07 00:09:06.444)| Call(Terminate) | 302A330B040C73070A021806021C0200 | ^M
TRC(2022-11-15 00:00:38.012)| SIP( OUT : Response ) Trying( 100 INVITE ) | 2 | | 0 | 332C30050A0F750A00011A06021C0200 | SIP/2.0 100 Trying^M

There are a many things that are getting in your way here. I'll think I'll get close to what you are trying to do, but I have to make some guesses.
First, the bare # in $ID is interpolating and you are missing the #10 in your string when it should have been a literal. You aren't getting a warning perhaps because that identifier would be a Perl special variable rather than a user-defined one.
Second, you have some weird filehandling there.
You're building up a modified log file into a single, big string in $line. You say that you have large files, and that means different things to different people. But some people work in contexts were "large" is tens or hundreds of gigs. Don't do that.
Third, you don't do anything with $line after you build it up. I think that you are expecting to read it again with <>.
I'd approach this a bit differently. I don't care so much about the line-endings right off the bat. If I'm going through millions of records, I don't want to spend time converting every line when I'm going to ignore most of them. I can convert that when I have stuff to output. That also depends a bit on the proportion of hits you expect. If you are printing almost every record, it doesn't matter as much as if you are printing 1% of them.
To start, I know that the file format looks almost like the email mbox format. That first line with the date ruins it all because it doesn't have a fixed string that you can use to see the start of the record like the mbox envelope does. This also means that since the entire record (header and body) are separated by CRLFCRLF and the records themselves are separated by CRLFCRLF, it's a bit tricky to get a complete record in paragraph mode.
So, lets read chunks separated by CRLFCRLF. The first chunk should be the header and the second chunk should be the body. There's a chance to get out of step here and there are some things you can do to recover from that, but I'll skip that here. Basically, inspect the chunk and see if it is what you expected (begins with date, etc). If you are curious about that sort of thing, the design of UTF-8 is interesting since it started with the idea that things can get garbled but you can get back on track.
Here's what we have so far. Single-quote $ID to get the true value, and set up ARGV (the filehandle for the empty <>) to use CRLFCRLF as the input record separator ($/). That's a per-filehandle variable that works on the currently selected (default) filehandle, so I select ARGV, set the value, then reselect the previous default. It's weird, but let's leave it at that. Then, the meat of my program is a while loop:
use v5.10;
use strict;
use warnings;
my $ID = 'D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2';
my $old = select(ARGV);
$/ = "\x0D\x0A" x 2;
select($old);
while( <> ) {
...
}
The outer <> gets the header ($_), and I always have to check for a body even if I want to skip that entire record. That merely keeps the reading synchronized (and your other requirements forebode some other ways to get out of sync). The trick is that I have to look at the Content-length header to see if there's a body. I don't particularly trust that length value though because I haven't done the work to see if it's from LF output or CRLF output (that is, the logger added octets without changing the Content-Length header).
There are many ways to do this, but this is simple enough: check if the content length is greater than zero:
while( <> ) {
my $body = '';
$body = <> if( /\vContent-Length:\s+([0-9]+)/i and $1 > 0 );
... filter goes here ...
print $_, $body
}
Now I need to decide if I want this record. You have two requirements, I think:
Call-ID has the $ID value
The body has rtpmap
Start with the $ID value. I want that to be the header value, so I want that in the pattern. For that, I use quotemeta to prepare the string to be interpolated into the pattern (the . is a special char). You have a line /Call-ID:\s+(.+)/ and $ID where I think you thought the capture value from (.+) would be compared to $ID, but that's not how it works.
my $ID = quotemeta('D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2');
while( <> ) {
my $body = '';
$body = <> if( /\vContent-Length:\s+([0-9]+)/i and $1 > 0 );
next unless /\vCall-ID:\s+$ID/;
print $_, $body
}
Here's an interesting note. I can't use the ^ beginning of line anchor because my line ending is CRLFCRLF, but the internal lines are separated by CRLF. I do know that the header will have vertical whitespace before it, so I add a \v to anchor it. Not a big deal.
Now, my pattern isn't going to change, so I can pre-compile that with qr//. Later I can use that right in the m//:
my $ID = quotemeta('D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2');
my $header_pattern = qr/\vCall-ID:\s+$ID/;
while( <> ) {
my $body = '';
$body = <> if( /\vContent-Length:\s+([0-9]+)/i and $1 > 0 );
next unless /$header_pattern/;
next unless $body =~ /\brtpmap\b/;
print $header, $body
}
I might be content to leave it like this. The output will still have CRLF feeds, but who cares? I can fix that in another program such as dos2unix if it really matters.
In your example, I don't think the line endings matter. It looks like you want to extract a couple values. I think that you want to print the value of the Call-ID line rtpmap value. Your code doesn't quite work for that for various reasons as you are trying to remember where you are and you stomp all over your state. Instead, I now have the header and body, I use captures to get the values, then I output them joined with newlines. I never converted the line endings because I never needed to.
while( <> ) {
my $body = '';
$body = <> if( /\vContent-Length:\s+([0-9]+)/i and $1 > 0 );
next unless m/$header_pattern/;
my $this_id = $1;
next unless $body =~ /\b(rtpmap:[^\v]+)/g;
print join "\n", $this_id, $;
}
But there's a another problem. The body has multiple rtpmap lines. If I want all of them, I need to make an adjusment. I can match the body in a global match and check how many results I get:
my #rtpmaps = $body =~ /\b(rtpmap:[^\v]+)/g;
next unless #rtpmaps > 0;
print join "\n", $this_id, #rtpmaps;
Here it is all together:
#perl
use v5.10;
use warnings;
my $ID = quotemeta('D5CCA1AE-686D11E2-A881ED01-8DFA6D70#10.218.16.2');
my $header_pattern = qr/\vCall-ID:\s+($ID)/;
my $old = select(ARGV);
$/ = "\x0D\x0A" x 2;
select($old);
chdir '/Users/brian/Desktop';
#ARGV = 'test.log';
while( <> ) {
my $body = '';
$body = <> if( /\vContent-Length:\s+([0-9]+)/i and $1 > 0 );
next unless m/$header_pattern/;
next unless length $body;
my $this_id = $1;
my #rtpmaps = $body =~ /\b(rtpmap:[^\v]+)/g;
next unless #rtpmaps > 0;
print join "\n", $this_id, #rtpmaps;
}
You had an additional requirement with the lines beginning with CTG and TRC. You can inspect the line before you look for a body and decide what you'd like to do with those.

I assume you are running on a system that uses Unix-style line endings, otherwise the file's Windows line endings would not be a problem. The key to handling Windows files under Unix is to let Perl do the dirty work by using the :crlf I/O layer when you open the file. To do this, you need to use the three-argument version of open(). In your case this is open LOG, '<:crlf', 'file.log' or die $!. Note that I do not need the parentheses in the open() because I have used the loosely-binding or rather than the tightly-binding ||.
The following is how I would implement your code, assuming I understand your requirements:
#!/usr/bin/env perl
use 5.010; # for \K
use strict;
use warnings;
open my $log, '<:crlf', 'file.log'
or die "Failed to open file.log: $!\n";
local $/ = '';
my $state = \&state_1;
while ( <$log> ) {
if ( eof $log ) {
s/ ^ CLG .*? \n \K (?= TRC ) /\n/smx;
}
$state = $state->();
}
sub state_1 {
if ( m/ Call-ID: \s+ /smx ) {
print;
return \&state_2;
}
return \&state_1;
}
sub state_2 {
if ( m/ \b rtpmap \b /smx ) {
print;
}
return \&state_1;
}
# ex: set ts=8 sts=4 sw=4 tw=72 ft=perl expandtab shiftround :
Rather than do logic on flag variables (your $SDP) I just implemented a state machine.
My logic does not mention $ID because the value you give is always true. If $ID is false I believe no output at all should be produced.
Strictly speaking, $/ should be localized to prevent Spooky Action at a Distance, but in a small script like this it is not likely to cause problems.
The if ( eof $log ) ... implements your requirement that a blank line be inserted between two lines in the last paragraph. If your intent was to break this into two paragraphs you will need a different implementation.

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

Perl CGI produces unexpected output

I have a Perl CGI script for online concordance application that searches for an instance of word in a text and prints the sorted output.
#!/usr/bin/perl -wT
# middle.pl - a simple concordance
# require
use strict;
use diagnostics;
use CGI;
# ensure all fatals go to browser during debugging and set-up
# comment this BEGIN block out on production code for security
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
# sanity check
my $q = new CGI;
my $target = $q->param("keyword");
my $radius = $q->param("span");
my $ordinal = $q->param("ord");
my $width = 2*$radius;
my $file = 'concordanceText.txt';
if ( ! $file or ! $target ) {
print "Usage: $0 <file> <target>\n";
exit;
}
# initialize
my $count = 0;
my #lines = ();
$/ = ""; # Paragraph read mode
# open the file, and process each line in it
open(FILE, " < $file") or die("Can not open $file ($!).\n");
while(<FILE>){
# re-initialize
my $extract = '';
# normalize the data
chomp;
s/\n/ /g; # Replace new lines with spaces
s/\b--\b/ -- /g; # Add spaces around dashes
# process each item if the target is found
while ( $_ =~ /\b$target\b/gi ){
# find start position
my $match = $1;
my $pos = pos;
my $start = $pos - $radius - length($match);
# extract the snippets
if ($start < 0){
$extract = substr($_, 0, $width+$start+length($match));
$extract = (" " x -$start) . $extract;
}else{
$extract = substr($_, $start, $width+length($match));
my $deficit = $width+length($match) - length($extract);
if ($deficit > 0) {
$extract .= (" " x $deficit);
}
}
# add the extracted text to the list of lines, and increment
$lines[$count] = $extract;
++$count;
}
}
sub removePunctuation {
my $string = $_[0];
$string = lc($string); # Convert to lowercase
$string =~ s/[^-a-z ]//g; # Remove non-aplhabetic characters
$string =~ s/--+/ /g; #Remove 2+ hyphens with a space
$string =~s/-//g; # Remove hyphens
$string =~ s/\s=/ /g;
return($string);
}
sub onLeft {
#USAGE: $word = onLeft($string, $radius, $ordinal);
my $left = substr($_[0], 0, $_[1]);
$left = removePunctuation($left);
my #word = split(/\s+/, $left);
return($word[-$_[2]]);
}
sub byLeftWords {
my $left_a = onLeft($a, $radius, $ordinal);
my $left_b = onLeft($b, $radius, $ordinal);
lc($left_a) cmp lc($left_b);
}
# process each line in the list of lines
print "Content-type: text/plain\n\n";
my $line_number = 0;
foreach my $x (sort byLeftWords #lines){
++$line_number;
printf "%5d",$line_number;
print " $x\n\n";
}
# done
exit;
The perl script produces expected result in terminal (command line). But the CGI script for online application produces unexpected output. I cannot figure out what mistake I am making in the CGI script. The CGI script should ideally produce the same output as the command line script. Any suggestion would be very helpful.
Command Line Output
CGI Output
The BEGIN block executes before anything else and thus before
my $q = new CGI;
The output goes to the server process' stdout and not to the HTTP stream, so the default is text/plain as you can see in the CGI output.
After you solve that problem you'll find that the output still looks like a big ugly block because you need to format and send a valid HTML page, not just a big block of text. You cannot just dump a bunch of text to the browser and expect it to do anything intelligent with it. You must create a complete HTML page with tags to layout your content, probably with CSS as well.
In other words, the output required will be completely different from the output when writing only to the terminal. How to structure it is up to you, and explaining how to do that is out of scope for StackOverflow.
As the other answers state, the BEGIN block is executed at the very start of your program.
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
There, you output an HTTP header Content-type: text/html\n\n. The browser sees that first, and treats all your output as HTML. But you only have text. Whitespace in an HTML page is collapsed into single spaces, so all your \n line breaks disappear.
Later, you print another header, the browser cannot see that as a header any more, because you already had one and finished it off with two newlines \n\n. It's now too late to switch back to text/plain.
It is perfectly fine to have a CGI program return text/plain and just have text without markup be displayed in a browser when all you want is text, and no colors or links or tables. For certain use cases this makes a lot of sense, even if it doesn't have the hyper in Hypertext any more. But you're not really doing that.
Your BEGIN block serves a purpose, but you are overdoing it. You're trying to make sure that when an error occurs, it gets nicely printed in the browser, so you don't need to deal with the server log while developing.
The CGI::Carp module and it's functionality fatalsToBrowser bring their own mechanism for that. You don't have to do it yourself.
You can safely remove the BEGIN block and just put your use CGI::CARP at the top of the script with all the other use statements. They all get run first anyway, because use gets run at compile time, while the rest of your code gets run at run time.
If you want, you can keep the $|++, which turns off the buffering for your STDOUT handle. It gets flushed immediately and every time you print something, that output goes directly to the browser instead of collecting until it's enough or there is a newline. If your process runs for a long time, this makes it easier for the user to see that stuff is happening, which is also useful in production.
The top of your program should look like this now.
#!/usr/bin/perl -T
# middle.pl - a simple concordance
use strict;
use warnigns;
use diagnostics;
use CGI;
use CGI::Carp('fatalsToBrowser');
$|=1;
my $q = CGI->new;
Finally, a a few quick words on the other parts I deleted from there.
Your comment requires over the use statements is misleading. Those are use, not require. As I said above, use gets run at compile time. require on the other hand gets run at run time and can be done conditionally. Misleading comments will make it harder for others (or you) to maintain your code later on.
I removed the -w flag from the shebang (#!/usr/bin/perl) and put the use warnings pragma in. That's a more modern way to turn on warnings, because sometimes the shebang can be ignored.
The use diagnostics pragma gives you extra long explanations when things go wrong. That's useful, but also extra slow. You can use it during development, but please remove it for production.
The comment sanity check should be moved down under the CGI instantiation.
Please use the invocation form of new to instantiate CGI, and any other classes. The -> syntax will take care of inheritance properly, while the old new CGI cannot do that.
I ran your cgi. The BEGIN block is run regardless and you print a content-type header here - you have explicitly asked for HTML here. Then later you attemp to print another header for PLAIN. This is why you can see the header text (that hasn't taken effect) at the beginning of the text in the browser window.

Parsing a syslog entry

This is what an entry looks like:
Jan 26 20:53:31 hostname logger: System rebooted for hard disk upgrade
I'm writing a small application to parse entries like this and email a nicely formatted message to the admin. I'm writing in Perl and found the split() function which is exactly what I'm looking for:
my #tmp = split(/ /, $string, 4);
#tmp = {$date, $hostname, $facility, $message)
That's what I'm hoping to get. Split() can handle the spaces in the $message part because I limit the amount of "words" to split off. However, the spaces in the $date part throw it off. Is there a clean way I can get these variables to represent what they're supposed to?
I know I could use substr() to grab the first 15 characters (the date), then use split() and limit it to 3 words instead of 4, then grab all my strings from there. But is there a more elegant way to do this?
If one-lined-ness is important to elegance, split on spaces that are not followed by a digit:
my ( $time, $hostname, $facility, $message ) = split /\s+(?=\D)/, $string, 4;
But it makes more sense to use a combination of split and unpack to address the need:
my ( $timeStamp, $log ) = unpack 'A15 A*', $string;
my ( $host, $facility, $msg ) = split /\s+/, $log;
Does Parse::Syslog do what you need without the i-try-this-regexp-oh-it-does-not-work-ok-i-hcanged-and-it-works-oh-not-always-hmm-let-me-try-that-much-better-yay-oh-no-it-broke-let-me-try-this-one-has-nobody-done-this-yet feeling?
Use a regex. Here is a simple example:
$mystring = "Jan 26 20:53:31 hostname logger: System rebooted for hard disk upgrade";
if($mystring =~ m/(\w{3}\s+\d{1,2}\s\d{2}:\d{2}:\d{2})\s([^\s]*)\s([^\s:]*):\s(.*$)/) {
$date=$1;
$host=$2;
$facility=$3;
$mesg=$4;
print "Date: $date\nHost: $host\nFacility: $facility\nMesg: $mesg";
}
Old question, but I experienced similar problem and rectified by formatting of my syslog messages ( hence modified rsyslog.conf)
I created rsyslog template as follows
template(name="CustomisedTemplate" type="list") {
property(name="timestamp")
constant(value=" ")
property(name="$year")
constant(value=";")
property(name="hostname")
constant(value=";")
property(name="programname")
constant(value=";")
property(name="msg" spifno1stsp="on")
property(name="msg" droplastlf="on")
constant(value="\n")
}
then
I set my customised template as default by adding
$ActionFileDefaultTemplate CustomisedTemplate.
to (r)syslog.conf
I could also create the filter for my program (logger), which will use template and redirect message created by program ( logger) to separate file. To achieve that, I added
if $programname contains "logger" then /var/logs/logger.err;CustomisedTemplate
to (r)syslog.conf
So at the end my syslog entry looks like
Jan 26 20:53:31 2016;hostname;logger:;System rebooted for hard disk upgrade
which is rather easy to parse.

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

How can I get a full Mail::SpamAssassin::MailMessage object from text?

I use the following code to generate a spam report using SpamAssassin:
use Mail::SpamAssassin;
my $sa = Mail::SpamAssassin->new();
open FILE, "<", "mail.txt";
my #lines = <FILE>;
my $mail = $sa->parse(#lines);
my $status = $sa->check($mail);
my $report = $status->get_report();
$report =~ s/\n/\n<br>/g;
print "<h1>Spam Report</h1>";
print $report;
$status->finish();
$mail->finish();
$sa->finish();
The problem I have is that it classifies 'sample-nonspam.txt' as spam:
Content preview: [...]
Content analysis details: (6.9 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 NO_RELAYS Informational: message was not relayed via SMTP
1.2 MISSING_HEADERS Missing To: header
0.1 MISSING_MID Missing Message-Id: header
1.8 MISSING_SUBJECT Missing Subject: header
2.3 EMPTY_MESSAGE Message appears to have no textual parts and no
Subject: text
-0.0 NO_RECEIVED Informational: message has no Received headers
1.4 MISSING_DATE Missing Date: header
0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822 headers
And that information -is- in the file. What worries me is that in the documentation, it states "Parse will return a Mail::SpamAssassin::Message object with just the headers parsed.". Does that mean it will not return a full message?
You're missing a single character:
my $mail = $sa->parse(\#lines);
From the docs (with emphasis added):
parse($message, $parse_now [, $suppl_attrib])
Parse will return a Mail::SpamAssassin::Message object with just the headers parsed. When calling this function, there are two optional parameters that can be passed in: $message is either undef (which will use STDIN), a scalar of the entire message, an array reference of the message with 1 line per array element, or a file glob which holds the entire contents of the message; and $parse_now, which specifies whether or not to create the MIME tree at parse time or later as necessary.
With the change above, I get the following output (HTML stripped):
pts rule name description
---- ---------------------- --------------------------------------------------
-2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
[score: 0.0000]
As the docs mention, parse is flexible. You could instead use
my $mail = $sa->parse(join "" => <FILE>); # scalar of the entire message
or
my $mail = $sa->parse(\*FILE); # a file glob with the entire contents
or
my $mail;
{ local $/; $mail = $sa->parse(<FILE>) } # scalar of the entire message
or even
open STDIN, "<", "mail.txt" or die "$0: open: $!";
my $mail = $sa->parse(undef); # undef means read STDIN
You'd remove my #lines = <FILE> for these last four examples to function as expected.
This is the right way to construct a Message:
my $mail = Mail::SpamAssassin::Message->new({ "message" => $content });