Parsing a syslog entry - perl

This is what an entry looks like:
Jan 26 20:53:31 hostname logger: System rebooted for hard disk upgrade
I'm writing a small application to parse entries like this and email a nicely formatted message to the admin. I'm writing in Perl and found the split() function which is exactly what I'm looking for:
my #tmp = split(/ /, $string, 4);
#tmp = {$date, $hostname, $facility, $message)
That's what I'm hoping to get. Split() can handle the spaces in the $message part because I limit the amount of "words" to split off. However, the spaces in the $date part throw it off. Is there a clean way I can get these variables to represent what they're supposed to?
I know I could use substr() to grab the first 15 characters (the date), then use split() and limit it to 3 words instead of 4, then grab all my strings from there. But is there a more elegant way to do this?

If one-lined-ness is important to elegance, split on spaces that are not followed by a digit:
my ( $time, $hostname, $facility, $message ) = split /\s+(?=\D)/, $string, 4;
But it makes more sense to use a combination of split and unpack to address the need:
my ( $timeStamp, $log ) = unpack 'A15 A*', $string;
my ( $host, $facility, $msg ) = split /\s+/, $log;

Does Parse::Syslog do what you need without the i-try-this-regexp-oh-it-does-not-work-ok-i-hcanged-and-it-works-oh-not-always-hmm-let-me-try-that-much-better-yay-oh-no-it-broke-let-me-try-this-one-has-nobody-done-this-yet feeling?

Use a regex. Here is a simple example:
$mystring = "Jan 26 20:53:31 hostname logger: System rebooted for hard disk upgrade";
if($mystring =~ m/(\w{3}\s+\d{1,2}\s\d{2}:\d{2}:\d{2})\s([^\s]*)\s([^\s:]*):\s(.*$)/) {
$date=$1;
$host=$2;
$facility=$3;
$mesg=$4;
print "Date: $date\nHost: $host\nFacility: $facility\nMesg: $mesg";
}

Old question, but I experienced similar problem and rectified by formatting of my syslog messages ( hence modified rsyslog.conf)
I created rsyslog template as follows
template(name="CustomisedTemplate" type="list") {
property(name="timestamp")
constant(value=" ")
property(name="$year")
constant(value=";")
property(name="hostname")
constant(value=";")
property(name="programname")
constant(value=";")
property(name="msg" spifno1stsp="on")
property(name="msg" droplastlf="on")
constant(value="\n")
}
then
I set my customised template as default by adding
$ActionFileDefaultTemplate CustomisedTemplate.
to (r)syslog.conf
I could also create the filter for my program (logger), which will use template and redirect message created by program ( logger) to separate file. To achieve that, I added
if $programname contains "logger" then /var/logs/logger.err;CustomisedTemplate
to (r)syslog.conf
So at the end my syslog entry looks like
Jan 26 20:53:31 2016;hostname;logger:;System rebooted for hard disk upgrade
which is rather easy to parse.

Related

Data::Dumper wraps second word's output

I'm experiencing a rather odd problem while using Data::Dumper to try and check on my importing of a large list of data into a hash.
My Data looks like this in another file.
##Product ID => Market for product
ABC => Euro
XYZ => USA
PQR => India
Then in my script, I'm trying to read in my list of data into a hash like so:
open(CONFIG_DAT_H, "<", $config_data);
while(my $line = <CONFIG_DAT_H>) {
if($line !~ /^\#/) {
chomp($line);
my #words = split(/\s*\=\>\s/, $line);
%product_names->{$words[0]} = $words[1];
}
}
close(CONFIG_DAT_H);
print Dumper (%product_names);
My parsing is working for the most part that I can find all of my data in the hash, but when I print it using the Data::Dumper it doesn't print it properly. This is my output.
$VAR1 = 'ABC';
';AR2 = 'Euro
$VAR3 = 'XYZ';
';AR4 = 'USA
$VAR5 = 'PQR';
';AR6 = 'India
Does anybody know why the Dumper is printing the '; characters over the first two letters on my second column of data?
There is one unclear thing in the code: is *product_names a hash or a hashref?
If it is a hash, you should use %product_names{key} syntax, not %product_names->{key}, and need to pass a reference to Data::Dumper, so Dumper(\%product_names).
If it is a hashref then it should be labelled with a correct sigil, so $product_names->{key} and Dumper($product_names}.
As noted by mob if your input has anything other than \n it need be cleaned up more explicitly, say with s/\s*$// per comment. See the answer by ikegami.
I'd also like to add, the loop can be simplified by loosing the if branch
open my $config_dat_h, "<", $config_data or die "Can't open $config_data: $!";
while (my $line = <$config_dat_h>)
{
next if $line =~ /^\#/; # or /^\s*\#/ to account for possible spaces
# ...
}
I have changed to the lexical filehandle, the recommended practice with many advantages. I have also added a check for open, which should always be in place.
Humm... this appears wrong to me, even you're using Perl6:
%product_names->{$words[0]} = $words[1];
I don't know Perl6 very well, but in Perl5 the reference should be like bellow considering that %product_names exists and is declared:
$product_names{...} = ... ;
If you could expose the full code, I can help to solve this problem.
The file uses CR LF as line endings. This would become evident by adding the following to your code:
local $Data::Dumper::Useqq = 1;
You could convert the file to use unix line endings (seeing as you are on a unix system). This can be achieved using the dos2unix utility.
dos2unix config.dat
Alternatively, replace
chomp($line);
with the more flexible
$line =~ s/\s+\z//;
Note: %product_names->{$words[0]} makes no sense. It happens to do what you want in old versions of Perl, but it rightfully throws an error in newer versions. $product_names{$words[0]} is the proper syntax for accessing the value of an element of a hash.
Tip: You should be using print Dumper(\%product_names); instead of print Dumper(%product_names);.
Tip: You might also find local $Data::Dumper::Sortkeys = 1; useful. Data::Dumper has such bad defaults :(
Tip: Using split(/\s*=>\s*/, $line, 2) instead of split(/\s*=>\s*/, $line) would permit the value to contain =>.
Tip: You shouldn't use global variable without reason. Use open(my $CONFIG_DAT_H, ...) instead of open(CONFIG_DAT_H, ...), and replace other instances of CONFIG_DAT_H with $CONFIG_DAT_H.
Tip: Using next if $line =~ /^#/; would avoid a lot of indenting.

Extracting DNA sequences from FASTA file with BioPerl with non-standard header

I'm trying to extract sequences from a database using the following code:
use strict;
use Bio::SearchIO;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";
Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:
>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA
The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?
By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.
I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.
This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.
use strict;
use warnings;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = qw(
secondround_merged_expanded.fasta
C7136661
1 10
);
my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);
sub makeid {
my ($head) = #_;
$head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
$1;
}
my $seq = $db->seq($id, $start, $end);
print $seq, "\n";
I have related question to this post. I was wondering if anyone has tried what happens when the position in the query is beyond the outside the limit of the fasta position. So lets say, the fasta contains 100 bases and you query contains position 102, does this method trap the error. I tried this in some real data and it appears to always return "1", however, my fasta sequences contains 0/1 and so it is hard to understand if this is an error code/ it is returning the output for the wrong base.
I tried looking in the documentation but could not find anything.

Weird behavior with Perl string concatenation

I'm working on a pretty simple script, reading a maplist.txt file and using the \n separated map names in it to build a command string - however, I'm getting some unexpected behavior.
My full code:
# compiles a map pack from maplist.txt
# for every server.
# Filipe Dobreira <dobreira#gmail.com>
# v1 # Sept. 2011
use strict;
my #servers = <*>;
foreach my $server (#servers)
{
# we only want folders:
next if -f $server;
print "server: $server\n";
my $maplist = $server . '/orangebox/cstrike/maplist.txt';
my $mapdir = $server . '/orangebox/cstrike/maps';
print " maplist: $maplist\n";
print " map folder: $mapdir\n";
# check if the maplist actually exists:
if(!(-e $maplist))
{
print "!!! failed to find $maplist\n";
next;
}
open MAPLIST, "<$maplist";
foreach my $map (<MAPLIST>)
{
chomp($map);
next if !$map;
# full path to the map file:
my $mapfile = "$mapdir/$map.bsp";
print "$mapfile\n";
}
}
Where I declare $mapfile, I expect the result to be something like:
zombieescape1/orangebox/cstrike/maps/ze_stargate_escape_v8.bsp
However, it seems like the concatenation is being made to the START of the string, and the final result ends up being something like:
.bspiescape1/orangebox/cstrike/maps/ze_stargate_escape_v8
So the .bsp portion is actually being written over the start of the leftmost string. I have very little perl experience, and I can only assume this is me failing to understand some quirk or operator behavior.
Note: I've also tried using "${mapdir}/${map}.bsp", concatenating everything with the dot operator, and a join "", $mapdir, $map, ".bsp", with the same result.
Thanks in advance.
PS: for reference, here's what a maplist.txtlooks like:
zm_3dubka_v3
zm_4way_tunnel_v2
zm_abstractchode_pyramid2
zm_anotheruglyzmap_v1e
zm_app7e_betterbworld_JDfix_v3
zm_atix_helicopter_mini
zm_base_winter_beta3
zm_battleforce_panic_ua
zm_black_lion_macd_v8
zm_bunker_f57_v2
zm_burbsdelchode_b3
zm_choddarena_b12
zm_choddasnowpanic_b4
zm_citylife_V2b
zm_crazycity
zm_deep_thought_nv
zm_desert_fortress_v2
ZM_desprerados_a1
zm_doomlike_station_v2
zm_dust_arena_v1_final
zm_exhibit_night_2F
zm_facility_v1
zm_farm3_nav72
zm_firewall_samarkand
zm_fortress_b7
zm_ghs_flats
zm_gl33m4x_errata
zm_idm_hauntedhouse_v1
zm_industry_v2
zm_kruma_kakariko_village_006
zm_kruma_panic_004
zm_lila_off!ce_v4
zm_little_city_v5pf_fix
zm_moonlight_v3_pF
zm_moon_roflicious_pF_02
zm_moocbblechode_b2
zm_mountain_b2
zm_neko_abura_v2
zm_neko_athletic_park_v2
zm_novum_v3_JDfix
zm_ocx_orly_v4
zm_officeattack_b5a
zm_officerush_betav7
zm_officesspace_pfss
zm_omi_facility_pfv2
zm_penumbra_PF3
zm_raindance_ak_v2
zm_roflicious_pfcf2
zm_roy_abandoned_canals_new
zm_roy_barricade_factory
zm_roy_highway
zm_roy_industrial_complex
zm_roy_old_industrial_pF
zm_roy_the_ship_pf
zm_roy_zombieranch_night_b4
zm_survival_f2a
zm_temple_v3pf
zm_towers_v3
zm_tx_highschool_zkedit_v2
zm_unpanicv2_pF
zm_vc2_office_redone_b1
zm_wasteyard_beta3
zm_winterfun_b4a
zm_wtfhax_v6
zm_wtfhax_v6e
zm_wwt_twinsteel_v8
I'd guess that the maplist.txt has non-unix line endings - probably dos - and as result you see what looks like prepending.
The problem is that the chomp() is only consuming one of the two line ending characters, leaving the carriage return behind.
You might find that if you set the Perl special variable $/ (input record seperator) before opening the map list, that chomp then does the job - it will consume both line-ending characters.
$/ = qq{\r\n};
Another solution would be to convert the line endings in the file before processing, perhaps using dos2unix.

Parsing a log file using perl

I have a log file where some of the entries look like this:
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
and I'm trying to get it into a CSV format:
Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC
I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.
So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.
Important points:
The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.
Since there is no one delimiter, you'll need to try this a few different ways:
First, split on ' ', then take the first three values:
my #array = split / /, $line;
my ($date, $time, $constant) = splice #array, 0, 3;
Join the rest of the fields together again, and re-split on v\d+= to get the values:
my $rest = join ' ', #array;
# $rest should now be "v1=XXX v2=YYY ..."
my #values = split /\s*v\d+=/, $rest;
shift #values; # since the first element in #values will be empty
print join ',', $date, $time, #values;
Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.
# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my #values = split /\s*v\d+=/, $valuelist;
shift #values;
print join ',', $date, $time, #values, "\n";
What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:
#!/usr/bin/env perl
use strict;
use warnings;
print "Date,Time,v1,v2,v3,v4,v5\n";
while (my $line = <DATA>) {
my #matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
print join(',', #matched), "\n";
}
__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
Two caveats:
1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.
2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).
In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).

Can I use Perl's unpack to break up a string into vars?

I have an image file name that consists of four parts:
$Directory (the directory where the image exists)
$Name (for a art site, this is the paintings name reference #)
$File (the images file name minus extension)
$Extension (the images extension)
$example 100020003000.png
Which I desire to be broken down accordingly:
$dir=1000 $name=2000 $file=3000 $ext=.png
I was wondering if substr was the best option in breaking up the incoming $example so I can do stuff with the 4 variables like validation/error checking, grabbing the verbose name from its $Name assignment or whatever. I found this post:
is unpack faster than substr?
So, in my beginners "stone tool" approach:
my $example = "100020003000.png";
my $dir = substr($example, 0,4);
my $name = substr($example, 5,4);
my $file = substr($example, 9,4);
my $ext = substr($example, 14,3); # will add the the "." later #
So, can I use unpack, or maybe even another approach that would be more efficient?
I would also like to avoid loading any modules unless doing so would use less resources for some reason. Mods are great tools I luv'em but, I think not necessary here.
I realize I should probably push the vars into an array/hash but, I am really a beginner here and I would need further instruction on how to do that and how to pull them back out.
Thanks to everyone at stackoverflow.com!
Absolutely:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = unpack 'A4' x 4, $example;
print "$dir\t$name\t$file\t$ext\n";
Output:
1000 2000 3000 .png
I'd just use a regex for that:
my ($dir, $name, $file, $ext) = $path =~ m:(.*)/(.*)/(.*)\.(.*):;
Or, to match your specific example:
my ($dir, $name, $file, $ext) = $example =~ m:^(\d{4})(\d{4})(\d{4})\.(.{3})$:;
Using unpack is good, but since the elements are all the same width, the regex is very simple as well:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = $example =~ /(.{4})/g;
It isn't unpack, but since you have groups of 4 characters, you could use a limited split, with a capture:
my ($dir, $name, file, $ext) = grep length, split /(....)/, $filename, 4;
This is pretty obfuscated, so I probably wouldn't use it, but the capture in a split is an ofter overlooked ability.
So, here's an explanation of what this code does:
Step 1. split with capturing parentheses adds the values captured by the pattern to its output stream. The stream contains a mix of fields and delimiters.
qw( a 1 b 2 c 3 ) == split /(\d)/, 'a1b2c3';
Step 2. split with 3 args limits how many times the string is split.
qw( a b2c3 ) == split /\d/, 'a1b2c3', 2;
Step 3. Now, when we use a delimiter pattern that matches pretty much anything /(....)/, we get a bunch of empty (0 length) strings. I've marked delimiters with D characters, and fields with F:
( '', 'a', '', '1', '', 'b', '', '2' ) == split /(.)/, 'a1b2';
F D F D F D F D
Step 4. So if we limit the number of fields to 3 we get:
( '', 'a', '', '1', 'b2' ) == split /(.)/, 'a1b2', 3;
F D F D F
Step 5. Putting it all together we can do this (I used a .jpeg extension so that the extension would be longer than 4 characters):
( '', 1000, '', 2000, '', 3000, '.jpeg' ) = split /(....)/, '100020003000.jpeg',4;
F D F D F D F
Step 6. Step 5 is almost perfect, all we need to do is strip out the null strings and we're good:
( 1000, 2000, 3000, '.jpeg' ) = grep length, split /(....)/, '100020003000.jpeg',4;
This code works, and it is interesting. But it's not any more compact that any of the other solutions. I haven't bench-marked, but I'd be very surprised if it wins any speed or memory efficiency prizes.
But the real issue is that it is too tricky to be good for real code. Using split to capture delimiters (and maybe one final field), while throwing out the field data is just too weird. It's also fragile: if one field changes length the code is broken and has to be rewritten.
So, don't actually do this.
At least it provided an opportunity to explore some lesser known features of split.
Both substr and unpack bias your thinking toward fixed-layout, while regex solutions are more oriented toward flexible layouts with delimiters.
The example you gave appeared to be fixed layout, but directories are usually separated from file names by a delimiter (e.g. slash for POSIX-style file systems, backwardslash for MS-DOS, etc.) So you might actually have a case for both; a regex solution to split directory and file name apart (or even directory/name/extension) and then a fixed-length approach for the name part by itself.