Creating a sort of "composable" parser for log files

Creating a sort of "composable" parser for log files - perl

I've started a little pet project to parse log files for Team Fortress 2. The log files have an event on each line, such as the following:
L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")
Notice there are some common parts of the syntax for log files. Names, for example consist of four parts: the name, an ID, a Steam ID, and the team of the player at the time. Rather than rewriting this type of regular expression, I was hoping to abstract this out slightly.
For example:
my $name = qr/(.*)<(\d+)><(.*)><(Red|Blue)>/
my $kill = qr/"$name" killed "$name"/;
This works nicely, but the regular expression now returns results that depend on the format of $name (breaking the abstraction I'm trying to achieve). The example above would match as:
my ($name_1, $id_1, $steam_1, $team_1, $name_2, $id_2, $steam_2, $team_2)
But I'm really looking for something like:
my ($player1, $player2)
Where $player1 and $player2 would be tuples of the previous data. I figure the "killed" event doesn't need to know exactly about the player, as long as it has information to create the player, which is what these tuples provide.
Sorry if this is a bit of a ramble, but hopefully you can provide some advice!

I think I understand what you are asking. What you need to do is reverse your logic. First you need to regex to split the string into two parts, then you extract your tuples. Then your regex doesn't need to know about the name, and you just have two generic player parsing regexs. Here is an short example:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $log = 'L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><
Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")';
my ($player1_string, $player2_string) = $log =~ m/(".*") killed (".*?")/;
my #player1 = $player1_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
my #player2 = $player2_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
print STDERR Dumper(\#player1, \#player2);
Hope this what you were looking for.

Another way to do it, but the same strategy as dwp's answer:
my #players =
map { [ /(.*)<(\d+)><(.*)><(Red|Blue)>/ ] }
$log_text =~ /"([^\"]+)" killed "([^\"]+)"/
;
Your log data contains several items of balanced text (quoted and parenthesized), so you might consider Text::Balanced for parts of this job, or perhaps a parsing approach rather than a direct attack with regex. The latter might be fragile if the player names can contain arbitrary input, for example.

Consider writing a Regexp::Log subclass.

Related

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!

It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Perl XML::SAX - character() method error

I'm new to using Perl XML::SAX and I encountered a problem with the characters event that is triggered. I'm trying to parse a very large XML file using perl.
My goal is to get the content of each tag (I do not know the tag names - given any xml file, I should be able to crack the record pattern and return every record with its data and tag like Tag:Data).
While working with small files, everything is ok. But when running on a large file, the characters{} event does partial reading of the content. There is no specific pattern in the way it cuts down the reading. Sometimes its the starting few characters of data and sometimes its last few characters and sometimes its just one letter from the actual data.
The Sax Parser is:
$myhandler = MyFilter->new();
$parser = XML::SAX::ParserFactory->parser(Handler => $myhandler);
$parser->parse_file($filename);
And, I have written my own Handler called MyFilter and overridding the character method of the parser.
sub characters {
my ($self, $element) = #_;
$globalvar = $element->{Data};
print "content is: $globalvar \n";
}
Even this print statement, reads the values partially at times.
I also tried loading the Parsesr Package before calling the $parser->parse() as:
$XML::SAX::ParserPackage = "XML::SAX::ExpatXS";
Stil doesn't work. Could anyone help me out here? Thanks in advance!

Sounds like you need XML::Filter::BufferText.
http://search.cpan.org/dist/XML-Filter-BufferText/BufferText.pm
From the description "One common cause of grief (and programmer error) is that XML parsers aren't required to provide character events in one chunk. They can, but are not forced to, and most don't. This filter does the trivial but oft-repeated task of putting all characters into a single event."
It's very easy to use once you have it installed and will solve your partial character data problem.

Feasibility of extracting arbitrary locations from a given string? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have many spreadsheets with travel information on them amongst other things.
I need to extract start and end locations where the row describes travel, and one or two more things from the row, but what those extra fields are shouldn't be important.
There is no known list of all locations and no fixed pattern of text, all that I can look for is location names.
The field I'm searching in has 0-2 locations, sometimes locations have aliases.
The Problem
If we have this:
00229 | 445 | RTF | Jan | trn_rtn_co | Chicago to Base1
00228 | 445 | RTF | Jan | train | Metroline to home coming from Base1
00228 | 445 | RTF | Jan | train_s | Standard train journey to Friends
I, for instance (though it will vary), will want this:
RTF|Jan|Chicago |Base1
RTF|Jan|Home |Base1
RTF|Jan|NULL |Friends
And then to go though, look up what Base1 and Friends mean for that person (whose unique ID is RTF) and replace them with sensible locations (assuming they only have one set of 'friends'):
RTF|Jan|Chicago |Rockford
RTF|Jan|Home |Rockword
RTF|Jan|NULL |Milwaukee
What I need
I need a way to pick out key words from the final column, such as: Metroline to home coming from Base1.
There are three types of words I'm looking for:
Home LocationsThese are known and limited, I can get these from a list
Home AliasesThese are known and limited, I can get these from a list
Away LocationsThese are unknown but cities/towns/etc in the UK I don't know how to recognize these in the string. This is my main problem
My Ideas
My go to program I thought of was awk, but I don't know if I can reliably search to find where a proper noun (i.e. location) is used for the location names.
Is there a package, library or dictionary of standard locations?
Can I get a program to scour the spreadsheets and 'learn' the names of locations?
This seems like a problem that would have been solved already (i.e. find words in a string of text), but I'm not certain what I'm doing, and I'm only a novice programmer.
Any help on what I can do would be appreciated.
Edit:
Any answer such as "US_Locations_Cities is something you could check against", "Check for strings mentioned in a file in awk using ...", "There is a library for language X that will let a program learn to recognise location names, it's not RegEx, but it might work", or "There is a dictionary of location names here" would be fine.
Ultimately anything that helps me do what I want to do (i.e get the location names!) would be excellent.

Sorry to tell you, but i think this is not 100% programmable.
The best bet would be to define some standard searches:
Chicago to Base1
[WORD] to [WORD]:
where "to" is fixed and you look for exactly one word before and after. the word before then is your source and word after your target
Metroline to home coming from Base1
[WORD] to [WORD] coming from [WORD]:
where "to" and "coming from" is fixed and you look for three words in the appropriate slots.
etc
if you can match a source and target -> ok
if you cannot match something then throw an error for that line and let the user decide or even better implement an appropiate correction and let the program automatically reevaluate that line.
these are non-trivial goals.
consider:
Cities out of us of a
Non english text entries
Abbreviations
for automatic error corrections try to match the found [WORD]'s with a list of us or other cities.
if the city is not found throw an error. if you find that error either include that not found city to your city list or translate a city name in a publicly known (official) name.

The best I can suggest is that, as long as your locations are all US cities, you can use a database of zip codes such as this one.
I don't know how you expect any program to pick up things like Friends or Base1

I have to agree with hacktick that as it stands now, it is not programmable. It seems that the only solution is to invent a language or protocol.
I think an easy implementation follows:
In this language you have two keywords: to and from (you could also possibly allocate at as a keyword synoym for from as well).
These keywords define a portion of string that follows as a "scan area" for
recognizing names
I'm only planning on implementing the simplest scan, but as indicated at the end of the post allows you to do your fallback.
In the implementation you have a "Preferred Name" hash, where you define the names that you want displayed for things that appear there.
{ Base1 => 'Rockford'
, Friends => 'Milwaukee'
, ...
}
You could split your sentences by chunks of text between the keywords, using the following rules:
A. First chunk, if not a keyword is taken as the value of 'from'.
A. On this or any subsequent chunk, if keyword then save the next chunk
after that for that value.
A. Each value is "scanned" for a preferred phrase before being stored
as the value.
my #chunks
= grep {; defined and ( s/^\s+//, s/\s+$//, length ) }
split /\b(from|to)\s+/i, $note
;
my %parts = ( to => '', from => '' );
my $key;
do {
last unless my $chunk = shift #chunks;
if ( $key ) {
$parts{ $key } = $preferred_title{ $chunk } // $chunk;
$key = '';
}
elsif ( exists $parts{ lc $chunk } ) {
$key = lc $chunk;
}
elsif ( !$parts{from} ) {
$parts{from} = $preferred_title{ $chunk } // $chunk;
}
} while ( #chunks );
say join( '|', $note, #parts{ qw<from to> } );
At the very least, collecting these values and printing them out can give you a sieve to decide on further courses of action. This will tell you that 'home coming' is perceived as a 'from' statement, as well as 'Standard train journey'.
You *could fix the 'home coming' by amending the regex thusly:
/\b(?:(?:coming )?(from)|(to))\s+/i
And we could add the following key-value pair to our preferred_title hash:
home => 'Home'
We could simply define 'Standard train journey' => '', or we could create a list of rejection patterns, where we reject a string as a meaningful value if they fit a pattern.
But they allow you to dump out a list of values and refine your scan of data. Another idea is that as it seems that your pretty consistent with your use of capitals (except for 'home') for places. So we could increase our odds of finding the right string by matching the chunk with
/\b(home|\p{Upper}.*)/
Note that this still considers 'Standard train journey' a proper location. So this would still need to be handled by rejection rules.
Here I reiterate that this can be a minimal approach to scanning the data to the point that you can make sense of what it this system takes to be locations and "80/20" it down: that is, hopefully those rules handle 80 percent of the cases, and you can tune the algorithm to handle 80 percent of the remaining 20, and iterate to the point that you simply have to change a handful of entries at worst.
Then, you have a specification that you would need to follow in creating travel notes from then on. You could even scan the notes as they were entered and alert something like
'No destination found in note!'.

HOP::Lexer with overlapping tokens

I'm using HOP::Lexer to scan BlitzMax module source code to fetch some data from it. One particular piece of data I'm currently interested in is a module description.
Currently I'm searching for a description in the format of ModuleInfo "Description: foobar" or ModuleInfo "Desc: foobar". This works fine. But sadly, most modules I scan have their description defined elsewhere, inside a comment block. Which is actually the common way to do it in BlitzMax, as the documentation generator expects it.
This is how all modules have their description defined in the main source file.
Rem
bbdoc: my module description
End Rem
Module namespace.modulename
This also isn't really a problem. But the line after the End Rem also contains data I want (the module name). This is a problem, since now 2 definitions of tokens overlap each other and after the first one has been detected it will continue from where it left off (position of content that's being scanned). Meaning that the token for the module name won't detect anything.
Yes, I've made sure my order of tokens is correct. It just doesn't seem possible (somewhat understandable) to move the cursor back a line.
A small piece of code for fetching the description from within a Rem-End Rem block which is above a module definition (not worked out, but working for the current test case):
[ 'MODULEDESCRIPTION',
qr/[ \t]*\bRem\n(?:\n|.)*?\s*\bEnd[ \t]*Rem\nModule[\s\t]+/i,
sub {
my ($label, $value) = #_;
$value =~ /bbdoc: (.+)/;
[$label, $1];
}
],
So in my test case I first scan for a single comment, then the block above (MODULEDESCRIPTION), then a block comment (Rem-End Rem), module name, etc.
Currently the only solution I can think of is setup a second lexer only for the module description, though I wouldn't prefer that. Is what I want even possible at all with HOP::Lexer?
Source of my Lexer can be found at https://github.com/maximos/maximus-web/blob/develop/lib/Maximus/Class/Lexer.pm

I've solved it by adding (a slightly modified version of) the MODULEDESCRIPTION. Inside the subroutine I simply filter out the module name and return an arrayref with 4 elements, which I later on iterate over to create a nice usable array with tokens and their values.
Solution is again at https://github.com/maximos/maximus-web/blob/develop/lib/Maximus/Class/Lexer.pm
Edit: Or let me just paste the piece of code here
[ 'MODULEDESCRIPTION',
qr/[ \t]*\bRem\R(?:\R|.)*?\bEnd[ \t]*Rem\R\bModule[\s\t]\w+\.\w+/i,
sub {
my ($label, $value) = #_;
my ($desc) = ($value =~ /\bbbdoc: (.+)/i);
my ($name) = ($value =~ /\bModule (\w+\.\w+)/i);
[$label, $desc, 'MODULENAME', $name];
}
],

How to skip 'die' in perl

I am trying to extract data from website using perl API. The process is to use a list of uris as input. Then I extract related information for each uri from website. If the information for one uri is not present it dies. Some thing like the code below
my #tags = $c->posts_for(uri =>"$currentURI");
die "No candidate related articles\n" unless #tags;
Now, I don't want the program to stop if it doesn't get any tags. I want the program to skip that particular uri and go to the next available uri. How can i do it?
Thank you for your time and help.
Thank you,
Sammed

Well, assuming that you're inside a loop processing each of the URIs in turn, you should be able to do something like:
next unless #tags;
For example, the following program only prints lines that are numeric:
while (<STDIN>) {
next unless /^\d+$/;
print;
}
The loop processes every input line in turn but, when one is found that doesn't match that regular expression (all numeric), it restarts the loop (for the next input line) without printing.
The same method is used in that first code block above to restart the loop if there are no tags, moving to the next URI.

Besides the traditional flow control tools, i.e. next/last in a loop or return in a sub, one can use exceptions in perl:
eval {
die "Bad bad thing";
};
if ($#) {
# do something about it
};
Or just use Try::Tiny.
However, from the description of the task it seems next is enough (so I voted for #paxdiablo's answer).

The question is rather strange, but as near as I can tell, you are asking how to control the flow of your current loop. Of course, using die will cause your program to exit, so if you do not want that, you should not use die. Seems elementary to me, that's why it is a strange questions.
So, I assume you have a loop such as:
for my $currentURI (#uris) {
my #tags = $c->posts_for(uri =>"$currentURI");
die "No candidate related articles\n" unless #tags;
# do stuff with #tags here....
}
And if #tags is empty, you want to go to the next URI. Well, that's a simple thing to solve. There are many ways.
next unless #tags;
for my $tag (#tags) { ... stuff ... }
if (#tags) { .... }
Next is the simplest one. It skips to the end of the loop block and starts with the next iteration. However, using a for or if block causes the same behaviour, and so are equivalent. For example:
for my $currentURI (#uris) {
my #tags = $c->posts_for(uri =>"$currentURI");
for my $tag (#tags) {
do_something($tag);
}
}
Or even:
for my $currentURI (#uris) {
for my $tag ($c->posts_for(uri =>"$currentURI")) {
do_something($tag);
}
}
In this last example, we removed #tags all together, because it is not needed. The inner loop will run zero times if there are no "tags".
This is not really complex stuff, and if you feel unsure, I suggest you play around a little with loops and conditionals to learn how they work.