Question about how to represent data in perl - perl

I'm messing around in perl, and I want to build a script to print hockey points and assists, with the date. It's about 40 rows, and I want to associate the date with the 2 pieces of information. I'm stuck on how to implement this.
Something
Like
10/24 2 goals 1 assist
As well filtering the totals. I can do this in a simple DB and use SQL but that seems like an overblown solution. (Plus I want to learn more perl). I feel I may be overthinking this. Any help is much appreciated.

When you represent structured data in Perl (and many other languages), it's common to use a hashmap. Perl doesn't have structs, so developers typically use hashes with well-known keys (which is also what many object-oriented frameworks in Perl end up doing under the hood). Without knowing anything else about your other goals, I'd start with something like this:
my %data = (
"2022-11-24" => { goals => 2, assists => 1 },
"2022-11-23" => { goals => 3, assists => 0 },
...
);
From this data structure, you can:
# get today's number of assists
my $assists_today = $data{"2022-11-24"}{assists};
# sum up each day's number of goals
my $total_goals = 0;
$total_goals += $data{$_}{goals} foreach keys %data;

Related

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Libreoffice basic - Associative array

I come from PHP/JS/AS3/... this kind languages. Now I'm learning basic for Libreoffice and I'm kind of struggling to find how to get something similar as associative array I use to use with others languages.
What I'm trying to do is to have this kind structure:
2016 => October => afilename.csv
2016 => April => anotherfilename.csv
with the year as main key, then the month and some datas.
More I try to find informations and more I confuse, so if someone could tell me a little bit about how to organise my datas I would be so pleased.
Thanks!
As #Chrono Kitsune said, Python and Java have such features but Basic does not. Here is a Python-UNO example for LibreOffice Writer:
def dict_example():
files_by_year = {
2016 : {'October' : 'afilename.csv',
'November' : 'bfilename.csv'},
2017 : {'April' : 'anotherfilename.csv'},
}
doc = XSCRIPTCONTEXT.getDocument()
oVC = doc.getCurrentController().getViewCursor()
for year in files_by_year:
for month in files_by_year[year]:
filename = files_by_year[year][month]
oVC.getText().insertString(
oVC, "%s %d: %s\n" % (month, year, filename), False)
g_exportedScripts = dict_example,
Create a file with the code above using a text editor such as Notepad or GEdit. Then place it here.
To run it, open Writer and go to Tools -> Macros -> Run Macro, and find the file under My Macros.
I'm not familiar with LibreOffice (or OpenOffice.org) BASIC or VBA, but I haven't found anything in the documentation for any sort of associative array, hash, or whatever else someone calls it.
However, many modern BASIC dialects allow you to define your own type as a series of fields. Then it's just a matter of using something like
Dim SomeArray({count|range}) As New MyType
I think that's as close as you'll get without leveraging outside libraries. Maybe the Python-UNO bridge would help since Python has such a feature (dictionaries), but I wouldn't know for certain. I also don't know how it would impact performance. You might prefer Java instead of Python for interfacing with UNO, and that's okay too: there's the java.util.HashMap type. Sorry I can't help more, but the important thing to remember is that any BASIC code in tends to live up to the meaning of the word "basic" in English without external assistance.
This question was asked a long ago, but answers are only half correct.
It is true that LibreOffice Basic does not have a native associative array type. But the LibreOffice API provides services. The com.sun.star.container.EnumerableMap service will meet your needs.
Here's an example:
' Create the map
map = com.sun.star.container.EnumerableMap.create("long", "any")
' The values
values = Array( _
Array(2016, "October", "afilename.csv"), _
Array(2016, "April", "anotherfilename.csv") _
)
' Fill the map
For i=LBound(values) to UBound(values)
value = values(i)
theYear = value(0)
theMonth = value(1)
theFile = value(2)
If map.containsKey(theYear) Then
map2 = map.get(theYear)
Else
map2 = com.sun.star.container.EnumerableMap.create("string","string")
map.put(theYear, map2)
End If
map2.put(theMonth, theFile)
Next
' Access to an element
map.get(2016).get("April") ' anotherfilename.csv
As you see, the methods are similar to what you can find in more usual languages.
Beware: if you experience IllegalTypeException you might have to use CreateUNOValue("<type>", ...) to cast a value into the declared type. (See this very old issue https://bz.apache.org/ooo/show_bug.cgi?id=121192 for an example.)

Feasibility of extracting arbitrary locations from a given string? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have many spreadsheets with travel information on them amongst other things.
I need to extract start and end locations where the row describes travel, and one or two more things from the row, but what those extra fields are shouldn't be important.
There is no known list of all locations and no fixed pattern of text, all that I can look for is location names.
The field I'm searching in has 0-2 locations, sometimes locations have aliases.
The Problem
If we have this:
00229 | 445 | RTF | Jan | trn_rtn_co | Chicago to Base1
00228 | 445 | RTF | Jan | train | Metroline to home coming from Base1
00228 | 445 | RTF | Jan | train_s | Standard train journey to Friends
I, for instance (though it will vary), will want this:
RTF|Jan|Chicago |Base1
RTF|Jan|Home |Base1
RTF|Jan|NULL |Friends
And then to go though, look up what Base1 and Friends mean for that person (whose unique ID is RTF) and replace them with sensible locations (assuming they only have one set of 'friends'):
RTF|Jan|Chicago |Rockford
RTF|Jan|Home |Rockword
RTF|Jan|NULL |Milwaukee
What I need
I need a way to pick out key words from the final column, such as: Metroline to home coming from Base1.
There are three types of words I'm looking for:
Home LocationsThese are known and limited, I can get these from a list
Home AliasesThese are known and limited, I can get these from a list
Away LocationsThese are unknown but cities/towns/etc in the UK I don't know how to recognize these in the string. This is my main problem
My Ideas
My go to program I thought of was awk, but I don't know if I can reliably search to find where a proper noun (i.e. location) is used for the location names.
Is there a package, library or dictionary of standard locations?
Can I get a program to scour the spreadsheets and 'learn' the names of locations?
This seems like a problem that would have been solved already (i.e. find words in a string of text), but I'm not certain what I'm doing, and I'm only a novice programmer.
Any help on what I can do would be appreciated.
Edit:
Any answer such as "US_Locations_Cities is something you could check against", "Check for strings mentioned in a file in awk using ...", "There is a library for language X that will let a program learn to recognise location names, it's not RegEx, but it might work", or "There is a dictionary of location names here" would be fine.
Ultimately anything that helps me do what I want to do (i.e get the location names!) would be excellent.
Sorry to tell you, but i think this is not 100% programmable.
The best bet would be to define some standard searches:
Chicago to Base1
[WORD] to [WORD]:
where "to" is fixed and you look for exactly one word before and after. the word before then is your source and word after your target
Metroline to home coming from Base1
[WORD] to [WORD] coming from [WORD]:
where "to" and "coming from" is fixed and you look for three words in the appropriate slots.
etc
if you can match a source and target -> ok
if you cannot match something then throw an error for that line and let the user decide or even better implement an appropiate correction and let the program automatically reevaluate that line.
these are non-trivial goals.
consider:
Cities out of us of a
Non english text entries
Abbreviations
for automatic error corrections try to match the found [WORD]'s with a list of us or other cities.
if the city is not found throw an error. if you find that error either include that not found city to your city list or translate a city name in a publicly known (official) name.
The best I can suggest is that, as long as your locations are all US cities, you can use a database of zip codes such as this one.
I don't know how you expect any program to pick up things like Friends or Base1
I have to agree with hacktick that as it stands now, it is not programmable. It seems that the only solution is to invent a language or protocol.
I think an easy implementation follows:
In this language you have two keywords: to and from (you could also possibly allocate at as a keyword synoym for from as well).
These keywords define a portion of string that follows as a "scan area" for
recognizing names
I'm only planning on implementing the simplest scan, but as indicated at the end of the post allows you to do your fallback.
In the implementation you have a "Preferred Name" hash, where you define the names that you want displayed for things that appear there.
{ Base1 => 'Rockford'
, Friends => 'Milwaukee'
, ...
}
You could split your sentences by chunks of text between the keywords, using the following rules:
A. First chunk, if not a keyword is taken as the value of 'from'.
A. On this or any subsequent chunk, if keyword then save the next chunk
after that for that value.
A. Each value is "scanned" for a preferred phrase before being stored
as the value.
my #chunks
= grep {; defined and ( s/^\s+//, s/\s+$//, length ) }
split /\b(from|to)\s+/i, $note
;
my %parts = ( to => '', from => '' );
my $key;
do {
last unless my $chunk = shift #chunks;
if ( $key ) {
$parts{ $key } = $preferred_title{ $chunk } // $chunk;
$key = '';
}
elsif ( exists $parts{ lc $chunk } ) {
$key = lc $chunk;
}
elsif ( !$parts{from} ) {
$parts{from} = $preferred_title{ $chunk } // $chunk;
}
} while ( #chunks );
say join( '|', $note, #parts{ qw<from to> } );
At the very least, collecting these values and printing them out can give you a sieve to decide on further courses of action. This will tell you that 'home coming' is perceived as a 'from' statement, as well as 'Standard train journey'.
You *could fix the 'home coming' by amending the regex thusly:
/\b(?:(?:coming )?(from)|(to))\s+/i
And we could add the following key-value pair to our preferred_title hash:
home => 'Home'
We could simply define 'Standard train journey' => '', or we could create a list of rejection patterns, where we reject a string as a meaningful value if they fit a pattern.
But they allow you to dump out a list of values and refine your scan of data. Another idea is that as it seems that your pretty consistent with your use of capitals (except for 'home') for places. So we could increase our odds of finding the right string by matching the chunk with
/\b(home|\p{Upper}.*)/
Note that this still considers 'Standard train journey' a proper location. So this would still need to be handled by rejection rules.
Here I reiterate that this can be a minimal approach to scanning the data to the point that you can make sense of what it this system takes to be locations and "80/20" it down: that is, hopefully those rules handle 80 percent of the cases, and you can tune the algorithm to handle 80 percent of the remaining 20, and iterate to the point that you simply have to change a handful of entries at worst.
Then, you have a specification that you would need to follow in creating travel notes from then on. You could even scan the notes as they were entered and alert something like
'No destination found in note!'.

How to do a range query

I have a bunch of numbers timestamps that I want to check against a range to see if they match a particular range of dates. Basically like a BETWEEN .. AND .. match in SQL. The obvious data structure would be a B-tree, but while there are a number of B-tree implementations on CPAN, they only seem to implement exact matching. Berkeley DB has the same problem; there are B-tree indices, but no range matching.
What would be the simplest way to do this? I don't want to use an SQL database unless I have to.
Clarification: I have a lot of these, so I'm looking for an efficient method, not just grep over an array.
grep will be fast, even on a million of them.
# Get everything between 500 and 10,000:
my #items = 1..1_000_000;
my $min = 500;
my $max = 10_000;
my #matches = grep {
$_ <= $max && $_ >= $min
} #items;
Run under time I get this:
time perl million.pl
real 0m0.316s
user 0m0.210s
sys 0m0.070s
Timestamps are numbers. why not common numerical comparaison operators like > and < ?
If you have many of timestamps the problem is not different if you just want to filter your set once. It's O(n) and every other method will be longer.
On the other hand, with a huge set from which you want to extract many different ranges, it could be more efficient to first sort the items. Call the number of search m, the complexity of direct filtering will be O(m.n). With sort followed by search it could be O(n.log(n) + m.log(n)) which is usually much better.
Any O(n.log(n)) sort method will do, including using the built-in sort operator (or b-tree like you suggested). The major difference between efficient sorting methods is if your memory can hold your full set or not. I there is a memory bootleneck to keep both datas and keys (timestamps) in memory you can keep only the timestamp and some index to data in memory and the real data elsewhere (disk file, database). But if your data set is really so big the most efficient solution would probably be to put the whole thing in a database with and index on timestamp (tie to database is real easy using perl).
Then you will have your range. You just use a dicotomic search to look for index of the first element included in range and of the last, complexity will be O(log(n)) (if you do a linear search the whole purpose of sorting will be defeated).
Below example of using sort and binary_search on an array of timestamps, extending use to some data structure with timestamp and content is left as an exercice.
use Search::Binary;
my #array = sort ((1, 2, 1, 1, 2, 3, 2, 2, 8, 3, 8, 3) x 100000);
my $nbelt = #array;
sub cmpfn
{
my ($h, $v, $i) = #_;
$i = $lasti + 1 unless $i;
$record = #array[$i||$lasti + 1];
$lasti = $i;
return ($v<=>$record, $i);
}
for (1..1){
$pos = binary_search(1, $nbelt, 2, \&cmpfn);
}
print "found at $pos\n";
I haven't used it. But found this on searching CPAN. This may provide what you want. You can use Tree::Binary for constructing your data and subclass Tree::Binary::Visitor::Base to do your range queries.
Other easy way is to use SQLite.

Creating a sort of "composable" parser for log files

I've started a little pet project to parse log files for Team Fortress 2. The log files have an event on each line, such as the following:
L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")
Notice there are some common parts of the syntax for log files. Names, for example consist of four parts: the name, an ID, a Steam ID, and the team of the player at the time. Rather than rewriting this type of regular expression, I was hoping to abstract this out slightly.
For example:
my $name = qr/(.*)<(\d+)><(.*)><(Red|Blue)>/
my $kill = qr/"$name" killed "$name"/;
This works nicely, but the regular expression now returns results that depend on the format of $name (breaking the abstraction I'm trying to achieve). The example above would match as:
my ($name_1, $id_1, $steam_1, $team_1, $name_2, $id_2, $steam_2, $team_2)
But I'm really looking for something like:
my ($player1, $player2)
Where $player1 and $player2 would be tuples of the previous data. I figure the "killed" event doesn't need to know exactly about the player, as long as it has information to create the player, which is what these tuples provide.
Sorry if this is a bit of a ramble, but hopefully you can provide some advice!
I think I understand what you are asking. What you need to do is reverse your logic. First you need to regex to split the string into two parts, then you extract your tuples. Then your regex doesn't need to know about the name, and you just have two generic player parsing regexs. Here is an short example:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $log = 'L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><
Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")';
my ($player1_string, $player2_string) = $log =~ m/(".*") killed (".*?")/;
my #player1 = $player1_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
my #player2 = $player2_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
print STDERR Dumper(\#player1, \#player2);
Hope this what you were looking for.
Another way to do it, but the same strategy as dwp's answer:
my #players =
map { [ /(.*)<(\d+)><(.*)><(Red|Blue)>/ ] }
$log_text =~ /"([^\"]+)" killed "([^\"]+)"/
;
Your log data contains several items of balanced text (quoted and parenthesized), so you might consider Text::Balanced for parts of this job, or perhaps a parsing approach rather than a direct attack with regex. The latter might be fragile if the player names can contain arbitrary input, for example.
Consider writing a Regexp::Log subclass.