Feasibility of extracting arbitrary locations from a given string? [closed] - perl

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have many spreadsheets with travel information on them amongst other things.
I need to extract start and end locations where the row describes travel, and one or two more things from the row, but what those extra fields are shouldn't be important.
There is no known list of all locations and no fixed pattern of text, all that I can look for is location names.
The field I'm searching in has 0-2 locations, sometimes locations have aliases.
The Problem
If we have this:
00229 | 445 | RTF | Jan | trn_rtn_co | Chicago to Base1
00228 | 445 | RTF | Jan | train | Metroline to home coming from Base1
00228 | 445 | RTF | Jan | train_s | Standard train journey to Friends
I, for instance (though it will vary), will want this:
RTF|Jan|Chicago |Base1
RTF|Jan|Home |Base1
RTF|Jan|NULL |Friends
And then to go though, look up what Base1 and Friends mean for that person (whose unique ID is RTF) and replace them with sensible locations (assuming they only have one set of 'friends'):
RTF|Jan|Chicago |Rockford
RTF|Jan|Home |Rockword
RTF|Jan|NULL |Milwaukee
What I need
I need a way to pick out key words from the final column, such as: Metroline to home coming from Base1.
There are three types of words I'm looking for:
Home LocationsThese are known and limited, I can get these from a list
Home AliasesThese are known and limited, I can get these from a list
Away LocationsThese are unknown but cities/towns/etc in the UK I don't know how to recognize these in the string. This is my main problem
My Ideas
My go to program I thought of was awk, but I don't know if I can reliably search to find where a proper noun (i.e. location) is used for the location names.
Is there a package, library or dictionary of standard locations?
Can I get a program to scour the spreadsheets and 'learn' the names of locations?
This seems like a problem that would have been solved already (i.e. find words in a string of text), but I'm not certain what I'm doing, and I'm only a novice programmer.
Any help on what I can do would be appreciated.
Edit:
Any answer such as "US_Locations_Cities is something you could check against", "Check for strings mentioned in a file in awk using ...", "There is a library for language X that will let a program learn to recognise location names, it's not RegEx, but it might work", or "There is a dictionary of location names here" would be fine.
Ultimately anything that helps me do what I want to do (i.e get the location names!) would be excellent.

Sorry to tell you, but i think this is not 100% programmable.
The best bet would be to define some standard searches:
Chicago to Base1
[WORD] to [WORD]:
where "to" is fixed and you look for exactly one word before and after. the word before then is your source and word after your target
Metroline to home coming from Base1
[WORD] to [WORD] coming from [WORD]:
where "to" and "coming from" is fixed and you look for three words in the appropriate slots.
etc
if you can match a source and target -> ok
if you cannot match something then throw an error for that line and let the user decide or even better implement an appropiate correction and let the program automatically reevaluate that line.
these are non-trivial goals.
consider:
Cities out of us of a
Non english text entries
Abbreviations
for automatic error corrections try to match the found [WORD]'s with a list of us or other cities.
if the city is not found throw an error. if you find that error either include that not found city to your city list or translate a city name in a publicly known (official) name.

The best I can suggest is that, as long as your locations are all US cities, you can use a database of zip codes such as this one.
I don't know how you expect any program to pick up things like Friends or Base1

I have to agree with hacktick that as it stands now, it is not programmable. It seems that the only solution is to invent a language or protocol.
I think an easy implementation follows:
In this language you have two keywords: to and from (you could also possibly allocate at as a keyword synoym for from as well).
These keywords define a portion of string that follows as a "scan area" for
recognizing names
I'm only planning on implementing the simplest scan, but as indicated at the end of the post allows you to do your fallback.
In the implementation you have a "Preferred Name" hash, where you define the names that you want displayed for things that appear there.
{ Base1 => 'Rockford'
, Friends => 'Milwaukee'
, ...
}
You could split your sentences by chunks of text between the keywords, using the following rules:
A. First chunk, if not a keyword is taken as the value of 'from'.
A. On this or any subsequent chunk, if keyword then save the next chunk
after that for that value.
A. Each value is "scanned" for a preferred phrase before being stored
as the value.
my #chunks
= grep {; defined and ( s/^\s+//, s/\s+$//, length ) }
split /\b(from|to)\s+/i, $note
;
my %parts = ( to => '', from => '' );
my $key;
do {
last unless my $chunk = shift #chunks;
if ( $key ) {
$parts{ $key } = $preferred_title{ $chunk } // $chunk;
$key = '';
}
elsif ( exists $parts{ lc $chunk } ) {
$key = lc $chunk;
}
elsif ( !$parts{from} ) {
$parts{from} = $preferred_title{ $chunk } // $chunk;
}
} while ( #chunks );
say join( '|', $note, #parts{ qw<from to> } );
At the very least, collecting these values and printing them out can give you a sieve to decide on further courses of action. This will tell you that 'home coming' is perceived as a 'from' statement, as well as 'Standard train journey'.
You *could fix the 'home coming' by amending the regex thusly:
/\b(?:(?:coming )?(from)|(to))\s+/i
And we could add the following key-value pair to our preferred_title hash:
home => 'Home'
We could simply define 'Standard train journey' => '', or we could create a list of rejection patterns, where we reject a string as a meaningful value if they fit a pattern.
But they allow you to dump out a list of values and refine your scan of data. Another idea is that as it seems that your pretty consistent with your use of capitals (except for 'home') for places. So we could increase our odds of finding the right string by matching the chunk with
/\b(home|\p{Upper}.*)/
Note that this still considers 'Standard train journey' a proper location. So this would still need to be handled by rejection rules.
Here I reiterate that this can be a minimal approach to scanning the data to the point that you can make sense of what it this system takes to be locations and "80/20" it down: that is, hopefully those rules handle 80 percent of the cases, and you can tune the algorithm to handle 80 percent of the remaining 20, and iterate to the point that you simply have to change a handful of entries at worst.
Then, you have a specification that you would need to follow in creating travel notes from then on. You could even scan the notes as they were entered and alert something like
'No destination found in note!'.

Related

how do duckduckgo spice IA secondary API calls get their parameters?

I have been looking through the spice instant answer source code. Yes, I know it is in maintenance mode, but I am still curious.
The documentation makes it fairly clear that the primary spice to API gets its numerical parameters $1, $2, etc. from the handle function.
My question: should there be secondary API calls included with spice alt_to as, say, in the movie spice IA, where do the numerical parameters to that API call come from?
Note, for instance, the $1 in both the movie_image and cast_image secondary API calls in spice alt_to at the preceding link. I am asking which regex capture returns those instances of $1.
I believe I see how this works now. The flow of information is still a bit murky to me, but at least I see how all of the requisite information is there.
I'll take the cryptocurrency instant answer as an example. The alt_to element in the perl package file at that link has a key named cryptonator. The corresponding .js file constructs a matching endpoint:
var endpoint = "/js/spice/cryptonator/" + from + "/" + to;
Note the general shape of the "remainder" past /js/spice/cryptonator: from/to, where from and to will be two strings.
Back in the perl package the hash alt_to->{cryptonator} has a key from which receives, I think, this remainder from/to. The value corresponding to that key is a regex meant to split up that string into its two constituents:
from => '([^/]+)/([^/]*)'
Applied to from/to, that regex will return $1=from and $2=to. These, then, are the $1 and $2 that go into
to => 'https://api.cryptonator.com/api/full/$1-$2'
in alt_to.
In short:
The to field of alt_to->{blah} receives its numerical parameters by having the from regex operate on the remainder past /js/spice/blah/ of the name of the corresponding endpoint constructed in the relevant .js file.

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Autocomplete with Firebase

How does one use Firebase to do basic auto-completion/text preview?
For example, imagine a blog backed by Firebase where the blogger can tag posts with tags. As the blogger is tagging a new post, it would be helpful if they could see all currently-existing tags that matched the first few keystrokes they've entered. So if "blog," "black," "blazing saddles," and "bulldogs" were tags, if the user types "bl" they get the first three but not "bulldogs."
My initial thought was that we could set the tag with the priority of the tag, and use startAt, such that our query would look something like:
fb.child('tags').startAt('bl').limitToFirst(5).once('value', function(snap) {
console.log(snap.val())
});
But this would also return "bulldog" as one of the results (not the end of the world, but not the best either). Using startAt('bl').endAt('bl') returns no results. Is there another way to accomplish this?
(I know that one option is that this is something we could use a search server, like ElasticSearch, for -- see https://www.firebase.com/blog/2014-01-02-queries-part-two.html -- but I'd love to keep as much in Firebase as possible.)
Edit
As Kato suggested, here's a concrete example. We have 20,000 users, with their names stored as such:
/users/$userId/name
Oftentimes, users will be looking up another user by name. As a user is looking up their buddy, we'd like a drop-down to populate a list of users whose names start with the letters that the searcher has inputted. So if I typed in "Ja" I would expect to see "Jake Heller," "jake gyllenhaal," "Jack Donaghy," etc. in the drop-down.
I know this is an old topic, but it's still relevant. Based on Neil's answer above, you more easily search doing the following:
fb.child('tags').startAt(queryString).endAt(queryString + '\uf8ff').limit(5)
See Firebase Retrieving Data.
The \uf8ff character used in the query above is a very high code point
in the Unicode range. Because it is after most regular characters in
Unicode, the query matches all values that start with queryString.
As inspired by Kato's comments -- one way to approach this problem is to set the priority to the field you want to search on for your autocomplete and use startAt(), limit(), and client-side filtering to return only the results that you want. You'll want to make sure that the priority and the search term is lower-cased, since Firebase is case-sensitive.
This is a crude example to demonstrate this using the Users example I laid out in the question:
For a search for "ja", assuming all users have their priority set to the lowercased version of the user's name:
fb.child('users').
startAt('ja'). // The user-inputted search
limitToFirst(20).
once('value', function(snap) {
for(key in snap.val()){
if(snap.val()[key].indexOf('ja') === 0) {
console.log(snap.val()[key];
}
}
});
This should only return the names that actually begin with "ja" (even if Firebase actually returns names alphabetically after "ja").
I choose to use limitToFirst(20) to keep the response size small and because, realistically, you'll never need more than 20 for the autocomplete drop-down. There are probably better ways to do the filtering, but this should at least demonstrate the concept.
Hope this helps someone! And it's quite possible the Firebase guys have a better answer.
(Note that this is very limited -- if someone searches for the last name, it won't return what they're looking for. Hence the "best" answer is probably to use a search backend with something like Kato's Flashlight.)
It strikes me that there's a much simpler and more elegant way of achieving this than client side filtering or hacking Elastic.
By converting the search key into its' Unicode value and storing that as the priority, you can search by startAt() and endAt() by incrementing the value by one.
var start = "ABA";
var pad = "AAAAAAAAAA";
start += pad.substring(0, pad.length - start.length);
var blob = new Blob([start]);
var reader = new FileReader();
reader.onload = function(e) {
var typedArray = new Uint8Array(e.target.result);
var array = Array.prototype.slice.call(typedArray);
var priority = parseInt(array.join(""));
console.log("Priority of", start, "is:", priority);
}
reader.readAsArrayBuffer(blob);
You can then limit your search priority to the key "ABB" by incrementing the last charCode by one and doing the same conversion:
var limit = String.fromCharCode(start.charCodeAt(start.length -1) +1);
limit = start.substring(0, start.length -1) +limit;
"ABA..." to "ABB..." ends up with priorities of:
Start: 65666565656565650000
End: 65666665656565650000
Simples!
Based on Jake and Matt's answer, updated version for sdk 3.1. '.limit' no longer works:
firebaseDb.ref('users')
.orderByChild('name')
.startAt(query)
.endAt(`${query}\uf8ff`)
.limitToFirst(5)
.on('child_added', (child) => {
console.log(
{
id: child.key,
name: child.val().name
}
)
})

Creating a sort of "composable" parser for log files

I've started a little pet project to parse log files for Team Fortress 2. The log files have an event on each line, such as the following:
L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")
Notice there are some common parts of the syntax for log files. Names, for example consist of four parts: the name, an ID, a Steam ID, and the team of the player at the time. Rather than rewriting this type of regular expression, I was hoping to abstract this out slightly.
For example:
my $name = qr/(.*)<(\d+)><(.*)><(Red|Blue)>/
my $kill = qr/"$name" killed "$name"/;
This works nicely, but the regular expression now returns results that depend on the format of $name (breaking the abstraction I'm trying to achieve). The example above would match as:
my ($name_1, $id_1, $steam_1, $team_1, $name_2, $id_2, $steam_2, $team_2)
But I'm really looking for something like:
my ($player1, $player2)
Where $player1 and $player2 would be tuples of the previous data. I figure the "killed" event doesn't need to know exactly about the player, as long as it has information to create the player, which is what these tuples provide.
Sorry if this is a bit of a ramble, but hopefully you can provide some advice!
I think I understand what you are asking. What you need to do is reverse your logic. First you need to regex to split the string into two parts, then you extract your tuples. Then your regex doesn't need to know about the name, and you just have two generic player parsing regexs. Here is an short example:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $log = 'L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><
Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")';
my ($player1_string, $player2_string) = $log =~ m/(".*") killed (".*?")/;
my #player1 = $player1_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
my #player2 = $player2_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
print STDERR Dumper(\#player1, \#player2);
Hope this what you were looking for.
Another way to do it, but the same strategy as dwp's answer:
my #players =
map { [ /(.*)<(\d+)><(.*)><(Red|Blue)>/ ] }
$log_text =~ /"([^\"]+)" killed "([^\"]+)"/
;
Your log data contains several items of balanced text (quoted and parenthesized), so you might consider Text::Balanced for parts of this job, or perhaps a parsing approach rather than a direct attack with regex. The latter might be fragile if the player names can contain arbitrary input, for example.
Consider writing a Regexp::Log subclass.

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)