what's the proper syntax to insert a multi-line substitution string and apply it toward a single array value? - perl

Here's the scenario -- One step of the process involves fixing city names when the data is obviously misspelled, along with some basic conversions like "MTN" to "Mountain" and so forth. I've built a variable containing several substitution strings, and I'm trying to apply that set of subs on one of the input fields later down the line.
my $citysub = <<'EOF';
s/DEQUEEN/DE QUEEN/;
s/ELDORADO/EL DORADO/;
... # there are about 100 such substitution strings
EOF
...
while ($line <INFILE>)
{
...
#field = split(/","/,$line); # it's a comma-delimited file with quoted strings; this is spltting exactly like I intend; at the end, I'll piece it back together properly
...
# the 9th field and 12th field are city names, i.e., $field[8] and $field[12]
$field[8] =~ $citysub; # this is what I'm wanting to do, but it doesn't work!
# since that doesn't work, I'm using the following, but it's much slower, obviiously
$field[8] = `echo $field[8]|sed -e "$citysub"`; # external calls to system commands
So, what's the proper syntax to insert a multi-line substitution string and apply it toward a single array value?

my %citysub = ( "DEQUEEN" => "DE QUEEN", "ELDORADO" => "EL DORADO" );
for my $find ( keys %citysub ) {
my $replace = $citysub{ $find };
$field[8] =~ s/$find/$replace/g;
}
Explanation: Create a hash of "thing to match" => "thing to replace with". then loop over that hash and run s/// with the thing to match and the thing to replace with.

Related

Perl: how to format a string containing a tilde character "~"

I have run into an issue where a perl script we use to parse a text file is omitting lines containing the tilde (~) character, and I can't figure out why.
The sample below illustrates what I mean:
#!/usr/bin/perl
use warnings;
formline " testing1\n";
formline " ~testing2\n";
formline " testing3\n";
my $body_text = $^A;
$^A = "";
print $body_text
The output of this example is:
testing1
testing3
The line containing the tilde is dropped entirely from the accumulator. This happens whether there is any text preceding the character or not.
Is there any way to print the line with the tilde treated as a literal part of the string?
~ is special in forms (see perlform) and there's no way to escape it. But you can create a field for it and populate it with a tilde:
formline " \#testing2\n", '~';
The first argument to formline is the "picture" (template). That picture uses various characters to mean particular things. The ~ means to suppress output if the fields are blank. Since you supply no fields in your call to formline, your fields are blank and output is suppressed.
my #lines = ( '', 'x y z', 'x~y~z' );
foreach $line ( #lines ) { # forms don't use lexicals, so no my on control
write;
}
format STDOUT =
~ ID: #*
$line
.
The output doesn't have a line for the blank field because the ~ in the picture told it to suppress output when $line doesn't have anything:
ID: x y z
ID: x~y~z
Note that tildes coming from the data are just fine; they are like any other character.
Here's probably something closer to what you meant. Create a picture, #* (variable-width multiline text), and supply it with values to fill it:
while( <DATA> ) {
local $^A;
formline '#*', $_;
print $^A, "\n";
}
__DATA__
testing1
~testing2
testing3
The output shows the field with the ~:
testing1
~testing2
testing3
However, the question is very odd because the way you appear to be doing things seems like you aren't really doing what formats want to do. Perhaps you have some tricky thing where you're trying to take the picture from input data. But if you aren't going to give it any values, what are you really formatting? Consider that you may not actually want formats.

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

How can i make substitutions of the same word in Perl on the same xml line?

I'm working on an XML Document, I need to open it and transform to uppercase some specific tag values on the same line. If I have the same word it only does the substitution for one of them although I'm using two different if loops:
This is my XML:
<pageID="1" width="827" height="1169" Sender_Company="société" Sender_Address="société" Sender_Fax="" Category="C2" Language_2="" Document_Object="" Language_1="french" Language_3="" NumPage="1" Script_1="typed">
This is my code:
while (<FILEIN>) {
if ($_ =~ /pageID="1"/) {
$haschanged = 1;
if ($_ !~ /Sender_Address=""/) {
if ($_ =~ /(Sender_Address="(.*?)")/){
my $SenderAddress = $2;
$SenderAddress = uc($SenderAddress);
$_ =~ s/$1/Sender_Address="$SenderAddress"/;
}
}
if ($_ !~ /Sender_Company=""/) {
if ($_ =~ /(Sender_Company="(.*?)")/) {
my $SenderCompany = $2;
$SenderCompany = uc($SenderCompany);
$_ =~ s/$1/Sender_Company="$SenderCompany"/;
#print "$_\n";
}
}
}
}
When I use two different values for Sender_Company="bla" and Sender_Address="société" the transformation to uppercase works but when I use in this case the same word Sender_Company="société" and Sender_Address="société" it doesn't do the transformation to uppercase.
Does anyone have any ideas? I can't find the logic behind it not wanting to transform the same word when I'm using two distinct if loops at a time. Thank you!
Your understanding of XML is a bit debatable:
That isn't XML. It is an XML fragment at most (Element not closed, tag name can't double as attribute like <pageID="1">, no <?xml ...?> declaration, no root element, …)
Don't parse XML with regexes ;-)
XML doesn't have a concept of “lines”.
Besides of that, the code should work fine. Do note that you can make your life easy, and your code short:
$_ =~ /foo/ is the same as /foo/, $_ !~ /foo/ is the same as !/foo/.
Instead of extracting two captures, and substituting the result in a second regex, you can do it all in just one step:
s{ (?<=Sender_Address=") ([^"]+) (?=") }{ uc $1 }ex
Wait, what? I extract one or more non-"-characters that are preceded by the string Sender_Address=" and are followed by " (look-around assertions). The thing in between I capture, and substitute it with an uppercased version. Because I match at least one character, I don't have to test for the empty tag case. The /e flag allows code in the substitution (not really neccessary here), and the /x allows us to include nonmatching whitespace for better formatting.
You can easily extend this for both attributes you want to uppercase:
# This subsumes your whole logic inside `if (/pageID="1"/)`
$haschanged = 1;
for my $attr (qw/Sender_Address Sender_Company/) {
s{ (?<=\Q$attr\E=") ([^"]+) (?=") }{ uc $1 }ex;
}
The \Q...\E causes the interpolated stuff to match literally, even if it contains characters that would be regex metacharacters otherwise.
There are a few remaining bugs:
You fail to uppercase characters that are given as entities.
XML allows single quotes '...' to be used as tag value delimiters. You don't handle them
See the points under Your understanding of XML…
All of these can be solved by using an XML parser, and then transforming the attributes in the DOM.

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys

Perl Map Function

I'm new to the map and grep functions and I'm trying to make an existing script more concise.
I can "grep" the #tracknames successfully but I'm having a problem with "map". I want #trackartist to return true if two consecutive "--" are found in a line and take the value of $1, otherwise false, but it returns the whole line if the upper condition is not met.
What am I doing wrong?
my #tracknames = grep /^\d\d\..*?(\.(?:flac|wv))$/, <*.*>;
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; } <*.*>;
Sample of files
01. some track artist 1 -- some track name 1.(flac or wv)
02. some track artist 2 -- some track name 2.(flac or wv)
03. some track artist 3 -- some track name 3.(flac or wv)
etc.
Remember that grep is for filtering a list and map is for transforming a list. Right now, your map statement returns $_ for every item in the list. If $_ matches the pattern in your substitution, it will be modified and replaced with the first match. Otherwise, it's not modified and the original $_ is returned.
It sounds like you want to filter out items that don't match the pattern. One way would be to combine a map and a grep:
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; }
grep { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ } <*.*>;
Of course, this means you're doing the same pattern match twice. Another approach is to do a transform with map, but transform anything that doesn't match the pattern into an empty list.
my #trackartist = map { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ ? $1 : ( ) } <*.*>
This uses the ternary conditional operator (?:) to check if the regex matches (returning a true value). If it does, $1 is returned from the map block, if not, an empty list ( ) is returned, which adds nothing to the list resulting from the map.
As a side note, you might want to look into using the glob function rather than <>, which has some disadvantages.
I like map and grep as much as the next guy, but your task seems more suited to a divide-and-conquer parsing approach. I say this because your comments suggest that your interest in map is leading you down a road where you'll end up with a data model consisting of parallel arrays -- #tracks, #artists, etc. -- which is often difficult to maintain in the long run. Here's a sketch of what I mean:
my #tracks;
while (my $file_name = <DATA>){ # You'll use glob() or <*.*>
# Filter out unwanted files.
my ($num, $artist_title, $ext) = $file_name =~ /
^ (\d\d) \. \s*
(.*)
\. (flac|wv) $
/x;
next unless $ext;
# Try to parse the artist and title. Adjust as needed.
my ($artist, $title) = split /\s+--\s+/, $artist_title, 2;
($artist, $title) = ('UNKNOWN', $artist) unless $title;
# Store all info as a hash ref. No need for parallel arrays.
push #tracks, {
file_name => $file_name,
ext => $ext,
artist => $artist,
title => $title,
};
}
__DATA__
01. Perl Jam -- Open or die.wv
02. Perl Jam -- Map to nowhere.flac
03. Perl Jam -- What the #$#!?.wv
04. Perl Jam -- Regex blues.wv
05. Perl Jam -- Use my package, baby.wv
06. Perl Jam -- No warnings.wv
07. Perl Jam -- Laziness ISA virtue.wv
08. Guido and the Pythons -- Home on the xrange.flac
09. Guido and the Pythons -- You gotta keep em generated.flac
10. StackOverflow medley.wv
foo.txt