Need further understanding in the next unless code I'm reading - perl

I need help with 2 things on this code that I'm reading. First, is I keep seeing this inside of while loop to read a file:
wile(<filename>){
next unless (/\w/);
chomp;
s/^\s*//;
s/^\s*$//;
my($name, $datatype, $io, $dummy) = split /\s*,\s*/, $_, 4;
}
So, I'm wondering what that is doing? Because there are commas in the same line being read, so wouldn't the commas make it go to the next iteration? SO how would it split the lines if it is going to another iteration when the commas are being read?
Another one I'm stomped by is:
while (<AP>) {
chomp;
s/
//g;
}
I have no idea what that code is actually substituting...
Thanks!

The first snippet:
Reads a line from a filehandle called filename. This is a really bad name for a filehandle
It skips the processing if there is not even a single \w (word character) on the line.
The next unless (/\w/); is the same as next if not (/\w/). Note that there is no need for parenthesis -- next unless /\w/; is fine.
A word character is, from perlretut
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
It removes (only) the newline with chomp. Then it removes leading spaces, if any
It removes blank lines, the ones with only spaces on them
It splits the line by commas, allowing that they have spaces before and/or after. It also limits the number of terms returned, to 4. This means that it returns the first three comma-separated fields, and then all the rest as one string in the last element of the list
The second snippet is really bad, whatever it is meant to do. (Remove spaces on the line?)
Comments
It is far better to use lexical filehandles, rather than barenames. So you'd open a file as
open my $fh, '<', $filename or die "Can't open $filename: $!";
and read it by while (my $line = <$fh>) or by while (<$fh>).
Normally you'll see lines skipped if they have nothing other than spaces
next unless /\S/; # or
next if /^\s*$/;
Using \w also skips lines with some characters (other than what is matched by \w), which means that one had better be very sure that those are fine to skip.
Here it may be meant to skip a line with commas but no \w (comma is not matched by \w), for which split would return spaces (or empty strings) in a list. I find this a bit hidden and fragile. I'd drop lines with spaces only, and handle possible loose commas in processing. As it stands it doesn't help with ,,a, anyway, what yields ('', '', 'a'). So checking is probably needed in any case.
Note that altogether this code leaves trailing spaces. When split is invoked with the optional fourth argument it keeps all spaces, and they haven't been removed otherwise.

Related

Need to create regular expression for one file having sentence

L02 TIME DEPOSITS 489,26,45,422.92
L18 DRAFTS ACCOUNT (IF CREDIT) 10,063.00 10,063.00
L21 SBI BILLS ACCOUNT (CONTRA) A18 37,51,432.00
A12A DEMAND LOANS 4,39,13,597.30
These are the lines I have in my file I want to extract the amounts from each line which starts with either (L or A) and store into a variable.
This is what I have written
pattern =/[A-Z]\w+\s*([\d,.]*)\s*([\d,.])*/g
$first = $1;
$second= $2;
Your regex is looking for a string of \w and then spaces in the middle so it cannot match multiple words. The last * should be inside the parenthesis, like the first one (but see below). The [A-Z] matches any block capital while you say that you want A or L, so use [AL] instead.
my #amounts = $string =~ /^[AL]\w+ \s+ [A-Za-z ]* ([\d,.]*)/xg;
You don't want to literally repeat the pattern with * quantifier in order to account for a variable number of occurrences. What if 2 becomes 3 when requirements change? Four? Instead, you can capture all matches in an array and get exactly as many as there are.
The /x allows us to use spaces inside for readability.
Here is another approach, which is more flexible.
You need a pattern containing any of digit, , (comma), . (period) -- and which is only such in the string. You want this only on lines that start with A or L.
So skip lines which do not start with A or L, then match only the needed pattern.
use warnings;
use strict;
my $filename = '...';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (<$fh>)
{
next unless /^[AL]/; # skip if the line doesn't start with A or L
my #amounts = $_ =~ /\b ([\d,.]+) \b/xg;
print "#amounts\n" if #amounts;
}
close $fh;
Here you need to specify \b, the word boundary. Otherwise 02 in L02 is matched, for example.
With no matches the array is empty so we test, to not print empty lines. Adjust as suitable.
The next step in reducing reliance on regex details and making code more flexible is to split the line by spaces and process term by term. Then adjustments are far easier and changes can be absorded.
For example, this helps with the change in data mentioned in a comment -- what if there is a date? The above regex would match the numeric parts, while the first one would just break down.
With a loop over fields on each line we can just skip the date, next if /\d{4}-\d{2}/;

How to delete duplicate lines while ignoring particular characters?

I need to remove all of the duplicate lines from a file, but ignoring all appearances of these characters:
(),、“”。!?#
As an example, these two lines would be considered duplicates, so one of them would be deleted:
“This is a line。“
This is a line
Similarly, these three lines would be considered duplicates, and only one would remain:
This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!
It does not matter which of the duplicate lines is kept remaining in the document.
After removing duplicates, the orders of the lines should not be changed.
Nearly all lines have important punctuation, but the punctuation might vary somewhat. Whichever line is kept might still have punctuation, so the punctuation should not be deleted in the final output.
How can I delete all of the duplicate lines in a file, while ignoring some characters?
From your example, you could just delete your symbols, and then remove your duplicates.
For instance :
$ cat foo
«This is a line¡»
This is another line! with more words¡
Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line
This is another line, with more words!
This is another line with more words
$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words
Similarly these three lines would be considered duplicates and only one would remain:
$
Seems to do the job.
Edit :
From your question, it seems like those symbol/punctuation mars do not matter. You should precise that.
I don't have time to write that but I think the easy way should be to parse your file and maintain an array of already printed line :
for each line:
cleanedLine = stripFromSymbol(line)
if cleanedLine not in AlreadyPrinted:
AlreadyPrinted.push(cleanedLine)
print line
This is an approach. You collect them into arrays keyed on a normalized version. Normalized here means remove all the chars you don’t want and squash spaces too. Then it picks the shortest version to print/keep. That heuristic—which to keep—wasn’t really specified so season to taste. Code is a bit terse for production so you might flesh it out for clarity.
use utf8;
use strictures;
use open qw/ :std :utf8 /;
my %tree;
while (my $original = <DATA>) {
chomp $original;
( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
push #{$tree{$normalized}}, $original;
#print "O:",$original, $/;
#print "N:",$normalized, $/;
}
#{$_} = sort { length $a <=> length $b } #{$_} for values %tree;
print $_->[0], $/ for values %tree;
__DATA__
“This is a line。“
This is a line
This is a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!
Yields–
This is another line with more words
This is a line

How to remove all lines except .c extention(at last) lines using perl scripting

I've a string $string which has got list of lines, some ending with *.c, *.pdf,etc and few without any extensions(these are directories). I need to remove all lines except *.c lines. How can i do that using regular expression? I've written to get removed *.c files as below but how to do a not of it?
next if $line =~ /(\.c)/i;
Any ideas.
thanks,
Sharath
Use unless instead of if to reverse the sense of the condition.
next unless $line =~ /\.c$/i;
or simply invert the test:
next if $line !~ /\.c$/i;
Also, you don't need parentheses around the regexp, and you need $ to anchor it to the end of the line.

Clarification on chomp

I'm on break from classes right now and decided to spend my time learning Perl. I'm working with Beginning Perl (http://www.perl.org/books/beginning-perl/) and I'm finishing up the exercises at the end of chapter three.
One of the exercises asked that I "Store your important phone numbers in a hash. Write a program to look up numbers by the person's name."
Anyway, I had come up with this:
#!/usr/bin/perl
use warnings;
use strict;
my %name_number=
(
Me => "XXX XXX XXXX",
Home => "YYY YYY YYYY",
Emergency => "ZZZ ZZZ ZZZZ",
Lookup => "411"
);
print "Enter the name of who you want to call (Me, Home, Emergency, Lookup)", "\n";
my $input = <STDIN>;
print "$input can be reached at $name_number{$input}\n";
And it just wouldn't work. I kept getting this error message:
Use of uninitialized value in concatenation (.) or string at hello.plx
line 17, line 1
I tried playing around with the code some more but each "solution" looked more complex than the "solution" that came before it. Finally, I decided to check the answers.
The only difference between my code and the answer was the presence of chomp ($input); after <STDIN>;.
Now, the author has used chomp in previous example but he didn't really cover what chomp was doing. So, I found this answer on www.perlmeme.org:
The chomp() function will remove (usually) any newline character from
the end of a string. The reason we say usually is that it actually
removes any character that matches the current value of $/ (the input
record separator), and $/ defaults to a newline..
Anyway, my questions are:
What newlines are getting removed? Does Perl automatically append a "\n" to the input from <STDIN>? I'm just a little unclear because when I read "it actually removes any character that matches the current value of $/", I can't help but think "I don't remember putting a $/ anywhere in my code."
I'd like to develop best practices as soon as possible - is it best to always include chomp after <STDIN> or are there scenarios where it's unnecessary?
<STDIN> reads to the end of the input string, which contains a newline if you press return to enter it, which you probably do.
chomp removes the newline at the end of a string. $/ is a variable (as you found, defaulting to newline) that you probably don't have to worry about; it just tells perl what the 'input record separator' is, which I'm assuming means it defines how far <FILEHANDLE> reads. You can pretty much forget about it for now, it seems like an advanced topic. Just pretend chomp chomps off a trailing newline. Honestly, I've never even heard of $/ before.
As for your other question, it is generally cleaner to always chomp variables and add newlines as needed later, because you don't always know if a variable has a newline or not; by always chomping variables you always get the same behavior. There are scenarios where it is unnecessary, but if you're not sure it can't hurt to chomp it.
Hope this helps!
OK, as of 1), perl doesn't add any \n at input. It is you that hit Enter when finished entering the number. If you don't specify $/, a default of \n will be put (under UNIX, at least).
As of 2), chomp will be needed whenever input comes from the user, or whenever you want to remove the line ending character (reading from a file, for example).
Finally, the error you're getting may be from perl not understanding your variable within the double quotes of the last print, because it does have a _ character. Try to write the string as follows:
print "$input can be reached at ${name_number{$input}}\n";
(note the {} around the last variable).
<STDIN> is a short-cut notation for readline( *STDIN );. What readline() does is reads the file handle until it encounters the contents of $/ (aka $INPUT_RECORD_SEPARATOR) and returns everything it has read including the contents of $/. What chomp() does is remove the last occurrence contents of $/, if present.
The contents is often called a newline character but it may be composed of more than one character. On Linux, it contains a LF character but on Windows, it contains CR-LF.
See:
perldoc -f readline
perldoc -f chomp
perldoc perlvar and search for /\$INPUT_RECORD_SEPARATOR/
I think best practice here is to write:
chomp(my $input = <STDIN>);
Here is quick example how chomp function ($/ meaning is explained there) works removing just one trailing new line (if any):
chomp (my $input = "Me\n"); # OK
chomp ($input = "Me"); # OK (nothing done)
chomp ($input = "Me\n\n"); # $input now is "Me\n";
chomp ($input); # finally "Me"
print "$input can be reached at $name_number{$input}\n";
BTW: That's funny thing is that I am learning Perl too and I reached hashes five minutes ago.
Though it may be obvious, it's still worth mentioning why the chomp is needed here.
The hash created contains 4 lookup keys: "Me", "Home", "Emergency" and "Lookup"
When $input is specified from <STDIN>, it'll contain "Me\n", "Me\r\n" or some other line-ending variant depending on what operating system is being used.
The uninitialized value error comes about because the "Me\n" key does not exist in the hash. And this is why the chomp is needed:
my $input = <STDIN>; # "Me\n" --> Key DNE, $name_number{$input} not defined
chomp $input; # "Me" --> Key exists, $name_number{$input} defined

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.