Difference between chomp and trim in Perl? - perl

What is the difference between chomp and trim in Perl? Which one is better to use and when?

Chomp: The chomp() function will remove (usually) any newline character from the end of a string. The reason we say usually is that it actually removes any character that matches the current value of $/ (the input record separator), and $/ defaults to a newline.
For more information see chomp.
As rightfold has commented There is no trim function in Perl. Generally people write a function with name trim (you can use any other name also) to remove leading and trailing white spaces (or single or double quotes or any other special character)
trim remove white space from both ends of a string:
$str =~ s/^\s+|\s+$//g;

trim removes both leading and trailing whitespaces, chomp removes only trailing input record separator (usually new line character).

Chomp: It only removes the last character, if it is a newline.
More details can be found at: http://perldoc.perl.org/functions/chomp.html
Trim: There is no function called Trim in Perl. Although, we can create our function to remove the leading and trailing spaces in Perl. Code can be as follows:
perl trim function - remove leading and trailing whitespace
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
More details can be found at :http://perlmaven.com/trim

Related

Code does not remove non-ascii characters from variable

Why do the following lines of code not remove non-ascii characters from my variable and replace it with a single space?
$text =~ s/[[:^ascii:]]+/ /rg;
$text =~ s/\h+/ /g;
Whereas this works to remove newline?
$log_mess =~ s/[\r\n]+//g;
To explain the problem for anyone finding this question in the future:
$text =~ s/[[:^ascii:]]+/ /rg;
The problem is the /r option on the substitution operator (s/.../.../).
This operator is documented in the "Regexp Quote-Like Operators" section of perlop. It says this about /r:
r - Return substitution and leave the original string untouched.
You see, in most cases, the substitution operator works on the string that it is given (e.g. your variable $text) but in some cases, you don't want that. In some cases, you want the original variable to remain unchanged and the altered string to be returned so that you can store it in a new variable.
Previously, you would do this:
my $new_var = $var;
$new_var =~ s/regex/substitution/;
But since the /r option was added, you can simplify that to:
my $new_var = $var =~ s/regex/substitution/r;
I'm not sure why you used /r in your code (I guess you copied it from somewhere else), but you don't need it here and it's what is leading to your original string being unchanged.

Need further understanding in the next unless code I'm reading

I need help with 2 things on this code that I'm reading. First, is I keep seeing this inside of while loop to read a file:
wile(<filename>){
next unless (/\w/);
chomp;
s/^\s*//;
s/^\s*$//;
my($name, $datatype, $io, $dummy) = split /\s*,\s*/, $_, 4;
}
So, I'm wondering what that is doing? Because there are commas in the same line being read, so wouldn't the commas make it go to the next iteration? SO how would it split the lines if it is going to another iteration when the commas are being read?
Another one I'm stomped by is:
while (<AP>) {
chomp;
s/
//g;
}
I have no idea what that code is actually substituting...
Thanks!
The first snippet:
Reads a line from a filehandle called filename. This is a really bad name for a filehandle
It skips the processing if there is not even a single \w (word character) on the line.
The next unless (/\w/); is the same as next if not (/\w/). Note that there is no need for parenthesis -- next unless /\w/; is fine.
A word character is, from perlretut
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
It removes (only) the newline with chomp. Then it removes leading spaces, if any
It removes blank lines, the ones with only spaces on them
It splits the line by commas, allowing that they have spaces before and/or after. It also limits the number of terms returned, to 4. This means that it returns the first three comma-separated fields, and then all the rest as one string in the last element of the list
The second snippet is really bad, whatever it is meant to do. (Remove spaces on the line?)
Comments
It is far better to use lexical filehandles, rather than barenames. So you'd open a file as
open my $fh, '<', $filename or die "Can't open $filename: $!";
and read it by while (my $line = <$fh>) or by while (<$fh>).
Normally you'll see lines skipped if they have nothing other than spaces
next unless /\S/; # or
next if /^\s*$/;
Using \w also skips lines with some characters (other than what is matched by \w), which means that one had better be very sure that those are fine to skip.
Here it may be meant to skip a line with commas but no \w (comma is not matched by \w), for which split would return spaces (or empty strings) in a list. I find this a bit hidden and fragile. I'd drop lines with spaces only, and handle possible loose commas in processing. As it stands it doesn't help with ,,a, anyway, what yields ('', '', 'a'). So checking is probably needed in any case.
Note that altogether this code leaves trailing spaces. When split is invoked with the optional fourth argument it keeps all spaces, and they haven't been removed otherwise.

How can I use "s" as a substitution delimiter in Perl?

I was playing with Perl and thought that
sssssss
Would have been the same as
s/s/ss/
It seems only certain delimiters can be used. What are they?
You can use any non-whitespace character as the delimiter, but you can't use the delimiter inside PATTERN or REPLACEMENT without escaping it. This is totally valid:
my $x = 's';
$x =~ s s\ss\s\ss;
print $x; # prints "ss"
Note that a space is required after the first s or else it will be interpreted as ss identifier.

Perl non-English Character

See this piece of perl code:
#!/usr/bin/perl -w -CS
use feature 'unicode_strings';
open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";
binmode( IN, ':utf8' );
binmode( OUT, ':utf8' );
## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model
while (<IN>) {
# Remove starting and trailing tags (e.g. <s>)
# s/\<[a-z\/]+\>//g;
# Remove ellipses
s/\.\.\./ /g;
# Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
# Unicode 2026 (horizontal ellipsis)
# Unicode 2013 and 2014 (m- and n-dash)
s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;
# Remove quotes
s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/?/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/\s+/ /g;
# Remove leading space
s/^\s+//;
chomp($_);
print OUT uc($_) . "\n";
# print uc($_) . " ";
} print OUT "\n";
It seems that there is a non-english character on line 49, namely the line s/?/e/g;.
So when I run this, warning come out that Quantifier follows nothing in regex;.
How can I deal with this problem? How to make perl recognize the character? I have to run this code with perl 5.10.
Another little question is that what is the meaning of the "-CS" in the 1st line.
Thanks to all.
I think your problem is that your editor doesn't handle unicode characters, so the program is trashed before it even gets to perl, and as this apparently isn't your program, it was probably trashed before it got to you.
Until the entire tool chain handles unicode correctly, you have to be careful to encode non-ascii characters in a way that preserves them. It's a pain, and no simple solutions exist. Consult your perl manual for how to embed unicode characters safely.
As per the comment line just before the erroneous line, the character to be replaced is an accented "e"; presumably what is meant is e with an acute accent: "é". Assuming your input is Unicode, it can be represented in Perl as \x{00E9}. See also http://www.fileformat.info/info/unicode/char/e9/index.htm
I guess you copy/pasted this script from a web page on a server which was not properly configured to display the required character encoding. See further also http://en.wikipedia.org/wiki/Mojibake

In perl pattern matching..how to exclude the \n character from pattern

I am new to perl and writing my first few programs and using its pattern matching abilities. I am reading a file into array like this:
#list=<file>
Then indexing each line of array by $list[0..9] etc, and when I match it against a pattern, the $list[0] includes \n character, hence the match fails. So if ($string =~ $list[0]) fails though without \n character in pattern it would match.
How do I tell pattern matcher to not consider the \n character from pattern?
Thanks
You can shave the line ends from the array after reading:
#lines = …;
chomp #lines;
Now #lines contains the lines without line ends. See perldoc chomp for details.
If you want to remove the \n from your lines you can:
chomp $list[0]
see perldoc -f chomp for the details.
This is a good opportunity to get to know how Perl modules work.
You can for example use Perl6::Slurp which will both a) parse the file b) put the contents in an array c) remove the newline characters for you.
For example:
use Perl6::Slurp;
my #lines = slurp '<:utf8', 'filename', {chomp=>"\n"}
This will match with the \n:
if ( $list[0] =~ "$string\n")
Or if you want the \n to be optional:
if ( $list[0] =~ /$string\n?/ )