Perl non-English Character - perl

See this piece of perl code:
#!/usr/bin/perl -w -CS
use feature 'unicode_strings';
open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";
binmode( IN, ':utf8' );
binmode( OUT, ':utf8' );
## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model
while (<IN>) {
# Remove starting and trailing tags (e.g. <s>)
# s/\<[a-z\/]+\>//g;
# Remove ellipses
s/\.\.\./ /g;
# Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
# Unicode 2026 (horizontal ellipsis)
# Unicode 2013 and 2014 (m- and n-dash)
s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;
# Remove quotes
s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/?/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/\s+/ /g;
# Remove leading space
s/^\s+//;
chomp($_);
print OUT uc($_) . "\n";
# print uc($_) . " ";
} print OUT "\n";
It seems that there is a non-english character on line 49, namely the line s/?/e/g;.
So when I run this, warning come out that Quantifier follows nothing in regex;.
How can I deal with this problem? How to make perl recognize the character? I have to run this code with perl 5.10.
Another little question is that what is the meaning of the "-CS" in the 1st line.
Thanks to all.

I think your problem is that your editor doesn't handle unicode characters, so the program is trashed before it even gets to perl, and as this apparently isn't your program, it was probably trashed before it got to you.
Until the entire tool chain handles unicode correctly, you have to be careful to encode non-ascii characters in a way that preserves them. It's a pain, and no simple solutions exist. Consult your perl manual for how to embed unicode characters safely.

As per the comment line just before the erroneous line, the character to be replaced is an accented "e"; presumably what is meant is e with an acute accent: "é". Assuming your input is Unicode, it can be represented in Perl as \x{00E9}. See also http://www.fileformat.info/info/unicode/char/e9/index.htm
I guess you copy/pasted this script from a web page on a server which was not properly configured to display the required character encoding. See further also http://en.wikipedia.org/wiki/Mojibake

Related

Need further understanding in the next unless code I'm reading

I need help with 2 things on this code that I'm reading. First, is I keep seeing this inside of while loop to read a file:
wile(<filename>){
next unless (/\w/);
chomp;
s/^\s*//;
s/^\s*$//;
my($name, $datatype, $io, $dummy) = split /\s*,\s*/, $_, 4;
}
So, I'm wondering what that is doing? Because there are commas in the same line being read, so wouldn't the commas make it go to the next iteration? SO how would it split the lines if it is going to another iteration when the commas are being read?
Another one I'm stomped by is:
while (<AP>) {
chomp;
s/
//g;
}
I have no idea what that code is actually substituting...
Thanks!
The first snippet:
Reads a line from a filehandle called filename. This is a really bad name for a filehandle
It skips the processing if there is not even a single \w (word character) on the line.
The next unless (/\w/); is the same as next if not (/\w/). Note that there is no need for parenthesis -- next unless /\w/; is fine.
A word character is, from perlretut
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
It removes (only) the newline with chomp. Then it removes leading spaces, if any
It removes blank lines, the ones with only spaces on them
It splits the line by commas, allowing that they have spaces before and/or after. It also limits the number of terms returned, to 4. This means that it returns the first three comma-separated fields, and then all the rest as one string in the last element of the list
The second snippet is really bad, whatever it is meant to do. (Remove spaces on the line?)
Comments
It is far better to use lexical filehandles, rather than barenames. So you'd open a file as
open my $fh, '<', $filename or die "Can't open $filename: $!";
and read it by while (my $line = <$fh>) or by while (<$fh>).
Normally you'll see lines skipped if they have nothing other than spaces
next unless /\S/; # or
next if /^\s*$/;
Using \w also skips lines with some characters (other than what is matched by \w), which means that one had better be very sure that those are fine to skip.
Here it may be meant to skip a line with commas but no \w (comma is not matched by \w), for which split would return spaces (or empty strings) in a list. I find this a bit hidden and fragile. I'd drop lines with spaces only, and handle possible loose commas in processing. As it stands it doesn't help with ,,a, anyway, what yields ('', '', 'a'). So checking is probably needed in any case.
Note that altogether this code leaves trailing spaces. When split is invoked with the optional fourth argument it keeps all spaces, and they haven't been removed otherwise.

Perl automatically adds newline to output file

I am using Perl to write to a file. It keeps adding a newline to the output file in the same spot even after I use chomp. I cannot figure out why.
Sample Code (reading from an input file, processing the line and then writing that line out to the output file):
open(OUT, "> out.txt");
# ...
while(<STDIN>) {
# ...
my $var = substr($_, index($_, "as "));
chomp($var);
print("Var is: " . $var); # no newline
print OUT $var . ","; # adds newline before the comma
# ...
}
# ...
close(OUT);
Any ideas as to what might be causing this or how to fix it? Thanks.
The cannonical procedure:
while(<STDIN>) {
chomp;
# ...
my $var = substr($_, index($_, "as "));
print("Var is: " . $var); # no newline
print OUT $var . ","; # adds newline before the comma
# ...
}
In most operating systems, lines in files are terminated by newlines.
Just what is used as a newline may vary from OS to OS. Unix
traditionally uses \012 , one type of DOSish I/O uses \015\012 , Mac
OS uses \015 , and z/OS uses \025 .
Perl uses \n to represent the "logical" newline, where what is logical
may depend on the platform in use. In MacPerl, \n always means \015 .
On EBCDIC platforms, \n could be \025 or \045 . In DOSish perls, \n
usually means \012 , but when accessing a file in "text" mode, perl
uses the :crlf layer that translates it to (or from) \015\012 ,
depending on whether you're reading or writing. Unix does the same
thing on ttys in canonical mode. \015\012 is commonly referred to as
CRLF.
To trim trailing newlines from text lines use chomp(). With default
settings that function looks for a trailing \n character and thus
trims in a portable way.
In this case you're hitting a cross-platform barrier, you're reading documents written on an os, from a different and not compatible platform.
For an isolated execution you should covert your file line endings to match the host.
To address the issue permanently, you can try: https://metacpan.org/pod/File::Edit::Portable . Thanks #stevieb

Difference between chomp and trim in Perl?

What is the difference between chomp and trim in Perl? Which one is better to use and when?
Chomp: The chomp() function will remove (usually) any newline character from the end of a string. The reason we say usually is that it actually removes any character that matches the current value of $/ (the input record separator), and $/ defaults to a newline.
For more information see chomp.
As rightfold has commented There is no trim function in Perl. Generally people write a function with name trim (you can use any other name also) to remove leading and trailing white spaces (or single or double quotes or any other special character)
trim remove white space from both ends of a string:
$str =~ s/^\s+|\s+$//g;
trim removes both leading and trailing whitespaces, chomp removes only trailing input record separator (usually new line character).
Chomp: It only removes the last character, if it is a newline.
More details can be found at: http://perldoc.perl.org/functions/chomp.html
Trim: There is no function called Trim in Perl. Although, we can create our function to remove the leading and trailing spaces in Perl. Code can be as follows:
perl trim function - remove leading and trailing whitespace
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
More details can be found at :http://perlmaven.com/trim

How can I use "s" as a substitution delimiter in Perl?

I was playing with Perl and thought that
sssssss
Would have been the same as
s/s/ss/
It seems only certain delimiters can be used. What are they?
You can use any non-whitespace character as the delimiter, but you can't use the delimiter inside PATTERN or REPLACEMENT without escaping it. This is totally valid:
my $x = 's';
$x =~ s s\ss\s\ss;
print $x; # prints "ss"
Note that a space is required after the first s or else it will be interpreted as ss identifier.

Perl: How to remove spaces and blank lines in one pass

I have got 2 perl scripts, first one removes blank lins from a file and the second one removes all spaces inside a file. I wonder, if it's possible to connect both of these regular expressions inside 1 script?
For spaces, i have used this regsub: $str =~ tr/ //d;
and for Blank lines, I have used this regexp
while (<$file>) {
if (/\S/){
print $new_file $_; }}
It should be really easy: just add tr/ //d before the if line.
Note: It will remove lines containing spaces only, too. If you want to keep them (but transliterated to empty lines), insert the transliteration before the print line.
If you wish to trim the end of the line that contains space,
you might want it to work like this:
perl -pi -e 's/\s*$/\n/' f1 f2 f3 #UNIX file format
perl -pi -e 's/\s*$/\r\n/' f1 f2 f3 #DOS file format