Regular Expression for # Array in Perl - perl

Actually i m using one perl script to prepare Readme txt for my builds. in that script i used
foreach $line (<LOG>)
{
if(length(trim($line))>0)
{
$line=trim($line);
$line=~ s/[\r]//gs;
if(rindex($line,'#')!=-1)
{
$icut=substr($line,0,index($line,']'));
$icut2=substr($icut,index($icut,'#')+1,length($icut));
}
push(#issue,$icut2);
it's fetching correct issue no but when the situation comes like
[I#1303350], [I#1270918],[I#1312521] Updated Physical Confirmation Template based on CO
then it's fetching only one issue no not all issue i that same line. so i modified my code like
foreach $revno(<REV>)
{
if(length(trim($revno))>0)
{
$revno=trim($revno);
$revno=~ s/[\r]//gs;
if(rindex($revno,'#')!=-1)
{
$revcut=substr($revno,0,rindex($revno,']'));
print "$revcut\n";
$revcut1=substr($revcut,index($revcut,'#')+1,length($revcut));
}
}
push(#issue,$revcut1);
now it's fetch it all revision no but output is like 1312588,1303350], [I#1270918],[I#1312521 but i want to remove the # [ ] I only but not , so pls tell me how can i parse this through regex.

This can be done without regular expressions: Transliterate: tr///
use warnings;
use strict;
my $s = '1312588,1303350], [I#1270918],[I#1312521';
$s =~ tr/ ][#I//d;
print "$s\n";
__END__
1312588,1303350,1270918,1312521

You can do it like this:
echo "[I#1303350], [I#1270918],[I#1312521]" | perl -lnwe "print for m/#(\d+)/g"

This works for me:
my #issues = $line_of_data =~ m/#(\d+)/g;
And if you want commas, it's far easier to do this:
my $with_commas = join( ', ', #issues );
And you still have the individual issues numbers as "atoms" of data.

Related

Regular expression to validate an email in perl

I am new in Perl and trying to catch hold of the scripting language where I come across regular expression to validate a email address. I am sharing the perl script. I am not sure where I am making mistake in it. The \# part is omitted always following which the correct email id is also showing as invalid.
Here is the code :
#!/usrs/bin/perl/
$string = "XYZ#yahoo.com";
if ( $string =~ /([a-zA-Z]+)\#([a-zA-Z]+)\.(com|net|org)/)
{
print "TRUE";
print $string;
}
else
{
print "FALSE";
print $string;
}
Thanks for your help.
The regex for validating an email address is included in the source code for Email::Valid. I've copied it below:
$RFC822PAT = <<'EOF';
[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*#[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>#,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>#,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>#,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:#[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>#,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*#[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>#,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*#[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>#,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>#,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)
EOF
But your actual problem is this line:
$string = "XYZ#yahoo.com";
The #yahoo looks like an array variable to Perl, and and you don't have an array called #yahoo it gets replaced with an empty string. Try printing the value of $string to see what you get.
The solution is either to use single quotes (so the array variable isn't expanded):
$string = 'XYZ#yahoo.com';
Or to escape the # with a \:
$string = "XYZ\#yahoo.com";
Alway add use strict and use warnings to your Perl programs. They would have told you what the problem is.
#!/usr/bin/perl
use strict;
use warnings;
use Email::Valid;
my $email_address = 'a.n#example.com';
unless( Email::Valid->address($email_address) ) {
print "Sorry, that email address is not valid!";
}
Reference Site: http://learn.perl.org/examples/email_valid.html
For Regex Pattern try this:
my $pattern= '^([a-zA-Z][\w\_\.]{6,15})\#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$';
Reference Site: https://www.experts-exchange.com/articles/8652/Email-validation-using-Regular-Expression-in-Perl.html
Change your if condition to
if ($string =~ /^[a-z0-9A-Z][A-Za-z0-9.]+[A-Za-z0-9]\#[A-Za-z0-9.-]+$/)
and change
$string = "XYZ#yahoo.com"; to
$string = 'XYZ#yahoo.com';
Refference : http://perlmaven.com/email-validation-using-regular-expression-in-perl
for details.
Try
if ($email =~ /^[a-z0-9]([a-z0-9.]+[a-z0-9])?\#[a-z0-9.-]+$/)
And some test code:
#!/usr/bin/perl
my $email = "john.doe\#acme.org";
if ($email =~ /^[a-z0-9]([a-z0-9.]+[a-z0-9])?\#[a-z0-9.-]+$/) {
print "Valid email\n";
} else {
print "Not valid email\n";
}
exit;
Output:
Valid email

How to extract directory names from a path in Perl

I have a path like this
/home/user/doc/loc
I want to extract home, user, doc, loc separately. I tried split (////) and also split("/")
but none of them worked. Please give me sample script:
while (<EXPORT>) {
if (/^di/) {
($key, $curdir) = split(/\t/);
printf "the current dir is %s\n", $curdir;
printf("---------------------------------\n");
($home_dir, $user_dir, $doc_dir, $loc_dir) = split("/");
}
}
But it didn't work; hence please help me.
Given $curdir containing a path, you'd probably use:
my(#names) = split m%/%, $curdir;
on a Unix-ish system. Or you would use File::Spec and splitdir. For example:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Spec;
my $curdir = "/home/user/doc/loc";
my(#names) = split m%/%, $curdir;
foreach my $part (#names)
{
print "$part\n";
}
print "File::Spec->splitdir()\n";
my(#dirs) = File::Spec->splitdir($curdir);
foreach my $part (#dirs)
{
print "$part\n";
}
Ouput (includes a leading blank line):
home
user
doc
loc
File::Spec->splitdir()
home
user
doc
loc
split's first result will be the string preceding the first instance of the regular expression passed to it. Since you have a leading "/" here you would get an empty string in $home_dir, 'user' in $user_dir and so on. Add undef to the list assignment's first position or alternatively trim a leading slash first.
Also I'm not sure if you can call split without passing it $curdir here. Try:
(undef, $home_dir, $user_dir, $doc_dir, $loc_dir) = split("/", $curdir);

Perl: How to detect which file exists among foo.(txt|abc)

My perl script needs to detect the extension of an existing file and print out the filename. The input that specifies the filename with a vague extension would be in this format:
foo.(txt|abc)
and the script would print "foo.txt" if it exists. If foo.txt does not exist and foo.abc exists, then it would print "foo.abc."
How can I do this detection and printing of the correct existing file in a neat and clean way?
Thanks!
Actually, you've almost got the regular expression right there: the only thing you need to do is escape the . with a backslash (since . means "any character except the newline character" in regular expressions), and it would also help to put a ?: inside of the parentheses (since you don't need to capture the file extension). Also, ^ and $ denote markers for the beginning and the end of the string (so we're matching the entire string, not just part of a string...that way we don't get a match for the file name "thisisnotfoo.txt")
Something like this should work:
use strict;
use warnings;
my $file1="foo.txt";
my $file2="foo.abc";
my $file3="some_other_file";
foreach ($file1,$file2,$file3)
{
if(/^foo\.(?:txt|abc)$/)
{
print "$_\n";
}
}
When the above code is run, the output is:
foo.txt
foo.abc
Take a look at perldoc perlretut for more stuff about regular expressions.
You may want to look at glob, but you'd have to use a different syntax. The equivalent would be:
foo.{txt,abc}
See File::Glob for more information. Also note that this will return a list of all of the matches, so you'll have to do your own rules if it should prefer one when multiple exist.
sub text_to_glob {
my ($s) = #_;
$s =~ s/([\\\[\]{}*?~\s])/\\$1/g;
return $s;
}
my $pat = 'foo.(txt|abc)';
my #possibilities;
if (my ($base, $alt) = $pat =~ /^(.*\.)\(([^()]*)\)\z/s) {
#possibilities = glob(
text_to_glob($base) .
'{' . join(',', split(/\|/, $alt)) . '}'
);
} else {
#possibilities = $pat;
}
for my $possibility (#possibilities) {
say "$possibility: ", -e $possibility ? "exists" : "doesn't exist";
}
glob, but also see File::Glob
-e
use strict;
use warnings;
FILE:
for (glob "file.{txt,abc}") {
if (-f $_) {
print $_, "\n";
last FILE;
}
}

help in parsing

I am having a XML file as shown below,
<message1>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message1>
<message2>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message2>
I have to parse the values (val) and i could not use XML::Simple module. The parsing should be started from <message1> and i have to put the values in an array till </message1> and then i have to repeat this for <message2> till </message2>.
Pictorially it is like
<message1>
----100
----200
----300
----400
</message1>
<message2>
----100
----200
----300
----400
</message2>
Can any one help me .. I am struggling a lot
Thanks
Senthil kumar
Since we're back in 1999, I think I would forget about strict and warnings, use symbolic references and string eval, and be done with it:
#!/usr/bin/perl
while( <DATA>)
{ s{<(message\d)>}{\#$1=(}; # #message1=(
s{<val\d>}{}; #
s{<\/val\d>}{,}; # ,
s{</message\d>}{);}; # );
$s.=$_;
};
eval $s;
$,= ", "; $\= "\n";
foreach (1..2) { print "\#message$_: ", #{"message$_"}; }
__DATA__
<message1>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message1>
<message2>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message2>
(in case that's not clear: that's a joke! As they say "Have you tried using an XML parser instead?")
Assuming your input is completely regular as you show, the following should work.
But you are far better off getting a real XML parser to work, by wrapping a root element around all your content or by parsing each message separately.
use strict;
use warnings;
my %data;
while (<>) {
# skip blank lines
next unless /\S/;
my ($tag) = /^<(.*)>$/
or warn("expected tag, got $_ "), next;
$data{$tag} ||= [];
while (<>) {
last if /^<\/\Q$tag\E>$/;
my (undef, $value) = /^<val(\d+)>(.*)<\/val\1>$/
or warn("expected val, got $_ "), next;
push #{ $data{$tag} }, $value;
}
}
use Data::Dumper;
print Dumper \%data;

Why are some characters missing when I converted my Perl script to executable using Perlapp?

Before posting my question to the ActiveState forum, I'd like to try luck here :)
I'm trying to convert a simple script of mine to .exe file using Perlapp (version 8.1). The Perl script works fine and it seems Perlapp also did its job successfully.
But the converted .exe file has some weird behavior, which, I believe, must be related to utf-8 encoding. For example, the Perl script would yield the result like:
hàn huáng zhòng sè sī qīng guó
But running the executable file would give me only this:
h hu zh s s q gu
I've already configured Perlapp so that utf8.pm is explicitly added but the problem just refuses to go away. I've tried something else. For example,
binmode DATA, ":utf8";
and
">:encoding(utf8)"
but without any luck;
Can anyone kindly give me some hint as to what might be the reason? Thanks like always :)
I can post the whole code here but it seems unnecessary so I just paste some snippets of the code that I think is relevant to the problem:
use utf8;
%zidian = map {chomp;split/\s+/,$_,2} <DATA>;
open my $in,'<:utf8',"./original.txt";
open my $out,'>:utf8',"./modified.txt";
if ( $code~~%zidian) {
$value = lc$zidian{$code};
}
__DATA__
3400 Qiū
3401 TIǎN
3404 KUà
3405 Wǔ
And one more thing, I'm running ActivePerl 5.10.0.on Windows XP (Chinese Version) and the script is saved as utf-8 encoding without BOM. PerlApp cannot handle a script that has BOM.
Edit
If I were to give a workable snippet, then I suppose it's like giving the whole code because I'm using three inter-connected sub-routines, which I take with some modifications from Lingua::han::Pinyin module and Lingua::han::Utils module.
#! perl
# to make good vertical alignment,
# set font family to SonTi and font size to Four(12pts)
use utf8;
sub Unihan {
my $hanzi = shift;
my #unihan = map { uc sprintf("%x",$_) } unpack ("U*", $hanzi);
}
sub csplit {
my $hanzi = shift;
my #return_hanzi;
my #code = Unihan($hanzi);
foreach my $code (#code) {
my $value = pack("U*", hex $code);
push #return_hanzi, $value if ($value);
}
return wantarray ? #return_hanzi : join( '', #return_hanzi );
}
%zidian = map {chomp;split/\s+/,$_,2} <DATA>;
sub han2pinyin {
my $hanzi = shift;
my #pinyin;
my #code = Unihan($hanzi);
foreach $code (#code) {
if ( $code~~%zidian) {
$value = lc$zidian{$code};
}
else {
$value = " ";
}
push #pinyin, $value;
}
return wantarray ? #pinyin : join( '', #pinyin );
}
open $in,'<:utf8',"./original.txt";
seek $in, 3,0;
open $out,'>:utf8',"./modified.txt";
while(<$in>){
s/(.{18})/$1\n/g;
push #tmp, $_;
}
foreach (#tmp){
my #hanzi;
my #pinyin;
#hanzi = csplit($_);
my $hang = join "", #hanzi;
#pinyin = han2pinyin($hang);
for ( my $i = 0; $i < #hanzi && $i < #pinyin; ++$i ) {
if ( $hanzi[$i] =~ /[\xEFBC8C]|[\xE38082]|[\xEFBC81]|[\xEFBC9F]|[\xE2809C]|[\xE2809D]|[\xEFBC9A]/ ) {
splice(#pinyin, $i, 0," ");
}
}
printf $out "%-7s" x #pinyin, #pinyin;
print $out "\n";
printf $out "%-6s" x #hanzi, #hanzi;
print $out "\n";
}
__DATA__
3400 Qiū
3401 TIǎN
3404 KUà
3405 Wǔ
ActiveState hasn't given me any help yet. Whatever. Now I've figured out a workaround for my problem and this workaround looks very weird.
First I added some otherwise useless utf-8 encoded characters to my DATA section, like the following:
__DATA__
aardvark 'ɑ:dvɑ:k
aardwolf 'ɑ:dwulf
aasvogel 'ɑ:sfәugәl
3400 Qiū
3401 TIǎN
3404 KUà
3405 Wǔ
And then I removed the use utf8; pragma from my script;
and then I removed the utf8 flag from the following line of code:
open $out,'>:utf8',"./modified.txt";
Now it becomes
open $out,'>',"./modified.txt";
But I had to let the following line of code unchanged:
open $in,'<:utf8',"./original.txt";
Then everything was okay except that I'd receive "wide characters in print" warnings. But I added another line of code:
no warnings;
And then I Perlapped my script and everything worked fine :)
This is really strange. I'm suspecting this problem is somehow OS specific. It's also quite likely that there's something wrong with my Windows system. And I also tried Perl2exe and the compiled executable gave me some "memory 0010c4 cannot be read" error. Whatever. My problem is somehow fixed by myself :)