Excel with Japanese (wide) fonts

Excel with Japanese (wide) fonts - perl

I was working on parsing an Excel file that has Japanese in some of the cells. By using Spreadsheet::ParseExcel (Ver. 0.15) (which I know is older than current version), some of the cells with the characters below:
＜設定B-１コース＞
are appearing as:
print Dumper $oWkc->{_Value};
$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff11}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}";
and
print $oWkc->{Val} . "\n";
[-0
$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff13}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}";
[-0
If I want to get these values printed in the actual foramat, I am setting the STDOUT File handle to ":utf8", and my terminal to display utf-8 encoding (otherwise I am getting some "wide character" warning). Here I have to pick cells with B-１ or B-2 , but I am not sure what should be set inside my script so that these characters can be treated as what I am able to see them on STDOUT.
Currently I am using a regular expression to convert these wide characters to their corresponding ASCII value. As an example if I want to match B-1 which is stored as 'B-\x{ff11}', I will be
$oWkc->{_Value} =~ /([AB]-)(\x{ff11}|\x{ff12}|\x{ff13}/
my $lookup = $1.$2;
$lookup =~ s/\x{ff11}/1/;
$lookup =~ s/\x{ff12}/2/;
$lookup =~ s/\x{ff13}/3/;
For reference, B-1, A-2 etc these values are coming from some other source, and currently are ranging from A|B-[1-3].
What is the standard way to deal with these wide characters? I am not able to use encode/decode etc. Can any one give me some direction?
Currently though I am able to get the work done using regex...

While I did not verify it (I am not going to install a module from March 2001), the module apparently already decodes to Perl native strings, so you do not have to do much. The straightforward way works just fine, no need to overcomplicate things by those substitutions.
use utf8;
my $val = '＜設定B-１コース＞';
# does it match A or B, followed by a dash, followed by a fullwidth １,２ or ３?
$val =~ /(?:A|B)-[１２３]/; # returns true/1

To deal with multi-byte characters in Spreadsheet::ParseExcel you should update to the latest version and use the FmtJapan formatter. Several bug fixes around Japanese formatting went into recent versions.
Here is an example:
#!/usr/bin/perl
use warnings;
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::FmtJapan;
my $filename = 'Test2000J.xls';
my $parser = Spreadsheet::ParseExcel->new();
my $formatter = Spreadsheet::ParseExcel::FmtJapan->new();
my $workbook = $parser->parse($filename, $formatter);
if ( !defined $workbook ) {
die "Parsing error: ", $parser->error(), ".\n";
}
# Set your output encoding.
binmode STDOUT, ':encoding(cp932)';
# Or maybe this:
#binmode STDOUT, ':utf8';
for my $worksheet ( $workbook->worksheets() ) {
print "Worksheet name: ", $worksheet->get_name(), "\n\n";
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print " Row, Col = ($row, $col)\n";
print " Value = ", $cell->value(), "\n";
print " Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
}

Related

Keeping RTF formatting when moving files

I have been able to tokenize a RTF document and then print it to another RTF document. My question is whether or not it is possible to keep the original formatting from the first document (font, font color, background color). There are somethings that are randomly colored in the document so keeping the formatting is important.
Here it the tokenizer code :
#!usr/bin/perl
use strict;
use warnings;
use RTF::Writer;
use Data::Dumper;
use RTF::Tokenizer;
die "usage: $0 input output\n" unless #ARGV == 2;
my $infile = shift;
my $outfile = shift;
my $tokenizer = RTF::Tokenizer->new();
$tokenizer->read_file($infile);
my ( $token_type, $argument, $parameter );
{
# reduce bogus warnings
no warnings 'uninitialized';
# get past the header
( $token_type, $argument, $parameter ) =
$tokenizer->get_token() until
($token_type eq 'control' and $argument eq 'par');
}
my #final;
while ($token_type ne 'eof'){
( $token_type, $argument, $parameter ) = $tokenizer->get_token();
push #final, $argument if $token_type eq 'text';
}
my $rtf = RTF::Writer->new_to_file($outfile);
my #sorted = sort {
my #fields_a = split / / , $a;
my #fields_b = split / /, $b;
chomp($a, $b);
$fields_a[0] cmp $fields_b[0];
} #final;
$rtf->prolog;
$rtf->print(\#sorted);
$rtf->close;
This is what im inputing
{\rtf1\ansi\deff0{\fonttbl{\f0 Times New Roman;}}
{\colortbl;\red255\green0\blue0;
\red0\green0\blue255;}
\cf1 145747.2545
\cf0 134758.2545
and I want to output these in order with the same formating. I already made a sorting script for it

According to the documentation for RTF::Writer, sequences of RTF commands need to be passed to the print() method as scalar references. For example:
use strict;
use warnings;
use RTF::Writer;
my $rtf = RTF::Writer->new_to_handle(*STDOUT);
while (<DATA>) {
$rtf->print(\$_);
}
$rtf->close;
__DATA__
{\rtf1\ansi\deff0{\fonttbl{\f0 Times New Roman;}}
{\colortbl;\red255\green0\blue0;
\red0\green0\blue255;}
\cf1 145747.2545
\cf0 134758.2545
I'm not familiar with the RTF spec, so I don't know whether newlines are desirable here or not.
{\rtf1\ansi\deff0{\fonttbl{\f0 Times New Roman;}}
{\colortbl;\red255\green0\blue0;
\red0\green0\blue255;}
\cf1 145747.2545
\cf0 134758.2545
If you just pass a scalar to print() rather than a scalar reference, it looks like some escaping is performed:
\'7b\'5crtf1\'5cansi\'5cdeff0\'7b\'5cfonttbl\'7b\'5cf0 Times New Roman;\'7d\'7d
\line \'7b\'5ccolortbl;\'5cred255\'5cgreen0\'5cblue0;
\line \'5cred0\'5cgreen0\'5cblue255;\'7d
\line \'5ccf1 145747\'2e2545
\line \'5ccf0 134758\'2e2545
\line

How to print data in column form in Perl?

I have a program that prints the contents of arrays in rows. I would like it to print each array in a column next to each other.
This is the code:
#!/usr/local/bin/perl
use strict;
use warnings;
my #M_array;
my #F_array;
open (my $input, "<", 'ssbn1898.txt');
while ( <$input> ) {
chomp;
my ( $name, $id ) = split ( /,/ );
if ( $id eq "M" ) {
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
print "M: #M_array \n";
print "F: #F_array \n";
Is this possible or am I trying to do something that can't be done?
Desired format:
M F
Namem1 Namef1
Namem2 Namef2

You can add whatever separator you would like between your data by using the join function, the example below formats the data in your array separated by tabs:
...
use List::MoreUtils qw/pairwise/;
my $separator = "\t";
print join($separator, qw(M F)), "\n";
print join(
"\n",
pairwise { ( $a // '') . $separator . ( $b // '') } #M_array, #F_array
), "\n";
...

I think, you should use Perl formats. Have a look at the Perl documentation. You may want to use the #* format field in your case.
I extended your code in order to print the desired output at the end
use strict;
use warnings;
my #M_array;
my #F_array;
open (my $input, "<", 'ssbn1898.txt');
while ( <$input> ) {
chomp;
my ( $name, $id ) = split ( /,/ );
if ( $id eq "M" ) {
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
unshift #M_array, 'M';
unshift #F_array, 'F';
my $namem;
my $namef;
my $max = 0;
$max = (length($_) gt $max ? length($_) : $max) for #M_array;
my $w = '#' . '<' x $max;
eval "
format STDOUT =
$w #*
\$namem, \$namef
.
";
while ( #M_array or #F_array) {
$namem = shift #M_array || '';
$namef = shift #F_array || '';
write;
}

join is probably the simplest approach to take tabs will align your columns nicely.
join ( "\t", #array ),
Alternatively, perl allows formatting via (s)printf:
printf ( "%-10s %-10s", "first", "second" );
Or a more detailed 'format'
Given what you're trying to do is put your two arrays into columns though:
#!/usr/local/bin/perl
use strict;
use warnings;
my $format = "%-10s\t%-10s\n";
my #M_array = qw ( M1 M2 M3 M4 M5 );
my #F_array = qw ( F1 F2 F3 );
my $maxrows = $#M_array > $#F_array ? $#M_array : $#F_array;
printf ( $format, "M", "F" );
for my $rownum ( 0..$maxrows ) {
printf ( $format, $M_array[$rownum] // '', $F_array[$rownum] // '' );
}
This will print a header row, and then loop through you arrays printing one line at a time. // is a conditional operation that tests if something is defined. It's only available in newer perls though*. In older versions || will do the trick - it's almost the same, but handles '' and 0 slightly differently.
* Perl 5.10 onward, so is pretty safe, but worth mentioning because some system are still rocking around with perl 5.8 on them.

You may format output with the sprintf function, but there are some more problems to solve: What if the arrays don't have the same count of entries? For this, you need a place-holder. How much letters must fit into a column? How should it be aligned? Some code for illustration:
#!/usr/bin/perl
use strict;
use warnings;
my #m = (1, 2, 3);
my #f = (11, 22, 33, 44);
# calculate how many rows to display
my $max = #m;
if (#m < #f) {
$max = #f;
}
# placeholder for missing data
my $none = '-';
# formatting 20 chars per column, left aligned
my $fmt = "%-20s%-20s\n";
# print header
print sprintf($fmt, "M", "F");
# print data rows
foreach my $i (0..$max-1) {
print sprintf($fmt, ($m[$i] or $none), ($f[$i] or $none));
}
If you are interested in more sophisticated formatting (for instance center-aligned text), you should switch to the special formatting capabilities Perl provides for report generation.

Borrowing from #HunterMcMillen
use strict;
use warnings;
use feature "say";
local $, = "\t"; # separator when printing list
my $i = (#F_array > #M_array) ? $#F_array : $#M_array;
say qw(M F);
say $M_array[$i] //"", $F_array[$i] //"" for 0 .. $i;

I guess Text::Table is the required module which comes with the perl distribution(just need to install).Go through the below documentation -
Documentation of Text::Table
You need to pass the content as array to the add() method and it will do the wonders for you.

"Unrecognized character \xEF" Error with BOM-Free File

I am currently extracting text from CSV files using Perl and the module, Text::CSV.
Each of the CSV files have quotation marks separating each field. The texts are being saved to independent text files with tab separation into columns. I can call and print each column from the text files no problem, but when I try to use the values in a loop, I get the error Unrecognized character \xEF.
An example of my code is as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
#### Match ligand data with GPCR interaction data ####
my $csv = Text::CSV->new();
my $file = $ARGV[0];
open (FILE, "<$file");
open (OUT, ">new_$file");
while (my $line2 = <FILE>)
{
binmode(STDOUT, ":utf8");
if ($line2 =~ /^(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)$/)
{
#### Data from filtered1.txt ####
my $up_fil = $1;
my $ligid_fil = $2;
my $units_fil = $3;
my $low_fil = $4;
my $median_fil = $5;
my $upper_fil = $6;
my $ref = $7;
#### Convert negative log affinity values to normal ####
my $activity = $units_fil;
$activity =~ s/p//;
my $value;
if （$median_fil ne "")
{
$value = $median_fil;
$value = (10**-$median_fil)/(10**-9);
}
elsif ($low_fil ne "" and $upper_fil ne "")
{
my $lower = $low_fil;
$lower = (10**-$low_fil)/(10**-9);
my $upper = $upper_fil;
$upper = (10**-$upper_fil)/(10**-9);
$value = "$upper - $lower";
}
else
{
$value = "n/a";
}
#### Match entries from filtered1.txt with ligands.csv ####
open (LIG, "<ligands.csv");
while (my $line3 = <LIG>)
{
$csv->parse($line3);
my #ligand_fields = $csv->fields();
if (!$ligand_fields[14]) { next; }
if ($ligand_fields[0] eq $ligid_fil)
{
#print OUT "$ligand_fields[14]\t$ligand_fields[13]\t$up_fil\t$ligid_fil\t$activity\t$value\t$ref\n";
print "$ligand_fields[14]\t$ligand_fields[13]\t$up_fil\t$ligid_fil\t$activity\t$value\t$ref\n";
next;
}
}
close LIG;
}
}
close FILE;
close OUT;
I've also tried using a regex along the lines of the following, but to no avail.
# remove BOM
${$self->{CODE}} =~ s/^(?:
\xef\xbb\xbf |
\xfe\xff |
\xff\xfe |
\x00\x00\xfe\xff |
\xff\xfe\x00\x00
)//x;
The original CSV files appear not to have any BOM, so I suspect that Text::CSV may be creating it when it is parsing and returning values. I hope this was a clear enough explanation of the problem, and if needed, I can provide more details. Thanks in advance for any advice given.

The documentation of Text::CSV states you should almost certainly be using the binary mode.
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
From https://metacpan.org/pod/Text::CSV#SYNOPSIS.
You may also want to take a look at Text::CSV::Encoded.
I also see you are setting a binmode of :utf8 on STDOUT. There are a couple of problems with that:
You are setting it each time around the loop unnecessarily
The :utf8 binmode does not have good error checking, you should use :encoding(UTF-8) instead
The byte 0xEF can appear in UTF-8 byte sequences, but only in very specific circumstances, it is too high (> 0x7F) to be a single character. However in Perl \xEF or \x{ef} does not refer to the byte 0xEF, but the Unicode code point U+00EF which is represented in UTF-8 as 0xC3 0xAF. You can see this in a Unicode/UTF-8 character table such as http://www.utf8-chartable.de/.
$ perl -E 'binmode STDOUT, ":encoding(UTF-8)"; say "\xEF";'
ï
So I think that is why your regular expression to remove the BOM didn't work.
I would recommend using three argument open with either '<:encoding(UTF-8)' or '>:encoding(UTF-8)' to open all of your input and output files, and using Text::CSV in binary mode, for the best results.

replace a string of characters with the line number

I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.

Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS

If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.

To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'

This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...

Proper-case first field

My text files contains this one :
COcoNut,Other,900,21_7_2011,Coimbatore,TEINGKAAY
CotTon,Others,3500,21_7_2011,Coimbatore,PARUTTI
Maize,Others,1200,21_7_2011,Coimbatore,MAKKAACHOOLAM
Bajra,Other,1325,14_7_2011,Coimbatore,KAMBU
Jowar,Other,2750,14_7_2011,Coimbatore,CHOOLAM
Ragi,Other,910,14_7_2011,Coimbatore,KEIZHVARAKU
Coconut,Grade_I,650,12_7_2011,Coimbatore,TEINGKAAY GRADE ONNU
Copra,other,5300,7_7_2011,Coimbatore,KOPPARAI
Paddy,ADT_______36,950,15_7_2011,Madurai,NELLU ADT MUPPATTI AARU
Paddy,AST_16,950,15_7_2011,Madurai,NELLU AST PATINAARU
Here i had COcoNut, cotTon JOWar, etc. But i want to print like this Coconut, Cotton, Jowar i.e., the first letter should be uppercase rather than remaining using regular expression in perl and not in any packages ....
And also you have seen 'Others' , but i want only 'Other' in that text files. This also added with the above expression.
Then, this text files will read and write a same files i.e., to be overwrite in that files in perl scripts
Please any one suggest me

#!/usr/bin/perl
use strict;
use warnings;
use IO::InSitu;
my ( $in, $out ) = open_rw ( '/path/to/file' );
my $separator = ',';
while ( <$in> ) {
my #fields = split ( $separator => $_ );
$fields[ 0 ] = ucfirst lc $fields[ 0 ];
$fields[ 1 ] =~ s/(?<=other)s//gi;
print { $out } join ( $separator => #fields ) . "\n";
}

s/^([^,]*)/\u\L$1/;s/,Others,/,Other,/;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Excel with Japanese (wide) fonts - perl

Related

Keeping RTF formatting when moving files

How to print data in column form in Perl?

"Unrecognized character \xEF" Error with BOM-Free File

replace a string of characters with the line number

Proper-case first field

Categories

Resources