Problems reading header line from my Excel 2007 files created with Perl - perl

I have a problem with merging two dynamically created Excel 2007 files.
My files are created with the Perl Module Excel::Writer::XLSX on Solaris.
Say I have two files, fileA.xlsx and fileB.xlsx. Now I want to merge them together (fileA + fileB => fileC).
It is not really possible at this time to append fileB to fileA. This is a limitation of Excel::Writer::XLSX, which can only create new files.
Both .xlsx files can be opened without complaints in Excel 2007, in LibreOffice 3 (on linux), and (with the help of Microsoft's xlsx to xls converters) even in Excel 2003.
However, when I open them with perl (using the module Spreadsheet::XLSX), the contents of the header row, (row 0) are always skipped;
# ...
foreach my $infile (#infiles) {
my $excel = Spreadsheet::XLSX->new($infile);
my $i = 0;
foreach my $sheet ( #{ $excel->{Worksheet} } ) {
printf( "Infile '$infile', Sheet $i: %s\n", $sheet->{Name} );
$sheet->{MaxRow} ||= $sheet->{MinRow};
print "$infile: " . $sheet->{MaxRow} . " rows\n";
print "data starts at row: " . $sheet->{MinRow} . ". \n";
next unless $i == 0; # only copy data from the first sheet (for speed)
my $start_row = $sheet->{MinRow};
foreach my $row ( $start_row .. $sheet->{MaxRow} ) {
$sheet->{MaxCol} ||= $sheet->{MinCol};
foreach my $col ( $sheet->{MinCol} .. $sheet->{MaxCol} ) {
my $cell = $sheet->{Cells}[$row][$col];
if ($cell) {
# do something with the data
# ...
# write to outfile
$excel_writer->sheets(0)->write($dest_row, $col, $cell->{Val} )
}
}
}
}
}
Now, the ouput of this code fragment is always
data starts at row: 1.
But this is not true, it starts at row 0. If I manually go to read in data from row0, $cell is undefined (although it shouldn't be).
Interestingly, when I open the file in Microsoft Excel, and change it trivially, (say, by adding a blank space to one of the cell values in the header row), and save the file, then the header row IS found by the code above.
data starts at row: 0.
By the way, when I open, change, save the file in LibreOffice, there are numerous warnings concerning date values when I re-read them with the code above. (Thus, datetime values seem to be saved slightly incorrectly by LibreOffice).
The code that produces the files looks like this (note: some vars are defined outside of this sub):
sub exportAsXLS {
#require Spreadsheet::WriteExcel;
require Excel::Writer::XLSX;
my ( $data, $dir, $sep, #not2export ) = #_;
my $val;
my $EXCEL_MAXROW = 1048576;
return undef unless $data;
return "." unless scalar #$data > 0;
my $time = time2str( "%Y%m%d_%H%M%S", time() );
my $file = "$outdir/$dir/${host}_${port}-${time}.xlsx";
#my $workbook = Spreadsheet::WriteExcel->new($file);
my $workbook = Excel::Writer::XLSX->new($file);
$workbook->set_optimization();
my $worksheet = $workbook->add_worksheet();
# Set the default format for dates.
#my $date_formatHMS = $workbook->add_format( num_format => 'mmm d yyyy hh:mm AM/PM' );
#my $date_formatHMS = $workbook->add_format( num_format => 'yyyy-mm-ddThh:mm:ss.sss' );
my %formats;
$formats{date_HM} = $workbook->add_format( num_format => 'yyyy-mm-ddThh:mm' );
$formats{date_HMS} = $workbook->add_format( num_format => 'yyyy-mm-ddThh:mm:ss' );
$formats{num} = $workbook->add_format();
$formats{num}->set_num_format();
$formats{headline} = $workbook->add_format();
$formats{headline}->set_bold();
$formats{headline}->set_num_format('#');
# Format as a string. use the Excel text format #:
# Doesn't change to a number when edited
$formats{string} = $workbook->add_format( num_format => '#' );
$worksheet->set_row( 0, 15, $formats{headline} );
my $row = 0;
my $col = 0;
for ( my $r = -1 ; $r < #$data && $r < $EXCEL_MAXROW ; $r++ ) {
for ( my $i = 0 ; $i < #$column ; $i++ ) {
next if grep( $_ eq $column->[$i], #not2export );
my $val = $data->[$r]{ $column->[$i] };
my $t = int $type->[$i];
if ( $r < 0 ) {
#warn " type: $type->[$i] , ";
# Erste Zeile = Spaltennamen ausgeben
$worksheet->write_string( $row, $col++, $column->[$i], $formats{string});
#$worksheet->write_comment( 0, 0, "\x{263a}" ); # Smiley
#$worksheet->write( $row, $col++, $column->[$i], $formats{headline} );
} elsif ( ( $t == 11 ) or ( $t == 9 ) ) {
# 11 - Der Wert ist ein Datum, im SHORT Format, 9- long
$val = time2str( "%Y-%m-%dT%H:%M:%S", str2time( $data->[$r]{ $column->[$i] } ) );
$worksheet->write_date_time( $row, $col++, $val, $formats{date_HMS} );
} else {
$worksheet->write( $row, $col++, $val );
}
}
$col = 0;
$row++;
}
return $file;
}
The difference between the files is as follows.
On the left is the file that Excel::Writer::XLSX produces. ON the right is the file that MS Excel 2003 produces after a trivial change to the header row. the row header data is refactored, externalized to a different file, sharedStrings.xml
Which looks like this.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="5" uniqueCount="5">
<si>
<t>SITE</t>
</si>
<si>
<t>LOG_DATE</t>
</si>
<si>
<t>KTZI201_WF_TEMPERATUR</t>
</si>
<si>
<t>KTZI300_TEMP_RESERVOIR</t>
</si>
<si>
<t>XPEDITION</t>
</si>
</sst>
Spreadsheet::XLSX can read the header if the .xlsx file is formatted as shown on the right half of the picture, but skips the header row when formatted as shown on the left half.

When I run your program against the output of this Excel::Writer::XLSX example program it correctly reports data in the first row (row == 0):
Infile 'a_simple.xlsx', Sheet 0: Sheet1
a_simple.xlsx: 10 rows
data starts at row: 0.
Perhaps you should double check the program that is producing the input files.
Also, make sure you are on the latest version of Excel::Writer::XLSX.

Related

Perl script is producing the symbol  while converting an Excel file to CSV

We have a batch process in our system which will convert an Excel .xlsx file to CSV format using Perl. When it converts the CSV file it produces some symbols like Â, so I am not getting the expected result. Can some please help me how to use the same value as in the Excel file while converting to CSV?
Value in Excel file:
Unverifiable License Documentation NB Only
Value converted in CSV through Perl:
Unverifiable License Documentation – NB Only
I want to retain the same value that is in Excel while converting to CSV
Note: I used Encoding(UTF-8) while opening the file but even then it didn't work.
My Perl code
use Spreadsheet::XLSX;
use File::Basename;
use set_env_cfg;
use Date::Simple (':all');
use Math::Round;
$sts = open( INP, "< ${if}" );
#$sts = open (INP, '<:encoding(UTF-8)', ${if} );
#$sts = open (INP, '<:encoding(ISO-8859-1)', ${if} );
if ( $sts == 0 ) {
print LOG tmstmp() . ": Error opening input file\n";
close LOG;
print LOG "$ldlm\n";
`cp $lf $od`;
die;
}
print LOG "$ldlm\n";
print LOG tmstmp() . ": Conversion started for $if\n";
$oBook = Spreadsheet::XLSX->new($if);
foreach $WkS ( #{ $oBook->{Worksheet} } ) {
print LOG tmstmp() . ": Converting worksheet ----- " . $WkS->{Name}, "\n";
$cfgrec = ''; # initialize the configure record
$sts = open( OUT, ">$od/$WkS->{Name}.txt" );
if ( $sts == 0 ) {
print LOG tmstmp() . ": Error opening output file\n";
close LOG;
close INP;
print LOG "$ldlm\n";
`cp $lf $od`;
die;
}
$WkS->{MaxRow} ||= $WkS->{MinRow};
foreach $iR ( $WkS->{MinRow} .. $WkS->{MaxRow} ) {
$WkS->{MaxCol} ||= $WkS->{MinCol};
print OUT $cfgkey if ( ( $cfgko == 0 ) && ( $iR >= $hdrcnt ) );
foreach $iC ( $WkS->{MinCol} .. $WkS->{MaxCol} ) {
$cell = $WkS->{Cells}[$iR][$iC];
if ($cell) {
if ( ( $cell->{Type} ) eq "Date" ) {
if ( int( $cell->{Val} ) == ( $cell->{Val} ) ) {
$tmpval = date("1900-01-01") + ( $cell->{Val} ) - 2;
}
else {
$css = round( ( ( $cell->{Val} ) - int( $cell->{Val} ) ) * 86400 );
$cmi = int( $css / 60 );
$chr = int( $css / 3600 );
$css = $css - $cmi * 60;
$cmi = $cmi - $chr * 60;
$tmpval = date("1900-01-01") + int( $cell->{Val} ) - 2;
$tmpval .= " $chr:$cmi:$css";
}
}
else {
$tmpval = Spreadsheet::XLSX::Utility2007::unescape_HTML( $cell->{Val} );
}
print OUT $tmpval; ###Added double quotes in txt file to handle the comma delimiter value
}
if ( ( $iR == ${hdr_seq} - 1 ) ) {
if ( ( $cell->{Type} ) eq "Date" ) {
if ( int( $cell->{Val} ) == ( $cell->{Val} ) ) {
$tmpval = date("1900-01-01") + ( $cell->{Val} ) - 2;
}
else {
$css = round( ( ( $cell->{Val} ) - int( $cell->{Val} ) ) * 86400 );
$cmi = int( $css / 60 );
$chr = int( $css / 3600 );
$css = $css - $cmi * 60;
$cmi = $cmi - $chr * 60;
$tmpval = date("1900-01-01") + int( $cell->{Val} ) - 2;
$tmpval .= " $chr:$cmi:$css";
}
}
else {
$tmpval = Spreadsheet::XLSX::Utility2007::unescape_HTML( $cell->{Val} );
}
$cfgrec .= $tmpval;
}
if ( ( $iC == 0 ) && ( $iR == ${hdr_seq} ) ) {
$cfgrec = uc($cfgrec);
$cfgko = cnt_ocr( $cfgrec, $keyhdr );
$cfgkey = "*|" x ( $klm - $cfgko );
}
print OUT "|" if ( $iC < $WkS->{MaxCol} );
print OUT $cfgkey if ( ( $cfgko == $iC + 1 ) && ( $iR >= $hdrcnt ) );
}
print OUT "\n";
}
print LOG tmstmp() . ": Worsheet conversion completed successfully ----- " . $WkS->{Name}, "\n";
close OUT;
push #csv_file_lst, "$WkS->{Name}.txt";
}
print LOG tmstmp() . ": Conversion completed successfully for $if\n";
My guess is that your Excel file contains data encoded using the Windows-1252 code page that has been reencoded into UTF-8 without first being decoded
This string from your Excel file
Unverifiable License Documentation – NB Only
contains an EN DASH, which is represented as "\x96" in Windows-1252. If this is again encoded into UTF-8 the result is the two bytes "\xC2\x96". Interpreting this using Windows-1252 results in the two characters LATIN CAPITAL LETTER A WITH CIRCUMFLEX followed by EN DASH, which is what you're seeing
As far as I can tell, the only change necessary is to open your file with Windows-1252 decoding, like this
open my $fh, '<:encoding(Windows-1252)', $excel_file or die $!
Update
Your revised question shows your Perl code, but has removed the essential information from the Excel data that you show. This string
Unverifiable License Documentation NB Only
now has just two spaces between Documentation and NB and omits the "0x96" n-dash
Note — I've since restored the original data and tidied your code.
Your various attempts at opening the input file are here
$sts=open (INP, "< ${if}" );
#$sts=open (INP, '<:encoding(UTF-8)', ${if} );
#$sts=open (INP, '<:encoding(ISO-8859-1)', ${if} );
and you came very close with ISO-8859-1, but Microsft, in their wisdom, have reused the gaps in ISO-8859-1 encoding between 0x7F and 0x9F to represent printable characters in Windows-1252. The n-dash character at 0x96 is inside this range, so decoding your input as ISO-8859-1 won't render it correctly
As far as I can see, you just need to write
$sts = open (INP, '<:encoding(Windows-1252)', ${if} );
and your input data will be read correctly
You should also specify the encoding of your output file to avoid Wide character in print warnings and malformed data. I can't tell whether you want to duplicate the encoding of your Excel file, use UTF-8, or something else entirely, but you should change this
$sts = open( OUT, ">$od/$WkS->{Name}.txt" );
to either
$sts = open OUT, '>:encoding(Windows-1252)', "$od/$WkS->{Name}.txt";
or
$sts = open OUT, '>:encoding(UTF-8)', "$od/$WkS->{Name}.txt";
as appropriate
Note also that it is best practice to use the three-parameter form of open all the time, and it is best to use lexical file names instead of the global ones that you have. But this isn't a code review, so I've disregarded those points
I hope this underlines to you that it is vital to establish the encoding of your input data and decode it correctly. Guessing really isn't an option
Update
My apologies. I overlooked that the initial open is ignore by the Spreadsheet::XLSX module, which is passed a filename, rather than a file handle
This module is awkward in that it completely hides all character decoding, and relies on [Text::Iconv][Text::Iconv] to do the little conversion that it supports: something that is much better supported by Perl's own [Encode][Encode] module
The change I suggested to your open call is wrong, because it seems that a .xlsx file is a zipped file. However you never read from INP so it will make no difference. You should also close INP immediately after you have opened it as it is a wasted resource
Short of using a different module, the best thing I can suggest is that you hack the data returned by Spreadsheet::XLSX->new
This block will correct the erroneous re-encoding. I have added it right before your foreach $iR ( ... )` loop
You will need to add
use Encode qw/ decode :fallbacks /;
to the top of your code
Please let me know how you get on. Now I really must go!
{
my $columns = $WkS->{Cells};
for my $row ( #$columns ) {
next unless $row;
for my $cell ( #$row) {
next unless $cell and $cell->type eq 'Text';
for ( $cell->{_Value} ) {
$_ = decode('UTF-8', $_, FB_CROAK);
$_ = decode('Windows-1252', $_, FB_CROAK);
}
}
}
}

How to delete entire column in Excel sheet and write updated data in new excel file using Perl?

I am new to Perl. I have excel file say "sample.xls" which looks like follows.
Sample.xls
There are about data of 1000 rows like this. I want to parse this file and write it in another file say "output.xls" with following output format.
output.xls
I have written a script in perl, however, it doesn't give me the exact output the way I want. Also, looks like the script is not very efficient. Can anyone guide me how I can improve my script as well as have my output as shown in "output.xls" ??
Here's the Script:
#!/usr/bin/perl –w
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use Spreadsheet::WriteExcel::Chart;
# Read the input and output filenames.
my $inputfile = "path/sample.xls";
my $outputfile = "path/output.xls";
if ( !$inputfile || !$outputfile ) {
die( "Couldn't find file\n" );
}
my $parser = Spreadsheet::ParseExcel->new();
my $inwb = $parser->parse( $inputfile );
if ( !defined $inwb ) {
die "Parsing error: ", $parser->error(), ".\n";
}
my $outwb = Spreadsheet::WriteExcel->new( $outputfile );
my $inws = $inwb->worksheet( "Sheet1" );
my $outws = $outwb->add_worksheet("Sheet1");
my $out_row = 0;
my ( $row_min, $row_max ) = $inws->row_range();
my ( $col_min, $col_max ) = $inws->col_range();
my $format = $outwb->add_format(
center_across => 1,
bold => 1,
size => 10,
border => 4,
color => 'black',
border_color => 'black',
align => 'vcenter',
);
$outws->write(0,0, "Item Name", $format);
$outws->write(0,1, "Spec", $format);
$outws->write(0,2, "First name", $format);
$outws->write(0,3, "Middle Name", $format);
$outws->write(0,4, "Last Name", $format);
$outws->write(0,5, "Customer Number", $format);
$outws->write(0,6, "Age", $format);
$outws->write(0,7, "Units", $format);
my $col_count = 1;
#$row_min = 1;
for my $inws ( $inwb->worksheets() ) {
my ( $row_min, $row_max ) = $inws->row_range();
my ( $col_min, $col_max ) = $inws->col_range();
for my $in_row ( 2 .. $row_max ) {
for my $col ( 0 .. 0 ) {
my $cell = $inws->get_cell( $in_row, $col);
my #fields = split /_/, $cell->value();
next unless $cell;
$outws->write($in_row,$col, $cell->value());
$outws->write($in_row,$col+1, $fields[1]);
}
}
for my $in_row ( 2 .. $row_max ) {
for my $col ( 1 .. 1 ) {
my $cell = $inws->get_cell( $in_row, $col);
my #fields = split /_/, $cell->value();
next unless $cell;
#$outws->write($in_row,$col+1, $cell->value());
$outws->write($in_row,$col+1, $fields[0]);
$outws->write($in_row,$col+2, $fields[1]);
$outws->write($in_row,$col+3, $fields[2]);
$outws->write($in_row,$col+4, $fields[3]);
}
}
for my $in_row ( 2 .. $row_max ) {
for my $col ( 2 .. 2 ) {
my $cell = $inws->get_cell( $in_row, $col);
my #fields = split /_/, $cell->value();
next unless $cell;
$outws->write($in_row,6, $cell->value());
}
}
for my $in_row ( 2 .. $row_max ) {
for my $col ( 3 .. 9 ) {
my $cell = $inws->get_cell( $in_row, $col);
next unless $cell;
}
}
for my $in_row ( 2 .. $row_max ) {
for my $col ( 10 .. 10 ) {
my $cell = $inws->get_cell( $in_row, $col );
next unless $cell;
$outws->write($in_row,7, $cell->value());
}
}
}
To get your output sorted, you need to collect all the information first before you are writing it out. Right now, you are doing a bit of jumping back and forth between rows and columns.
Here are some changes I would make to get it sorted, and make it more efficient (to read).
Create a data structure $data outside of your loop to store all the information.
If there is only one worksheet, you don't need to loop over sheets. Just work with one sheet.
Loop over the lines.
Inside that loop, use the code you have to parse the individual fields to just parse them. No 2..2 loops. Just a bunch of statements.
my #item_fields = split /_/, $inws->get_cell( $in_row, 0 ) || q{};
my #name_fields = split /_/, $inws->get_cell( $in_row, $col ) || q{};
Store them in $data per item.
push #{ $data } = [ $item_fields[0], ... ];
Done with the loop. Open the output file.
Loop over $data with a sort and write to the output file.
foreach my $row (sort { $a->[0] cmp $b->[0] } #{ $data } ) { ... }
Done.
I suggest you read up on sort and also check out perlref and perlreftut to learn more about references (data structures).

Spreadsheet::ParseExcel building an array or hash

I am new to Spreadsheet::ParseExcel. I have a space-delimited file which I opened in Microsoft Excel and saved it as a XLS file.
I installed Spreadsheet::ParseExcel and used the example code in documentation to print the contents of the file. My objective is to build an array of some of the data to write to a database. I just need a little help building the array -- writing to a database I'll figure out later.
I'm having a hard time understanding this module -- I did read the documentation, but because of my inexperience I'm unable to understand it.
Below is the code I'm using for the output.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use Spreadsheet::ParseExcel;
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse( 'file.xls' );
if ( !defined $workbook ) {
die $parser->error(), ".\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
}
And here is some of the output
Row, Col = (0, 0)
Value = NewRecordFlag
Unformatted = NewRecordFlag
Row, Col = (0, 1)
Value = AgencyName
Unformatted = AgencyName
Row, Col = (0, 2)
Value = CredentialIdnt
Unformatted = CredentialIdnt
Row, Col = (0, 3)
Value = ContactIdnt
Unformatted = ContactIdnt
Row, Col = (0, 4)
Value = AgencyRegistryCardNumber
Unformatted = AgencyRegistryCardNumber
Row, Col = (0, 5)
Value = Description
Unformatted = Description
Row, Col = (0, 6)
Value = CredentialStatusDescription
Unformatted = CredentialStatusDescription
Row, Col = (0, 7)
Value = CredentialStatusDate
Unformatted = CredentialStatusDate
Row, Col = (0, 8)
Value = CredentialIssuedDate
Unformatted = CredentialIssuedDate
My objective is to build an array of CredentialIssuedDate, AgencyRegistryCardNumber, and AgencyName. Once I grasp the concept of doing that, I can go to town with this great module.
Here's a quick example of something that should work for you. It builds an array #rows of arrays of the three field values you want for each worksheet, and displays each result using Data::Dumper. I haven't been able to test it, but it looks right and does compile
It starts by building a hash %headers that relates the column header strings to the column number, based on the first row in each worksheet.
Then the second row onwards is processed, extracting the cells in the columns named in the #wanted array, and putting their values in the array #row, which is pushed onto #rows as each one is accumulated
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Data::Dumper;
my #wanted = qw/
CredentialIssuedDate
AgencyRegistryCardNumber
AgencyName
/;
my $parser = Spreadsheet::ParseExcel->new;
my $workbook = $parser->parse('file.xls');
if ( not defined $workbook ) {
die $parser->error, ".\n";
}
for my $worksheet ( $workbook->worksheets ) {
my ( $row_min, $row_max ) = $worksheet->row_range;
my ( $col_min, $col_max ) = $worksheet->col_range;
my %headers;
for my $col ( $col_min, $col_max ) {
my $header = $worksheet->get_cell($row_min, $col)->value;
$headers{$header} = $col;
}
my #rows;
for my $row ( $row_min + 1 .. $row_max ) {
my #row;
for my $name ( #wanted ) {
my $col = $headers{$name};
my $cell = $worksheet->get_cell($row, $col);
push #row, $cell ? $cell->value : "";
}
push #rows, \#row;
}
print Dumper \#rows;
}
I was able to resolve this by using the Spreadsheet::BasicReadNamedCol module
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use Spreadsheet::BasicReadNamedCol;
my $xlsFileName = 'shit.xls';
my #columnHeadings = (
'AgencyName',
'eMail',
'PhysicalAddress1',
'PhysicalAddress2'
);
my $ss = new Spreadsheet::BasicReadNamedCol($xlsFileName) ||
die "Could not open '$xlsFileName': $!";
$ss->setColumns(#columnHeadings);
# Print each row of the spreadsheet in the order defined in
# the columnHeadings array
my $row = 0;
while (my $data = $ss->getNextRow())
{
$row++;
print join('|', $row, #$data), "\n";
}

making cells blank in and excel sheet using perl

I have an excel sheet to which I need to empty some cells
So far this is what it looks like:
I open the sheet, and check for cells not empty in column M.
I add those cells to my array mistake
and then I would like to make black all those cells and save the file (this step not working), as that file needs to be the input to anotherprogram/
thanks!
$infile = $ARGV[0];
$columns = ReadData($infile) or die "cannot open excel table\n\n";
print "xls sheet contains $columns->[1]{maxrow} rows\n";
my $xlsstartrow;
if ( getExcel( A . 1 ) ne "text" ) {
$xlsstartrow = 2;
}
else
{
$xlsstartrow = 4;
}
check_templates();
print "done";
sub check_templates {
for ( $row = $xlsstartrow ; $row < ( $columns->[1]{maxrow} + 1 ) ; $row++ ) {
if (getExcel(M . $row) ne "" ){
$cell = "M" . $row ;
push(#mistakes,$cell);
}
}
rewritesheet(#mistakes);
}
sub rewritesheet {
my $FileName = $infile;
my $parser = Spreadsheet::ParseExcel::SaveParser->new();
my $template = $parser->Parse($FileName);
my $worksheet = $template->worksheet(0);
my $row = 0;
my $col = 0;
# Get the format from the cell
my $format = $template->{Worksheet}[$sheet]
->{Cells}[$row][$col]
->{FormatNo};
foreach (#mistakes){
$worksheet->AddCell( $_, "" );
}
$template->SaveAs($infile2);`
Empty column values in an Excel sheet and save the result?
If the whole purpose of your program is to delete all column M values from a .xls file, then the following program (adopted from your program) will do exactly that:
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
my $infile = $ARGV[0];
(my $infile2 = $infile) =~ s/(\.xls)$/_2$1/;
my $parser = Spreadsheet::ParseExcel::SaveParser->new();
my $workbook = $parser->Parse($infile);
my $sheet = $workbook->worksheet(0);
print "xls sheet contains rows \[0 .. $sheet->{MaxRow}\]\n";
my $startrow = $sheet->get_cell(0, 0) eq 'text' ? 4-1 : 2-1;
my $col_M = ord('M') - ord('A');
for my $row ($startrow .. $sheet->{MaxRow}) {
my $c = $sheet->get_cell($row, $col_M);
if(defined $c && length($c->value) > 0) { # why check?
$sheet->AddCell($row, $col_M, undef) # delete value
}
}
$workbook->SaveAs($infile2);
print "done";
But, if you really want to clear out column M only, why would you test for values? You could just overwrite them without test. Maybe thats not all your program is required to perform? I don't know.
Regards
rbo

Parsing unsorted data from large fixed width text

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = #ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my #param_name;
my #template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push #param_name, $line;
}
elsif ($line =~ /^------/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "#template";
my #lineData;
my #param_data;
my #param_time;
my $data_type;
foreach $current_param (#param_name) {
#param_time = ();
#param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
#lineData = unpack $template, $line;
push #param_time, #lineData[0];
push #param_data, #lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = #lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', #param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",#param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', #param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', #param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.
First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.
It seems like you need a simple array (#param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)
I suggest you read this, to understand how you can load your data file into a
Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.
For example, using time as the key for the hash
%HoH = (
1 => {
name => "Param1",
ty => "UI",
value => "5",
},
2 => {
name => "Param3",
ty => "TXT",
value => "Some Text 1",
},
3 => {
name => "Param1",
ty => "UI",
value => "10",
},
);
Make sure you close the INFILE after reading in the contents, before you start processing.
So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.
Let me know if you need more info.
Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!
It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.
I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.
To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.
#!/usr/bin/env perl
use strict; use warnings;
my %data;
$_ = <DATA> until /^-+/; # skip header
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
last unless $line =~ /\S/;
my (undef, $param, undef, $value) = split ' ', $line, 4;
push #{ $data{ $param } }, $value;
}
use Data::Dumper;
print Dumper \%data;
__DATA__
Param1 filter = ALL_VALUES
Param2 filter = ALL_VALUES
Param3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1 Param1 UI 5
2 Param3 TXT Some Text 1
3 Param1 UI 10
4 Param2 D 2.1234
5 Param1 UI 15
6 Param2 D 3.1234
7 Param3 TXT Some Text 2
Output:
$VAR1 = {
'Param2' => [
'2.1234',
'3.1234'
],
'Param1' => [
'5',
'10',
'15'
],
'Param3' => [
'Some Text 1',
'Some Text 2'
]
};
First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.
foreach $current_param (#param_name) {
...
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) { ... }
...
}
Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.
Now on to make it more performant.
I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.
This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;
my $input_filename = shift #ARGV;
open my $input, '<', $input_filename;
my #param_names;
my $template = ''; # stop uninitialized warning
my #field_names;
my $field_name_line;
while( <$input> ){
chomp;
next if /^\s*$/;
if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
push #param_names, $param;
}elsif( /^[\s-]+$/ ){
my #fields = split /(\s+)/;
my $pos = 0;
for my $field (#fields){
my $length = length $field;
if( substr($field, 0, 1) eq '-' ){
$template .= "\#${pos}A$length ";
}
$pos += $length;
}
last;
}else{
$field_name_line = $_;
}
}
#field_names = unpack $template, $field_name_line;
for( #field_names ){
s(^\s+){};
$_ = lc $_;
$_ = 'type' if substr('type', 0, length $_) eq $_;
}
my %temp_files;
for my $param ( #param_names ){
for(qw'time data'){
my $fh = tempfile 'temp_XXXX', UNLINK => 1;
binmode $fh, ':raw';
$temp_files{$param}{$_} = $fh;
}
}
my %convert = (
TXT => sub{ pack 'A*', join "\n", #_ },
D => sub{ pack 'd*', #_ },
UI => sub{ pack 'L*', #_ },
);
sub print_time{
my($param,$time) = #_;
my $fh = $temp_files{$param}{time};
print {$fh} $convert{D}->($time);
}
sub print_data{
my($param,$format,$data) = #_;
my $fh = $temp_files{$param}{data};
print {$fh} $convert{$format}->($data);
}
my %data_type;
while( my $line = <$input> ){
next if $line =~ /^\s*$/;
my %fields;
#fields{#field_names} = unpack $template, $line;
print_time( #fields{(qw'name time')} );
print_data( #fields{(qw'name type value')} );
$data_type{$fields{name}} //= $fields{type};
}
close $input;
open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>', $input_filename.".toc";
for my $param( #param_names ){
my $data_fh = $temp_files{$param}{data};
my $time_fh = $temp_files{$param}{time};
seek $data_fh, 0, 0;
seek $time_fh, 0, 0;
my #toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
copy( $time_fh, $bin, 8*1024 );
close $time_fh;
push #toc_line, sysseek($bin, 0, 1);
copy( $data_fh, $bin, 8*1024 );
close $data_fh;
push #toc_line, sysseek($bin, 0, 1);
say {$toc} join ',', #toc_line, '';
}
close $bin;
close $toc;
I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my #param_names;
my #template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push #param_names, trim($line);
}
elsif ($line =~ /^----/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my #getType = ((1) x $size);
my $template = "#template";
my #lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
#lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], #param_names);
if ($inHeader) {
push #{$dataHash{$lineData[1]}{time} }, $lineData[0];
push #{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", #_}, D=>sub{pack 'd*', #_}, UI=>sub{pack 'L*', #_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (#param_names){
my $data = $dataHash{$param};
my #toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(#{$data->{time}});
push #toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(#{$data->{data}});
push #toc_line, tell $binfile;
print {$tocfile} join(',',#toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (#strings) = #_;
foreach my $string (#strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? #strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( #_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
The output is as follows:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
Here is the output TOC File:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
Thanks everyone for their help so far! This is an excellent resource!
EDIT: Added Binary & TOC file writing code.