perl array from csv file creating newline in unexpected place - perl

Hi I have a few scripts that convert an xlsx file to a tab seperated file, which then remove any commas, duplicates and then splits it by commas. (i do this to make sure users have not put any commas in a colomn)
I then do some stuff. and then convert it back to an xlsx file. This has always worked fine. But instead of opening and closing files all the time i thought i would push the file to an array and then convert it to an xlsx at the end. Unfortunatly when i try and convert back to an xlsx file it is creating a newline in the space between the name. If i OUTPUT to a csv file then Open it and convert to an xlsx file it works fine.
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::BasicRead;
use Excel::Writer::XLSX;
local $" = "'\n'";
open( STDERR, ">&STDOUT" );
#covert to csv
my $xlsx_WSD = ( "C:\\Temp\\testing_file.xlsx"),, 1;
my #csvtemp;
if ( -e $xlsx_WSD ) {
my $ss = new Spreadsheet::BasicRead($xlsx_WSD) or die;
my $col = '';
my $row = 0;
while ( my $data = $ss->getNextRow() ) {
$row++;
$col= join( "\t", #$data );
push #csvtemp, $col . "\n" if ( $col ne "" );
}
}
else {
print " C:\\Temp\\testing_file.xlsx file EXISTS ...!!\n";
print " please investigate and use the restore option if required !..\n";
exit;
}
;
my #arraynew;
my %seen;
our $Header_row = shift (#csvtemp);
foreach (#csvtemp){
chomp;
$_ =~ s/,//g;
$_ =~ s/\t/,/g;
# print $_ . "\n" if !$seen{$_}++ ;
push #arraynew, $_ . "\n" if !$seen{$_}++ ; #remove any dupes
}
#covert back to xlsx
my $workbook = Excel::Writer::XLSX->new("C:\\Temp\\testing_filet.xlsx");
my $worksheet = $workbook->add_worksheet();
my ( $x, $y ) = ( 0, 0 );
while (<#arraynew>) {
my #list = split /,/;
foreach my $c (#list) {
$worksheet->write( $x, $y++, $c );
}
$x++;
$y = 0;
}
__DATA__
Animal keeper M/F Years START DATE FRH FSM
GIRAFFE JAMES LE M 5 10/12/2007 Y
HIPPO JACKIE LEAN F 6 11/12/2007 Y
ZEBRA JAMES LEHERN M 7 12/12/2007 Y
GIRAFFE AMIE CAHORT M 5 13/12/2012 Y
GIRAFFE MICKY JAMES M 5 14/06/2007 Y
MEERKAT JOHN JONES M 9 15/12/2007 v v
LEOPPARD JIM LEE M 8 16/12/2002
unexpected result
GIRAFFE JAMES
LE M 5 10/12/2007 Y
"
HIPPO" JACKIE
LEAN F 6 11/12/2007 Y
"
ZEBRA" JAMES
LEHERN M 7 12/12/2007 Y
"
GIRAFFE" AMIE
CAHORT M 5 13/12/2012 Y
"
GIRAFFE" MICKY
JAMES M 5 14/06/2007 Y
"
MEERKAT" JOHN
JONES M 9 15/12/2007 v v
"
LEOPPARD" JIM
LEE M 8 16/12/2002

Since you are running this on Windows, have you considered using Win32::OLE instead?
use strict;
use Win32::OLE;
my $app = Win32::OLE->GetActiveObject('Excel.Application')
|| Win32::OLE->new('Excel.Application', 'Quit');
my $wb = $app->Workbooks->Open("C:/Temp/testing_file.xlsx");
my $ws = $wb->ActiveSheet;
my $max_row = $ws->UsedRange->Rows->Count;
my $max_col = $ws->UsedRange->Columns->Count;
my ($row, %already) = (1);
while ($row <= $max_row) {
my ($col, #output) = (1);
while ($col <= $max_col) {
my $val = $ws->Cells($row, $col)->{Text};
if ($val =~ /[,\t]/) {
$val =~ tr/,//d;
$val =~ tr/\t/,/;
$ws->Cells($row, $col)->{Value} = $val;
}
#output[$col - 1] = $val;
$col++;
}
if ($already{join "|", #output}++) {
$ws->Rows($row)->EntireRow->Delete;
$max_row--;
} else {
$row++;
}
}
$wb->SaveAs("C:\\temp\\testing_filet.xlsx");

This is an issue with end of line characters.
There are three conventions for marking the end of a line: with \n on Unix, \r\n on Windows and \r on Mac. It looks as though your script is assuming the Mac convention while input and output use the Windows convention.
So after reading the input, a leading \n appears on all lines except the first. As long as this is also the case with the output lines prior to composing them with \r, you end up with an output file with perfectly \r\n-delimited lines. Clearly it's better to make your script wary of what line ending convention the input is using and ensure it uses the same for splitting the lines and composing the output.

Related

Next line array if a condition is achieved

I have a text in a file F1 each sentence in line, and another file contain the part of speech(POS) of each word in the text for example:
F1 contains:
he lives in paris\n
he jokes
F2 contains:
he pro\n
lives verb\n
in prep\n
paris adv_pl\n
he pro\n
jokes verb\n
I would like to parse each sentence of F1 and extract the POS of each word. I arrived to extract the POS of the first sentence, but the program can't parse the second line. This is the code:
open( FILE, $filename ) || die "Problème d'ouverture du ficher en entrée";
open( FILEOUT, ">$filenameout" ) || die "Problème d'ouverture";
open( F, "/home/ahmed/Bureau/test/corpus.txt" ) || die " Pb pour ouvrir";
open( F2, "/home/ahmed/Bureau/test/corp.txt" ) || die " Pb pour ouvrir";
my $z;
my $y = 0;
my $l;
my $li;
my $pos;
while ( $ligne = <F> ) {
while ( $li = <F2> ) { # F2 POS
chomp($li);
# prem contain the first word of F2 in each line,
# deux contain the POS of this word
( $prem, $deux ) = ( $li =~ m/^\W*(\w+)\W+(\w+)/ );
print "premier: $prem\n";
chomp($ligne);
#val = split( / /, $ligne ); # corpus de texte
$l = #val;
while ( $y < $l ) { # $l length of sentence
$z = $val[$y];
print "z : $z\n";
if ( $z eq $prem ) {
print "true\n";
$pos .= "POSw" . $y . "=" . $deux . " ";
++$y;
} else {
last;
}
}
}
print FILEOUT "$pos\n";
$pos = "";
}
The result I had in the terminal:
premier: he
z : he
true
premier : lives
z : lives
true
premier : in
z : in
true
premier : paris
z : paris
true
premier : he
premier : jokes
The first sentence has 4 words, when it pass 4, we must go to the next line in the text, I can't arrive to solve it.
There are some issues in your script.
You must always use strict; use warnings; to show the most common syntax and/or typing errors, unused variables, etc.
You should always use the three-parameter open and no global FILEHANDLEs (see opentut).
You should use some sensible names for your filehandles, not FH, FH1, etc. but $fh_sentences and $fh_grammar (or other meaningful names).
So far for the general part. Now let's get more specific:
Your outer loop (F) reads the sentences one by one. The next loop (F2) reads the grammatical types but it does so only once for the first sentence. When the F2 file is read, subsequent calls to <F2> will always return undef because the file was already read. You have to reset the filepointer to the beginning of the file after each sentence or -- even better -- read the file F2 in advance and store its contents in a hash.
Iterating over an array of words in a sentence is easier with foreach my $word (#words). No need to do the housekeeping of index variables (like $y) yourself.
chomping and splitting the sentences should be moved outside the F2 loop because $ligne doesn't change in the loop and only burns CPU cycles.
Putting this together I end up with this:
use strict;
use warnings;
# Read the grammar file, F2, into a hash:
my %grammar;
open( my $fh_grammar, '<', 'F2' ) or die "Pb pour ouvrir F2: $!\n";
while( my $ligne = <$fh_grammar> ) {
my ($prem, $deux) = ( $ligne =~ m/^\W*(\w+)\W+(\w+)/ );
$grammar{$prem} = $deux;
}
close($fh_grammar);
# The hash is now:
# %grammar = (
# 'he' => 'pro',
# 'lives => 'verb',
# 'in' => 'prep',
# 'paris' => 'adv_pl'
# 'jokes' => 'verb'
# );
# Read the sentences from F1 and check the grammar:
open( my $fh_sentences, '<', 'F1' ) or die "Pb pour ouvrir F1: $!\n";
while( my $ligne = <$fh_sentences> ) {
my #words = split(/\s+/, $ligne );
foreach my $word (#words) {
print "z: $word\n";
if ( exists $grammar{$word} ) {
print "true; $grammar{$word}\n";
}
}
print "\n";
}
close($fh_sentences);
Output:
z: he
true; pro
z: lives
true; verb
z: in
true; prep
z: paris
true; adv_pl
z: he
true; pro
z: jokes
true; verb
You can solve the above problem in different way like :
First read the POS file and put it in hash
Code :
my $filename = "F2";
open FH2, $filename or die "Error";
my %POS_hash;
while(<FH2>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
$POS_hash{$arr[0]} = $arr[1];
}
Now read your file and replace it with the POS
my $filename1 = "F1";
open FH1, $filename1 or die "Error";
while(<FH1>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
foreach my $val (#arr)
{
if(exists $POS_hash{$val})
{
print "$POS_hash{$val}\t";
}
}
print "\n";
}
I believe this is a better way for your problem. Hope this will solve your problem.

Splitting and tallying substrings within mixed integer-string data

Input Data (example):
40A3B35A3C
30A5B28A2C2B
Desired output (per-line) is a single number determined by the composition of the code 40A3B35A3C and the following rules:
if A - add the proceeding number to the running total
if B - add the proceeding number to the running total
if C - subtract the proceeding number from the running total
40A 3B 35A 3C would thus produce 40 + 3 + 35 - 3 = 75.
Output from both lines:
75
63
Is there an efficient way to achieve this for a particular column (such as $F[2]) in a tab-delimited .txt file using a one-liner? I have considered splitting the entire code into individual characters, then performing if statement checks to detect A/B/C, but my Perl knowledge is limited and I am unsure how to go about this.
When you use split with a capture, the captured group is returned from split, too.
perl -lane '
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * ("C" eq $op ? -1 : 1) while ($n, $op) = splice #ar, 0, 2;
print $s
' < input
Or maybe more declarative:
BEGIN { %one = ( A => 1,
B => 1,
C => -1 ) }
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * $one{$op} while ($n, $op) = splice #ar, 0, 2;
print $s
When working through a string like this, it's useful to know that regular expressions can return a list of results.
E.g.
my #matches = $str =~ m/(\d+[A-C])/g; #will catch repeated instances
So you can do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my $total;
#break the string into digit+letter groups.
for (m/(\d+[A-C])/g) {
#separate out this group into num and code.
my ( $num, $code ) = m/(\d+)([A-C])/;
print "\t",$num, " => ", $code, "\n";
if ( $code eq "C" ) {
$total -= $num;
}
else {
$total += $num;
}
}
print $total, " => ", $_;
}
__DATA__
40A3B35A3C
30A5B28A2C2B
perl -lne 'push #a,/([\d]+)[AB]/g;
push #b,/([\d]+)[C]/g;
$sum+=$_ for(#a);$sum-=$_ for(#b);
print $sum;#a=#b=();undef $sum' Your_file
how it works
use the command line arg as the input
set the hash "%op" to the
operations per letter
substitute the letters for operators in the
input evaluate the substituted input as an expression
use strict;
use warnings;
my %op=qw(A + B + C -);
$ARGV[0] =~ s/(\d+)(A|B|C)/$op{$2} $1/g;
print eval($ARGV[0]);

PERL Fixed Width to CSV based on Input Files

EDITED: I'm attempting to create a brief script that calls for an input fixed width file and a file with the start position and length of each attribute and then outputs the file as CSV instead of fixed width. I haven't messed with removing whitespace yet and am currently focusing on building the file reader portion.
Fixed:
My current issue is that this code returns data from the third row for $StartPosition and from the fourth row for $Length when they should both be first found on the first row of COMMA. I have no idea what is prompting this behavior.
Next issue: It only reads the first record in practice_data.txt I'm guessing it's something where I need to tell COMMA to go back to the beginning?
while (my $sourceLine = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$NextRecord ="";
while (my $commaLine = <COMMA>) {
my $Comma = index($commaLine, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($commaLine, 0, $Comma);
print "Start position is $StartPosition \n";
$Comma = $Comma + 1
$Length = substr($commaLine, $Comma);
print "Length is $Length \n";
$NextRecord = substr($sourceLine, $StartPosition, $Length);
$Output = "$Output . ',' . $NextRecord";
}
print OUTPUT "$Output \n";
}
practice_data.txt
1234512345John Doe 123 Mulberry Lane Columbus Ohio 43215Johnny Jane
5432154321Jason McKinny 423 Thursday Lane Columbus Ohio 43212Jase Jamie
4321543212Mike Jameson 289 Front Street Cleveland Ohio 43623James Sarah
Each record is 100 characters long.
Definitions.txt:
0,10
10,10
20,10
30,20
50,10
60,10
70,5
75,15
90,10
It always helps to provide enough information so that we can at least do some testing without having to read your code and imagine what the data must look like.
I suggest you use unpack, after first building a template from the file that holds the field specifications. Note that the A field specifier trims trailing spaces from the data.
It is all but essential to use the Text::CSV module to parse or generate well-formed CSV data. And I have used the autodie pragma to avoid having to explicitly check and report on the status of every I/O operation.
I have used this data
my_source_data.txt
12345678 ABCDE1234FGHIJK
my_field_spec.txt
0,8
10,5
15,4
19,6
And this program
use strict;
use warnings;
use 5.010;
use autodie;
use Text::CSV;
my #template;
open my $field_fh, '<', 'my_field_spec.txt';
while ( <$field_fh> ) {
my (#info) = /\d+/g;
die unless #info == 2;
push #template, sprintf '#%dA%d', #info;
}
my $template = "#template";
open my $source_fh, '<', 'my_source_data.txt';
my $csv = Text::CSV->new( { binary => 1, eol => $/ } );
while ( <$source_fh> ) {
my #fields = unpack $template;
$csv->print(\*STDOUT, \#fields);
}
output
12345678,ABCDE,1234,FGHIJK
It looks like you're slightly confused on how to read the contents of the COMMA filehandle.. Each time you read <COMMA>, you're reading another line from that file. Instead, read a line into a scalar like my $line = <FH> and use that instead:
while (my $source_line = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$Input = $_;
$NextRecord ="";
while (my $comma_line = <COMMA>) {
my $Comma = index($comma_line, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($comma_line, 0, $Comma);
print "Start position is $StartPosition \n";
$Length = substr($comma_line, $Comma);
print "Length is $Length \n";
$NextRecord = substr($Input, $StartPosition, $Length) + ',';
$Output = "$Output$NextRecord";
}
print OUTPUT "$Output \n";
}

Parse only date from a line in text file

I have a text file with delimiters as spaces at the start of the lines.
Lines with no initial spaces should go in the first column of the CSV file; those with two spaces should go in the second column of the CSV; and those with four spaces should go in the third column.
This is all working fine as required.
In lines starting with two spaces I want that only the date should go in the second column, discarding the other data of the line. The rest should all remain as it is.
I have denoted spaces at the start of the line as # for clarity.
Text file:
Component1
##(111) Amar Sen <amar.sen#gmail.com> <No comment> 2013/04/01
####/Com/src/folder1/folder2/newfile.txt
##(1199) Prashant Singh <psinsgh#gmail.com> <No comment> 2013/04/24
####/Com/src/folder1/folder2/testfile24
####/Com/src/folder1/folder2/testfile25
####/Com/src/folder1/folder2/testfile26
##(1204) Anthony Li <anthon.li#gmail.com> <No comment> 2013/04/25
####/Com/src2
Component2(added)
Component3
Output format:
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt
2013/04/24,/Com/src/folder1/folder2/testfile24
/Com/src/folder1/folder2/testfile25
/Com/src/folder1/folder2/testfile26
2013/04/25,/Com/src2
Component2(added)
Component3
Here's the code. Its working fine except for the change described above.
use strict;
use warnings;
my $previous_count = "-1"; #beginning, we will think, that no spaces.
my $current_count = "0"; #current default value
my $maximum_count = 3;
my $to_written = "";
my $delimiter_between_columns = ",";
my $newline_separator = ";";
my $file = 'C:\\textfile.txt';
open (my $fh, '<:encoding(UTF-8)', $file) or die "Could not open file '$file' $!";
while (my $row = <$fh>) {
# ok, read.
chomp($row);
# print "row is : $row\n";
if ($row =~ m/^(\s*)/) {
#print length($1);
$current_count = length($1) / 2; #take number of spaces divided by 2
$row =~ s/^\s+//;
if ($previous_count >= $current_count || $previous_count == $maximum_count) {
#output here
print "$to_written" . $newline_separator . "\n";
$previous_count = 0;
$to_written = "";
}
$previous_count = 0 if ($previous_count == -1);
$to_written .= $delimiter_between_columns x ($current_count - $previous_count) . "$row";
$previous_count = $current_count;
#print"\n";
}
}
print "$to_written" . $newline_separator . "\n";
You seem to have got yourself tied up in knots a little with your solution.
This program seems to do what you need. I have added some commas to your "output format" as your example has no placeholders for initial empty fields.
I have kept the hash characters for this purpose. Obviously it is trivial to change them for spaces, replacing s/^(#*)// with s/^(\s*)//.
use strict;
use warnings;
my #row;
while (<DATA>) {
chomp;
s/^(#*)//;
my $i = length($1) / 2;
if ($i == 1 and m<(\d{4}/\d{2}/\d{2})>) {
$row[$i] = $1;
}
else {
$row[$i] = $_;
}
if ($i == 2) {
print join(',', #row), ";\n";
#row = ('') x 3;
}
}
__DATA__
Component1
##(111) Amar Sen <amar.sen#gmail.com> <No comment> 2013/04/01
####/Com/src/folder1/folder2/newfile.txt
##(1199) Prashant Singh <psinsgh#gmail.com> <No comment> 2013/04/24
####/Com/src/folder1/folder2/testfile24
####/Com/src/folder1/folder2/testfile25
####/Com/src/folder1/folder2/testfile26
##(1204) Anthony Li <anthon.li#gmail.com> <No comment> 2013/04/25
####/Com/src2
output
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
,2013/04/24,/Com/src/folder1/folder2/testfile24;
,,/Com/src/folder1/folder2/testfile25;
,,/Com/src/folder1/folder2/testfile26;
,2013/04/25,/Com/src2;
Update
It makes more sense to cascade values from columns one and two into subsequent rows where they are not supplied. If you remove the line #row = ('') x 3 from my program it will do just that, with this output
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile26;
Component1,2013/04/25,/Com/src2;

Perl multidimensional array question

I have a program that prints out the location of commas in a paragraph of text in the form
For example if the paragraph is
one,two,three
three and a half
four,five
six
seven,eight
The program will print
0:4
0:8
2:5
4:6
I would like to use this output to create an array where the numbers after the colon are listed across columns in the row specified by the index before the colon. The array formed by the coordinates above would be
4 8
<blank or character '.'>
5
<blank or character '.'>
6
so array[0,0] = 4, array[0,1] = 8
array[1,0] = empty
array[2,0] = 5
etc...
I bet this is simple but I need help to write it.
$data_file="file.out";
open(DAT, $data_file) || die("Could not open file!");
#raw_data=<DAT>;
close(DAT);
my %array;
my $line = 0;
foreach $newline(#raw_data) {
chomp;
while ( $newline=~m/(,|;|:|and)/) {
push #{ $array{$line} }, pos($newline); # autovivification
}
$line++; }
Program
#!/usr/bin/env perl
use strict;
use warnings;
my %array;
my $line = 0;
while (<DATA>) {
chomp;
while ( /(,|;|:|(?:and))/g ) {
my $position = pos() - length($1) + 1;
push #{ $array{$line} }, $position; # autovivification
}
$line++;
}
for ( sort { $a <=> $b } keys %array ) {
my $flattened_value = join ', ', #{ $array{$_} };
print "$_ -> $flattened_value\n";
}
__DATA__
one,two,three
three and a half
four,five
six
seven,eight
Output
0 -> 4, 8
1 -> 7
2 -> 5
4 -> 6
Refer: chomp, join, keys, sort, split.
Refer the following documents to get an understanding of Perl's data structures and especially autovivification which has been used in this example.
perldoc perlref
perldoc perlreftut