Format of the date field gets changed through Spreadsheet::ParseExcel

Format of the date field gets changed through Spreadsheet::ParseExcel - perl

I have an Excel Sheet (A.xls) which has following content:
Date,Value
10/1/2020,36.91
10/2/2020,36.060001
I got following output using same script with Perl v5.6.1 on solaris 5.8
>>./a4_test.pl
INFO>Excel File=A.xls,#WorkSheet=1,AuthorID=Sahoo, Ashish
DEBUG>row 2 - col 0:10-2-20
DEBUG>row 2 - col 1:36.060001
And I got different output for date field using same script with perl v5.26.3 on solaris 5.11
>>./a4_test.pl
INFO>Excel File=A.xls,#WorkSheet=1,AuthorID=Sahoo, Ashish
DEBUG>row 2 - col 0:2020-10-02
DEBUG>row 2 - col 1:36.060001
I used 0.2602 version of Spreadsheet::ParseExcel on Solaris 8 machine and 0.65 version on Solaris 11 machine.
Why am I getting different output while reading date field from
excel sheet through Spreadsheet::ParseExcel module?
#!/usr/perl/5.12/bin/perl -w
use Spreadsheet::ParseExcel;
my $srce_file = "a.xls";
my $oExcel = new Spreadsheet::ParseExcel;
my $oBook = $oExcel->Parse($srce_file);
my %hah_sheet = ();
my $header_row = 1;
my($iR, $iC, $oWkS, $oWkC);
my $book = $oBook->{File};
my $nsheet= $oBook->{SheetCount};
my $author= $oBook->{Author};
unless($nsheet){
print "ERR>No worksheet found for source file:$srce_file\n";
return 0;
}
else{
print "INFO>Excel
File=$srce_file,#WorkSheet=$nsheet,AuthorID=$author\n";
}
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++) {
next if($iSheet >0);
$oWkS = $oBook->{Worksheet}[$iSheet];
my $rows = 0;
for(my $iR = $oWkS->{MinRow}; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++) {
$rows++;
my $str_len = 0;
for(my $iC = $oWkS->{MinCol}; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol}; $iC++) {
$oWkC = $oWkS->{Cells}[$iR][$iC];
next if ($iR <$header_row);
if (defined($oWkC)){
my $cell_value = $oWkC->Value;
$cell_value =~s/\n+//g; #removed newline inside the value
#
##if the first column at header row is null then skip. Column might be shifted
if($iR==$header_row && $iC == 0){
last unless($cell_value);
}
if($iR == $header_row){
$hah_sheet{$iR}{$iC} = uc($cell_value);
}else {
$hah_sheet{$iR}{$iC} = $cell_value;
$str_len += length($cell_value);
##View cell value by row/column
print "DEBUG>row ${iR} - col ${iC}:$cell_value\n";
}
}else{
$hah_sheet{$iR}{$iC} = ""; #keep position for NULL value
}
} # END of Column loop
} # END of Row loop
} # END of Worksheet

If you search for "date" in Changes, you see this:
0.33 2008.09.07
- Default format for formatted dates changed from 'm-d-yy' to 'yyyy-mm-dd'
This explains why you see different date formats between versions 0.2602 and 0.65 of Spreadsheet::ParseExcel.
If you always want your code to print the same format regardless of which version you are using, you could transform the date in your code. For example, if you always want to see yyyy-mm-dd:
$cell_value =~ s/^(\d+)-(\d+)-(\d+)$/sprintf '%04d-%02d-%02d', 2000+$3, $1, $2/e;
Or, vice versa:
$cell_value =~ s/^(\d+)-(\d+)-(\d+)$/sprintf '%0d-%0d-%02d', $2, $3, $1-2000/e;

Related

Perl: calculate jackknife error from of each column of a multi-column file

I am trying to calculate the jacknife average and error of each column in a multi-column file.
My example data file look like this:
$ cat data.HW2
1.1 2.1 3.1 4.1
1.2 2.2 3.2 4.2
1.3 2.3 3.3 4.3
1.4 2.4 3.4 4.4
My attempted solution is to define arrays that will eventually be the size same as the number of columns (in this case 4) and iterate over them line by line:
cat jackkinfe.pl
#! /usr/bin/perl
use warnings; use strict;
my #n=0;
my #x;
my $j;
my $i;
my $dg;
my #x_jack;
my #x_tot=0;
my $cols;
my $col_start=0;
# read in the data
while(<>)
{
my #column = split();
$cols=#column;
foreach my $j ($col_start .. $#column) {
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
$n[$j]++;
}
}
# Do the jackknife estimates
for ($j=$col_start; $j<$cols; $j++)
{
for ($i = 0; $i < $n[$j]; $i++)
{
$x_jack[$i][$j] = ($x_tot[$j] - $x[$i][$j]) / ($n[$j] - 1);
}
# Do the final jackknife estimate
my #g_jack_av=0;
my #g_jack_err=0;
for ($i = 0; $i < $n[$j]; $i++)
{
$dg = $x_jack[$i][$j];
$g_jack_av[$j] += $dg;
$g_jack_err[$j] += $dg**2;
}
$g_jack_av[$j] /= $n[$j];
$g_jack_err[$j] /= $n[$j];
$g_jack_err[$j] = sqrt(($n[$j] - 1) * abs($g_jack_err[$j] - $g_jack_av[$j]**2));
printf "%e %e ", $g_jack_av[$j], $g_jack_err[$j];
}
printf "\n";
It gives me the following two warnings:
$cat data.HW2 | perl jackknife.pl
Use of uninitialized value within #n in array element at cols_jacknife.pl line 19, <> line 1.
Use of uninitialized value within #n in array element at cols_jacknife.pl line 20, <> line 1.
It is complaining at the following two lines:
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
But I want to set the size of #n dynamically depending on the size of the data file.
How do I remove this warning?
Any other suggestions on my Perl usage are also welcome and much appreciated since I am trying to learn best practices.

This part of your code
my #n=0;
....
foreach my $j ($col_start .. $#column) {
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
$n[$j]++;
}
Will trigger the warning once for every value of $j larger than 0, because only the first element in #n is defined: $n[0] = 0. Only at the end of the loop iteration is the array value finally defined, when it is set to 1 by the increment operator with $n[$j]++.
Technically, the code will still work as you expect, because undef will be cast to 0. So.... it should be safe to ignore the warning. You can do something like this inside your loop to avoid it:
$n[$j] //= 0; # $n[$j] is defined, or set to 0
This is equivalent to
if (not defined($n[$j])) {
$n[$j] = 0;
}

insert null values in missing rows of file

I have a text file which consists some data of 24 hours time stamp segregated in 10 minutes interval.
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2
2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2
2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3
2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2
2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2
2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4
2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2
2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2
2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3
2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2
2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3
2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3
2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8
2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5
2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4
2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3
2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3
2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1
2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3
2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7
2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2
2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3
2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1
2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6
2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1
2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5
2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2
2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9
2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8
2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3
2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3
2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3
2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3
2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7
2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6
2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3
2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6
2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3
2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6
2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1
2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6
2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3
2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3
2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3
2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3
2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3
2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3
2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2
2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7
2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3
2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7
2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3
2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3
2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1
2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7
2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2
2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2
2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5
2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1
2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2
2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1
2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9
2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6
2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7
2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8
2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6
2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2
2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2
2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3
2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5
2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7
2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1
2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6
2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1
2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2
2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9
2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3
2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3
2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8
2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1
2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7
2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1
2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1
2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9
2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3
2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7
2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1
2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8
2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1
2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9
2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5
2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6
2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4
2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1
2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6
2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4
2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2
2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3
2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3
2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2
2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3
2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4
2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7
2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4
2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2
2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2
2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1
2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2
2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2
2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3
2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1
2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5
2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3
2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1
2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1
if you see above sample as per 10 minutes interval there should be total 143 rows in 24 hours in this file but after second last line which has time 2016-02-06,23:40:00 data for date, time 2016-02-06,23:50:00 is missing.
similarly after 2016-02-06,22:40:00 data for date, time 2016-02-06,22:50:00 is missing.
can we insert missing date,time followed by 6 null separated by commas e.g. 2016-02-06,22:50:00,null,null,null,null,null,null where ever any data missing in rows of this file based on count no 143 rows and time stamp comparison in rows 2016-02-06,00:00:00 to 2016-02-06,23:50:00 which is also 143 in count ?
here is what i have tried
created a file with 143 entries of date and time as 2.csv and used below command
join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.1,2.1,2.2 <(sort -k2 1.csv) <(sort -k2 2.csv)|grep "2016-02-06,21:30:00"| sort -u|sed "s/\t//g"> 3.txt
part of output is repetitive like this :
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,21:30:00
any suggestions ?

I'd actually not cross reference a new csv file, and instead do it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Piece;
my $last_timestamp;
my $interval = 600;
#read stdin line by line
while ( <> ) {
#extract date and time from this line.
my ( $date, $time, #fields ) = split /,/;
#parse the timestamp
my $timestamp = Time::Piece -> strptime ( $date . $time, "%Y-%m-%d%H:%M:%S" );
#set last if undefined.
$last_timestamp //= $timestamp;
#if there's a gap... :
if ( $last_timestamp + $interval < $timestamp ) {
#print "GAP detected at $timestamp: ",$timestamp - $last_timestamp,"\n";
#print lines to fill in the gap
for ( ($timestamp - $last_timestamp) % 600 ) {
$last_timestamp += 600;
print join ( ",", $last_timestamp -> strftime("%Y-%m-%d,%H:%M:%S"), ("null")x6),"\n";
}
}
$last_timestamp = $timestamp;
print;
}
Which for your sample gives me lines (snipped for brevity):
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,22:50:00,null,null,null,null,null,null
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
Note - this is assuming the timestamps are exactly 600s apart. You can adjust the logic a little if that isn't a valid assumption, but it depends exactly what you're trying to get at that point.

Here's another Perl solution
It initialises $date to the date contained in the first line of the file, and a time of 00:00:00
It then fills the %values hash with records using the value of $date as a key, incrementing the value by ten minutes until the day of month changes. These form the "default" values
Then the contents of the file are used to overwrite all elements of %values for which we have an actual value. Any gaps will remain set to their default from the previous step
Then the hash is simply printed in sorted order, resulting in a full set of data with defaults inserted as necessary
use strict;
use warnings 'all';
use Time::Piece;
use Time::Seconds 'ONE_MINUTE';
use Fcntl ':seek';
my $delta = 10 * ONE_MINUTE;
my $date = Time::Piece->strptime(<ARGV> =~ /^([\d-]+)/, '%Y-%m-%d');
my %values;
for ( my $day = $date->mday; $date->mday == $day; $date += $delta ) {
my $ds = $date->strftime('%Y-%m-%d,%H:%M:%S');
$values{$ds} = $ds. ',null' x 6 . "\n";
}
seek ARGV, 0, SEEK_SET;
while ( <ARGV> ) {
my ($ds) = /^([\d-]+,[\d:]+)/;
$values{$ds} = $_;
}
print $values{$_} for sort keys %values;

here is the answer..
cat 1.csv 2.csv|sort -u -t, -k2,2

...or a shell script:
#! /bin/bash
set -e
file=$1
today=$(head -1 $file | cut -d, -f1)
line=0
for (( h = 0 ; h < 24 ; h++ ))
do
for (( m = 0 ; m < 60 ; m += 10 ))
do
stamp=$(printf "%02d:%02d:00" $h $m)
if [ $line -eq 0 ]; then IFS=',' read date time data; fi
if [ "$time" = "$stamp" ]; then
echo $date,$time,$data
line=0
else
echo $today,$stamp,null,null,null,null,null,null
line=1
fi
done
done <$file

I would write it like this in Perl
This program expects the name of the input file as a parameter on the command line, and prints its output to STDOUT, which may be redirected as normal
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
use Time::Seconds 'ONE_MINUTE';
my $format = '%Y-%m-%d,%H:%M:%S';
my $delta = 10 * ONE_MINUTE;
my $next;
our #ARGV = 'mydates.txt';
while ( <> ) {
my $new = Time::Piece->strptime(/^([\d-]+,[\d:]+)/, $format);
while ( $next and $next < $new ) {
say $next->strftime($format) . ',null' x 6;
$next += $delta;
}
print;
$next = $new + $delta;
}
while ( $next and $next->hms('') > 0 ) {
say $next->strftime($format) . ',null' x 6;
$next += $delta;
}
output
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:30:00,null,null,null,null,null,null
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2
2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2
2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3
2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2
2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2
2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4
2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2
2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2
2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3
2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2
2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3
2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3
2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8
2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5
2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4
2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3
2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3
2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1
2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3
2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7
2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2
2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3
2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1
2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6
2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1
2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5
2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2
2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9
2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8
2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3
2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3
2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3
2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3
2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7
2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6
2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3
2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6
2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3
2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6
2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1
2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6
2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3
2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3
2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3
2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3
2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3
2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3
2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2
2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7
2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3
2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7
2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3
2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3
2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1
2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7
2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2
2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2
2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5
2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1
2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2
2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1
2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9
2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6
2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7
2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8
2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6
2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2
2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2
2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3
2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5
2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7
2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1
2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6
2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1
2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2
2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9
2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3
2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3
2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8
2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1
2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7
2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1
2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1
2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9
2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3
2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7
2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1
2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8
2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1
2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9
2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5
2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6
2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4
2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1
2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6
2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4
2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2
2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3
2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3
2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2
2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3
2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4
2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7
2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4
2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2
2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2
2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1
2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2
2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2
2016-02-06,21:40:00,null,null,null,null,null,null
2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3
2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1
2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5
2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,22:50:00,null,null,null,null,null,null
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3
2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1
2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,23:50:00,null,null,null,null,null,null

pass 3 value to sub : Too many arguments for

I have an input like this:
100 200 A_30:120,A_140:180,B_180:220
100 300 A_70:220,B_130:300,A_190:200,A_60:300
I want to count number of A or B in each line and also compare range of A or B in each line with the range in two first column and return the length of intersection. e.g. output for first line: A:2 A_length:40 B:1 B_length:20
while(<>){
chomp($_);
my #line = split("\t| ", $_);
my $col1 = $line[0];
my $col2 = $line[1];
my #col3 = split(",",$line[2]);
my $A=0;
my $B=0;
my $A_length=0;
my $B_length=0;
for my $i (0 .. #col3-1){
my $col3 = $col3[$i];
if ($col3 =~ /A/){
my $length=0;
$length = range ($col3,$col1,$col2);
$A_length = $A_length+$length;
$A++;
}
if ($col3 =~ /B/){
my $length=0;
$length = range ($col3,$col1,$col2);
$B_length = $B_length+$length;
$B++;
}
$i++;
}
print("#A: ",$A,"\t","length_A: ",$A_length,"\t","#B: ",$B,"\t","length_B: ",$B_length,"\n");}
sub range {
my ($col3, $col1, $col2) = ($_[0],$_[1],$_[2]);
my #sub = split(":|_", $col3);
my $sub_strt = $sub[1];
my $sub_end = $sub[2];
my $sub_length;
if (($col1 >= $sub_strt) && ($col2 >= $sub_end)){
$sub_length = ($sub_end) - ($col1);}
if (($col1 >= $sub_strt) && ($col2 >= $sub_end)){
$sub_length = ($col2) - ($col1);}
if(($col1 <= $sub_strt) && ($col2 >= $sub_end)){
$sub_length = ($sub_end) - ($sub_strt);}
if(($col1 <= $sub_strt) && ($col2 <= $sub_end)){
$sub_length = ($col2) - ($sub_strt);}
return $sub_length;
}
I FIXED IT :)

Perl already has a builtin length function, which only takes one argument. As perl is compiling your script and gets to your length function call, it doesn't know about the sub length { ... } that you have defined later in the script, so it complains that you are using the builtin length function incorrectly.
How to fix this? This is Perl, so there are many ways
name your function something else. Making a function with the same name as a Perl builtin function is usually a bad idea
Call your function with the & sigil: my $length = &length($col3,$col1,$col2); That will be enough of a hint to the compiler that your function call does not refer to the builtin function.
Qualify your function call with a package name, in this case main::length($col3,$col1,$col2) or just ::length($col3,$col1,$col2).
Note that even if Perl did know about the length function you defined (you could get Perl to know by moving the sub length { ... } definition to the top of the script, for example), the function call would still be ambiguous to the compiler, the compiler would emit a warning like
Ambiguous call resolved as CORE::length(), qualify as such or use & at ...
and your script would still fail to compile. Here CORE::length would mean that Perl is resolving the ambiguity in favor of the builtin function.

Compare two CSV files and show only the difference

I have two CSV files:
File1.csv
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 915.13, Aircel_Indore
File2.csv
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 815.13, Aircel_Indore
These are sample input files in actual so many headers and values will be there, so I can not make them hard coded.
In my expected output I want to keep the first two columns and the last column as it is as there won't be any change in the same and then the comparison should happen for the rest of the columns and values.
Expected output:
Time, Object_Name, Frequency, Longname
2013-08-05 00:00, 815.13, Aircel_Indore
How can I do this?

Please look at the links below, there are some examples scripts:
http://bytes.com/topic/perl/answers/647889-compare-two-csv-files-using-perl
Perl: Compare Two CSV Files and Print out differences
http://www.perlmonks.org/?node_id=705049

If you are not bound to Perl, here a solution using AWK:
#!/bin/bash
awk -v FS="," '
function filter_columns()
{
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
}
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
}
} NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}
}
}' File1.csv File2.csv
This outputs:
Time, Object_Name, Frequany, Longname
2013-08-05 00:00, Alpha, 815.13, Aircel_Indore
Here the explanation:
#!/bin/bash
# FS = "," makes awk split each line in fields using
# the comma as separator
awk -v FS="," '
# this function selects the columns you want. NF is the
# the number of field. Therefore $NF is the content of
# the last column and $(NF-1) of the but last.
function filter_columns()
{
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
}
# This block processes just the first file, this is the aim
# of the condition NR == FNR. The condition NF != 0 skips the
# empty lines you have in your file. The block prints the header
# and then save all the other lines in the array memory.
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
}
}
# This block processes just the second file (NR != FNR).
# Since the header has been already printed, it skips the first
# line of the second file (FNR == 1). The block compares each line
# against that one saved in the array memory (the corresponding
# line in the first file). The block prints just the lines
# that do not match.
NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}
}
}' File1.csv File2.csv

Answering #IlmariKaronen's questions would clarify the problem much better, but meanwhile I made some assumptions and took a crack at the problem - mainly because I needed an excuse to learn a bit of Text::CSV.
Here's the code:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Array::Compare;
use feature 'say';
open my $in_file, '<', 'infile.csv';
open my $exp_file, '<', 'expectedfile.csv';
open my $out_diff_file, '>', 'differences.csv';
my $text_csv = Text::CSV->new({ allow_whitespace => 1, auto_diag => 1 });
my $line = readline($in_file);
my $exp_line = readline($exp_file);
die 'Different column headers' unless $line eq $exp_line;
$text_csv->parse($line);
my #headers = $text_csv->fields();
my %all_differing_indices;
#array-of-array containings lists of "expected" rows for differing lines
# only columns that differ from the input have values, others are empty
my #all_differing_rows;
my $array_comparer = Array::Compare->new(DefFull => 1);
while (defined($line = readline($in_file))) {
$exp_line = readline($exp_file);
if ($line ne $exp_line) {
$text_csv->parse($line);
my #in_fields = $text_csv->fields();
$text_csv->parse($exp_line);
my #exp_fields = $text_csv->fields();
my #differing_indices = $array_comparer->compare([#in_fields], [#exp_fields]);
#all_differing_indices{#differing_indices} = (1) x scalar(#differing_indices);
my #output_row = ('') x scalar(#exp_fields);
#output_row[0, 1, #differing_indices, $#exp_fields] = #exp_fields[0, 1, #differing_indices, $#exp_fields];
$all_differing_rows[$#all_differing_rows + 1] = [#output_row];
}
}
my #columns_needed = (0, 1, keys(%all_differing_indices), $#headers);
$text_csv->combine(#headers[#columns_needed]);
say $out_diff_file $text_csv->string();
for my $row_aref (#all_differing_rows) {
$text_csv->combine(#{$row_aref}[#columns_needed]);
say $out_diff_file $text_csv->string();
}
It works for the File1 and File2 given in the question and produces the Expected output (except that the Object_Name 'Alpha' is present in the data line - I'm assuming that's a typo in the question).
Time,Object_Name,Frequany,Longname
"2013-08-05 00:00",Alpha,815.13,Aircel_Indore

I've created a script for it with very powerful linux tools. Link here...
Linux / Unix - Compare Two CSV Files
This project is about comparison of two csv files.
Let's assume that csvFile1.csv has XX columns and csvFile2.csv has YY columns.
Script I've wrote should compare one (key) column form csvFile1.csv with another (key) column from csvFile2.csv. Each variable from csvFile1.csv (row from key column) will be compared to each variable from csvFile2.csv.
If csvFile1.csv has 1,500 rows and csvFile2.csv has 15,000 total number of combinations (comparisons) will be 22,500,000. So this is very helpful way how to create Availability Report script which for example could compare internal product database with external (supplier's) product database.
Packages used:
csvcut (cut columns)
csvdiff (compare two csv files)
ssconvert (convert xlsx to csv)
iconv
curlftpfs
zip
unzip
ntpd
proFTPD
More you can find on my official blog (+example script):
http://damian1baran.blogspot.sk/2014/01/linux-unix-compare-two-csv-files.html

Transpose using AWK or Perl

Hi I would like to use AWK or Perl to get an output file in the format below. My input file is a space separated text file. This is similar to an earlier question of mine, but in this case the input and output has no formatting. My column positions may change so would appreciate a technique which does not reference column number
Input File
id quantity colour shape size colour shape size colour shape size
1 10 blue square 10 red triangle 12 pink circle 20
2 12 yellow pentagon 3 orange rectangle 4 purple oval 6
Desired Output
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6
I am using this code by Dennis Williamson. Only problem is the output I get has no space separation in the transposed fields. I require one space separation
#!/usr/bin/awk -f
BEGIN {
col_list = "quantity colour shape"
# Use a B ("blank") to add spaces in the output before or
# after a format string (e.g. %6dB), but generally use the numeric argument
# columns to be repeated on multiple lines may appear anywhere in
# the input, but they will be output together at the beginning of the line
repeat_fields["id"]
# since these are individually set we won't use B
repeat_fmt["id"] = "%-1s "
# additional fields to repeat on each line
ncols = split(col_list, cols)
for (i = 1; i <= ncols; i++) {
col_names[cols[i]]
forms[cols[i]] = "%-1s"
}
}
# save the positions of the columns using the header line
FNR == 1 {
for (i = 1; i <= NF; i++) {
if ($i in repeat_fields) {
repeat[++nrepeats] = i
repeat_look[i] = i
rformats[i] = repeat_fmt[$i]
}
if ($i in col_names) {
col_nums[++n] = i
col_look[i] = i
formats[i] = forms[$i]
}
}
# print the header line
for (i = 1; i <= nrepeats; i++) {
f = rformats[repeat[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $repeat[i]
}
for (i = 1; i <= ncols; i++) {
f = formats[col_nums[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $col_nums[i]
}
printf "\n"
next
}
{
for (i = 1; i <= NF; i++) {
if (i in repeat_look) {
f = rformats[i]
gsub("B", " ", f)
repeat_out = repeat_out sprintf(f, $i)
}
if (i in col_look) {
f = formats[i]
gsub("B", " ", f)
out = out sprintf(f, $i)
coln++
}
if (coln == ncols) {
print repeat_out out
out = ""
coln = 0
}
}
repeat_out = ""
}
Output
id quantitycolourshape
1 10bluesquare
1 redtrianglepink
2 circle12yellow
2 pentagonorangerectangle
My apologies for not including all info about the actual file earlier. I did this only for simplicity, but it did not capture all my requirements.
In my actual file I am looking to transpose fields n_cell and n_bsc for NODE SITE CHILD
NODE SITE CHILD n_cell n_bsc
Here is a link to the actual file I am working on

<>;
print("id colour shape size\n");
while (<>) {
my #combined_fields = split;
my $id = shift(#combined_fields);
while (#combined_fields) {
my #fields = ( $id, splice(#combined_fields, 0, 3) );
print(join(' ', #fields), "\n");
}
}

You have told us that your real data consists of over 5,000 columns and that its column positions may change and I'm afraid that really isn't enough.
So in the absence of any proper information I have written this, which uses the header line to calculate the number and size of the sets of data, where the id column is, and in which column the first set starts.
It works fine on your example data, but I can only guess whether it will work on your live file.
use strict;
use warnings;
my #headers = split ' ', <>;
my %headers;
$headers{$_}++ for #headers;
die "Expected exactly one 'id' column" unless $headers{id} // 0 == 1;
my $id_index = 0;
$id_index++ while $headers[$id_index] ne 'id';
my #labels = grep $headers{$_} > 1, keys %headers;
my $set_size = #labels;
my $num_sets = $headers{$labels[0]};
my $start_index = 0;
$start_index++ while $headers[$start_index] ne $labels[0];
my #reformat;
while (<>) {
my #fields = split;
next unless #fields;
my $id = $fields[$id_index];
for (my $i = $start_index; $i < #fields; $i+=$set_size) {
push #reformat, [ $id, #fields[$i..$i + $set_size - 1] ];
}
}
unshift #labels, 'id';
print "#labels\n";
print "#$_\n" for #reformat;
output
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6