Perl - Use Data::Dumper to write back to a file - perl

I have a large .csv file (2 - 3 million records). I need to concatenate the first three fields (with underscores) and append it to each record, then I need to sort file based on that new field and three other fields. I am able to do that (am testing it with a 4 record file for now) - but I'm not sure how to write it back to the file in the same .csv form - instead of the way Data::Dumper formats each line as a separate variable. Here is the code I have so far - I have a couple of Print (to screen) lines to see what it's doing -
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
my $filename = '/testpath/test.csv';
#$filename = 'test.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/testpath/parsedTest.csv';
print $OFH Dumper(#data);
close $OFH;
exit;
I'm assuming it's in the "print $OFH Dumper(#data);" line that I need to re-format it back to its original form.
And please be kind as I am a novice.
__________EDIT__________________________________
Here are the four lines from the test .csv file - first line is the header record:
STORE_NBR,CONTROL_NBR,LINE_NBR,SALES_NBR,QTY_MISTINT,REASON_CODE,MISTINT_COMM,SZ_CDE,TINTER_MODEL,TINTER_SERL_NBR,SPECTRO_MODEL,SPECTRO_SERL_NBR,EMP_NBR,TRAN_DATE,TRAN_TIME,CDS_ADL_FLD,PROD_NBR,PALETTE,COLOR_ID,INIT_TRAN_DATE,GALLONS_MISTINTED,UPDATE_EMP_NBR,UPDATE_TRAN_DATE,GALLONS,FORM_SOURCE,UPDATE_TRAN_TIME,SOURCE_IND,CANCEL_DATE,COLOR_TYPE,CANCEL_EMP_NBR,NEED_EXTRACTED,MISTINT_MQ_XTR,DATA_SOURCE,GUID,QUEUE_NAME,BROKER_NAME,MESSAGE_ID,PUT_TIME,CREATED_TS
1334,53927,1,100551589,1,6,Bad Shercolor Match,16,IFC 8112NP,01DX8005513,,,77,10/23/2015,95816,,OV0020001,,MANUAL,10/21/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,5394A0E67FFF4D01A0D9AD16FA29ABB1,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533133333464EB2956052C0020,10/23/2015 10:45,10/23/2015 10:45
2525,67087,1,650462328,1,4,Tinted Wrong Product,14,IFC 8012NP,Standalone-5,,,11,10/23/2015,104314,,A91W00353,,,10/20/2015,0.25,0,,0.25,,,COMP,,CUSTOM MATCH,0,TRUE,TRUE,O,1AC5D8742D47435EA05343D57372AD32,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533235323531C2295605350020,10/23/2015 10:46,10/23/2015 10:47
1350,163689,1,650462302,1,3,Tinted Wrong Color,14,IFC 8012NP,06DX8006805,,,1,10/23/2015,104907,,A91W00351,COLOR,6233,10/23/2015,0.25,0,,0.5,ENG,,SW,,PALETTE,0,TRUE,TRUE,O,F1A072BCC548412FA22052698B5B0C28,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F53313335307BC12956053C0020,10/23/2015 10:52,10/23/2015 10:52
Hope that's not too convoluted to read.

Data::Dumper outputs a format that is valid perl, and is good for debugging, but not for writing a CSV file.
You could write the CSV by hand:
foreach my $row (#data) {
print $OFG join(',', #$row), "\n";
}
but you really should use a specialized module, in this case Text::CSV, both for reading and writing the CSV – it will handle all the border cases (such as fields with embedded commas).
The synopsis contains a good example of both reading and writing; I won't repeat that here.

You don't have to rebuild the line if you just store it in #data too!
my #data;
while(my $line = <$FH>) {
chomp($line);
my #fields = split(/,/, $line);
push #data, [ "$line\n", join('_', #fields[0..2]), #fields[19, 22, 25] ];
}
#data = sort {
$a->[1] cmp $b->[1] ||
$a->[2] cmp $b->[2] ||
$a->[3] cmp $b->[3] ||
$a->[4] cmp $b->[4]
} #data;
print($OFH $_->[0]) for #data;
If your input didn't contain NULs, you could even use the following faster approach:
print $OFH
map { /[^\0]*\z/g }
sort
map {
chomp;
my #fields = split /,/;
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], "$_\n")
}
<$FH>;
But yeah, you should probably use a legit CSV parser.
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });
my #data;
while (my $row = $csv->getline($FH)) {
push #data, [ join('_', #$row[0..2]), $row ];
}
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[1][19] cmp $b->[1][19] ||
$a->[1][22] cmp $b->[1][22] ||
$a->[1][25] cmp $b->[1][25]
} #data;
$csv->say($OFH, $_->[1]) for #data;
The following is the fast approach using a CSV parser:
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
print $OFH
map { /[^\0]*\z/g }
sort
map {
$csv->parse($_);
my #fields = $csv->fields();
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], $_)
}
<$FH>;

Was unable to use the Text::CVS_XS because it was not available on our server, unfortunately - but did find adding this single "print" line worked -
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedTest.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
Tested out fine with the small file, now to test on the actual file!

Related

Reformat Dates in Perl (for later use in MySQL)

I am writing a Perl script to open a .csv file, make some changes, sort it on four fields, then write it back to a new file. Found out that because this data will then be used to load a MySQL table that I also need to reformat the Date variables. Currently, Dates are in the file as 00/00/0000 and for MySQL, need to have them formatted as 0000-00-00. Right now, I simply tried to do it for one field, although I actually need to do it on three Date fields for each line from the .csv file.
This script is running - but it is not reformatting the Date field I'm trying to test this on.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
#my $filename = '/swpkg/shared/batch_processing/mistints/mistints.csv';
my $filename = 'tested.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
my $in_date = $fields[14];
my $db_date = join '-', reverse split /\D/, $in_date;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
#print $OFH Dumper(#data);
print $OFH join(',', #$_), $/ for #data;
close $OFH;
#print "Sorted:\n", Dumper(#data);
#print "Sorted:", Dumper(#data);
exit;
The two lines I added to this script (which are not working) are the my $in_date and my $db_date lines. Now I will also need to reformat two fields (at the end of each line) that are DATETIME, i.e. 10/23/2015 10:47, where I will only need to reformat the date within that field, and I'm not even sure where to begin tackling that one.
And please go easy since I'm a noob with Perl.
EDIT - SORRY, had to re-edit because I didn't notice the first part of my script had not copied.
Rather than using a bunch of string functions, it's better to use the Time::Piece module to parse and reformat date-time values. It has strptime and strftime methods to do this for you. This short program shows the reformatting of both date-time formats that you mention. ymd is a convenience method, and is equivalent to strftime('%Y-%m-%d')
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = Time::Piece->strptime($in_date, '%m/%d/%Y')->ymd;
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = Time::Piece->strptime($in_date, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
output
01/02/2003 -> 2003-02-01
01/02/2003 04:05 -> 2003-02-01 04:05
Update
If you prefer, you could write a subroutine that takes the original date and its format string, together with the desired format. Like this
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = date_from_to($in_date, '%m/%d/%Y', '%Y-%m-%d');
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = date_from_to($in_date, '%m/%d/%Y %H:%M', '%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
sub date_from_to {
my ($date, $from, $to) = #_;
Time::Piece->strptime($date, $from)->strftime($to);
}
The output is identical to that of the program above
Update
Regarding your comment, your code should look like this
$_ = join '-', (split /\//)[2,0,1] for $fields[14, 20, 23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields;
But I would prefer to see some consistency in the way the date fields are handled, like this
$_ = Time::Piece->strptime($_, '%m/%d/%Y')->strftime('%Y-%m-%d') for #fields[14,20,23];
$_ = Time::Piece->strptime($_, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields

Selecting records from a file based on keys from a second file

My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;

perl hash mapping/retrieval issues with split and select columns

Perl find and replace multiple(huge) strings in one shot
P.S.This question is related to the answer for above question.
When I try to replace this code:
Snippet-1
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
with this code:
Snippet-2
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
even though the key exists (as in the dumper), exists statement doesn't return the value for the key.
For same input file, it works perfectly with just split alone (Snippet-1) whereas not returning anything when i select specific columns after split(Snippet-2).
Is there some integer/string datatype mess-up happening here?
Input Mapping File
483329,Buffalo
483330,Buffalo
483337,Buffalo
Script Output
$VAR1 = {
'483329' => 'Buffalo',
'46546' => 'Chicago_CW',
'745679' => 'W. Washington',
};
1 search is ENB
2 search is 483329 **expected Buffalo here**
3 search is 483330
4 search is 483337
Perl Code
open my $map_fh, '<', $MarketMapFile or die $!;
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
close $map_fh;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/,quote_space => 0 });
my $tmpCSVFile = $CSVFile."tmp";
open my $in_fh, '<', $CSVFile or die $!;
open my $out_fh, '>', $tmpCSVFile or die $!;
my $cnt=1;
while (my $row = $csv->getline($in_fh)) {
my $search = $row->[5];
$search =~ s/[^[:print:]]+//g;
if ($MapSelection =~ /eNodeBID/i) {
$search =~ s/(...)-(...)-//g;
$search =~ s/\(M\)//g;
}
my $match = (exists $replace{$search}) ? $replace{$search} : undef;
print "\n$cnt search is $search ";
if (defined($match)) {
$match =~ s/[^[:print:]]+//g;
print "and match is $match";
}
push #$row, $match;
#print " match is $match";
$csv->print($out_fh, $row);
$cnt++;
}
# untie %replace;
close $in_fh;
close $out_fh;
You have a problem of scope. Your code:
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
declares %replace within the if block. Move it outside so that it can also be seen by later code:
my %replace;
if ($MapSelection =~ /eNodeBID/i) {
%replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
Putting use strict and use warnings at the top of your code helps you find these kinds of issues.
Also, you can just use my $match = $replace{$search} since it's equivalent to your ?: operation.
Always include use strict; and use warnings; at the top of EVERY perl script. If you had done that and been maintaining good coding practice with declaring your variables, you would've gotten error:
Global symbol "%replace" requires explicit package name at
That would've let you know there was a scoping issue with your code. One way to avoid that is to use a ternary in your initialization of %replace
my %replace = ($MapSelection =~ /eNodeBID/i)
? map { chomp; (split /,/)[0,1] } <$map_fh>
: ();

Not getting output while creating Hash of Arrays

I am trying to create hash of arrays. I am taking data from a txt file and converting this into hash of arrays.
Txt file data is as below
group1 : usr1 usr4 usr6
group2 : usr2 usr1 usr5
group3 : usr1 usr2 usr3
so on ......
I am converting this hash of arrays like
%hash = (group1 => [usr1 usr4 usr6], group2 => [usr2 usr1 usr5]);
Following code i am trying
%hash = ();
open (FH, "2.txt") or die "file not found";
while (<FH>) {
#array = split (":", $_);
$array[1] =~ s/^\s*//;
$array[1] =~ s/\s*$//;
#arrayRef = split (" ", $array[1]);
$hash{$array[0]} = [ #arrayRef ];
#print #array;
#print "\n";
}
close FH;
print $hash{group1}[0];
print #{ $hash{group2}};
I am not getting output. There is something wrong in the code. Please help me understanding it better
Your code works for me, but the problem is that you are using the key "group1 " (note the extra space), and not "group1" like you think. When you split on colon :, you remember to strip the fields after from spaces, but not the field before. You should probably do:
my #array = split /\s*:\s*/, $_;
Also, you should always use
use strict;
use warnings;
Coding without these two pragmas is difficult and takes much longer.
use strict;
use warnings;
my %hash;
open (my $FH, "<", "2.txt") or die $!;
while (<$FH>) {
my ($key, #array) = split /[:\s]+/, $_;
$hash{$key} = \#array;
}
close $FH;
use Data::Dumper;
print Dumper \%hash;

Perl sort multiple column array

The hostlist.txt file has only 1 col. The prog reads hostlist.txt file, remove duplicate hostnames, sort the list, lookup ip address of each host in the list, and print the output on terminal.
hostlist.txt
host01
host03
host02
host01
output on terminal
host01,192.168.1.15
host02,192.168.1.12
host03,192.168.1.33
Program:
open(HOSTFILE, "hostlist.txt") or die "Couldn't open location file: $!\n";
while ($hosts = <HOSTFILE>) {
chomp($hosts);
push(#hostnames, $hosts);
}
close HOSTFILE;
#hostnameUnique = uniq(#hostnames);
#hostnameUniqueSorted = sort { lc($a) cmp lc($b) } #hostnameUnique;
foreach $hostname (#hostnameUniqueSorted){
$ipaddr = inet_ntoa((gethostbyname($hostname))[4]);
print "$hostname,$ipaddr\n";
}
I want to do the same thing as above, except the input file newhostlist.txt has 3 cols. Remove the duplicate hostname, sort first col($type), then sort 3rd col($location), then sort 2nd col($hostname), lookup ip address, and print output.
How do I process the multiple column array?
newhostlist.txt
dell,host01,dc2
dell,host03,dc1
hp,host02,dc1
dell,host01,dc2
Output:
dell,host03,192.168.1.33,dc1
hp,host02,192.168.1.12,dc1
dell,host01,192.168.1.15,dc2
#!/usr/bin/perl
use strict;
use warnings;
open(my $fh, '<', "newhostlist.txt") or die "Unable to open file: $!\n";
my %unique = map {$_ => 1} <$fh>;
my #data =
map {join",", ($_->[0], $_->[1], (#{$_->[3]}=gethostbyname($_->[1]))?inet_ntoa($_->[3][4]):'-' , $_->[2])}
sort {$a->[0] cmp $b->[0] ||
$a->[2] cmp $b->[2] ||
$a->[1] cmp $b->[1]}
map {s/^\s+|\s+$//g; [split/,/]} keys %unique;
ETA: Added the check for failed ipaddr lookup.
The easiest way to handle this would be to use the diamond operator, I feel:
use strict;
use warnings;
use ARGV::readonly;
my %seen;
while (<>) {
chomp; # to avoid missing newline at eof errors
next if $seen{$_};
$seen{$_}++;
my #row = split /,/, $_;
my #host = gethostbyname($hostname);
my $ipaddr;
if (#host == 0) {
$ipaddr = "host not found - ip not avail";
} else {
$ipaddr = inet_ntoa($host[4]);
}
splice #row, 2, 0, $ipaddr;
print join ",", #row;
}
Using ARGV::readonly allows for somewhat safer usage of the implicit file opens used with the diamond operator.
After that, simply weed out lines already seen by using a hash, split the row, put in the value you need where you need it, and print out the reassembled row.
If you expect more complicated data in your rows, you might wish to look at a csv module, such as Text::CSV.
I recommend to use an array of hashes for this:
.....
my ($type, $hostname, $location) = split /,/, $line;
push #records, {
type => $type,
hostname => $hostname,
location => $location,
};
.....
my #records_sorted = sort { $a->{type} cmp $b->{type} || $a->{location} cmp $b->{location} || $a->{hostname} cmp $b->{hostname} } #records;
...