The hostlist.txt file has only 1 col. The prog reads hostlist.txt file, remove duplicate hostnames, sort the list, lookup ip address of each host in the list, and print the output on terminal.
hostlist.txt
host01
host03
host02
host01
output on terminal
host01,192.168.1.15
host02,192.168.1.12
host03,192.168.1.33
Program:
open(HOSTFILE, "hostlist.txt") or die "Couldn't open location file: $!\n";
while ($hosts = <HOSTFILE>) {
chomp($hosts);
push(#hostnames, $hosts);
}
close HOSTFILE;
#hostnameUnique = uniq(#hostnames);
#hostnameUniqueSorted = sort { lc($a) cmp lc($b) } #hostnameUnique;
foreach $hostname (#hostnameUniqueSorted){
$ipaddr = inet_ntoa((gethostbyname($hostname))[4]);
print "$hostname,$ipaddr\n";
}
I want to do the same thing as above, except the input file newhostlist.txt has 3 cols. Remove the duplicate hostname, sort first col($type), then sort 3rd col($location), then sort 2nd col($hostname), lookup ip address, and print output.
How do I process the multiple column array?
newhostlist.txt
dell,host01,dc2
dell,host03,dc1
hp,host02,dc1
dell,host01,dc2
Output:
dell,host03,192.168.1.33,dc1
hp,host02,192.168.1.12,dc1
dell,host01,192.168.1.15,dc2
#!/usr/bin/perl
use strict;
use warnings;
open(my $fh, '<', "newhostlist.txt") or die "Unable to open file: $!\n";
my %unique = map {$_ => 1} <$fh>;
my #data =
map {join",", ($_->[0], $_->[1], (#{$_->[3]}=gethostbyname($_->[1]))?inet_ntoa($_->[3][4]):'-' , $_->[2])}
sort {$a->[0] cmp $b->[0] ||
$a->[2] cmp $b->[2] ||
$a->[1] cmp $b->[1]}
map {s/^\s+|\s+$//g; [split/,/]} keys %unique;
ETA: Added the check for failed ipaddr lookup.
The easiest way to handle this would be to use the diamond operator, I feel:
use strict;
use warnings;
use ARGV::readonly;
my %seen;
while (<>) {
chomp; # to avoid missing newline at eof errors
next if $seen{$_};
$seen{$_}++;
my #row = split /,/, $_;
my #host = gethostbyname($hostname);
my $ipaddr;
if (#host == 0) {
$ipaddr = "host not found - ip not avail";
} else {
$ipaddr = inet_ntoa($host[4]);
}
splice #row, 2, 0, $ipaddr;
print join ",", #row;
}
Using ARGV::readonly allows for somewhat safer usage of the implicit file opens used with the diamond operator.
After that, simply weed out lines already seen by using a hash, split the row, put in the value you need where you need it, and print out the reassembled row.
If you expect more complicated data in your rows, you might wish to look at a csv module, such as Text::CSV.
I recommend to use an array of hashes for this:
.....
my ($type, $hostname, $location) = split /,/, $line;
push #records, {
type => $type,
hostname => $hostname,
location => $location,
};
.....
my #records_sorted = sort { $a->{type} cmp $b->{type} || $a->{location} cmp $b->{location} || $a->{hostname} cmp $b->{hostname} } #records;
...
Related
I have a large .csv file (2 - 3 million records). I need to concatenate the first three fields (with underscores) and append it to each record, then I need to sort file based on that new field and three other fields. I am able to do that (am testing it with a 4 record file for now) - but I'm not sure how to write it back to the file in the same .csv form - instead of the way Data::Dumper formats each line as a separate variable. Here is the code I have so far - I have a couple of Print (to screen) lines to see what it's doing -
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
my $filename = '/testpath/test.csv';
#$filename = 'test.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/testpath/parsedTest.csv';
print $OFH Dumper(#data);
close $OFH;
exit;
I'm assuming it's in the "print $OFH Dumper(#data);" line that I need to re-format it back to its original form.
And please be kind as I am a novice.
__________EDIT__________________________________
Here are the four lines from the test .csv file - first line is the header record:
STORE_NBR,CONTROL_NBR,LINE_NBR,SALES_NBR,QTY_MISTINT,REASON_CODE,MISTINT_COMM,SZ_CDE,TINTER_MODEL,TINTER_SERL_NBR,SPECTRO_MODEL,SPECTRO_SERL_NBR,EMP_NBR,TRAN_DATE,TRAN_TIME,CDS_ADL_FLD,PROD_NBR,PALETTE,COLOR_ID,INIT_TRAN_DATE,GALLONS_MISTINTED,UPDATE_EMP_NBR,UPDATE_TRAN_DATE,GALLONS,FORM_SOURCE,UPDATE_TRAN_TIME,SOURCE_IND,CANCEL_DATE,COLOR_TYPE,CANCEL_EMP_NBR,NEED_EXTRACTED,MISTINT_MQ_XTR,DATA_SOURCE,GUID,QUEUE_NAME,BROKER_NAME,MESSAGE_ID,PUT_TIME,CREATED_TS
1334,53927,1,100551589,1,6,Bad Shercolor Match,16,IFC 8112NP,01DX8005513,,,77,10/23/2015,95816,,OV0020001,,MANUAL,10/21/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,5394A0E67FFF4D01A0D9AD16FA29ABB1,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533133333464EB2956052C0020,10/23/2015 10:45,10/23/2015 10:45
2525,67087,1,650462328,1,4,Tinted Wrong Product,14,IFC 8012NP,Standalone-5,,,11,10/23/2015,104314,,A91W00353,,,10/20/2015,0.25,0,,0.25,,,COMP,,CUSTOM MATCH,0,TRUE,TRUE,O,1AC5D8742D47435EA05343D57372AD32,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533235323531C2295605350020,10/23/2015 10:46,10/23/2015 10:47
1350,163689,1,650462302,1,3,Tinted Wrong Color,14,IFC 8012NP,06DX8006805,,,1,10/23/2015,104907,,A91W00351,COLOR,6233,10/23/2015,0.25,0,,0.5,ENG,,SW,,PALETTE,0,TRUE,TRUE,O,F1A072BCC548412FA22052698B5B0C28,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F53313335307BC12956053C0020,10/23/2015 10:52,10/23/2015 10:52
Hope that's not too convoluted to read.
Data::Dumper outputs a format that is valid perl, and is good for debugging, but not for writing a CSV file.
You could write the CSV by hand:
foreach my $row (#data) {
print $OFG join(',', #$row), "\n";
}
but you really should use a specialized module, in this case Text::CSV, both for reading and writing the CSV – it will handle all the border cases (such as fields with embedded commas).
The synopsis contains a good example of both reading and writing; I won't repeat that here.
You don't have to rebuild the line if you just store it in #data too!
my #data;
while(my $line = <$FH>) {
chomp($line);
my #fields = split(/,/, $line);
push #data, [ "$line\n", join('_', #fields[0..2]), #fields[19, 22, 25] ];
}
#data = sort {
$a->[1] cmp $b->[1] ||
$a->[2] cmp $b->[2] ||
$a->[3] cmp $b->[3] ||
$a->[4] cmp $b->[4]
} #data;
print($OFH $_->[0]) for #data;
If your input didn't contain NULs, you could even use the following faster approach:
print $OFH
map { /[^\0]*\z/g }
sort
map {
chomp;
my #fields = split /,/;
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], "$_\n")
}
<$FH>;
But yeah, you should probably use a legit CSV parser.
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });
my #data;
while (my $row = $csv->getline($FH)) {
push #data, [ join('_', #$row[0..2]), $row ];
}
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[1][19] cmp $b->[1][19] ||
$a->[1][22] cmp $b->[1][22] ||
$a->[1][25] cmp $b->[1][25]
} #data;
$csv->say($OFH, $_->[1]) for #data;
The following is the fast approach using a CSV parser:
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
print $OFH
map { /[^\0]*\z/g }
sort
map {
$csv->parse($_);
my #fields = $csv->fields();
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], $_)
}
<$FH>;
Was unable to use the Text::CVS_XS because it was not available on our server, unfortunately - but did find adding this single "print" line worked -
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedTest.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
Tested out fine with the small file, now to test on the actual file!
I've asked this question before how to do this with AWK but it doesn't handle it all that well.
The data has semicolons in quoted fields, which AWK doesn't take into account. So I was trying it in perl with the text::csv module so I don't have to think about that. The problem is I don't know how to output it to files based on a column value.
Short example from previous question, the data:
10002394;"""22.98""";48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;"""Miami""";http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;Chicago;"""http://testdata.com/bla/28798580.jpg""";5.95;10201848233
10025825;12.99;65;Chicago;"""http://testdata.com/bla/29017837.jpg""";5.95;93962025367
The desired result:
File --> 26.csv
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
File --> 48.csv
10002394;22.98;48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
File --> 53.csv
10003122;13.0;53;Miami;http://testdata.com/bla/29019899.jpg;5.95;24404000059
File --> 65.csv
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10007645;20.99;65;Chicago;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;Chicago;http://testdata.com/bla/29017837.jpg;5.95;93962025367
This is what I have so far. EDIT: Modified code:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#use Data::Dumper;
use Time::Piece;
my $inputfile = shift || die "Give input and output names!\n";
open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";
#binmode($infile, ":encoding(utf8)");
my $csv = Text::CSV_XS->new({binary => 1,sep_char => ";",quote_space => 0,eol => $/});
my %fh;
my %count;
my $country;
my $date = localtime->strftime('%y%m%d');
open(my $fh_report, '>', "report$date.csv");
$csv->getline($infile);
while ( my $elements = $csv->getline($infile)){
EDITED IN:
__________
next unless ($elements->[29] =~ m/testdata/);
for (#$elements){
next if ($elements =~ /apple|orange|strawberry/);
}
__________
for (#$elements){
s/\"+/\"/g;
}
my $filename = $elements->[2];
$shop = $elements->[3] .";". $elements->[2];
$count{$country}++;
$fh{$filename} ||= do {
open(my $fh, '>:encoding(UTF-8)', $filename . ".csv") or die "Could not open file '$filename'";
$fh;
};
$csv->print($fh{$filename}, $elements);
}
#print $fh_report Dumper(\%count);
foreach my $name (reverse sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count) {
print $fh_report "$name;$count{$name}\n";
}
close $fh_report;
Errors:
Can't call method "print" on an undefined value at sort_csv_delimiter.pl line 28, <$infile> line 2
I've been messing around with this but I'm totally at a loss. Can someone help me?
My guess is that you want hash of cached file handles,
my %fh;
while ( my $elements = $csv->getline( $infile ) ) {
my $filename = $elements->[2];
$fh{$filename} ||= do {
open my $fh, ">", "$filename.csv" or die $!;
$fh;
};
# $csv->combine(#$elements);
$csv->print($fh{$filename}, $elements);
}
I don't see an instance of your stated problem -- occurrences of the semicolon separator character ; within quoted fields -- but you are correct that Text::CSV will handle it correctly.
This short program reads your example data from the DATA file handle and prints the result to STDOUT. I presume you know how to read from or write to different files if you wish.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });
my #data;
while ( my $row = $csv->getline(\*DATA) ) {
push #data, $row;
}
my $file;
for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } #data ) {
unless (defined $file and $file == $row->[2]) {
$file = $row->[2];
printf "\nFile --> %d.csv\n", $file;
}
$csv->print(\*STDOUT, $row);
}
__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
output
File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;"10201848233 "
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
Update
I have just realised that your "desired result" isn't the output that you expect to see, but rather the way separate records are written to different files. This program solves that.
It looks from your question as though you want the data sorted in order of the first field as well, and so I have read all of the file into memory and printed a sorted version to the relevant files. I have also used autodie to avoid having to code status checks for all the IO operations.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });
my #data;
while ( my $row = $csv->getline(\*DATA) ) {
push #data, $row;
}
my ($file, $fh);
for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } #data ) {
unless (defined $file and $file == $row->[2]) {
$file = $row->[2];
open $fh, '>', "$file.csv";
}
$csv->print($fh, $row);
}
close $fh;
__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
FWIW I have done this using Awk (gawk):
awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }' bigfile.txt
other_process data | awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }'
Let me explain the awk script:
BEGIN { # execution block before reading any file (once)
if(!(col ~/^[1-9]/)) exit 2; # assert the `col` variable is a positive number
outname = "part-%s.txt"; # formatting string of the output file names
}
!/^#/ { # only process lines not starting with '#' (header/comments in various data files)
out = sprintf(outname, $col); # format the output file name, given the value in column `col`
print > out; # put the line to that file
}
If you like you can add a variable to specify a custom filename or use the current filename (or STDIN) as prefix:
NR == 1 { # at the first file (not BEGIN, as we might need FILENAME)
if(!(col ~/^[1-9]/)) exit 2; # assert the `col` variable is a positive number
if(!outname) outname = (FILENAME == "-" ? "STDIN" : FILENAME); # if `outname` variable was not provided (with `-v/--assign`), use current filename or STDIN
if(!(outname ~ /%s/)) outname = outname ".%s"; # if `outname` is not a formatting string - containing %s - append it
}
!/^#/ { # only process lines not starting with '#' (header/comments in various data files)
out = sprintf(outname, $col); # format the output file name, given the value in column `col`
print > out; # put the line to that file
}
Note: if you provide multiple input files, only the first file's name will be used as output prefix. To support multiple input files and multiple prefixes, you can use FNR == 1 instead and add another variable to distinguish between user-provided outname and the auto-generated one.
I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764
Given a set of genes and existing pair of genes, I want to generate new pairs of genes which are not already existing.
The genes file has the following format :
123
134
23455
3242
3423
...
...
The genes pairs file has the following format :
12,345
134,23455
23455,343
3242,464452
3423,7655
...
...
But I still get few common elements between known_interactions and new_pairs. I'm not sure where the error is.
For the arguments,
perl generate_random_pairs.pl entrez_genes_file known_interactions_file 250000
I got a common elements of 15880. The number 250000 is to tell how many random pairs I want the program to generate.
#! usr/bin/perl
use strict;
use warnings;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";
my #e_file = <IN>; s/\s+\z// for #e_file;
my #k_file = <IN2>; s/\s+\z// for #k_file;
my (%known_interactions);
my %entrez_genes;
$entrez_genes{$_}++ foreach #e_file;
foreach my $line (#k_file) {
my #array = split (/,/, $line);
$known_interactions{$array[0]} = $array[1];
}
my $count = 0;
foreach my $key1 (keys %entrez_genes) {
foreach my $key2 (keys %entrez_genes) {
if ($key1 != $key2) {
if (exists $known_interactions{$key1} && ($known_interactions{$key1} == $key2)) {next;}
if (exists $known_interactions{$key2} && ($known_interactions{$key2} == $key1)) {next;}
if ($key1 < $key2) { print "$key1,$key2\n"; $count++; }
else { print "$key2,$key1\n"; $count++; }
}
if ($count == $interactions) {
die "$count\n";
}
}
}
I can see nothing wrong with your code. I wonder if you have some whitespace in your data - either after the comma or at the end of the line? It would be safer to extract just the digit fields with, for instance
my #e_file = map /\d+/g, <IN>;
Also, you would be better off keeping both elements of the pair as the hash key, so that you can just check the existence of the element. And if you make sure the lower number is always first you don't need to do two lookups.
This example should work for you. It doesn't address the random selection part of your requirement, but that wasn't in your own code and wasn't your immediate problem
use strict;
use warnings;
#ARGV = qw/ entrez_genes.txt known_interactions.txt 9 /;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open my $fh, '<', $e_file or die "Error!! Cannot open $e_file: $!";
my #e_file = sort { $a <=> $b } map /\d+/g, <$fh>;
open $fh, '<', $k_file or die "Error!! Cannot open $k_file: $!";
my %known_interactions;
while (<$fh>) {
my $pair = join ',', sort { $a <=> $b } /\d+/g;
$known_interactions{$pair}++;
}
close $fh;
my $count = 0;
PAIR:
for my $i (0 .. $#e_file-1) {
for my $j ($i+1 .. $#e_file) {
my $pair = join ',', #e_file[$i, $j];
unless ($known_interactions{$pair}) {
print $pair, "\n";
last PAIR if ++$count >= $interactions;
}
}
}
print "\nTotal of $count interactions\n";
first of all, you are not chomping (removing newlines) from your file of known interactions. That means that given a file like:
1111,2222
you will build this hash:
$known_interactions{1111} = "2222\n";
That is probably why you are getting duplicate entries. My guess is (can't be sure without your actual input files) that these loops should work ok:
map{
chomp;
$entrez_genes{$_}++ ;
}#e_file;
and
map {
chomp;
my #array = sort(split (/,/));
$known_interactions{$array[0]} = $array[1];
}#k_file;
Also, as a general rule, I find my life is easier if I sort the interacting pair (the joys of bioinformatics :) ). That way I know that 111,222 and 222,111 will be treated in the same way and I can avoid multiple if statements like you have in your code.
Your next loop would then be (which IMHO is more readable):
my #genes=keys(%entrez_genes);
for (my $i=0; $i<=$#genes;$i++) {
for (my $k=$n; $k<=$#genes;$k++) {
next if $genes[$n] == $genes[$k];
my #pp=sort($genes[$n],$genes[$k]);
next unless exists $known_interactions{$pp[0]};
next if $known_interactions{$pp[0]} == $pp[1];
print "$pp[0], $pp[1]\n";
$count++;
die "$count\n" if $count == $interactions;
}
}
Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};