Getting GLOB values instead of file content - perl

I am trying to write a script that takes in one file and compares it to the second and then outputs the difference. I had it working, but decided I wanted to get rid of any line that starts with '#." I had to use push as .= was not working. Ever since then I get output like
keys = GLOB(0x23d2d48)
I'm not sure what I am missing.
#!/usr/bin/perl
use warnings;
use lib '/var/www/oooOOoO/lib/Perl';
my #a1;
my #a2;
my %diff1;
my %diff2;
my #diff1;
my #diff2;
my $input_file = "/etc/mysql/conf.d/replication-slave.cnf";
my $input_file2 = "tA.cnf";
open( my $input_fh, "<", $input_file ) || die "Can't open $input_file: $!";
open( my $input_fh2, "<", $input_file2 ) || die "Can't open $input_file: $!";
#a1 = ' ';
for ($input_fh) {
next if /^#/;
push#a1, $_;
}
#a2= ' ';
for ($input_fh2) {
next if /^#/;
push #a2, $_;
}
#diff1{ #a1 } = #a1;
delete #diff1{ #a2 };
# %diff1 contains elements from '#a1' that are not in '#a2'
#k = (keys %diff1);
print "keys = #k\n";
I've tried changing keys to values, but that didn't work.
Thanks

The problem lies in this bit of code:
for ($input_fh) {
next if /^#/;
push #a1, $_;
}
This is creating a single element list containing a filehandle, then pushing that filehandle to #a1. To read from the filehandle you need to wrap it with <>:
while (<$input_fh>) {
next if /^#/;
push #a1, $_;
}
Note I've switched the for to a while as for imposes list context and reads the file all at once whereas while will read one line at a time. You can also remove:
#a1 = ' ';
#a2 = ' ';
Which just adds an extra element to both arrays.

It should work, but you're code's a little messy. I'm also not sure what you're trying to do when you assign #diff1{#a1} = #a1.
Try this re-write and let me know:
#!/usr/bin/perl
use strict;
use warnings;
use lib '/var/www/ooooOOooOoo/lib/Perl';
my $input_file = "/etc/mysql/conf.d/replication-slave.cnf";
my $input_file2 = "tA.cnf";
open my $input_fh, "<", $input_file or die "Can't open $input_file: $!";
open my $input_fh2, "<", $input_file2 or die "Can't open $input_file: $!";
my (#a1, #a2);
while(<$input_fh>){
chomp;
next if /^#/;
push #a1, $_;
}
while(<$input_fh2>){
chomp;
next if /^#/;
push #a2, $_;
}
my %diff1;
#diff1{#a1} = #a1; # What are you actually trying to do here?
delete #diff1{#a2};
# %diff1 contains elements from '#a1' that are not in '#a2'
my #k = (keys %diff1);
print "keys = #k\n";
But you might want to try this approach instead:
my #nums1 = qw(1 2 3 4 5);
my #nums2 = qw(one two three four 5);
my (%compare1, %compare2);
foreach(#nums1){
chomp;
$compare1{$_} = 1;
}
foreach(#nums2){
chomp;
$compare2{$_} = 1;
}
foreach my $key (keys %compare1){
print "$key\n" unless $compare2{$key};
}

Once you have loaded the two arrays, it's better to use a Perl CPAN module to do this kind of taks. I think Array::Utils is a good candidate to achieve your goals. From the module documentation:
use Array::Utils qw(:all);
my #a = qw( a b c d );
my #b = qw( c d e f );
# symmetric difference
my #diff = array_diff(#a, #b);
# intersection
my #isect = intersect(#a, #b);
# unique union
my #unique = unique(#a, #b);
# check if arrays contain same members
if ( !array_diff(#a, #b) ) {
# do something
}
# get items from array #a that are not in array #b
my #minus = array_minus( #a, #b );

Related

How do I properly find double entries in two files in Perl?

Let's say I have two files with lists of ip-addresses. Lines in the first file are unique. Lines in the second may or may not be the same as in the first one.
What I need is to compare two files, and remove possible doubles from the second file in order to merge it with the base file later.
I've managed to write the following code and it seems to work properly, but I have a solid feeling that this code can be improved or I may be totally missing some important concept.
Are there any ways to solve the task without using complex data structures, i.e. hashrefs?
#!/usr/bin/perl
use strict;
use warnings;
my $base = shift #ARGV;
my $input = shift #ARGV;
my $res = 'result.txt';
open ("BASE","<","$base");
open ("INP","<","$input");
open ("RES", ">", "$res");
my $rhash = {}; # result hash
while (my $line = <BASE>) {chomp($line); $rhash->{$line}{'res'} = 1;} # create uniq table
while (my $line = <INP>) { chomp($line); $rhash->{$line}{'res'}++; $rhash->{$line}{'new'} = 1; } # create compare table marking it's entries as new and incrementing double keys
close BASE;
close INP;
for my $line (sort keys %$rhash) {
next if $line =~ /\#/; # removing comments
printf("%-30s%3s%1s", $line, $rhash->{$line}{'res'}, "\n") if $rhash->{$line}{'res'} > 1; # kinda diagnosti output of doubles
if (($rhash->{$line}{'new'}) and ($rhash->{$line}{'res'} < 2)) {
print RES "$line\n"; # printing new uniq entries to result file
}
}
close RES;
If I understand correctly file1 and file2 each contain ips (unique in each file) And you want to get ips in file2 not in file1. If so, then maybe the following code achieves your goal.
Although it seems your code will do it, this might be clearer.
#!/usr/bin/perl
use strict;
use warnings;
my $base = shift #ARGV;
my $input = shift #ARGV;
my $res = 'result.txt';
open ("BASE","<","$base") or die $!;
open ("INP","<","$input") or die $!;
open ("RES", ">", "$res") or die $!;
my %seen;
while (my $line = <BASE>) {
chomp $line;
$seen{$line}++;
}
close BASE or die $!;
while (my $line = <INP>) {
chomp $line;
print RES "$line\n" unless $seen{$line}; # only in file2 not in file1
}
close INP or die $!;
close RES or die $!;

How to randomly pair items in a list

I have a list of Accession numbers that I want to pair randomly using a Perl script below:
#!/usr/bin/perl -w
use List::Util qw(shuffle);
my $file = 'randomseq_acc.txt';
my #identifiers = map { (split /\n/)[1] } <$file>;
chomp #identifiers;
#Shuffle them and put in a hash
#identifiers = shuffle #identifiers;
my %pairs = (#identifiers);
#print the pairs
for (keys %pairs) {
print "$_ and $pairs{$_} are partners\n";
but keep getting errors.
The accession numbers in the file randomseq_acc.txt are:
1094711
1586007
2XFX_C
Q27031.2
P22497.2
Q9TVU5.1
Q4N4N8.1
P28547.2
P15711.1
AAC46910.1
AAA98602.1
AAA98601.1
AAA98600.1
EAN33235.2
EAN34465.1
EAN34464.1
EAN34463.1
EAN34462.1
EAN34461.1
EAN34460.1
I needed to add the closing right curly brace to be able to compile the script.
As arrays are indexed from 0, (split /\n/)[1] returns the second field, i.e. what follows newline on each line (i.e. nothing). Change it to [0] to make it work:
my #identifiers = map { (split /\n/)[0] } <$file>; # Still wrong.
The diamond operator needs a file handle, not a file name. Use open to associate the two:
open my $FH, '<', $file or die $!;
my #identifiers = map { (split /\n/)[0] } <$FH>;
Using split to remove a newline is not common. I'd probably use something else:
map { /(.*)/ } <$FH>
# or
map { chomp; $_ } <$FH>
# or, thanks to ikegami
chomp(my #identifiers = <$FH>);
So, the final result would be something like the following:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw(shuffle);
my $filename = '...';
open my $FH, '<', $filename or die $!;
chomp(my #identifiers = <$FH>);
my %pairs = shuffle(#identifiers);
print "$_ and $pairs{$_} are partners\n" for keys %pairs;

Selecting records from a file based on keys from a second file

My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;

Perl Merge file

I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764

Argument is not numeric error while comparing hash values based on keys

#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(
#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].