perl + compare numbers (NUM1 and NUM2) between two files

perl + compare numbers (NUM1 and NUM2) between two files - perl

I need to compare chksum (NUM1 and NUM2) between file1 to file2 (see example below down)
The first field in file1 or file2 is the file path
The second field in file1 or file2 is the first chksum
The third field in file1 or file2 is the second chksum
The target is to read from file1 the first field (file path) and to verify if this path exists in file2
If file path exist in file2 then need to compare the chksum numbers between file1 to file2
If chksum equal then need to write the file path + chksum numbers in equal.txt file
else if chksum not equal then need to write the file path + chksum numbers in not_equal.txt file
remark (if file path from file1 not exist in file2 then need to write the file path in not_exist.txt file)
I need to do it for all files path in file1 until EOF
Question: Can someone have smart perl script for this?
File1
NUM1 NUM2
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cpqarray.ko 1317610 32
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cryptoloop.ko 320619 9
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/DAC960.ko 20639107 6
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/floppy.ko 9547813 71
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/loop.ko 2083034 23
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/nbd.ko 6470230 18
/data/libc-2.5.so 55861 1574
/bin/libcap.so.1.10 03221 12
/var/libcidn-2.5.so 31744 188
/etc/libcom_err.so.2.1 40247 8
.
.
.
File2
NUM1 MUM2
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cpqarray.ko 541761 232
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cryptoloop.ko 224619 9
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/DAC960.ko 06391 73
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/floppy.ko 54081 71
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/loop.ko 08307 23
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/nbd.ko 470275 58
.
.
.
.
.

For each file, create a hashtable where the key is the filename and the value is the checksum.
Iterate through the filenames from the first file (foreach $file (keys %hash_from_file1)) and check if that filename exists in the hash from the second file. If it does, check that the values of the two hashtables are the same ($hash_from_file1{$file} eq $hash_from_file2{$file}). If those match, then write the file and its hash value to equal.txt. If not, write the file and hash value to not_equal.txt.
Is it possible for there to be an entry in the second file that wouldn't exist in the first file?

mobrule's solution is correct.
This is the code:
use strict;
use warnings;
open FIN, "file2";
my $file2_hash = {};
while (<FIN> =~/^(.*?)\s*(\d+)\s*(\d+)$/) {
$file2_hash->{$1} = "$2_$3";
}
close FIN;
open FIN, "file1";
open EQUAL, ">equal.txt";
open NOT_EQUAL, ">not_equal.txt";
open NOT_EXIST, ">not_exist.txt";
while (<FIN> =~/^(.*?)\s*(\d+)\s*(\d+)$/) {
my $output_str = "$1\t$2\t$3\n";
if (not exists $file2_hash->{$1}) {
print NOT_EXIST $output_str;
} elsif ($file2_hash->{$1} ne "$2_$3") {
print NOT_EQUAL $output_str;
} else {
print EQUAL $output_str;
}
}
close FIN;
close EQUAL;
close NOT_EQUAL;
close NOT_EXIST;

Related

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.

Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Generating 2 files based on two columns in a third file

I am trying to prepare two input files based on information in a third file. File 1 is for sample1 and File 2 is for sample2. Both these files have lines with tab delimited columns. The first column contains unique identifier and the second column contains information.
File 1
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
..so on. Similarly, File 2 contains
>ENG012 ggggggggggggg
>ENG098 ksksksksksks
>ENG234 wewewewewew
I have a File 3 that contains two columns each corresponding to the identifier from File 1 and File 2
>ENT01 >ENG78
>ENT02 >ENG098
>ENT02 >ENG012
>ENT02 >ENG234
>ENT03 >ENG012
and so on. I want to prepare input files for File 1 and File 2 by following the order in file 3. If an entry is repeated in file 3 (ex ENT02) I want to repeat the information for that entry. The expected output is
For File 1:
>ENT01 xxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyx
>ENT02 xyxyxyxyxyx
>ENT03 ththththththth
And for file 2
>ENG78 some info
>ENG098 ksksksksks
>ENG012 gggggggg
>ENG234 wewewewewew
>ENG012 gggggggg
All the the entries in file 1 and file 2 are unique but not in file 3. Also, there are some entries in file3 in either column that is not present in either file 1 or file 2. The current logic I am using is that finding an intersection of identifiers from column 1 in both files1&2 with respective columns in file 3, storing this as a list and using this list to compare with File1 and File 2 separately to generate output for File 1 & 2. I am working with the following lines
awk 'FNR==NR{a[$1]=$0;next};{print a[$1]}' file1 intersectlist
grep -v -x -f idsnotfoundinfile1 file3
I am not able to get the right output as I think at some point it is getting sorted and only uniq values are printed out. Can someone please help me clear work this out.

You need to read and remember the first 2 files into some data structure, and then for the third file, output to 2 new files:
$ awk -F'\t' -v OFS='\t' '
FNR == 1 {file_num++}
file_num == 1 || file_num == 2 {data[file_num,$1] = $2; next}
function value(str) {
return str ? str : "some info"
}
{
for (i=1; i<=2; i++) {
print $i, value(data[i,$i]) > ARGV[i] ".new"
}
}
' file1 file2 file3
$ cat file1.new
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
$ cat file2.new
>ENG78 some info
>ENG098 ksksksksksks
>ENG012 ggggggggggggg
>ENG234 wewewewewew
>ENG012 ggggggggggggg

The files 1 and 2 first need be read so that you can find their lines with identifiers from file 3. Since the identifiers in these files are unique you can build a hash for each file, with identifiers as keys.
Then process file 3 line by line, where for each identifier on the line retrieve its value from the hash for the appropriate file and write the corresponding lines to new files 1 and 2.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
my ($file1, $file2, $file3) = qw(File1.txt File2.txt File3.txt);
my ($fileout1, $fileout2) = map { $_ . 'new' } ($file1, $file2);
my %file1 = map { split } path($file1)->lines;
my %file2 = map { split } path($file2)->lines;
my ($ofh1, $ofh2) = map { path($_)->openw } ($fileout1, $fileout2);
open my $fh, '<', $file3 or die "Can't open $file3: $!";
while (<$fh>) {
my ($f1, $f2) = split;
say $ofh1 "$f1\t", $file1{$f1} // 'some info'; #/ see text
say $ofh2 "$f2\t", $file2{$f2} // 'some info';
}
close $_ for $ofh1, $ofh2, $fh;
This produces the correct output based on fragments of input files that are provided.
I use Path::Tiny here for its conciseness. Its lines method returns all lines, and in map's block each is split by default space. The list of such pairs returned by map is assigned to a hash, whereby each pair of successive strings forms a key-value pair.
Multiple files can be opened in one statement, and Path::Tiny again makes it clean with openw. Its methods throw the exception (die) on errors, so we get error checking as well.
If an identifier in File 3 is not found in File 1/2 I bluntly use 'some info' as stated in the question,† but I expect that there is a more rounded solution for such a case. Then the laconic // should be changed to accommodate extra processing (or call a sub in place of 'some info' string).
It is assumed that files 1 and 2 always have two entries on a line.
Some shortcuts are taken, like reading each file into a hash in one line. Please expand the code as appropriate, with whatever checks may be needed.
† In such a case $file1{$f1} is undef so // (defined-or) operator returns its right-hand-side argument. A "proper" way is to test if (exist $file1{$f1}) but // works as well.

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2

This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686

So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.

This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;

finding a string pattern spread across several lines in a file

I need help with finding a string in a text.
From the text below
I need to find the first occurence of /infile after the text ${SALE}. Once I find the infile I need to find the contents of the following /fields
From the example below the output should be
all 1 char 178,
zip 170 char 5***
The output will be the text between /fields and the next /.
Solutions in shell, perl, awk would be appreciated.
script starts here
${CHKERR}
echo ${SALE}
badchar ${SALE} - | upshift - - | ssort '
/stat
/padbyte " "
/infile 0 open stlf
***/fields
all 1 char 178,
zip 170 char 5***
/joinkey zip
/derived country " "
/infile /data/retprep/rethold/statezip stlf
/fields
zipkey 1 char 5,
state 6 char 2
/derived x 1

Not the smallest one, but works:
awk 'f3 && NF {print $0;getline;print $0;f3=0} /\${SALE}/ {f1=1} f1 && /\/infile/ {f2=1} f2 && /\*\*\*\/fields/ {f3=1}' file
all 1 char 178,
zip 170 char 5***

This solution works by reading through the file looking for each pattern in a list one by one. Once the last has been found the lines from the file are printed until a blank line is found.
The program expects the path to the input file as a parameter on the command line.
use strict;
use warnings;
my #matches = ( qr<\${SALE}>, qr<\Q/infile>, qr<\Q/fields> );
for my $match (#matches) {
while (<>) {
last if $_ =~ $match;
}
}
print while ($_ = <>) =~ /\S/;
output
all 1 char 178,
zip 170 char 5***

Try doing this :
perl -0777ne '
print $1 if m!/infile.*?/fields\n(.*?)^$!ms and qr<\${SALE}> .. eof
' file

Splitting a concatenated file based on header text

I have a few very large files which are basically a concatenation of several small files and I need to split them into their constituent files. I also need to name the files the same as the original files.
For example the files QMAX123 and QMAX124 have been concatenated to:
;QMAX123 - Student
... file content ...
;QMAX124 - Course
... file content ...
I need to recreate the file QMAX123 as
;QMAX123 - Student
... file content ...
And QMAX124 as
;QMAX124 - Course
... file content ...
The original file's header ;QMAX<some number> is unique and only appears as a header in the file.
I used the script below to split the content of the files, but I haven't been able to adapt it to get the file names right.
awk '/^;QMAX/{close("file"f);f++}{print $0 > "file"f}' <filename>
So I can either adapt that script to name the file correctly or I can rename the split files created using the script above based on the content of the file, whichever is easier.
I'm currently using cygwin bash (which has perl and awk) if that has any bearing on your answer.

The following Perl should do the trick
use warnings ;
use strict ;
my $F ; #will hold a filehandle
while (<>) {
if ( / ^ ; (\S+) /x) {
my $filename = $1 ;
open $F, '>' , $filename or die "can't open $filename " ;
} else {
next unless defined $F ;
print $F $_ or warn "can't write" ;
}
}
Note it discards any input before a line with filename next unless defined $F ; You may care to generate an error or add a default file. Let me know and I can change it

With Awk, it's as simple as
awk '/^;QMAX/ {filename = substr($1,2)} {print >> filename}' input_file

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

perl + compare numbers (NUM1 and NUM2) between two files - perl

Related

get column list using sed/awk/perl

Generating 2 files based on two columns in a third file

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

finding a string pattern spread across several lines in a file

Splitting a concatenated file based on header text

Categories

Resources