Getting member size from zip using Archive::Zip::MemberRead - perl

I am trying to read each member file size from a zip without actually extracting. I iterate through all member names, then use Archive::Zip::MemberRead to get a file handle for each member, against which I was hoping to be able to use the stat method to get the size. However, stat on a file handle from a zip file element returns an empty array so I can't get my file size. Here is my code:
my $zip = Archive::Zip->new($zipFilePath);
my #mbrs = $zip->memberNames();
foreach my $mbrName(#mbrs)
{
my $fh = Archive::Zip::MemberRead->new($zip, $mbrName);
my #fileStats = stat($fh);
my $size = $fileStats[7];
print "\n".$mbrName." -- ".$size;
}
However, the output I get does not display any file size:
dir/fileName1.txt --
dir/fileName2.txt --
The question is how to retrieve member file sizes without actually extracting them.

Why not just use the Archive::Zip module itself? This seems to work for me:
#!/usr/bin/perl
use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
my $filename = "somezipfile.zip";
# Read in the ZIP file
my $zip = Archive::Zip->new();
unless ($zip->read($filename) == AZ_OK) {
die "Read error\n";
}
# Loop through the members, printing their name,
# compressed size, and uncompressed size.
my #members = $zip->members();
foreach (#members)
{
print " - " . $_->fileName() . ": " . $_->compressedSize() .
" (" . $_->uncompressedSize() . ")\n";
}

Here is one way only if you have 7-zip installed:
#!/usr/bin/env perl
use warnings;
use strict;
## List files from zip file provided as first argument to the script, the format
## is like:
# Date Time Attr Size Compressed Name
#------------------- ----- ------------ ------------ ------------------------
#2012-10-19 16:56:38 ..... 139 112 1.txt
#2012-10-19 16:56:56 ..... 126 105 2.txt
#2012-10-19 16:57:24 ..... 71 53 3.txt
#2012-10-03 14:39:54 ..... 155 74 A.txt
#2012-09-29 17:53:44 ..... 139 70 AA.txt
#2011-12-08 10:41:16 ..... 30 30 AAAB.txt
#2011-12-08 10:41:16 ..... 18 18 AAAC.txt
# ...
for ( map { chomp; $_ } qx/7z l $ARGV[0]/ ) {
# Omit headers and footers with this flip-flop.
if ( my $l = ( m/^(?:-+\s+){2,}/ ... m/^(?:-+\s+){2,}/ ) ) {
## Don't match flip-flop boundaries.
next if $l == 1 || $l =~ m/E0$/;
## Extract file name and its size.
my #f = split ' ';
printf qq|%s -- %d bytes\n|, $f[5], $f[3];
}
}
I run it like:
perl script.pl files.zip
That yiedls in my test (with some output suppressed):
1.txt -- 139 bytes
2.txt -- 126 bytes
3.txt -- 71 bytes
A.txt -- 155 bytes
AA.txt -- 139 bytes
AAAB.txt -- 30 bytes
AAAC.txt -- 18 bytes
B.txt -- 40 bytes
BB.txt -- 131 bytes
C.txt -- 4 bytes
CC.txt -- 184 bytes
File1.txt -- 177 bytes
File2.txt -- 250 bytes
aaa.txt -- 30 bytes
...

Related

Perl Net::FTP::Recursive: Can't download folder but files of subfolders

I succeed to download files from a subfolder of the ftp-server. But if I want to download the folders from upper level it does not work.
Here is the folder structure:
folder rwx r-x r-x
subfolder1 rwx r-x r-x
file1 rw- r-- r--
file2 rw- r-- r--
subfolder2 rxx r-x r-x
file3 rw- r-- r--
file4 rw- r-- r--
If I use this:
$f1->cwd("/folder/subfolder1");
$f1->rget();
$f1->quit;
the files file1 and file2 will be downloaded.
If I use this:
$f1->cwd("/folder");
$f1->rget();
$f1->quit;
nothing will be downloaded and the program finished due to timeout. I expected that it will download subfolder1 and subfolder2 and the content of the subfolders. Is there any explanation for this and how can I solve it in the way that I can download subfolder and files?
A detailled description of the code is here
UPDATE 1: Debugging
Debugging with
my $f1 = Net::FTP::Recursive->new($host1, Debug => 1) or die "Can't open $host1\n";
gives the following:
Net::FTP::Recursive=GLOB(0x312bf50)>>> CWD /folder
Net::FTP::Recursive=GLOB(0x312bf50)<<< 250 CWD command successful
Net::FTP::Recursive=GLOB(0x312bf50)>>> PWD
Net::FTP::Recursive=GLOB(0x312bf50)<<< 257 "/folder" is the current directory
Net::FTP::Recursive=GLOB(0x312bf50)>>> PASV
Net::FTP::Recursive=GLOB(0x312bf50)<<< 227 Entering Passive Mode (188,40,220,103,255,187).
Net::FTP::Recursive=GLOB(0x312bf50)>>> LIST
Net::FTP::Recursive=GLOB(0x312bf50)<<< 150 Opening BINARY mode data connection for file list
Timeout at C:/Strawberry/perl/lib/Net/FTP.pm line 1107.
UPDATE 2: Timeout at C:/Strawberry/perl/lib/Net/FTP.pm line 1107.
_list_cmd is the function of the line mentioned in the debug output. I also add the lines where _list_cmdis used and wrapped lines to make it more readible conserving line numbers.
671 # Try to delete the contents
672 # Get a list of all the files in the directory, excluding
# the current and parent directories
673 my #filelist = map { /^(?:\S+;)+ (.+)$/ ? ($1) : () }
grep { !/^(?:\S+;)*type=[cp]dir;/i } $ftp->_list_cmd("MLSD", $dir);
925 sub ls { shift->_list_cmd("NLST", #_); }
925 sub dir { shift->_list_cmd("LIST", #_); }
1087 sub _list_cmd {
1088 my $ftp = shift;
1089 my $cmd = uc shift;
1090
1091 delete ${*$ftp}{'net_ftp_port'};
1092 delete ${*$ftp}{'net_ftp_pasv'};
1093
1094 my $data = $ftp->_data_cmd($cmd, #_);
1095
1096 return
1097 unless (defined $data);
1098
1099 require Net::FTP::A;
1100 bless $data, "Net::FTP::A"; # Force ASCII mode
1101
1102 my $databuf = '';
1103 my $buf = '';
1104 my $blksize = ${*$ftp}{'net_ftp_blksize'};
1105
1106 while ($data->read($databuf, $blksize)) {
1107 $buf .= $databuf;
1108 }
1109
1110 my $list = [split(/\n/, $buf)];
1111
1112 $data->close();
1114 if (EBCDIC) {
1115 for (#$list) { $_ = $ftp->toebcdic($_) }
1116 }
1117
1118 wantarray
1119 ? #{$list}
1120 : $list;
1121 }
To download the directories and files a tried the following workaround: using a loop over all subdirectory and appyling rget. It does the job I want. Nevertheless, the reason why rget does not work on the upper level is still not answered. At least now it is clear that it is not a permission probplem.
# ftp-server directory
my $ftpdir = "folder";
# Defie local download folder
my $download = "C:/local";
chdir($download);
# Change to remote directory
$f1->cwd($ftpdir) or die "Can't cwd to $ftpdir\n", $f1->message;
# grep all folder of top level
my #ftp_directories = $f1->ls;
# remove . and ..
#ftp_subdir = grep ! /^\.+$/, #ftp_subdir;
foreach my $sd (#ftp_subdir) {
# Make folder on local computer
my $localdir = catfile($download,$sd);
mkdir $localdir;
# Change local working directory
chdir $localdir;
# Change to remote sub directory to be downloaded
$f1->cwd($sd) or die "Can't cwd to $sd\n";
}
# download files
$f1->rget();
# Change to upper level
$f1->cwd("..");
}
$f1->quit;

Perl: how to compare array to hash and print out results

I'm quite new to Perl, so I'm sorry if this is somewhat rudimentary.
I'm working with a Perl script that is working as a wrapper for some Python, text formatting, etc. and I'm struggling to get my desired output.
The script takes a folder, for this example, the folder contains 6 text files (test1.txt through test6.txt). The script then extracts some information from the files, runs a series of command line programs and then outputs a tab-delimited result. However, that result contains only those results that made it through the rest of the processing by the script, i.e. the result.
Here are some snippets of what I have so far:
use strict;
use warnings;
## create array to capture all of the file names from the folder
opendir(DIR, $folder) or die "couldn't open $folder: $!\n";
my #filenames = grep { /\.txt$/ } readdir DIR;
closedir DIR;
#here I run some subroutines, the last one looks like this
my $results = `blastn -query $shortname.fasta -db DB/$db -outfmt "6 qseqid sseqid score evalue" -max_target_seqs 1`;
#now I would like to compare what is in the #filenames array with $results
Example of tab delimited result - stored in $results:
test1.txt 200 1:1-20 79 80
test3.txt 800 1:1-200 900 80
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
I would like the final output to include all of the files that were run through the script, so I think I need a way to compare two arrays or perhaps compare an array to a hash?
Example of the desired output:
test1.txt 200 1:1-20 79 80
test2.txt 0 No result
test3.txt 800 1:1-200 900 80
test4.txt 0 No result
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
Update
Ok, so I got this to work with suggestions by #terdon by reading the file into a hash and then comparing. So I was trying to figure out how to do this with out writing to file and the reading the file back in - I still can't seem to get the syntax correct. Here's what I have, however it seems like I'm not able to match the array to the hash - meaning the hash must not be correct:
#!/usr/bin/env perl
use strict;
use warnings;
#create variable to mimic blast results
my $blast_results = "file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984";
#create array to mimic filename array
my #filenames = ("file1.ab1", "file2.ab1", "file3.ab1");
#header for file
my $header = "Query\tSeq_length\tTarget found\tScore (Bits)\tExpect(E-value)\tAlign-length\tIdentities\tPositives\tChr\tStart\tEnd\n";
#initialize hash
my %hash;
#split blast results into array
my #row = split(/\s+/, $blast_results);
$hash{$row[0]}=$_;
print $header;
foreach my $file (#filenames){
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$row[0]\t$row[9]\t$row[1]:$row[7]-$row[8]\t$row[2]\t$row[3]\t$row[4]\t$row[5]\t$row[6]\t$row[1]\t$row[7]\t$row[8]\n";
}
## If not, print this.
else{
print "$file\t0\tNo Blast Results: Sequencing Rxn Failed\n";
}
}
print "-----------------------------------\n";
print "$blast_results\n"; #test what results look like
print "-----------------------------------\n";
print "$row[0]\t$row[1]\n"; #test if array is getting split correctly
print "-----------------------------------\n";
print "$filenames[2]\n"; #test if other array present
The result from this script is (the #filenames array is not matching the hash):
Query Seq_length Target found Score (Bits) Expect(E-value) Align-length Identities Positives Chr Start End
file1.ab1 0 No Blast Results: Sequencing Rxn Failed
file2.ab1 0 No Blast Results: Sequencing Rxn Failed
file3.ab1 0 No Blast Results: Sequencing Rxn Failed
-----------------------------------
file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984
-----------------------------------
file1.ab1 9
-----------------------------------
file3.ab1
I'm not entirely sure what you need here but the equivalent of awk's A[$1]=$0 is done using hashes in Perl. Something like:
my %hash;
## Open the output file
open(my $fh, "<","text_file");
while(<$fh>){
## remove newlines
chomp;
## split the line
my #A=split(/\s+/);
## Save this in a hash whose keys are the 1st fields and whose
## values are the associated lines.
$hash{$A[0]}=$_;
}
close($fh);
## Now, compare the file to #filenames
foreach my $file (#filenames){
## Print the file name
print "$file\t";
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$hash{$file}\n";
}
## If not, print this.
else{
print "0\tNo result\n";
}
}

some help on the following perl script

Need help in merging/concatenating /combining /binding etc
I have several ascii files each defining one variable which I have converted to a single column array
I have such columnised data for many variables ,so I need to perform a column bind like R does and make it one single file.
I can do the same in R but there are too many files. Being able to do it with one single code will help save a lot of time.
Using the following code ,new to perl and need help with this.
#filenames = ("file1.txt","file2.txt");
open F2, ">file_combined.txt" or die;
for($j = 0; $j< scalar #filenames;$j++){
open F1, $filenames[$j] or die;
for($i=1;$i<=6;$i++){$line=<F1>;}
while($line=<F1>){
chomp $line;
#spl = split '\s+', $line;
for($i=0;$i<scalar #spl;$i++){
print F2 "$spl[$i]\n";
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
close F1;
}
Input files here are Ascii text files of a raster.They look like this
32 12 34 21 32 21 22 23
12 21 32 43 21 32 21 12
The above mentioned code without the paste syntax converts these files into a single column
32
12
34
21
32
21
22
23
12
21
32
43
21
32
21
12
The output should look like this
12 21 32
32 23 23
32 21 32
12 34 12
43 32 32
32 23 23
32 34 21
21 32 23
Each column represents a different ascii file.
I need around 15 such ascii files into one dataframe.I can do the same in R but it consumes a lot of time as the number of files and regions of interest are too many and the files are a bit large too.
Let's step through what you have...
# files you want to open for reading..
#filenames = ("file1.txt","file2.txt");
# I would use the 3 arg lexical scoped open
# I think you want to open this for 'append' as well
# open($fh, ">>", "file_combined.txt") or die "cannot open";
open F2, ">file_combined.txt" or die;
# #filenames is best thought as a 'list'
# for my $file (#filenames) {
for($j = 0; $j< scalar #filenames;$j++){
# see above example of 'open'
# - $filenames[$j] + $file
open F1, $filenames[$j] or die;
# what are you trying to do here? You're overriding
# $line in the next 'while loop'
for($i=1;$i<=6;$i++){$line=<F1>;}
# while(<$fh1>) {
while($line=<F1>){
chomp $line;
# #spl is short for split?
# give '#spl' list a meaningful name
#spl = split '\s+', $line;
# again, #spl is a list...
# for my $word (#spl) {
for($i=0;$i<scalar #spl;$i++){
# this whole block is a bit confusing.
# 'F2' is 'file_combined.txt'. Then you try and merge
# ( and overwrite the file) with the paste afterwards...
print F2 "$spl[$i]\n";
# is this a 'system call'?
# Missing 'backticks' or 'system'
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
# close $fh1
close F1;
}
# I'm assuming there's a 'close F2' somewhere here..
It looks like you're trying to do this:
#filenames = ("file1.txt","file2.txt");
$oufile = "combined_text.txt";
`paste $filenames[0] $filenames[1] > $outfile`;

Loading Big files into Hashes in Perl (BLAST tables)

I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.

perl + compare numbers (NUM1 and NUM2) between two files

I need to compare chksum (NUM1 and NUM2) between file1 to file2 (see example below down)
The first field in file1 or file2 is the file path
The second field in file1 or file2 is the first chksum
The third field in file1 or file2 is the second chksum
The target is to read from file1 the first field (file path) and to verify if this path exists in file2
If file path exist in file2 then need to compare the chksum numbers between file1 to file2
If chksum equal then need to write the file path + chksum numbers in equal.txt file
else if chksum not equal then need to write the file path + chksum numbers in not_equal.txt file
remark (if file path from file1 not exist in file2 then need to write the file path in not_exist.txt file)
I need to do it for all files path in file1 until EOF
Question: Can someone have smart perl script for this?
File1
NUM1 NUM2
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cpqarray.ko 1317610 32
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cryptoloop.ko 320619 9
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/DAC960.ko 20639107 6
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/floppy.ko 9547813 71
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/loop.ko 2083034 23
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/nbd.ko 6470230 18
/data/libc-2.5.so 55861 1574
/bin/libcap.so.1.10 03221 12
/var/libcidn-2.5.so 31744 188
/etc/libcom_err.so.2.1 40247 8
.
.
.
File2
NUM1 MUM2
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cpqarray.ko 541761 232
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/cryptoloop.ko 224619 9
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/DAC960.ko 06391 73
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/floppy.ko 54081 71
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/loop.ko 08307 23
/lib/modules/2.6.18-128.el5PAE/kernel/drivers/block/nbd.ko 470275 58
.
.
.
.
.
For each file, create a hashtable where the key is the filename and the value is the checksum.
Iterate through the filenames from the first file (foreach $file (keys %hash_from_file1)) and check if that filename exists in the hash from the second file. If it does, check that the values of the two hashtables are the same ($hash_from_file1{$file} eq $hash_from_file2{$file}). If those match, then write the file and its hash value to equal.txt. If not, write the file and hash value to not_equal.txt.
Is it possible for there to be an entry in the second file that wouldn't exist in the first file?
mobrule's solution is correct.
This is the code:
use strict;
use warnings;
open FIN, "file2";
my $file2_hash = {};
while (<FIN> =~/^(.*?)\s*(\d+)\s*(\d+)$/) {
$file2_hash->{$1} = "$2_$3";
}
close FIN;
open FIN, "file1";
open EQUAL, ">equal.txt";
open NOT_EQUAL, ">not_equal.txt";
open NOT_EXIST, ">not_exist.txt";
while (<FIN> =~/^(.*?)\s*(\d+)\s*(\d+)$/) {
my $output_str = "$1\t$2\t$3\n";
if (not exists $file2_hash->{$1}) {
print NOT_EXIST $output_str;
} elsif ($file2_hash->{$1} ne "$2_$3") {
print NOT_EQUAL $output_str;
} else {
print EQUAL $output_str;
}
}
close FIN;
close EQUAL;
close NOT_EQUAL;
close NOT_EXIST;