Comparing FILE1 value to FILE2 range and printing matches - perl

I'm very new to Perl and am working on a Bioinformatics project at University. I have FILE1 containing a list of positions, in the format:
99269
550
100
126477
1700
And FILE2 in the format:
517 1878 forward
700 2500 forward
2156 3289 forward
99000 100000 forward
22000 23000 backward
I want to compare every position in FILE1 to every range in values on FILE2, and if a position falls into one of the ranges then I want to print the position, range and direction.
So my expected output would be:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
Currently it will run with no errors, however it doesn't output any information so I am unsure where I am going wrong! When I split the final 'if' rule it runs but will only work if the position is on exactly the same line as the range.
My code is as follows:
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt"
or die "cannot open > CC22: $!";
open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt"
or die "cannot open > CDS: $!";
open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";
while (<FILE1>) {
if (/^(\d+)/) {
my $CC22 = $1;
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
}
close(FILE1);
close(FILE2);
I have posted the same question on Perlmonks.

Because you are only reading FILE2 once it is only compared with the first line of FILE1
Subsequent lines are compared with the closed file
Stash the lines from FILE1 in an array and then compare each line in FILE2 with each array entry, as shown below
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "out.txt";
open FILE1, "file1.txt"
or die "cannot open > CC22: $!";
open FILE2, "file2.txt"
or die "cannot open > CDS: $!";
open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";
my #file1list = ();
while (<FILE1>) {
if (/^(\d+)/) {
push #file1list, $1;
}
}
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
for my $CC22 (#file1list) {
if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
( there are also stylistic issues with the program (like capital letters for variables) but I've ignored these, it's quite a nice program for a beginner)

I thought I could simplify some of that by using split instead of regex, but I think my code is actually longer and more difficult to read! In any event, remember that split works great for problems like this:
# User config area
my $positions_file = 'input_positions.txt';
my $ranges_file = 'input_ranges.txt';
my $output_file = 'output_data.txt';
# Reading data
open my $positions_fh, "<", $positions_file;
open my $ranges_fh, "<", $ranges_file;
chomp( my #positions = <$positions_fh> );
# Store the range data in an array containing hash tables
my #range_data;
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir}
while (<$ranges_fh>) {
chomp;
my ( $start, $end, $dir ) = split; #splits $_ according to whitespace
push #range_data, { start => $start, end => $end, dir => $dir };
#print "start: $start, end: $end, direction: $dir\n";
} #/while
close $positions_fh;
close $ranges_fh;
# Data processing:
open my $output_fh, ">", $output_file;
#It feels like it should be more efficient to process one range at a time for all data points
foreach my $range (#range_data) { #start one range at a time
#each $range = $range_data[#] = { hash table }
foreach my $position (#positions) { #check all positions
if ( ( $range->{start} <= $position ) and ( $position <= $range->{end} ) ) {
my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n";
print $output_fh $output_string;
} #/if
} #/foreach position
} #/foreach range
close $output_fh;
This code would probably run faster if the data processing was done during the while loop that's reading the range data.

Your bug was because you were embedding file processing, so your inner loop only went through the file's contents a single time and then was stuck at eof.
The easiest solution is just to load the inner loop file entirely into memory first.
The following demonstrates using more Modern Perl techniques:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt";
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt";
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
my #ranges = do {
# open my $fh, '<', $cdsfile; # Using Fake Data instead below
open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n";
map {[split]} <$fh>;
};
# open my $infh, '<', $cc22file; # Using Fake Data instead below
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n";
# open my $outfh, '>', $outfile; # Using STDOUT instead below
my $outfh = \*STDOUT;
CC22:
while (my $cc22 = <$infh>) {
chomp $cc22;
for my $cds (#ranges) {
if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) {
print $outfh "$cc22 #$cds\n";
next CC22;
}
}
# warn "$cc22 No match found\n";
}
Outputs:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
Live Demo

Related

Sort all records according to modified date and time

I have some problem with my code. I have 1 GB records, in which I have to sort according to date and time. Records are look like :
TYP_journal article|KEY_1926000001|AED_|TIT_A Late Eighteenth-Century Purist|TPA_|GLO_Pronouncements of George Campbell and his contemporaries which time has set aside.|AUT_Bryan, W. F.|AUS_|AFF_|RES_|IED_|TOC_|FJN_Studies in Philology|ISN_0039-3738|ESN_|PLA_Chapel Hill, NC|URL_|DAT_1926|VOL_23|ISS_|EXT_358-370|CPP_|FSN_|ISN_|PLA_|SNO_|PUB_|IBZ_|PLA_|PYR_|PAG_|DAN_|DGI_|DGY_|OFP_|OFU_|FSS_|PDF_|LIB_|INO_|FAU_|INH_|IUR_|INU_|CDT_9/15/2003 3:12:28 PM|MDT_5/16/2017 9:18:40 AM|
I sort these records using MDT_5/16/2017 9:18:40 AM.
I used below technique:
I filter file, which have MDT_ or not (create two file with MDT_ and without MDT_).
For MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD #Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
For Un_MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD #un_ModifiedDate;
$doc_UMD->close;
close (read_file);
From MDT_ contains file, I collect all date and time and sort them and then unique.
#modi_date = map $_->[0],
sort { uc($a->[1]) cmp uc($b->[1]) } map { [ $_, toISO8601($_) ] } #modi_date;
#modi_date = reverse (#modi_date);
#modi_date = uniq (#modi_date);
according to sorted date and time I grep all records from MDT_file. And finally create final file.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
foreach my $changes (#modi_date)
{
chomp($changes);
$Count_pro++;
#ab = grep (/$changes/, #all_data_with_time);
print $doc1 ("#ab\n");
$progress_bar->update($Count_pro);
}
$doc1->close;
But this process take more time. Is there any way to do in short time?
As you pointed out doing everything in memory is not an option on your machine. However, I do not see why you are first sorting the dates,
to then grep all records with that date, instead of sorting all of those records on the date.
I also suspect that if you were to go through the original file line by line and not in one huge map sort split map, you might save some memory,
but I'll leave that up to you to try - it would save you creating the files and then re-parsing things.
I would suggest doing 2 + 3 in one go:
Skip building #modi_date ( somewhere not visible to us :/ ).
my $mdt_fn = 'with_mdt.txt'; # <- whatever name you gave that file?
open ( my $fh, '< :encoding(UTF-8)', $mdt_fn )
or die "could not open file '$mdt_fn' to read: $!";
my $dt_parser = DateTime::Format::Strptime->new(
pattern => '%m/%d/%Y %r',
);
# get all records from file. To ensure we only need to parse the line once,
# store the datetime in a hashref.
my #records;
while ( my $line = <$fh> ){
push #records, {
dt => _dt_from_record($line),
record => $line,
};
}
# If you wanted to CMP rather than doing datetime comparison,
# adapt _dt_from_record and use 'cmp' instead of '<=>'
#records = sort{ $a->{dt} <=> $b->{dt} }#records;
open ( my $out_fh, '> :encoding(UTF-8)', 'sorted.txt') or
die "could not open file to write to: $!";
# Or reverse first if you want latest to oldest
print $out_fh $_->{record}."\n" for #records;
close $out_fh;
# I prefer using DateTime for this.
# Using a parser will alert me if some date was set, but cannot be parsed.
# If you want to spare yourself some additional time,
# why not store the parsed date in the file. However, I doubt this takes long.
sub _dt_from_record {
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return $dt_parser->parse_datetime($1);
}
Finally i done it.
Complete code is :-
use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;
BEGIN {our $start_run = time();
my $Start = localtime;
print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my #un_ext_file;
my #un_ext_file1;
my $current_data =today();
my $time = Time::Simple->new();
my $hour = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";
##check for output name DIR
opendir(DIR1, $current_ou);
my #current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (#current_ou_folder)
{
if ($entry eq "output")
{
move "$current_ou/output" , "$current_ou/$ren_folder";
mkdir "$current_ou/output";
}
else
{
mkdir "$current_ou/output";
}
}
opendir(DIR, $current_in);
my #files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (#files_and_folder)
{
next if $entry eq '.' or $entry eq '..';
next if -d $entry;
push(#un_ext_file1, $entry);
}
##### check duplicate file name
my %seen;
my #file_test;
foreach my $file_name (#un_ext_file1)
{
if ($file_name =~ /(.*)\.([a-z]+)$/)
{
push (#file_test, $1);
}
else
{
push (#file_test, $file_name);
}
}
foreach my $string (#file_test)
{
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
##collect all file from array
foreach my $file_name (#un_ext_file1)
{
my $REC_counter=0;
if ($file_name =~ /(.*)\.([a-z]+)$/) #####work for all extension
{
my $file_name_with_out_ext = $1;
my #modi_date_not_found;
eval{
#####read source file
#####First short file date wise (old date appear first then new date apper in last)
##### To get modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD #Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
#Dt_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");
##### To get un-modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD #un_ModifiedDate;
$doc_UMD->close;
close (read_file);
#un_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");
##### Read ModifiedDate
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
my #all_ModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
##### write in sotred_data file ModifiedDate after sorting all data.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } #all_ModifiedDate;
$doc1->close;
##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
my #all_sorted_data = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
#all_sorted_data = reverse (#all_sorted_data);
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
my #all_unModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
my $doc_final = new IO::File(">$current_ou/output/$1.txt");
$doc_final->binmode(':utf8');
print $doc_final #all_sorted_data;
print $doc_final #all_unModifiedDate;
$doc_final->close;
unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
}
}
}
#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";
get_time();
sub toISO8601
{
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}
sub get_time
{
my $end_run = time();
my $run_time = $end_run - our $start_run;
#my $days = int($sec/(24*60*60));
my $hours = ($run_time/(60*60))%24;
my $mins =($run_time/60)%60;
my $secs = $run_time%60;
print "\nJob took";
print colored ['bold green'], (" $hours:$mins:$secs ");
print "to complete this process\n";
my $End = localtime;
print colored ['bold green'], ("\nEnd time :- $End\n");
}
All process is done with-in :-- 20 min.
specially i am V. very thank-full to #bytepusher.

Perl: How to print a random section (word definition) from a dictionary file

I want to print a random new word English in dictionary file in terminal Unix by Perl. I want to select and print a random line and 2 follow lines.
But my code doesn't complete this work.
Please help me to improve it.
An example of the output I wish:
#inspire: ....
ghk
lko...
Dictionary file:
#inspiration: mean....
abc def...
ghk lmn
...
#inspire: ....
ghk
lko...
#people: ...
...
The complete dictionary file is here anhviet109K.txt. It's about 14MB
My code:
use strict;
use warnings;
use File::Copy qw(copy move);
my $files = 'anhviet109K.txt';
my $fh;
my $linewanted = 16 + int( rand( 513796 - 16 ) );
# 513796: number of lines of file dic.txt
open( $fh, "<", $files ) or die "cannot open < $fh: $!";
my $del = " {2,}";
my $temp = 0;
my $count = 0;
while ( my $line = <$fh> ) {
if ( ( $line =~ "#" ) && ( $. > $linewanted ) ) {
$count = 4;
}
else {
next;
}
if ( $count > 0 ) {
print $line;
$count--;
}
else {
last;
}
}
close $fh;
Something like this, perhaps?
Your data has helped me to exclude the header entries in your dictionary file
This program finds the location of all of the entries (lines beginning with #) in the file, then chooses one at random and prints it
Tốt học tiếng Anh may mắn
use strict;
use warnings 'all';
use Fcntl ':seek';
use constant FILE => 'anhviet109K.txt';
open my $fh, '<', FILE or die qq{Unable to open "#{[FILE]}" for input: $!};
my #seek; # Locations of all the definitions
my $addr = tell $fh;
while ( <$fh> ) {
push #seek, $addr if /^\#(?!00-)/;
$addr = tell $fh;
}
my $choice = $seek[rand #seek];
seek $fh, $choice, SEEK_SET;
print scalar <$fh>;
while ( <$fh> ) {
last if /^\#/;
print;
}
output
#finesse /fi'nes/
* danh từ
- sự khéo léo, sự phân biệt tế nhị
- mưu mẹo, mánh khoé
* động từ
- dùng mưu đoạt (cái gì); dùng mưu đẩy (ai) làm gì; dùng mưu, dùng kế
=to finesse something away+ dùng mưu đoạt cái gì
A single pass approach:
use strict;
use warnings;
use autodie;
open my $fh, '<:utf8', 'anhviet109K.txt';
my $definition = '';
my $count;
my $select;
while (my $line = <$fh>) {
if ($line =~ /^#(?!00-)/) {
++$count;
$select = rand($count) < 1;
if ($select) {
$definition = $line;
}
}
elsif ($select) {
$definition .= $line;
}
}
# remove blank line that some entries have
$definition =~ s/^\s+\z//m;
binmode STDOUT, ':utf8';
print $definition;
This iterative random selection always selects the first item, has a 1/2 chance of replacing it with the second item, a 1/3 for the third, and so on.

Parsing out text from string

I have a tab-delimited file1:
20 50 80 110
520 590 700 770
410 440 20 50
300 340 410 440
read and put them into an array:
while(<INPUT>)
{
chomp;
push #inputarray, $_;
}
Now I'm looping through another file2:
20, 410, 700
80, 520
300
foreach number of each line in file2, I want to search the #inputarray for the number. If it exists, I want to grab the corresponding number that follows. For instance, for number 20, I want to grab the number 50. I assume that they are still separated by a tab in the string that exists as an array element in #inputarray.
while(my $line = <INPUT2>)
{
chomp $line;
my #linearray = split("\t", $line);
foreach my $start (#linearray)
{
if (grep ($start, #inputarray))
{
#want to grab the corresponding number
}
}
}
Once grep finds it, i don't know how to grab that array element to find the position of the number to extract the corresponding number using perhaps the substr function. How do i grab the array element that grep found?
A desired output would be:
line1:
20 50
410 440
700 770
line2:
80 110
520 590
line3:
300 340
IMHO, it would be best to store the numbers from file1 in a hash. Referring to the example clontent of file1 as you provided above you can have something like below
{
'20' => '50',
'80' => '110',
'520'=> '590',
'700'=> '770',
'410'=> '440',
'20' => '50',
'300'=> '340',
'410' => '440'
}
A sample piece of code will be like
my %inputarray;
while(<INPUT>)
{
my #numbers = split $_;
my $length = scalar $numbers;
# For $i = 0 to $i < $length;
# $inputarray{$numbers[$i]} = $numbers[$i+1];
# $i+=2;
}
An demonstration of the above loop
index: 0 1 2 3
numbers: 20 50 80 110
first iteration: $i=0
$inputarray{$numbers[0]} = $numbers[1];
$i = 2; #$i += 2;
second iteration: $i=2
$inputarray{$numbers[2]} = $numbers[3];
And then while parsing file2, you just need to treat the number as the key of %inputarray.
I believe this gets you close to what you want.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $prev = undef;
foreach my $curr ( split /\s+/ )
{
$follows{$prev} = $curr if ($prev);
$prev = $curr;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you only want to consider numbers from file1 in pairs, as opposed to seeing which numbers follow what other numbers without taking pair boundaries into account, then you need to change the logic in the first while loop slightly.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $line = $_;
while ( $line =~ s/(\S+)\s+(\S+)\s*// )
{
$follows{$1} = $2;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you want to read the input once but check for numbers a lot, you might be better off to split the input line into individual numbers. Then add each each number as key into a hash with the following number as value. That makes reading slow and takes more memory but the second part, where you want to check for following numbers will be a breeze thanks to exist and the nature of hashes.
If i understood your question correct, you could use just one big hash. That is of course assuming that every number is always followed by the same number.

Perl - binary unpack using pointer and index

I have a binary file that contain 3 files, a PNG, a PHP and a TGA file.
Here the file to give you the idea : container.bin
the file is build this way:
first 6 bytes are a pointer to the index, in this case 211794
Then you have all 3 files stacked one after the other
and at the ofset 211794, you have the index, that tell you where the file start and end
in this example you have:
[offset start] [offset end] [random data] [offset start] [name]
6 15149 asdf 6 Capture.PNG
15149 15168 4584 15149 index.php
15168 211794 12 15168 untilted.tga
meaning that capture.png start at offset 6, finish at offset 15149, then asdf is a random data, and the start offset is repeated again.
Now what I want to do is a perl to separate the file on this binary files.
The perl need to check the first 6 offset of the file (header), then jump to the index location, and use the list to extract the file out.
A mix of seek and read can be used to achieve the task:
#!/usr/bin/env perl
use strict;
use warnings;
use Fcntl 'SEEK_SET';
sub get_files_info {
my ( $fh, $offset ) = #_;
my %file;
while (<$fh>) {
chomp;
my $split_count = my ( $offset_start, $offset_end, $random_data, $offset_start_copy,
$file_name ) = split /\s/;
next if $split_count != 5;
if ( $offset_start != $offset_start_copy ) {
warn "Start of offset mismatch: $file_name\n";
next;
}
$file{$file_name} = {
'offset_start' => $offset_start,
'offset_end' => $offset_end,
'random_data' => $random_data,
};
}
return %file;
}
sub write_file {
my ( $fh, $file_name, $file_info ) = #_;
seek $fh, $file_info->{'offset_start'}, SEEK_SET;
read $fh, my $contents,
$file_info->{'offset_end'} - $file_info->{'offset_start'};
open my $fh_out, '>', $file_name or die 'Error opening file: $!';
binmode $fh_out;
print $fh_out $contents;
print "Wrote file: $file_name\n";
}
open my $fh, '<', 'container.bin' or die "Error opening file: $!";
binmode $fh;
read $fh, my $offset, 6;
seek $fh, $offset, SEEK_SET;
my %file = get_files_info $fh, $offset;
for my $file_name ( keys %file ) {
write_file $fh, $file_name, $file{$file_name};
}
The only real difficulty here is to make sure that both input and output files are read in binary mode. This can be achieved by using the :raw PerlIO layer when the files are opened.
This program seems to do what you want. It first locates and reads the index block into a string, and then opens that string for input and reads the start and end position and name of each of the constituent files. Thereafter processing each file is simple.
Be aware that unless the formatting of the index block is more strict than you say, you can rely only on the first, second, and last whitespace-separated fields on each line since random text could contain spaces. There is also no way to specify a file name containing spaces.
The output, using Data::Dump, is there to demonstrate correct functionality and is not necessary for the functioning of the program.
use v5.10;
use warnings;
use Fcntl ':seek';
use autodie qw/ open read seek close /;
open my $fh, '<:raw', 'container.bin';
read $fh, my $index_loc, 6;
seek $fh, $index_loc, SEEK_SET;
read $fh, my ($index), 1024;
my %contents;
open my $idx, '<', \$index;
while (<$idx>) {
my #fields = split;
next unless #fields;
$contents{$fields[-1]} = [ $fields[0], $fields[1] ];
}
use Data::Dump;
dd \%contents;
for my $file (keys %contents) {
my ($start, $end) = #{ $contents{$file} };
my $size = $end - $start;
seek $fh, $start, SEEK_SET;
my $nbytes = read $fh, my ($data), $size;
die "Premature EOF" unless $nbytes == $size;
open my $out, '>:raw', $file;
print { $out } $data;
close $out;
}
output
{
"Capture.PNG" => [6, 15149],
"index.php" => [15149, 15168],
"untilted.tga" => [15168, 211794],
}

Perl while loops and reading lines

Each record has 4 lines:
Like the following:
#NCYC361­11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT
+
!)))))****(((***%%((((*(((+,**(((+**+,­
There are two files in which 1 file corresponded to the other
there are an array of seqeunces A1
So read 1 record at a time from file 1. read record from file 2. if the sequence in record 1 file 1 (line 2) matches the seuqnece in the array A1, i print the record from file 2 to an output file so on...but the point is i need to read a record at a time.... how would i break out of the inner loop so that i can read the next record from the file 1 and then compare it to the next record in file 2
If you ask about controlling nested loops you can do that with labels.
Example:
OUTER:
while(<>){
for(#something){
last OUTER;
}
}
See last for example.
In case only lines with same number could ever match, you don't really need more than one loop. You can call reading operation (<>, read, sysread) wherever you want. It only usually placed directly in loop because it conveniently returns undef and breaks it when work is done.
while(defined(my $first_line = <FIRST>)){
my $second_line = <SECOND>;
if($first_line eq $second_line){
print "match\n";
} else {
print "no match\n";
}
}
From your sentence I need to check if the sequence matches any with the sequence from the second I gather that you want to check whether any lines in the two files match?
If you need to read a file several times then you can use seek to rewind to the start of it without reopening it.
This program shows the idea.
use strict;
use warnings;
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $out, '>', 'matches' or die $!;
while (my $line1 = <$fh1>) {
seek $fh2, 0, 0;
while (my $line2 = <$fh2>) {
if ($line1 eq $line2) {
print $out $line1;
last;
}
}
}
Edit
Your comment has changed the problem. Both files have four-line records and you want to compare the second line in corresponding records across the two files.
use strict;
use warnings;
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $match, '>', 'matches' or die $!;
open my $nomatch, '>', 'nomatch' or die $!;
while (1) {
my (#data1, #data2);
for (1 .. 4) {
my $line;
$line = <$fh1>;
push #data1, $line if defined $line;
$line = <$fh2>;
push #data2, $line if defined $line;
}
last unless #data1 == 4 and #data2 == 4;
if ($data1[1] eq $data2[1]) {
print $match #data2;
}
else {
print $nomatch #data2;
}
}
A full example :
#!/usr/bin/env perl
use strict;
use warnings;
open F1, "<", "/path/1";
open F2, "<", "/path/2";
#a1 = <F1>;
#a2 = <F2>;
for (0..$#a1) {
if ($a1[$_] eq $a2[$_]) {
print "MATCH line [$_]\n";
} else {
print "DOESN'T MATCH line [$_]\n";
}
}