I am writing a comparefiles subroutine in Perl that reads a line of text from one file (f1) and then searches for it in another (f2) in the normal O(n^2) way.
sub comparefiles {
my($f1, $f2) = #_;
while(<f1>) {
# reset f2 to the beginning of the file
while(<f2>) {
}
}
}
sub someother {
open (one, "<one.out");
open (two, "<two.out");
&comparefiles(&one, &two);
}
I have two questions
How do I pass the file handles to the subroutine? In the above code, I have used them as scalars. Is that the correct way?
How do I reset the file pointer f2 to the beginning of the file at the position marked in the comment above?
First of all start every time your script with :
use strict;
use warnings;
Use lexical filehandle, three args open and test the result :
open my $fh1 , '<' , $filename1 or die "can't open '$filename1' for reading : $!";
Then you can pass the filehandles to the sub :
comparefiles($fh1, $fh2);
To rewind the file use the seek function (perldoc -f seek)
seek $fh, 0, 0;
If the files are small enough to fit in memory, you might consider storing the lines in a hash, which would prevent the need for O(n^2) searching.
Within the framework of your existing approach, I would advise against nesting your file reading loops -- perhaps on aesthetic grounds if nothing else. Instead, put the inner loop in a subroutine.
use strict;
use warnings;
# Works for 2 or more files.
analyze_files(#ARGV);
sub analyze_files {
my #file_names = #_;
my #handles = map { open my $h, '<', $_; $h } #_;
my $fh = shift #handles;
while (my $line = <$fh>) {
my #line_numbers = map { find_in_file($_, $line) } #handles;
print join("\t", #line_numbers, $line);
}
}
# Takes a file handle and a line to hunt for.
# Returns line number if the line is found.
sub find_in_file {
my ($fh, $find_this) = #_;
seek $fh, 0, 0;
while (my $line = <$fh>){
return $. if $line eq $find_this;
}
return -1; # Not found.
}
Related
I have a text file which lists a service, device and a filter, here I list 3 examples only:
service1 device04 filter9
service2 device01 filter2
service2 device10 filter11
I have written a perl script that iterates through the file and should then print device=device filter=filter to a file named according to the service it belongs to, but if a string contains a duplicate filter, it should add the devices to the same file, seperated by semicolons. Looking at the above example, I then need a result of:
service1.txt
device=device04 filter=filter9
service2.txt
device=device01 filter=filter2 ; device=device10 filter=filter11
Here is my code:
use strict;
use warnings qw(all);
open INPUT, "<", "file.txt" or die $!;
my #Input = <INPUT>;
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
chomp ($serv, $device, $filter);
push my #arr, "device==$device & filter==$filter";
open OUTPUT, ">>", "$serv.txt" or die $!;
print OUTPUT join(" ; ", #arr);
close OUTPUT;
}
The problem I am having is that both service1.txt and service2.txt are created, but my results are all wrong, see my current result:
service1.txt
device==device04 filter==filter9
service2.txt
device==device04 filter==filter9 ; device==device01 filter==filter2device==device04 filter==filter9 ; device==device01 filter==filter2 ; device==device10 filter==filter11
I apologise, I know this is something stupid, but it has been a really long night and my brain cannot function properly I believe.
For each service to have its own file where data for it accumulates you need to distinguish for each line what file to print it to.
Then open a new service-file when a service without one is encountered, feasible since there aren't so many as clarified in a comment. This can be organized by a hash service => filehandle.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV || 'data.txt';
my %handle;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($serv, $device, $filter) = split;
if (exists $handle{$serv}) {
print { $handle{$serv} } " ; device==$device & filter==$filter";
}
else {
open my $fh_out, '>', "$serv.txt" or do {
warn "Can't open $serv.txt: $!";
next;
};
print $fh_out "device==$device & filter==$filter";
$handle{$serv} = $fh_out;
}
}
say $_ '' for values %handle; # terminate the line in each file
close $_ for values %handle;
For clarity the code prints almost the same in both cases, what surely can be made cleaner. This was tested only with the provided sample data and produces the desired output.
Note that when a filehandle need be evaluated we need { }. See this post, for example.
Comments on the original code (addressed in the code above)
Use lexical filehandles (my $fh) instead of typeglobs (FH)
Don't read the whole file at once unless there is a specific reason for that
split has nice defaults, split ' ', $_, where ' ' splits on whitespace and discards leading and trailing space as well. (And then there is no need to chomp in this case.)
Another option is to first collect data for each service, just as OP attempts, but again use a hash (service => arrayref/string with data) and print at the end. But I don't see a reason to not print as you go, since you'd need the same logic to decide when ; need be added.
Your code looks pretty perl4-ish, but that's not a problem. As MrTux has pointed out, you are confusing collection and fanning out of your data. I have refactored this to use a hash as intermediate container with the service name as keys. Please note that this will not accumulate results across mutliple calls (as it uses ">" and not ">>").
use strict;
use warnings qw(all);
use File::Slurp qw/read_file/;
my #Input = read_file('file.txt', chomp => 1);
my %store = (); # Global container
# Capture
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
push #{$store{$serv}}, "device==$device & filter==$filter";
}
# Write out for each service file
foreach my $k(keys %store) {
open(my $OUTPUT, ">", "$k.txt") or die $!;
print $OUTPUT join(" ; ", #{$store{$k}});
close( $OUTPUT );
}
I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance
You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);
Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.
For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;
I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764
I wrote a perl script to count the occurrences of a character in a file.
So far this is what I have got,
#!/usr/bin/perl -w
use warnings;
no warnings ('uninitialized', 'substr');
my $lines_ref;
my #lines;
my $count;
sub countModule()
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = $fh;
my #filtered = grep (/\// ,#contents);
return \#filtered;
}
#lines = countModule();
##lines = $lines_ref;
$count = #lines;
print "###########\n $count \n###########\n";
My test file looks like this:
10.0.0.1/24
192.168.10.0/24
172.16.30.1/24
I am basically trying to count the number of instances of "/"
This is the output that I get:
###########
1
###########
I am getting 1 instead of 3, which is the number of occurrences.
Still learning perl, so any help will be appreciated..Thank you!!
Here are a few points about your code
You should always use strict at the top of your program, and only use no warnings for special reasons in a limited scope. There is no general reason why a working Perl program should need to disable warnings globally
Declare your variables close to their first point of use. The style of declaring everything at the top of the file is unnecessary and is a legacy of C
Never use prototypes in your code. They are available for very special purposes and shouldn't be used for the vast majority of Perl code. sub countModule() { ... } insists that countModule may never be called with any parameters and isn't necessary or useful. The definition should be just sub countModule { ... }
A big well done! for using a lexical file handle, the three-parameter form of open, and putting $! in your die string
my #contents = $fh will just set #contents to a single-element list containing just the filehandle. To read the whole file into the array you need my #contents = <$fh>
You can avoid escaping slashes in a regular expression if you use a different delimiter. To do that you need to use the m operator explicitly, like my #filtered = grep m|/|, #contents)
You return an array reference but assign the returned value to an array, so #lines = countModule() sets #lines to a single-element list containing just the array reference. You should either return a list with return #filtered or dereference the return value on assignment with #lines = #{ countModule }
If all you need to do is to print the number of lines in the file that contain a slash character then you could write something like this
use strict;
use warnings;
my $count;
sub countModule {
open my $fh, '<', '/test' or die "Could not open $file: $!";
return [ grep m|/|, <$fh> ];
}
my $lines = countModule;
$count = #$lines;
print "###########\n $count \n###########\n";
Close, but a few issues:
use strict;
use warnings;
sub countModule
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = <$fh>; # The <> brackets are used to read from $fh.
my #filtered = grep (/\// ,#contents);
return #filtered; # Remove the reference.
}
my #lines = countModule();
my $count = scalar #lines; # 'scalar' is not required, but lends clarity.
print "###########\n $count \n###########\n";
Each of the changes I made to your code are annotated with a #comment explaining what was done.
Now in list context your subroutine will return the filtered lines. In scalar context it will return a count of how many lines were filtered.
You did also mention find the occurrences of a character (despite everything in your script being line-oriented). Perhaps your counter sub would look like this:
sub file_tallies{
my $file = '/test';
open my $fh, '<', $file or die $!;
my $count;
my $lines;
while( <$fh> ) {
$lines++;
$count += $_ =~ tr[\/][\/];
}
return ( $lines, $count );
}
my( $line_count, $slash_count ) = file_tallies();
In list context,
return \#filtered;
returns a list with one element -- a reference to the named array #filtered. Maybe you wanted to return the list itself
return #filtered;
Here's some simpler code:
sub countMatches {
my ($file, $c) = #_; # Pass parameters
local $/;
undef $/; # Slurp input
open my $fh, "<",$file or die "could not open $file: $!";
my $s = <$fh>; # The <> brackets are used to read from $fh.
close $fh;
my $ptn = quotemeta($c); # So we can match strings like ".*" verbatim
my #hits = $s =~ m/($ptn)/g;
0 + #hits
}
print countMatches ("/test", '/') . "\n";
The code pushes Perl beyond the very basics, but not too much. Salient points:
By undeffing $/ you can read the input into one string. If you're counting
occurrences of a string in a file, and not occurrences of lines that contain
the string, this is usually easier to do.
m/(...)/g will find all the hits, but if you want to count strings like
"." you need to quote the meta characters in them.
Store the results in an array to evaluate m// in list context
Adding 0 to a list gives the number of items in it.
I quickly jotted off a Perl script that would average a few files with just columns of numbers. It involves reading from an array of filehandles. Here is the script:
#!/usr/local/bin/perl
use strict;
use warnings;
use Symbol;
die "Usage: $0 file1 [file2 ...]\n" unless scalar(#ARGV);
my #fhs;
foreach(#ARGV){
my $fh = gensym;
open $fh, $_ or die "Unable to open \"$_\"";
push(#fhs, $fh);
}
while (scalar(#fhs)){
my ($result, $n, $a, $i) = (0,0,0,0);
while ($i <= $#fhs){
if ($a = <$fhs[$i]>){
$result += $a;
$n++;
$i++;
}
else{
$fhs[$i]->close;
splice(#fhs,$i,1);
}
}
if ($n){ print $result/$n . "\n"; }
}
This doesn't work. If I debug the script, after I initialize #fhs it looks like this:
DB<1> x #fhs
0 GLOB(0x10443d80)
-> *Symbol::GEN0
FileHandle({*Symbol::GEN0}) => fileno(6)
1 GLOB(0x10443e60)
-> *Symbol::GEN1
FileHandle({*Symbol::GEN1}) => fileno(7)
So far, so good. But it fails at the part where I try to read from the file:
DB<3> x $fhs[$i]
0 GLOB(0x10443d80)
-> *Symbol::GEN0
FileHandle({*Symbol::GEN0}) => fileno(6)
DB<4> x $a
0 'GLOB(0x10443d80)'
$a is filled with this string rather than something read from the glob. What have I done wrong?
You can only use a simple scalar variable inside <> to read from a filehandle. <$foo> works. <$foo[0]> does not read from a filehandle; it's actually equivalent to glob($foo[0]). You'll have to use the readline builtin, a temporary variable, or use IO::File and OO notation.
$text = readline($foo[0]);
# or
my $fh = $foo[0]; $text = <$fh>;
# or
$text = $foo[0]->getline; # If using IO::File
If you weren't deleting elements from the array inside the loop, you could easily use a temporary variable by changing your while loop to a foreach loop.
Personally, I think using gensym to create filehandles is an ugly hack. You should either use IO::File, or pass an undefined variable to open (which requires at least Perl 5.6.0, but that's almost 10 years old now). (Just say my $fh; instead of my $fh = gensym;, and Perl will automatically create a new filehandle and store it in $fh when you call open.)
If you are willing to use a bit of magic, you can do this very simply:
use strict;
use warnings;
die "Usage: $0 file1 [file2 ...]\n" unless #ARGV;
my $sum = 0;
# The current filehandle is aliased to ARGV
while (<>) {
$sum += $_;
}
continue {
# We have finished a file:
if( eof ARGV ) {
# $. is the current line number.
print $sum/$. , "\n" if $.;
$sum = 0;
# Closing ARGV resets $. because ARGV is
# implicitly reopened for the next file.
close ARGV;
}
}
Unless you are using a very old perl, the messing about with gensym is not necessary. IIRC, perl 5.6 and newer are happy with normal lexical handles: open my $fh, '<', 'foo';
I have trouble understanding your logic. Do you want to read several files, which just contains numbers (one number per line) and print its average?
use strict;
use warnings;
my #fh;
foreach my $f (#ARGV) {
open(my $fh, '<', $f) or die "Cannot open $f: $!";
push #fh, $fh;
}
foreach my $fh (#fh) {
my ($sum, $n) = (0, 0);
while (<$fh>) {
$sum += $_;
$n++;
}
print "$sum / $n: ", $sum / $n, "\n" if $n;
}
Seems like a for loop would work better for you, where you could actually use the standard read (iteration) operator.
for my $fh ( #fhs ) {
while ( defined( my $line = <$fh> )) {
# since we're reading integers we test for *defined*
# so we don't close the file on '0'
#...
}
close $fh;
}
It doesn't look like you want to shortcut the loop at all. Therefore, while seems to be the wrong loop idiom.