Having this snippet:
my $file = "input.txt"; # let's assume that this is an ascii file
my $size1 = -s $file;
print "$size1\n";
$size2 = 0;
open F, $file;
$size2 += length($_) while (<F>);
close F;
print "$size2\n";
when can one assert that it is true that $size1 equals $size2?
If you don't specify an encoding that supports multibyte characters, it should hold. Otherwise, the result can be different:
$ cat 1.txt
žluťoučký kůň
$ perl -E 'say -s "1.txt";
open my $FH, "<:utf8", "1.txt";
my $f = do { local $/; <$FH> };
say length $f;'
20
14
You cannot, because the input layer may do some convert on the input line, for example change crlf to cr, that may change the length of that line.
In addition, length $line count how many characters in $line, in the multi-byte encoding, as the example given by #choroba, one character may occupy more than one bytes.
See perlio for further details.
No, as Lee Duhem says, the two numbers may be different because of Perl's end-of-line processing, or because length reports the size of the string in characters, which will throw the numbers out if there are any wide characters in the text.
However the tell function will report the exact position in bytes that you have read up to, so an equivalent to your program for which the numbers are guaranteed to match is this
use strict;
use warnings;
my $file = 'input.txt';
my $size1 = -s $file;
print "$size1\n";
open my $fh, '<', $file or die $!;
my $size2 = 0;
while (<$fh>) {
$size2 = tell $fh;
}
close $fh;
print "$size2\n";
Please note the use of use strict and use warnings, the lexical file handle, the three-parameter form of open, and the check that it succeeded. All of these are best-practice for Perl programs and should be used in everything you write
You're simply missing binmode(F); or the :raw IO layer. These cause Perl to return the file exactly as it appears on disk. No line ending translation. No decoding of character encodings.
open(my $fh, '<:raw', $file)
or die "open $file: $!\n");
Then your code works fine.
my $size = 0;
$size += length while <$fh>;
That's not particularly good because it could read the entire file at once for binary files. So let's read fixed-sized blocks instead.
local $/ = \(64*1024);
my $size = 0;
$size += length while <$fh>;
That's basically the same as using read, which reads 4K or 8K (in newer Perls) at a time. There are performance benefits to reading more than that at a time, and we can use sysread to do that.
my $size = 0;
while (my $bytes_read = sysread($fh, my $buf, 64*1024)) {
$size += $bytes_read;
}
Reading the whole file is silly, though. You could just seek to the end of the file.
use Fcntl qw( SEEK_END );
my $size = sysseek($fh, 0, SEEK_END);
But then again, you might as well just use -s.
my $size = -s $fh;
Related
I'm doing some simple parsing on text files (which could get up into the 1GB range). How would I go about skipping the first N rows, and more importantly, the last (different) N rows? I'm sure I could open the file and count the rows, and do something with $_ < total_row_count -N, but that seems incredibly inefficient.
I'm pretty much a perl newb, btw.
A file is a sequence of bytes, without the notion of "lines." Some of those bytes are considered as "line" separators (linefeeds), which is how software gives us our "logical" lines to work with. So there is no way to know how many lines there are in a file -- without having read it and counted them, that is.
A simple and naive way is to read line-by-line and count
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
++$cnt while <$fh>;
with a little faster version using $. variable
1 while <$fh>;
my $cnt = $.;
These take between 2.5 and 3 seconds for a 1.1 Gb text file on a reasonable desktop.
We can speed this up a lot by reading in larger chunks and counting newline characters
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
NUM_LINES: {
my $len = 64_000;
my $buf;
$cnt += $buf =~ tr/\n//
while read $fh, $buf, $len;
seek $fh, 0, 0;
};
This goes in barely over half a second, on same hardware and Perl versions.
I've put it in a block to scope unneeded variables but it should be in a sub, where you can then check where the filehandle is when you get it and return it there after counting (so we can count the "rest" of lines from some point in the file and the processing can then continue), etc. It should also include checks on read operation, at each invocation.
I'd think that a half a second overhead on a Gb large file isn't that bad at all.
Still, you can go for faster yet, at the expense of it being far messier. Get the file size (metadata, so no reading involved) and seek to a position estimated to be the wanted number of lines before the end (no reading involved). That most likely won't hit the right spot so read to the end to count lines and adjust, seeking back (further or closer). Repeat until you reach the needed place.
open my $fh, "<", $file;
my $size = -s $file;
my $estimated_line_len = 80;
my $num_last_lines = 100;
my $pos = $size - $num_last_lines*$estimated_line_len;
seek $fh, $pos, 0;
my $cnt;
++$cnt while <$fh>;
say "There are $cnt lines from position $pos to the end";
# likely need to seek back further/closer ...
I'd guess that this should get you there in under 100 ms. Note that $pos is likely inside a line.
Then once you know the number of lines (or the position for desired number of lines before the end) do seek $fh, 0, 0 and process away. Or really have this in a sub which puts the filehandle back where it was before returning, as mentioned.
I think you need a circular buffer to avoid reading entire file on your memory.
skip-first-last.pl
#!/usr/bin/perl
use strict;
use warnings;
my ($first, $last) = #ARGV;
my #buf;
while (<STDIN>) {
my $mod = $. % $last;
print $buf[$mod] if defined $buf[$mod];
$buf[$mod] = $_ if $. > $first;
}
1;
Skip first 5 lines and last 2 lines:
$ cat -n skip-first-last.pl | ./skip-first-last.pl 5 2
6
7 my #buf;
8 while (<STDIN>) {
9 my $mod = $. % $last;
10 print $buf[$mod] if defined $buf[$mod];
11 $buf[$mod] = $_ if $. > $first;
12 }
I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance
You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);
Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.
For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;
I have a big (300 kB) text file containing words delimited by spaces. Now I want to open this file and process every word in it one by one.
The problem is that perl reads the file line by line (i.e) the entire file at once which gives me strange results. I know the normal way is to do something like
open($inFile, 'tagged.txt') or die $!;
$_ = <$inFile>;
#splitted = split(' ',$_);
print $#splitted;
But this gives me a faulty word count (too large array?).
Is it possible to read the text file word by word instead?
Instead of reading it in one fell swoop, try the line-by-line approach which is easier on your machine's memory usage too (although 300 KB isn't too large for modern computers).
use strict;
use warnings;
my #words;
open (my $inFile, '<', 'tagged.txt') or die $!;
while (<$inFile>) {
chomp;
#words = split(' ');
foreach my $word (#words) { # process }
}
close ($inFile);
To read the file one word at a time, change the input record separator ($/) to a space:
local $/ = ' ';
Example:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
{
local $/ = ' ';
while (<DATA>) {
say;
}
}
__DATA__
one two three four five
Output:
one
two
three
four
five
It's unclear what you input file looks like, but you imply that it contains just a single line composed of many "words".
300KB is far from a "big text file". You should read it in its entirety and pull the words from there one by one. This program demonstrates
use strict;
use warnings;
my $data = do {
open my $fh, '<', 'data.txt' or die $!;
local $/;
<$fh>;
};
my $count = 0;
while ($data =~ /(\S+)/g ) {
my $word = $1;
++$count;
printf "%2d: %s\n", $count, $word;
}
output
1: alpha
2: beta
3: gamma
4: delta
5: epsilon
Without more explanation of what a "faulty word count" might be it is very hard to help, but it is certain that the problem isn't because of the size of your array: if there was a problem there then Perl would raise an exception and die.
But if you are comparing the result with the statistics from a word processor, then it is probably because the definition of "word" is different. For instance, the word processor may consider a hyphenated word to be two words.
300K doesn't seem to be big, so you may try:
my $text=`cat t.txt` or die $!;
my #words = split /\s+/, $text;
foreach my $word (#words) { # process }
or slightly modified solution of squiguy
use strict;
use warnings;
my #words;
open (my $inFile, '<', 'tagged.txt') or die $!;
while (<$inFile>) {
push(#words,split /\s+/);
}
close ($inFile);
foreach my $word (#words) { # process }
Earlier I was working on a loop within a loop and if a match was made it would replace the entire string from the second loop file. Now i have a slightly different situation. I'm trying to replace a substring from the first loop with a string from the second loop. They're both csv files and semicolon delimited. What i'm trying to replace are special characters: from the numerical code to the character itself The first file looks like:
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
and the second file has the numerical code and the corresponding character:
Ą;Ą
ą;ą
Ǟ;Ǟ
Á;Á
á;á
Â;Â
ł;ł
The first semicolon in the second file belongs to the numerical code of the corresponding character and should not be used to split the file. The result should be:
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał;8;9
This is the code I have. How can i fix this?
use strict;
use warnings;
my $inputfile1 = shift || die "input/output!\n";
my $inputfile2 = shift || die "input/output!\n";
my $outputfile = shift || die "output!\n";
open my $INFILE1, '<', $inputfile1 or die "Used/Not found :$!\n";
open my $INFILE2, '<', $inputfile2 or die "Used/Not found :$!\n";
open my $OUTFILE, '>', $outputfile or die "Used/Not found :$!\n";
my $infile2_pos = tell $INFILE2;
while (<$INFILE1>) {
s/"//g;
my #elements = split /;/, $_;
seek $INFILE2, $infile2_pos, 0;
while (<$INFILE2>) {
s/"//g;
my #loopelements = split /;/, $_;
#### The problem part ####
if (($elements[2] =~ /\&\#\d{3}\;/g) and (($elements[2]) eq ($loopelements[0]))){
$elements[2] =~ s/(\&\#\d{3}\;)/$loopelements[1]/g;
print "$2. elements[2]\n";
}
#### End problem part #####
}
my $output_line = join(";", #elements);
print $OUTFILE $output_line;
#print "\n"
}
close $INFILE1;
close $INFILE2;
close $OUTFILE;
exit 0;
Assuming your character codes are standard Unicode entities, you are better off using HTML::Entities to decode them.
This program processes the data you show in your first file and ignores the second file completely. The output seems to be what you want.
use strict;
use warnings;
use HTML::Entities 'decode_entities';
binmode STDOUT, ":utf8";
while (<DATA>) {
print decode_entities($_);
}
__DATA__
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
output
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
You split your #elements at every occurrence of ;, which is then removed. You will not find it in your data, the semicolon in your Regexp can never match, so no substitutions are done.
Anyway, using seek is somewhat disturbing for me. As you have a reasonable number of replacement codes (<5000), you might consider putting them into a hash:
my %subst;
while(<$INFILE2>){
/^&#(\d{3});;(.*)\n/;
$subst{$1} = $2;
}
Then we can do:
while(<$INFILE1>){
s| &# (\d{3}) | $subst{$1} // "&#$1" |egx;
# (don't try to concat undef
# when no substitution for our code is defined)
print $OUTFILE $_;
}
We do not have to split the files or view them as CSV data if replacement should occur everywhere in INFILE1. My solution should speed things up a bit (parsing INFILE2 only once). Here I assumed your input data is correct and the number codes are not terminated by a semicolon but by length. You might want to remove that from your Regexes.(i.e. m/&#\d{3}/)
If you have trouble with character encodings, you might want to open your files with :uft8 and/or use Encode or similar.
Ive been trying to compare lines between two files and matching lines that are the same.
For some reason the code below only ever goes through the first line of 'text1.txt' and prints the 'if' statement regardless of if the two variables match or not.
Thanks
use strict;
open( <FILE1>, "<text1.txt" );
open( <FILE2>, "<text2.txt" );
foreach my $first_file (<FILE1>) {
foreach my $second_file (<FILE2>) {
if ( $second_file == $first_file ) {
print "Got a match - $second_file + $first_file";
}
}
}
close(FILE1);
close(FILE2);
If you compare strings, use the eq operator. "==" compares arguments numerically.
Here is a way to do the job if your files aren't too large.
#!/usr/bin/perl
use Modern::Perl;
use File::Slurp qw(slurp);
use Array::Utils qw(:all);
use Data::Dumper;
# read entire files into arrays
my #file1 = slurp('file1');
my #file2 = slurp('file2');
# get the common lines from the 2 files
my #intersect = intersect(#file1, #file2);
say Dumper \#intersect;
A better and faster (but less memory efficient) approach would be to read one file into a hash, and then search for lines in the hash table. This way you go over each file only once.
# This will find matching lines in two files,
# print the matching line and it's line number in each file.
use strict;
open (FILE1, "<text1.txt") or die "can't open file text1.txt\n";
my %file_1_hash;
my $line;
my $line_counter = 0;
#read the 1st file into a hash
while ($line=<FILE1>){
chomp ($line); #-only if you want to get rid of 'endl' sign
$line_counter++;
if (!($line =~ m/^\s*$/)){
$file_1_hash{$line}=$line_counter;
}
}
close (FILE1);
#read and compare the second file
open (FILE2,"<text2.txt") or die "can't open file text2.txt\n";
$line_counter = 0;
while ($line=<FILE2>){
$line_counter++;
chomp ($line);
if (defined $file_1_hash{$line}){
print "Got a match: \"$line\"
in line #$line_counter in text2.txt and line #$file_1_hash{$line} at text1.txt\n";
}
}
close (FILE2);
You must re-open or reset the pointer of file 2. Move the open and close commands to within the loop.
A more efficient way of doing this, depending on file and line sizes, would be to only loop through the files once and save each line that occurs in file 1 in a hash. Then check if the line was there for each line in file 2.
If you want the number of lines,
my $count=`grep -f [FILE1PATH] -c [FILE2PATH]`;
If you want the matching lines,
my #lines=`grep -f [FILE1PATH] [FILE2PATH]`;
If you want the lines which do not match,
my #lines = `grep -f [FILE1PATH] -v [FILE2PATH]`;
This is a script I wrote that tries to see if two file are identical, although it could easily by modified by playing with the code and switching it to eq. As Tim suggested, using a hash would probably be more effective, although you couldn't ensure the files were being compared in the order they were inserted without using a CPAN module (and as you can see, this method should really use two loops, but it was sufficient for my purposes). This isn't exactly the greatest script ever, but it may give you somewhere to start.
use warnings;
open (FILE, "orig.txt") or die "Unable to open first file.\n";
#data1 = ;
close(FILE);
open (FILE, "2.txt") or die "Unable to open second file.\n";
#data2 = ;
close(FILE);
for($i = 0; $i < #data1; $i++){
$data1[$i] =~ s/\s+$//;
$data2[$i] =~ s/\s+$//;
if ($data1[$i] ne $data2[$i]){
print "Failure to match at line ". ($i + 1) . "\n";
print $data1[$i];
print "Doesn't match:\n";
print $data2[$i];
print "\nProgram Aborted!\n";
exit;
}
}
print "\nThe files are identical. \n";
Taking the code you posted, and transforming it into actual Perl code, this is what I came up with.
use strict;
use warnings;
use autodie;
open my $fh1, '<', 'text1.txt';
open my $fh2, '<', 'text2.txt';
while(
defined( my $line1 = <$fh1> )
and
defined( my $line2 = <$fh2> )
){
chomp $line1;
chomp $line2;
if( $line1 eq $line2 ){
print "Got a match - $line1\n";
}else{
print "Lines don't match $line1 $line2"
}
}
close $fh1;
close $fh2;
Now what you may really want is a diff of the two files, which is best left to Text::Diff.
use strict;
use warnings;
use Text::Diff;
print diff 'text1.txt', 'text2.txt';