Reading and writing to the same file - perl

I'm using this code I found online to read a properties file in my Perl script:
open (CONFIG, "myfile.properties");
while (CONFIG){
chomp; #no new line
s/#.*//; #no comments
s/^\s+//; #no leading white space
s/\s+$//; #no trailing white space
next unless length;
my ($var, $value) = split (/\s* = \s*/, $_, 2);
$$var = $value;
}
Is it posssible to also write to the text file inside this while loop? Let's say the text file looks like this:
#Some comments
a_variale = 5
a_path = /home/user/path
write_to_this_variable = ""
How can I put some text in write_to_this_variable?

It is not really practical to overwrite text files where you have variable length records (lines). It is normal to copy the file, something like this:
my $filename = 'myfile.properites';
open(my $in, '<', $filename) or die "Unable to open '$filename' for read: $!";
my $newfile = "$filename.new";
open(my $out, '>', $newfile) or die "Unable to open '$newfile' for write: $!";
while (<$in>) {
s/(write_to_this_variable =) ""/$1 "some text"/;
print $out;
}
close $in;
close $out;
rename $newfile,$filename or die "unable to rename '$newfile' to '$filename': $!";
You might have to sanitse the text you are writing with something like \Q if it contains non-alphanumerics.

This is an example of a program that uses the Config::Std module to read an write a simple config file like yours. As far as I know it is the only module that will preserve any comments in the original file.
There are two points to note:
The first hash key in $props{''}{write_to_this_variable} forms the name of the config file section that will contain the value. If there are no sections, as for your file, then you must use an empty string here
If you need quotes around the a value then you must add these explicitly when you are assigning to the hash element, as I do here with '"Some text"'
I think the rest of the program is self-explanatory.
use strict;
use warnings;
use Config::Std { def_sep => ' = ' };
my %props;
read_config 'myfile.properties', %props;
$props{''}{write_to_this_variable} = '"Some text"';
write_config %props;
output
#Some comments
a_variale = 5
a_path = /home/user/path
write_to_this_variable = "Some text"

Related

Perl - Compare two large txt files and return the required lines from the first

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt.
I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt
However my code returns empty. Any help please ?
P.S. I should have mentioned this. Both the contents of the combined_gff and pegs.txt are present as rows. One row has a string. second row has another string. I just wish to pickup the rows from combined_gff whose substrings are present in pegs.txt
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
foreach my $extline (#ext) {
if ( index($gffline, $extline) != -1) {
$str=$str.$gffline;
$str=$str."\n";
exit;
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
The first problem is exit. The output file is never created if a substring is found.
The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.
Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).
Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.
#!/usr/bin/perl
use warnings;
use strict;
open my $data, '<', 'pegs.txt' or die $!;
chomp( my #ext = <$data> );
my $regex = join '|', map quotemeta, #ext;
open my $file, '<', 'combined_gff.txt' or die $!;
open my $out, '>', 'output.txt' or die $!;
while (<$file>) {
print {$out} $_ if /$regex/;
}
close $out;
I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".
As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.
Following the logic of your script I made one with the corrections and it worked fine.
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
chomp($gffline);
foreach my $extline (#ext) {
chomp($extline);
print $extline;
if ( index($gffline, $extline) > -1) {
$str .= $gffline ."\n";
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
Hope it works for you.
Welcho

Recursive search in Perl?

I'm incredibly new to Perl, and never have been a phenomenal programmer. I have some successful BVA routines for controlling microprocessor functions, but never anything embedded, or multi-facted. Anyway, my question today is about a boggle I cannot get over when trying to figure out how to remove duplicate lines of text from a text file I created.
The file could have several of the same lines of txt in it, not sequentially placed, which is problematic as I'm practically comparing the file to itself, line by line. So, if the first and third lines are the same, I'll write the first line to a new file, not the third. But when I compare the third line, I'll write it again since the first line is "forgotten" by my current code. I'm sure there's a simple way to do this, but I have issue making things simple in code. Here's the code:
my $searchString = pseudo variable "ideally an iterative search through the source file";
my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";
my $count = "0";
open (FILE, $file2) || die "Can't open cutdown.txt \n";
open (FILE2, ">$file3") || die "Can't open output.txt \n";
while (<FILE>) {
print "$_";
print "$searchString\n";
if (($_ =~ /$searchString/) and ($count == "0")) {
++ $count;
print FILE2 $_;
} else {
print "This isn't working\n";
}
}
close (FILE);
close (FILE2);
Excuse the way filehandles and scalars do not match. It is a work in progress... :)
The secret of checking for uniqueness, is to store the lines you have seen in a hash and only print lines that don't exist in the hash.
Updating your code slightly to use more modern practices (three-arg open(), lexical filehandles) we get this:
my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";
open my $in_fh, '<', $file2 or die "Can't open cutdown.txt: $!\n";
open my $out_fh, '>', $file3 or die "Can't open output.txt: $!\n";
my %seen;
while (<$in_fh>) {
print $out_fh unless $seen{$_}++;
}
But I would write this as a Unix filter. Read from STDIN and write to STDOUT. That way, your program is more flexible. The whole code becomes:
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
print unless $seen{$_}++;
}
Assuming this is in a file called my_filter, you would call it as:
$ ./my_filter < /tmp/cutdown.txt > /tmp/output.txt
Update: But this doesn't use your $searchString variable. It's not clear to me what that's for.
If your file is not very large, you can store each line readed from the input file as a key in a hash variable. And then, print the hash keys (ordered). Something like that:
my %lines = ();
my $order = 1;
open my $fhi, "<", $file2 or die "Cannot open file: $!";
while( my $line = <$fhi> ) {
$lines {$line} = $order++;
}
close $fhi;
open my $fho, ">", $file3 or die "Cannot open file: $!";
#Sort the keys, only if needed
my #ordered_lines = sort { $lines{$a} <=> $lines{$b} } keys(%lines);
for my $key( #ordered_lines ) {
print $fho $key;
}
close $fho;
You need two things to do that:
a hash to keep track of all the lines you have seen
a loop reading the input file
This is a simple implementation, called with an input filename and an output filename.
use strict;
use warnings;
open my $fh_in, '<', $ARGV[0] or die "Could not open file '$ARGV[0]': $!";
open my $fh_out, '<', $ARGV[1] or die "Could not open file '$ARGV[1]': $!";
my %seen;
while (my $line = <$fh_in>) {
# check if we have already seen this line
if (not $seen{$line}) {
print $fh_out $line;
}
# remember this line
$seen{$line}++;
}
To test it, I've included it with the DATA handle as well.
use strict;
use warnings;
my %seen;
while (my $line = <DATA>) {
# check if we have already seen this line
if (not $seen{$line}) {
print $line;
}
# remember this line
$seen{$line}++;
}
__DATA__
foo
bar
asdf
foo
foo
asdfg
hello world
This will print
foo
bar
asdf
asdfg
hello world
Keep in mind that the memory consumption will grow with the file size. It should be fine as long as the text file is smaller than your RAM. Perl's hash memory consumption grows a faster than linear, but your data structure is very flat.

Calculate the length of a string in a specific file format with perl

I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance
You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);
Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.
For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;

How can I count paragraphs in text file using Perl?

I need to create Perl code which allows counting paragraphs in text files. I tried this and doesn't work:
open(READFILE, "<$filename")
or die "could not open file \"$filename\":$!";
$paragraphs = 0;
my($c);
while($c = getc(READFILE))
{
if($C ne"\n")
{
$paragraphs++;
}
}
close(READFILE);
print("Paragraphs: $paragraphs\n");
See perlfaq5: How can I read in a file by paragraphs?
local $/ = ''; # enable paragraph mode
open my $fh, '<', $file or die "can't open $file: $!";
1 while <$fh>;
my $count = $.;
Have a look at the Beginning Perl book at http://www.perl.org/books/beginning-perl/. In particular, the following chapter will help you: http://docs.google.com/viewer?url=http%3A%2F%2Fblob.perl.org%2Fbooks%2Fbeginning-perl%2F3145_Chap06.pdf
If you're determining paragraphs by a double-newline ("\n\n") then this will do it:
open READFILE, "<$filename"
or die "cannot open file `$filename' for reading: $!";
my #paragraphs;
{local $/; #paragraphs = split "\n\n", <READFILE>} # slurp-split
my $num_paragraphs = scalar #paragraphs;
__END__
Otherwise, just change the "\n\n" in the code to use your own paragraph separator. It may even be a good idea to use the pattern \n{2,}, just in case someone went crazy on the enter key.
If you are worried about memory consumption, then you may want to do something like this (sorry for the hard-to-read code):
my $num_paragraphs;
{local $/; $num_paragraphs = #{[ <READFILE> =~ /\n\n/g ]} + 1}
Although, if you want to keep using your own code, you can change if($C ne"\n") to if($c eq "\n").

How do I read the contents of a small text file into a scalar in Perl?

I have a small text file that I'd like to read into a scalar variable exactly as it is in the file (preserving line separators and other whitespace).
The equivalent in Python would be something like
buffer = ""
try:
file = open("fileName", 'rU')
try:
buffer += file.read()
finally:
file.close()
except IOError:
buffer += "The file could not be opened."
This is for simply redisplaying the contents of the file on a web page, which is why my error message is going into my file buffer.
From the Perl Cookbook:
my $filename = 'file.txt';
open( FILE, '<', $filename ) or die 'Could not open file: ' . $!;
undef $/;
my $whole_file = <FILE>;
I would localize the changes though:
my $whole_file = '';
{
local $/;
$whole_file = <FILE>;
}
As an alternative to what Alex said, you can install the File::Slurp module (cpan -i File::Slurp from the command line) and use this:
use File::Slurp;
# Read data into a variable
my $buffer = read_file("fileName");
# or read data into an array
my #buffer = read_file("fileName");
Note that this dies (well... croaks, but that's just the proper way to call die from a module) on errors, so you may need to run this in an eval block to catch any errors.
If I don't have Slurp or Perl6::Slurp near by then I normally go with....
open my $fh, '<', 'file.txt' or die $!;
my $whole_file = do { local $/; <$fh> };
There is a discussion of the various ways to read a file here.
I don't have enough reputation to comment, so I apologize for making this another post.
# Harold Bamford: $/ should not be an obscure variable to a Perl programmer. A beginner may not know it, but he or she should learn it. The join method is a poor choice for the reasons stated in the article linked by hackingwords above. Here's the relevant quotation from the article:
That needlessly splits the input file into lines (join provides a list context to ) and then joins up those lines again. The original coder of this idiom obviously never read perlvar and learned how to use $/ to allow scalar slurping.
You could do something like:
$data_file="somefile.txt";
open(DAT, $data_file);
#file_data = <DAT>;
close(DAT);
That'll give you the file contents in an array, that you can use for whatever you want, for example, if you wanted each individual line, you could do something like:
foreach $LINE (#file_data)
{
dosomethingwithline($LINE);
}
For a full usage example:
my $result;
$data_file = "somefile.txt";
my $opened = open(DAT, $data_file);
if (!$opened)
{
$result = "Error.";
}
else
{
#lines = <DAT>;
foreach $LINE (#lines)
{
$result .= $LINE;
}
close(DAT);
}
Then you can use $result however you need. Note: This code is untested, but it should give you an idea.
I'd tweak draegtun's answer like this, to make it do exactly what was being asked:
my $buffer;
if ( open my $fh, '<', 'fileName' ) {
$buffer = do { local $/; <$fh> };
close $fh;
} else {
$buffer = 'The file could not be opened.';
}
Just join all lines together into a string:
open(F, $file) or die $!;
my $content = join("", <F>);
close F;
(It was previously suggested to use join "\n" but that will add extra newlines. Each line already has a newline at its end when it's read.)