I am writing a small perl program where I am checking the pattern of #start and #end. The agenda is to create a separate file with the lines in between start and end patterns. This I am able to do with below script.
#!/usr/bin/perl
open(INFILE,"<","testcases") || die "Can't open file: $!";
my $binary;
my $tccounter=1;
while(<INFILE>)
{
if(/^#start/i)
{
open(OUTFILE,">",$tccounter."_case.sh") || die "Can't open file: $!";
print "start of the script\n";
next;
}
elsif(/^#end/i)
{
################################
# Want to replace the previously
# written line here with some
# addtional customized lines
################################
close(OUTFILE);
$tccounter++;
print "End of the script\n";
print "last line for this testcase is \n $binary\n";
next;
}
else
{
$binary=$_ unless(/^\s*$/);
print OUTFILE $_;
}
}
But what I additionally needed is is identify the last line that is being written to a file and then replace that additional line with some custom data.
For example, here in my case the last line for all the files is execute.
I want replace the line "execute" in all the output files.
In the current output files last line is as below:
execute
expected out files last line should be
preline
execute
postline
Input file (testcases):
#start
line1
line 2
execute
#end
#start
line3
line 4
execute
#end
#start
line5
line 6
execute
#end
#start
line7
line 8
execute
#end
I suggest that you should buffer your output
If you push each line to an array instead of printing it then, once the #end tag is seen, it is simple to locate the last non-blank line in the array and replace it
Then the output file can be opened and the contents of the array printed to it
Here's an untested example
use strict;
use warnings 'all';
open my $fh, "<", "testcases" or die "Can't open input file: $!";
my $n;
my $i;
my $print;
my #buff;
while ( <$fh> ) {
if ( /^#start/i ) {
#buff = ();
$i = undef;
$print = 1;
print "start of the script\n";
}
elsif ( /^#end/i ) {
my $file = ++$n . "_case.sh";
$print = 0;
unless ( defined $i ) {
warn "No data found in block $n";
next;
}
splice #buff, $i, 1, "preline\n", $buff[$i], "postline\n";
open my $fh, ">", $file or die qq{Can't open "$file" for output: $!};
print $fh #buff;
close $fh;
print "End of the script\n";
}
elsif ( $print ) {
push #buff, $_;
$i = $#buff if /\S/;
}
}
I think Borodins answer is the way to go (I'm just not able to comment yet).
So the general algorithm is:
collect full record, from start marker to end marker
once end marker is reached, process record content. In your case:
find last non-empty line and surround it with others
print found line
write out file for record
repeat as needed
I couldn't resist and rewrote Borodins solution using the flipflop operator:
use strict;
use warnings;
open(my $in,'<','in.file') || die "Can't open file: $!";
my ($cnt,#rec);
while( <$in> ) {
push(#rec,$_) if /^#start/i .. /^#end/i; # collect record lines (using flipflop operator)
if( /^#end/i ) { # end of record reached?
next if #rec <= 2; # ignore empty records
# determine index of last nonempty line
my ($lci) = grep {$rec[$_]=~/\S/} reverse (1..$#rec-1); # ...except markers
printf "last line for this testcase is \n%s\n", # print find
splice #rec, $lci, 1, ("preline\n",$rec[$lci],"postline\n"); # surround with pre&post
# write out result
open(my $out,'>',++$cnt.'_case.sh') || die "Can't open file: $!";
$out->print(#rec[1..$#rec-1]); # ...except markers
$out->close;
#rec=(); # empty record for next use
}
}
Related
I have a .conf file where a list of tests to execute are written in a line, with different scenarios per test.
It looks like this :
scenario1,scenario2
scenario1,scenario2,scenario3
scenario1
In my code I open the file :
sub get_tests {
my $nb_tests = 0;
my #length_tests;
my #lists_scenarios;
my #current_list;
my $current_length;
# Open the conf file with all the tests to execute
my $filename = $folder_lists_scenarios.$scenario_list.".conf";
open(my $fh, '<:encoding(UTF-8)', $filename) or die $!;
# open my $fh, "<", $folder_lists_scenarios.$scenario_list.".conf" or die $!;
# Get all the scenarios
while (my $row = <$fh>) {
chomp $row; # delete carrier return
$nb_tests++; # increment number of tests
#current_list = split(/,/, $row); # separate the test into scenarios
$current_length = #current_list; # get the number of scenarios in the test
push #length_tests, $current_length; # store the number of scenarios
push #lists_scenarios, [#current_list]; # store the list of scenarios of the test
}
# Close the file
close $fh;
return ($nb_tests, \#length_tests, \#lists_scenarios);
}
My problem is that I use these strings to open files that have the name of the strings :
sub open_txt {
# Open a txt file and return then content in an array
my $filename = "folder_with_scenarios/".$_[0]."/content.txt";
my #lines;
my $temp;
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open file '$filename'. Please check if the file name is correct or in the good repertory.";
my $cnt_line = 0;
while (my $row = <$fh>) {
chomp $row;
$cnt_line++;
if ( length($row) > 1 ) { # if the line is not empty
$temp = length($row);
push #lines, $row
}
}
# If the file is not empty
if ($cnt_line > 0) {
return ($cnt_line, #lines);
# If the file is empty
} else {
die "[ERROR] The file $filename is empty\n";
}
}
And when I do this, the first scenarios of the line work well but the last of the line makes an error :
Uncaught exception from user code:
/content.txt'. Please check if the file name is correct or in the good repertory. at ./my_code.pl line 2219.
main::open_txt('folder_with_scenarios/scenario2\x{d}/content') called at ./my_code.pl line 2588
Apparently I have a \x{d} at the end of my line string and I don't know how to get rid of it.
Any idea ?
Thanks,
SLP
The value you are passing to open_txt ends with a Carriage Return.
You presumably read the value from a file that had Windows (CRLF) line endings on a non-Windows machine.
You presumably used chomp to remove the Line Feed, but left the Carriage Return in place.
If so, replace
chomp;
with
s/\s+\z//;
I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance
You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);
Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.
For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;
I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.
This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}
As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";
am fairly new to the perl scripting and need some help. below is my query:
I have a file which has contents like below:
AA ABC 0 0
line1
line2
...
AA XYZ 1 1
line..
line..
AA GHI 2 2
line..
line...
Now I would like get all the lines between those lines which have the starting string/pattern "AA" and write them to files ABC.txt, XYZ.txt, GHI.txt, repsectively including the line AA*, for examples ABC.txt should look like
AA ABC 0 0
line1
line2...
and XYZ.txt should look like
AA XYZ 1 1
line..
line..
Hope am clear in this question and any help regarding this is much appreciated.
Thanks,
Sandy
I presume you're asking for an algorithm since you didn't specify what you needed help with.
Declare a file handle for use for output.
While you haven't reached the end of the input file,
Read a line.
If it's a header line,
Parse it.
Determine file name.
(Re)open the output file.
Print the line to the output file handle.
Lest you be tempted to use one of the poor solutions that have been posted since I posted the above, here's the code:
my $fh;
while (<>) {
if (my ($fn) = /^AA\s+(\S+)/) {
$fn .= '.txt';
open($fh, '>', $fn)
or die("Can't create file \"$fn\": $!\n");
}
print $fh $_;
}
Possible improvements, all of which are easy to add:
Check for duplicate headers. (if -e $fn is one way)
Check for data before the first header. (if !$fh is one way)
You just need to keep one file open at a time... When a line matches XYZ, then you open your XYZ.txt file and output the line. You keep that file open (let's just say it's the handle CURRENT_FILE) and output each successive line to it until you match a new header line. Then you close the current file and open another one.
My Perl is a extremely rusty, so I don't think I can provide code that compiles, but essentially it's something close to this.
my $current_name = "";
foreach my $line (<INPUT>)
{
my($name) = $line =~ /^AA (\w+)/;
if( $name ne $current_name ) {
close(CURRENT_FILE) if $current_name ne "";
open(CURRENT_FILE, ">>", "$name.txt") || die "Argh\n";
$current_name = $name;
}
next if $current_name eq "";
print CURRENT_FILE $line;
}
close(CURRENT_FILE) if $current_name ne "";
What do you think about this one?
1: Get contents from the file (maybe using File::Slurp's read_file) and save to a scalar.
use File::Slurp qw(read_file write_file);
my $contents = read_file($filename);
2: Have a regex pattern matching similar to this:
my #file_rows = ($contents ~= /(AA\s[A-Z]{3}\s+\d+\s+\w*)/);
3: If column 2 values are always unique throughout the file:
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
write_file($values[1] . ".txt", $file_row);
}
3: Otherwise: Split the row values. Store them to a hash using the second column as the key. Write data to output files using the hash.
my %hash;
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
if (defined $hash{$value[1]}) {
$hash{$values[1]} .= $file_row;
} else {
$hash{$values[1]} = $file_row;
}
}
foreach my $key (keys %hash) {
write_file($key .'txt', $hash{$key});
}
Here's an option that looks for the pattern matching the start of each record. When found, it loops through the data file's lines and builds a record until it finds the same pattern again or eof, then that record is written to a file. It does not check to see if the file already exists before writing to it, so it will replace ABC.txt if it already exists:
use strict;
use warnings;
my $dataFile = 'data.txt';
my $nextLine = '';
my $recordRegex = qr/^AA\s+(\S+)\s+\d+\s+\d+/;
open my $inFH, '<', $dataFile or die $!;
RECORD: while ( my $line = <$inFH> ) {
my $record = $nextLine . $line;
if ( $record =~ $recordRegex ) {
my $fileName = $1 . '.txt';
while ( $nextLine = <$inFH> ) {
if ( $nextLine =~ $recordRegex or eof $inFH ) {
$record .= $nextLine if eof $inFH;
open my $outFH, '>', $fileName or die $!;
print $outFH $record;
close $outFH;
next RECORD;
}
$record .= $nextLine;
}
}
}
close $inFH;
Hope this helps!
Edit: This code replaces the original that was problematic. Thank you, amon, for reviewing the original code.
one text file like this as query file:
fooLONGcite
GetmoreDATA
stringMATCH
GOODthing
another text file like this as subject file:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
brotherGETDATA
CITEMORETHING
TOOLONGSTUFFETC
The expected result will be get the matched string from subject file and then print it out. So, the output should be:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
Here is my perl script. But It doesn't work. Can you help me find where is the problem? Thanks.
#!/usr/bin/perl
use strict;
# to check the command line option
if($#ARGV<0){
printf("Usage: \n <tag> <seq> <outfile>\n");
exit 1;
}
# to open the given infile file
open(tag, $ARGV[0]) or die "Cannot open the file $ARGV[0]";
open(seq, $ARGV[1]) or die "Cannot open the file $ARGV[1]";
my %seqhash = ();
my $tag_id;
my $tag_seq;
my $seq_id;
my $seq_seq;
my $seq;
my $i = 0;
print "Processing cds seq\n";
#check the seq file
while(<seq>){
my #line = split;
if($i != 0){
$seqhash{$seq_seq} = $seq;
$seq = "";
print "$seq_seq\n";
}
$seq_seq = $line[0];
$i++;
}
while(<tag>){
my #tagline = split;
$tag_seq = $tagline[0];
$seq = $seqhash{$seq_seq};
#print "$tag_seq\n";
print "$seq\n";
#print output ">$id\n$seq\n";
}
#print "Ending of Processing gff\n";
close(tag);
close(seq);
As I understand, you look for a match of part of the string, not an exact one. Here a script that does what I think you are looking for:
Content of script.pl. I take into account that file of queries is small because I add all its content to a regex:
use warnings;
use strict;
## Check arguments.
die qq[Usage: perl $0 <query_file> <subject_file>\n] unless #ARGV == 2;
## Open input files. Abort if found errors.
open my $fh_query, qq[<], shift #ARGV or die qq[Cannot open input file: $!\n];
open my $fh_subject, qq[<], shift #ARGV or die qq[Cannot open input file: $!\n];
## Variable to save a regex with alternations of the content of the 'query' file.
my $query_regex;
{
## Read content of the 'query' file in slurp mode.
local $/ = undef;
my $query_content = <$fh_query>;
## Remove trailing spaces and generate a regex.
$query_content =~ s/\s+\Z//;
$query_content =~ s/\n/|/g;
$query_regex = qr/(?i:($query_content))/;
}
## Read 'subject' file and for each line compare if that line matches with
## any word of the 'query' file and print in success.
while ( <$fh_subject> ) {
if ( m/$query_regex/o ) {
print
}
}
Run the script:
perl script.pl query.txt subject.txt
And result:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
Your current code doesn't make a lot of sense; you're even referencing variables you don't assign anything to.
All you need to do is read the first file into a hash, then check each line of the second against that hash.
while (my $line = <FILE>)
{
chomp($line);
$hash{$line} = 1;
}
...
while (my $line = <FILE2>)
{
chomp($line);
if (defined $hash{$line})
{
print "$line\n";
}
}