how to extract substrings by knowing the coordinates

how to extract substrings by knowing the coordinates - perl

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.

This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}

As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

Related

Perl script to subset multiple DNA sequences

I have a FASTA file of ~500 DNA sequences, each of which has target position for a Single-Neucleotide Polymorphism (SNP) of interest that is known to me.
For each entry in the file, I have a separate tab-delimited text file that has on each line
The FASTA sequence name
The start position
The end position
The SNP position
The sequences and the positions in the text file are in the same order.
The dummy FASTA file is:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC
and the dummy position file is
AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
This is the script I have written and tried
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
# Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = ();
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while (my $seq = <FASTA>){
chomp $seq;
my #seq = split(/\n/,$seq);
if($seq[0] =~ /^>/){
$seq1 = $seq[0];
}elsif($seq[0] =~ /[^>]/){ #matches any character except the >
$seq1 = ">".$seq[0];
}
for my $pos (#fields){
chomp $pos;
my #field = split(/\t/,$pos);
open(OUTFILE,">>$out");
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;
This is what I get out, which is what I sort of want:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
AGGAGCTCAC
However, I need to also print out the SNP position (column 4) after the sequence name, e.g.,
>AOS-94_S25_L002_R1_001_trimmed_contig_767
pos=10
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
pos=11
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
pos=18
AGGAGCTCAC
I tried inserting print OUTFILE "pos= $field[3]\n";after print OUTFILE "$seq1"; and I get the following:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
AGGAGCTCAC
Obviously I have messed up my loops, and probably some chomp commands.
For instance, when I print "$seq1" to a file, why does it not need a "\n" included in the printed string? There must already be a hard return in the string?
I know I am missing some basics of how this is structured, but I so far can't figure out how to fix my mistakes. Can anyone provide any suggestions?
Update
Perl code reformatted for legibility
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out = <STDIN>;
chomp $out;
local $/ = "\n>"; # Read by FASTA record
my $seq1 = ();
open( FASTA, $fasta ) || die "\n Unable to open the file!\n";
open( POS, $text ) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while ( my $seq = <FASTA> ) {
chomp $seq;
my #seq = split( /\n/, $seq );
if ( $seq[0] =~ /^>/ ) {
$seq1 = $seq[0];
}
elsif ( $seq[0] =~ /[^>]/ ) { # matches any character except the >
$seq1 = ">" . $seq[0];
}
for my $pos ( #fields ) {
chomp $pos;
my #field = split( /\t/, $pos );
open( OUTFILE, ">>$out" );
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] - 1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;

There are many problems with your code
Your comments don't correspond to the code. For instance, you have Read in the complete FASTA file when the code just accepts the file name from STDIN and trims it. It is usually best to write clean code with well-chosen identifiers; that way the program explains itself
You are using the two-parameter form of open and global file handles. You also don't have the reason for failure in the die string, and you have a newline at the end, which will prevent Perl from giving you the source file name and line number where the error occurred
Something like
open( FASTA, $fasta ) || die "\n Unable to open the file!\n"
should be
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!}
and
open( OUTFILE, ">>$out" );
should be
open my $out_fh, '>>', $output_file or die qq{Unable to open "$output_file" for appending: $!}
You should avoid putting quotes around variable names.
print OUTFILE "$seq1"
should be
print OUTFILE $seq1
You set the input record separator to "\n>". That means that every time you call <FASTA> Perl will read up to the next occurrence of that string. It also means that chomp will remove exactly that string from the end of the line, if it is there
The biggest problem is that you never reset $/ before reading from POS. Remember that its setting affects every readline (or <>) and every chomp. And because your $text file probably contains no > characters at the start of a line, you will read the entire file in one go
That is why you are seeing newlines in your output without asking for them. You have read the whole file, together with all embedded newlines, and chomp is useless here because you have modified the string that it removes
local is named that way for a reason. It alters the value temporarily and locally to the current scope. But your "current scope" is the entirety of the rest of the file and you are reading both files with the modified terminator
Use some braces { ... } to limit the scope of the local modification. Alternatively, because file handles in more recent versions of Perl behave as IO::Handle objects, you can write
$fasta_fh->input_record_separator("\n>")
and the change will apply only to that file handle, and there is no need to localise $/ at all
Here's an amended version of your program which also addresses some poor choices of identifier as well as a few other things. Please note that this code is untested. I am working on the train at present and can only check what I'm writing mentally
Note that things like while ( <$fasta_fh> ) and for ( #pos_records ) use the default variable $_ when no loop variable is specified. Likewise, operators like chomp and split will apply to $_ when the corresponding parameter is missing. That way there is never any need to mention any variable explicitly, and it leads to more concise and readable code. $_ is equivalent to it in the English language
I encourage you to understand what the things you're writing actually do. It is becoming common practice to copy code from one part of the internet and offer it to some kind souls elsewhere to get it working for you. That isn't "learning to program", and you will not understand anything unless you study the language and put your mind to it
And please take more care with laying out your code. I hope you can see that the edit I made to your question, and the code in my solution, is more more comfortable to read than the program that you posted? While you're welcome to make your own job as awkward as you like, it's unfair and impolite to offer a mess like that to a world of total strangers whom you're asking for free programming help. A nice middle line is to alter your editor to use an indent of four spaces when the tab key is pressed. Never use tab characters in source code!
use strict;
use warnings 'all';
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file die qq{Unable to open "$out_file" for output: $!};
my #pos_records = do {
open $pos_, '<', $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_fh>;
};
chomp #pos_records; # Remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; # Reading FASTA format now
while ( <$fasta_fh> ) {
chomp; # Remove "">\n" from the end
my ( $header, $seq ) = split /\n/; # Separate the two lines
$header =~ s/^>?/>/; # Replace any chomped >
for ( #pos_records ) {
my ( $name, $beg, $end, $pos ) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
print $out_fh "$header\n";
print $out_fh "pos=$pos\n";
print $out_fh "$subseq\n";
}
}
} # local $/ expires here
close $out_fh or die $!;

Okay, with a couple very minor edits, your code worked perfectly. This is the solution that worked for me:
#!/usr/bin/perl
use strict;
use warnings;
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};
my #pos_records = do {
open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_>;
};
chomp #pos_records; #remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; #Reading FASTA format now
for ( <$fasta_fh> ) {
chomp; #Remove ">\n" from the end
my ( $header, $seq) = split /\n/; #separate the two lines
$header = ">$header" unless $header =~ /^>/; # Replace any chomped >
for ( #pos_records ) {
my ($name,$beg,$end,$pos) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
my $final_SNP = $end - $pos;
if($header =~ /$name/){
print $out_fh "$header\n";
print $out_fh "pos=$final_SNP\n";
print $out_fh "$subseq\n";
}
}
}
} #local expires here
close $out_fh or die $!;
The only substantive thing I changed was the addition of an if statement. Without that, each fasta sequence was being written three times, each one with one with one of the three SNP positions. I also slightly changed what I was doing to notate the SNP position, which after excising the sequence, was actually $end - $pos and not just $pos.
Again, I can't thank you enough, as it is obvious you spent a fair bit of time helping me. For what its worth, I sincerely appreciate it. Your solution will serve as a template for my future efforts (which will likely be similar manipulations of fasta files), and your explanations have helped me to better understand things like what local does in a way that my pea brain can comprehend.

print hashes with values from different files

I want to create output file that has values from file 1 and file 2.
The line from file 1:
chr1 Cufflinks exon 708356 708487 1000 - .
gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "5"; FPKM
"3.1300591420"; frac "1.000000"; conf_lo "2.502470"; conf_hi
"3.757648"; cov "7.589085"; chr1Cufflinks exon 708356
708487 . - . gene_id "XLOC_001284"; transcript_id
"TCONS_00007667"; exon_number "7"; gene_name "LOC100288069"; oId
"CUFF.15.2"; nearest_ref "NR_033908"; class_code "j"; tss_id
"TSS2981";
The line from file 2:
CUFF.48557
chr4:160253850-160259462:160259621-160260265:160260507-160262715
The second column from this file is unique id (uniq_id).
I want to get output file in the following format:
transcript_id(CUFF_id) uniq_id gene_id(XLOC_ID) FPKM
My script takes XLOC_ID and FPKM values from first file and print them together with two columns from the second file.
#!/usr/bin/perl -w
use strict;
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
my %fpkm_hash;
my %xloc_hash;
open (FILE, "$v_merge_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
if ($array[2] eq 'exon') {
my $id = $array[8];
if ($id =~ /transcript_id \"(CUFF\S+)/) {
$id = $1;
$id =~ s/\"//g;
$id =~ s/;//;
}
my $fpkm = $array[8];
if ($fpkm =~ /FPKM \"(\S+)/) {
$fpkm = $1;
$fpkm =~ s/\"//g;
$fpkm =~ s/;//;
}
my $xloc = $array[17];
if ($xloc =~ /gene_id \"(XLOC\S+)/) {
$xloc = $1;
$xloc =~ s/\"//g;
$xloc =~ s/;//;
}
$fpkm_hash{$id} = $fpkm;
$xloc_hash{$id} = $xloc;
}
}
}
close FILE;
open (FILE, "$unique_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
my $id = $array[0];
my $uniq = $array[1];
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
}
}
close FILE;
I initialized hashes outside of the files, but I get the following error for each CUFF values:
CUFF.24093
chr17:3533641-3539345:3527526-3533498:3526786-3527341:3524707-3526632
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
How can I fix this issue?
Thank you!

I think the warning message is because the $id key, (CUFF.24093), you get on line 9343 of the second file isn't contained in the hashes you created in the first file.
Is it possible that an ID in the second file isn't contained in the first file? That seems to be the case here.
If so, and you just want to skip over this unknown ID, you could add a line to your program like:
my $id = $array[0];
my $uniq = $array[1];
next unless exists $fpkm_hash{$id}; # add this line
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
This will bypass the following print statement and go back to the top of the while loop and read in the next line and continue processing.
It depends on what action you want to take if you encounter an unknown ID.
Update: I thought I might make some observations/improvements to your code.
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
The error variable $! serves no purpose here (this is a fact I only recently discovered even after 14 years using Perl). $! is only set for system calls, (where you are involving the operating system).The most common are open and close for files, and opendir and closedir for directories. If an error occurs in opening/closing a file or a directory, $! will contain the error message. (See in my included code how I handled this - I created a message, $usage to be printed if the shift didn't succeed.
Instead of using 2 hashes to store the information, I used 1 hash,%data. The advantage is that it will use less memory, (because its only storing 1 set of keys instead of 2), Though, you could use the 2 if you like.
I used the recommended 3 argument (filehandle, mode, filename) form for opening the files. The 2 argument approach you used is outdated and less safe (for reasons I won't go into detail here). Also, the lexical filehandles I used, my $mrg and my $unique are the newer ways to create filehandles (instead of usingFILEfor your 2 opens).
You can directly assign to $linein your while loop like while (my $line = <FILE>) instead of the way you did it. In my sample program, I didn't assign to $line, but instead relied on the default variable $_. (It simplifies the 2 following statements, next unless /\S/; my #array = split /\t/;). I didn't chomp for the first file because you're only parsing inside the string and aren't using anything from the end of the string.chomp is necessary for the second while loop because the second variable my $uniq = ... would have a newline at its end if it wasn't removed by chomp.
I didn't know what you meant by this statement, if ($line =~ /[a-z]/). I am assuming you wanted to check for empty lines and only process lines with non-space data. That's why I wrote next unless /\S/;instead. (says to skip the following statements and got to the top of the while loop and read the next record).
Your first while loop worked because you had no errors in your input file. If there had errors, the way you wrote the code could have been a problem.
The statementmy $id = $array[8]; gives $id a value that would have been wrongly used if the following if statement had been false. (The same thing for the 2 other variables you want to capture,$fpkm and $xloc). You can see in my code example how I handled this.
In my code, I died if the match didn't succeed, You might not want todie but say match or next to try the next line of data. It depends on how you would want to handle a failed match.
And in this line$array[8] =~ /gene_id "(CUFF\S+)";/, Note that I put the ";following the captured data, so there is no need to remove it from the captured data (as you did in your substitutions)
Well, I know this is a long comment on your code, but I hope you get some good ideas about why I recommended the changes given.
or die "Could not find ID in $v_merge_gtf (line# $.)";
$. is the line number of the file being read.
#!/usr/bin/perl
use warnings;
use strict;
my $usage = "USAGE: perl $0 merge_gtf_file unique_gtf_file\n";
my $v_merge_gtf = shift #ARGV or die $usage;
my $unique_gtf = shift #ARGV or die $usage;
my %data;
open my $mrg, '<', $v_merge_gtf or die $!;
while (<$mrg>) {
next unless /\S/;
my #array = split /\t/;
if ($array[2] eq 'exon') {
$array[8] =~ /gene_id "(CUFF\S+)";/
or die "Could not find ID in $v_merge_gtf (line# $.)";
my $id = $1;
$array[8] =~ /FPKM "(\S+)";/
or die "Could not find FPKM in $v_merge_gtf (line# $.)";
my $fpkm = $1;
$array[17] =~ /gene_id "(XLOC\S+)";/
or die "Could not find XLOC in $v_merge_gtf (line# $.)";
my $xloc = $1;
$data{$id}{fpkm} = $fpkm;
$data{$id}{xloc} = $xloc;
}
}
close $mrg or die $!;
open my $unique, '<', $unique_gtf or die $!;
while (<$unique>) {
next unless /\S/;
chomp;
my ($id, $uniq) = split /\t/;
print join("\t", $id, $uniq, $data{$id}{fpkm}, $data{$id}{xloc}), "\n";
}
close $unique or die $!;

Calculate the length of a string in a specific file format with perl

I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance

You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);

Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.

For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.

#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;

Printing array in Perl

I currently have my Perl script to read fstab files, split them up by column and search for which word in each column is the longest to display it. All that works peachy (I think), the problem I'm having is that it keeps printing out the same length for every line which is not true. Example $dev_parts prints 24, and $labe_parts prints 24 and so on...
below is my code.
#!/usr/bin/perl
use strict;
print "Enter file name: \n";
my $file_name = <STDIN>;
open(IN, "$file_name");
my #parts = split( /\s+/, $file_name);
foreach my $usr_file (<IN>) {
chomp($usr_file);
#parts = split( /\s+/, $usr_file);
push(#dev, $parts[0]);
push(#label, $parts[1]);
push(#tmpfs, $parts[2]);
push(#devpts, $parts[3]);
push(#sysfs, $parts[4]);
push(#proc, $parts[5]);
}
foreach $dev_parts (#dev) {
$dev_length1 = length ($parts[$dev_parts]);
if ( $dev_length1 > $dev_length2) {
$dev_length2 = $dev_length1;
}
}
print "The longest word in the first line is: $dev_length2 \n";
foreach $label_parts (#label) {
$label_length1 = length($parts[$label_parts]);
if ($label_length1 > $label_length2) {
$label_length2 = $label_length1;
}
}
print "The longest word in the first line is: $label_length2 \n";

This is how your code should be
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
print "Enter file name: \n";
my $file_name = <STDIN>;
chomp($file_name);
open(FILE, "$file_name") or die $!;
my %colhash;
while (<FILE>) {
my $col=0;
my #parts = split /\s+/;
map { my $len = length($_);
$col++;
if($colhash{$col} < $len ){
$colhash{$col} = $len; # store the longest word length for each column
}
} #parts;
}
print Dumper(\%colhash);

You have a mistake here:
foreach $dev_parts (#dev) {
$dev_length1 = length ($parts[$dev_parts]);
As I understand it, you are looking for the longest element in #dev. However, you take the length of an element from the #parts array. This array is always set to whatever the last line of the file is. So you are looking at each element in the last line of the file, rather than each element of the appropriate column.
You just need to take length($dev_parts) instead.
Incidentally, here is a simpler way to find the longest length in an array:
use List::Util qw/max/; #Core module, always available.
my $longest_dev = max map {length} #dev;
A few other comments on your code:
use strict; is good. You should also use warnings;. It will help
you catch silly mistakes in your code.
You ought to check for errors whenever you open a file:
open(IN, $file_name) or die "Failed to open $file_name: $!";
Better yet, use the preferred open syntax with a lexical filehandle:
open(my $in_file, '<', $file_name) or die "Failed to open $file_name: $!";
...
while (<$in_file>) {
I'm not sure what you are trying to do here:
my #parts = split( /\s+/, $file_name);
You are splitting the file name by white space, but you don't use that for anything. And then you re-use the same array to hold the lines later.
A while loop is preferred to foreach when you go through lines of a file. It saves memory because it doesn't read the whole file into memory first (and it is otherwise exactly the same).
while (my $usr_file = <IN>) {

Count the occurrences of a string in a file

I wrote a perl script to count the occurrences of a character in a file.
So far this is what I have got,
#!/usr/bin/perl -w
use warnings;
no warnings ('uninitialized', 'substr');
my $lines_ref;
my #lines;
my $count;
sub countModule()
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = $fh;
my #filtered = grep (/\// ,#contents);
return \#filtered;
}
#lines = countModule();
##lines = $lines_ref;
$count = #lines;
print "###########\n $count \n###########\n";
My test file looks like this:
10.0.0.1/24
192.168.10.0/24
172.16.30.1/24
I am basically trying to count the number of instances of "/"
This is the output that I get:
###########
1
###########
I am getting 1 instead of 3, which is the number of occurrences.
Still learning perl, so any help will be appreciated..Thank you!!

Here are a few points about your code
You should always use strict at the top of your program, and only use no warnings for special reasons in a limited scope. There is no general reason why a working Perl program should need to disable warnings globally
Declare your variables close to their first point of use. The style of declaring everything at the top of the file is unnecessary and is a legacy of C
Never use prototypes in your code. They are available for very special purposes and shouldn't be used for the vast majority of Perl code. sub countModule() { ... } insists that countModule may never be called with any parameters and isn't necessary or useful. The definition should be just sub countModule { ... }
A big well done! for using a lexical file handle, the three-parameter form of open, and putting $! in your die string
my #contents = $fh will just set #contents to a single-element list containing just the filehandle. To read the whole file into the array you need my #contents = <$fh>
You can avoid escaping slashes in a regular expression if you use a different delimiter. To do that you need to use the m operator explicitly, like my #filtered = grep m|/|, #contents)
You return an array reference but assign the returned value to an array, so #lines = countModule() sets #lines to a single-element list containing just the array reference. You should either return a list with return #filtered or dereference the return value on assignment with #lines = #{ countModule }
If all you need to do is to print the number of lines in the file that contain a slash character then you could write something like this
use strict;
use warnings;
my $count;
sub countModule {
open my $fh, '<', '/test' or die "Could not open $file: $!";
return [ grep m|/|, <$fh> ];
}
my $lines = countModule;
$count = #$lines;
print "###########\n $count \n###########\n";

Close, but a few issues:
use strict;
use warnings;
sub countModule
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = <$fh>; # The <> brackets are used to read from $fh.
my #filtered = grep (/\// ,#contents);
return #filtered; # Remove the reference.
}
my #lines = countModule();
my $count = scalar #lines; # 'scalar' is not required, but lends clarity.
print "###########\n $count \n###########\n";
Each of the changes I made to your code are annotated with a #comment explaining what was done.
Now in list context your subroutine will return the filtered lines. In scalar context it will return a count of how many lines were filtered.
You did also mention find the occurrences of a character (despite everything in your script being line-oriented). Perhaps your counter sub would look like this:
sub file_tallies{
my $file = '/test';
open my $fh, '<', $file or die $!;
my $count;
my $lines;
while( <$fh> ) {
$lines++;
$count += $_ =~ tr[\/][\/];
}
return ( $lines, $count );
}
my( $line_count, $slash_count ) = file_tallies();

In list context,
return \#filtered;
returns a list with one element -- a reference to the named array #filtered. Maybe you wanted to return the list itself
return #filtered;

Here's some simpler code:
sub countMatches {
my ($file, $c) = #_; # Pass parameters
local $/;
undef $/; # Slurp input
open my $fh, "<",$file or die "could not open $file: $!";
my $s = <$fh>; # The <> brackets are used to read from $fh.
close $fh;
my $ptn = quotemeta($c); # So we can match strings like ".*" verbatim
my #hits = $s =~ m/($ptn)/g;
0 + #hits
}
print countMatches ("/test", '/') . "\n";
The code pushes Perl beyond the very basics, but not too much. Salient points:
By undeffing $/ you can read the input into one string. If you're counting
occurrences of a string in a file, and not occurrences of lines that contain
the string, this is usually easier to do.
m/(...)/g will find all the hits, but if you want to count strings like
"." you need to quote the meta characters in them.
Store the results in an array to evaluate m// in list context
Adding 0 to a list gives the number of items in it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to extract substrings by knowing the coordinates - perl

Related

Perl script to subset multiple DNA sequences

print hashes with values from different files

Calculate the length of a string in a specific file format with perl

Printing array in Perl

Count the occurrences of a string in a file

Categories

Resources