Conversion text file into csv using perl

Conversion text file into csv using perl - perl

I have long text file and I want to convert it in spreadsheet. It consists of Id, Name, Length and sequence. Every new protein starts with (>) sign and order are Id, name Length and sequence on new line
Example
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Output
Table will be
Id Length Name Sequence
LPT_ECOLI 90-255(Clockwisw) Thr operon lader peptide KRISTTITTT

Provided your IDS are unique this will do what you want:
my ($id, $length, $name, $sequence);
my %data;
while(<DATA>){
chomp;
my #split = split(/,/);
($id, $length, $name) = #split[0..2] if /^\d+/;
$id =~ s/^\d+\s>\s//;
$data{$id} = [$name, $length, $_] if /^[A-Z]/;
}
open my $out, '>', 'out.csv' or die $!;
print $out "Id,Length,Name,Sequence\n";
foreach my $id (sort keys %data){
($length, $name, $sequence) = #{$data{$id}};
print $out "$id,$length,$name,$sequence\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
This works by splitting your data on , and building a hash of arrays, using the ids as keys and the other information as values. This can then be printed to a .csv file.

Here's another option:
use strict;
use warnings;
while ( my $lines = <DATA> . <DATA> ) {
print join (',', ( split />\s+|,\s+|\n/, $lines )[ 1 .. 4 ]), "\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Output:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
The while loop starts by reading in two lines at a time. The split uses a regex to split those lines on " >" or ", " or "\n", and then joins elements 1-4 from the split with a comma and prints the results.
Hope this helps!

With a somewhat awkward sed script:
sed -nE '/^[0-9]+[ \t]+>/ { s/^[0-9]+[ \t]+>[ \t]+//; h; n; x; G; s/\n/,/; s/[ \t]*,[ \t]*/,/g; p }'
Output:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
This you can import in your spreadsheet as CSV.
Edit: Same thing with Perl if you insist:
perl -lpe 'chomp($_ .= "," . <>) if (s/^\d+\s*>\s*//o); s/\s*,\s*/,/g'

And in Perl:
#!/usr/bin/perl
use strict; use warnings;
open(my $fh, "<", "foo.data") || die;
my $last_was_rec_start = 0;
my ($id, $len, $name);
foreach (my $lineno=1; my $line = <$fh>; $lineno++ ) {
chomp($line);
if ($last_was_rec_start) {
# Add validation that line matches protein sequence?
print "${id},${len},${name}',$line\n";
$last_was_rec_start = 0;
next;
}
my #fields = split(/,\s+/, $line);
unless (scalar(#fields) == 3) {
print STDERR "Malformed line ${lineno}; expecting 3 comma-delimited fields:\n${line}\n";
next;
};
$len = $fields[1];
$name = $fields[2];
unless ($fields[0] =~ /\d+ > (.*)/) {
print STDERR "Malformed line ${lineno}; expecting number >\n${line}\n";
next;
}
$last_was_rec_start = 1;
$id = $1;
}
Which gives this output on your example:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide',KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I',MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Basically the code starts off by reading lines splitting on comma or ",". The first field sub-matched to find remove the number >. After we find a line match
the line after is taken as a sequence line.
However, you might also want to look at Bio::Perl. It can probably write CSV files and if your input in some standard format it might be able to read that as well.

Below please find sample code - in working wersion replace <DATA> with <STDIN> and use execute it as script < input-file > output-file
use strict; use warnings;
# print CSV header line
print "N, Id, Length, Name, Sequence\n";
my($line1,$line2);
while( defined($line1=<DATA>) and defined($line2=<DATA>)) {
# put two input lines slurped above into $_
local $_ = $line1 . $line2;
my ($N, $Id, $Length, $Name, $Sequence ) = m{
^(\d{1,6}) # $N - record numer (?)
\x20>\x20
([A-Z1-9_]{1,128}?) # $Id
\x20*,\x20*
([- ()0-9A-Za-z]{1,128}?) # Length
\x20*,\x20*
([^,\"\'\n\r]{1,256}?) # $Name
# the quotes (\"\') are escaped/backslashed to make SO syntax coloring work
\x20*\r?\n
([A-Z]{1,4096}?) # $Sequence
\r?\n
}sox or die "wrong line format (line $.):\n $_";
printf "%d, %s, %s, %s, %s\n", $N, $Id, $Length, $Name, $Sequence;
}
die if defined($line1); # incoplete set of input lines;
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

Related

Error use of uninitialized value although it is initialized

I am trying to make a table looking content of one input file but it constantly gives me an error
Use of uninitialized value $ac[3] in concatenation (.) or string at table.pl
line 58 (#1)
and
Use of uninitialized value $or[2] in concatenation (.) or string at table.pl
line 61 (#1)
and although I made almost every possible changes it still gives me an error and does not print well.
This is how my input file looks like:
HEADER OXIDOREDUCTASE 08-JUN-12 2LU5
EXPDTA SOLID-STATE NMR
REMARK 2 RESOLUTION. NOT APPLICABLE.
HETNAM CU COPPER (II) ION
HETNAM ZN ZINC
FORMUL 2 CU CU 2+
FORMUL 2 ZN ZN 2+
END
This is a script I am using:
#!/usr/bin/env perl
use strict;
use warnings;
use diagnostics;
#my $testfile=shift;
open(INPUT, "$ARGV[0]") or die 'Cannot make it';
my #file=<INPUT>;
close INPUT;
my #ac=();
my #dr=();
my #os=();
my #or=();
my #fo=();
for (my $line=0;$line<=$#file;$line++)
{
chomp($file[$line]);
if ($file[$line] =~ /^HEADER/)
{
print( (split '\s+', $file[$line])[-1]);
print "\t";
while ($file[$line] !~ /^END /)
{
$line++;
if ($file[$line]=~/^EXPDTA/)
{
$file[$line]=~s/^EXPDTA//;
#os=(#os,split '\s+', $file[$line]);
}
if ($file[$line] =~ /^REMARK 2 RESOLUTION./)
{
$file[$line]=~s/^REMARK 2 RESOLUTION.//;
#ac = (#ac,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^HETNAM/)
{
$file[$line]=~s/^HETNAM//;
$file[$line] =~ s/\s+//;
push #dr, $file[$line];
}
if ($file[$line] =~ /^SOURCE 2 ORGANISM_SCIENTIFIC/)
{
$file[$line]=~s/^SOURCE 2 ORGANISM_SCIENTIFIC//;
#or = (#or,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^FORMUL/)
{
$file[$line]=~s/^FORMUL//;
$file[$line] =~ s/\s+//;
push #fo, $file[$line];
}
}
print "$os[1] $os[2]\t";
print "\t";
#os=();
print "$ac[3] $ac[4]\t" or die "Cannot be printed"; #line 58
print "\t";
#ac=();
print "$or[2] $or[3]\t" or die "Cannot be printed"; #line 61
print "\t";
#or=();
foreach (#dr)
{
print "$_";
print "\t\t\t\t\t";
}
#dr=();
print "\n";
}
}
And this is the output it gives me, but it doesnt seems to print well and I am really not sure why:
2LU5 SOLID-STATE NMR CU COPPER (II) ION
Desired output that I am expecting is :
HEADER EXPDTA REMARK HETNAM FORMUL
OXIDOREDUCTASE 2LU5 SOLID-STATE NMR RESOLUTION. NOT APPLICABLE. COPPER (II) ION (here better to say last column because certain diversity exists before "copper") CU 2+
ZN ZINC ZN 2+

The root of your error is that:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #ac = ();
my $str = "REMARK 2 RESOLUTION. NOT APPLICABLE. ";
$str =~ s/^REMARK 2 RESOLUTION.//;
#ac = ( #ac, split '\s+', $str );
print Dumper \#ac;
The contents of #ac is:
$VAR1 = [
'',
'NOT',
'APPLICABLE.'
];
There is no $ac[3], you only have elements 0,1,2 in there.
With your #or error, you don't have any lines matching: /^SOURCE 2 ORGANISM_SCIENTIFIC/
So that array is empty, and that too, means you've got no $or[2] to print.
More generally - what you're doing here is actually really quite clunky, and there's a much cleaner solution.
How about:
#!/usr/bin/env perl
use strict;
use warnings;
#set the text "END" as our record separator
local $/ = 'END';
#define the fields to print out.
my #field_order = qw ( HEADER EXPDTA REMARK HETNAM FORMUL );
print join ( ",", #field_order), "\n"; #print header row
#iterate STDIN or file named on command line.
#just like you're doing with open (FILE, $ARGV[0])
while ( <> ) {
#select key value pairs into a hash - first word on the line is the 'key'
#and the value is 'anything else'.
my %this_entry = m/^(\w+)\s+(.*)$/gm;
next unless $this_entry{'HEADER'}; #check we have a header.
s/\s+/ /g for values %this_entry; #strip repeated spaces from fields;
s/\s+$//g for values %this_entry; #strip trailing whitespace.
#split 'header' row into separate subfields
#this is an example of how you could transform other fields.
($this_entry{'HEADER'}, $this_entry{'DATE'}, $this_entry{'STRUCT'} ) = split ' ', $this_entry{'HEADER'};
print join (",", #this_entry{#field_order} ), "\n";
}
This will - given your input - print:
HEADER,DATE,STRUCT,EXPDTA,REMARK,HETNAM,FORMUL
OXIDOREDUCTASE,08-JUN-12,2LU5,SOLID-STATE NMR,2 RESOLUTION. NOT APPLICABLE.,CU COPPER (II) ION,2 CU CU 2+
Which isn't quite what your output matches, but hopefully it's illustrated how much simpler this task could be?

Multi-column minimal representation discovery in perl

I have a csv of data with about 20 columns and each column will have more than one distinct value. Each row after the top one which is the header, is an individual data sample. I want to narrow the list down programatically so that I have the smallest number of data samples but each permutation of the column data is still represented.
Example data
SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15

Use a hash to determine whether you have seen a particular column combination before, and then use that to determine whether to print a particular line.
Here is a rather basic example to demonstrate the idea:
filter.pl
#!/usr/bin/env perl
use warnings;
use strict;
die "usage: $0 file col1,col2,col3, ... coln\n" unless #ARGV;
my ($file, $columns) = #ARGV;
-f $file or die "$file does not exist!";
defined $columns or die "need to pass in columns!";
my #columns;
for my $col ( split /,/, $columns ) {
die "Invalid column id $col" unless $col >= 1; # 1-based
push #columns, $col - 1; # 0-based
}
scalar #columns or die "No columns!";
open my $fh, "<", $file or die "Unable to open $file : $!";
my %uniq;
while (<$fh>) {
chomp();
next if $. == 1; # Skip Header
my (#data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file
my $key = join '|', #data[ #columns ]; # key will look like 'foo|bar|baz'
if (not defined $uniq{ $key } ) {
print $_ . "\n"; # Print the whole line with the first unique set of columns
$uniq{ $key } = 1; # Now we have seen this combo
}
}
data.csv
SERIAL,TRUTH,IN,PARALLEL
123,TRUE,YES,5
124,TRUE,YES,5
125,TRUE,YES,3
126,TRUE,NO,5
127,FALSE,YES,1
128,FALSE,YES,3
129,FALSE,NO,7
Output
perl filter.pl data.csv 2,3
123,TRUE,YES,5
126,TRUE,NO,5
127,FALSE,YES,1
129,FALSE,NO,7

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my #proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (#proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";

This will do as you ask
It builds a hash %fasta out of the FASTA file, keeping array #keys to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, #keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push #keys, $key;
}
elsif ( $key ) {
unshift #{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( #keys ) {
print $key, "\n";
print "$_\n" for #{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max' at the top of the program
my $len = max map length, map #$_, values %fasta;
for my $key ( #keys ) {
print $key, "\n";
my $seq = join '', #{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH

I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.

Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}

Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

How to parse through tab-delimited file in perl?

I'm new to Perl, and I've hit a mental roadblock. I need to extract information from a tab delimited file as shown below.
#name years risk total
adam 5 100 200
adam 5 50 100
adam 10 20 300
bill 20 5 100
bill 30 10 800
In this example, the tab delimited file shows length of investment, amount of money risked, and total at the end of investment.
I want to parse through this file, and for each name (e.g. adam), calculate sum of years invested 5+5, and calculate sum of earnings (200-100) + (100-50) + (300-20). I also would like to save the totals for each name (200, 100, 300).
Here's what I have tried so far:
my $filename;
my $seq_fh;
open $seq_fh, $frhitoutput
or die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
my $yeartotal = 0;
my $earning = 0;
#$line[0] = name
#$line[1] = years
#$line[2] = start
#$line[3] = end
while (#line[0]){
$yeartotal += $line[1];
$earning += ($line[3]-$line[2]);
}
}
Any ideas of where I went wrong?

The Text::CSV module can be used to read tab-delimited data. Often much nicer than trying to manually hack yourself something up with split and so on when it comes to things like quoting, escaping, etc..

You're wrong here : while(#line[0]){
I'd do:
my $seq_fh;
my %result;
open($seq_fh, $frhitoutput) || die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
$result{$line[0]}{yeartotal} += $line[1];
$result{$line[0]}{earning} += $line[3] - $line[2];
}

You should use hash, something like this:
my %hash;
while (my $line = <>) {
next if $line =~ /^#/;
my ($name, $years, $risk, $total) = split /\s+/, $line;
next unless defined $name and defined $years
and defined $risk and defined $total;
$hash{$name}{years} += $years;
$hash{$name}{risk} += $risk;
$hash{$name}{total} += $total;
$hash{$name}{earnings} += $total - $risk;
}
foreach my $name (sort keys %hash) {
print "$name earned $hash{$name}{earnings} in $hash{$name}{years}\n";
}

Nice opportunity to explore Perl's powerful command line options! :)
Code
Note: this code should be a command line oneliner, but it's a little bit easier to read this way. When writing it in a proper script file, you really should enable strict and warnings and use a little bit better names. This version won't compile under strict, you have to declare our $d.
#!/usr/bin/perl -nal
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
# print summary
END { print "$_:\tyears: $d{$_}{y},\tearnings: $d{$_}{e}" for sort keys %d }
Output
adam: years: 20, earnings: 430
bill: years: 50, earnings: 885
Explanation
I make use of the -n switch here which basically lets your code iterate over the input records (-l tells it to use lines). The -a switch lets perl split the lines into the array #F. Simplified version:
while (defined($_ = <STDIN>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
}
%d is a hash with the names as keys and hashrefs as values, which contain years (y) and earnings (e).
The END block is executed after finishing the input line processing and outputs %d.
Use O's Deparse to view the code which is actually executed:
book:/tmp memowe$ perl -MO=Deparse tsv.pl
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
$d{$F[0]}{'y'} += $F[1];
$d{$F[0]}{'e'} += $F[3] - $F[2];
sub END {
print "${_}:\tyears: $d{$_}{'y'},\tearnings: $d{$_}{'e'}" foreach (sort keys %d);
}
;
}
tsv.pl syntax OK

It seems like a fixed-width file, I would use unpack for that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Conversion text file into csv using perl - perl

Related

Error use of uninitialized value although it is initialized

Multi-column minimal representation discovery in perl

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

Reading the next line in the file and keeping counts separate

How to parse through tab-delimited file in perl?

Categories

Resources