Related
I have long text file and I want to convert it in spreadsheet. It consists of Id, Name, Length and sequence. Every new protein starts with (>) sign and order are Id, name Length and sequence on new line
Example
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Output
Table will be
Id Length Name Sequence
LPT_ECOLI 90-255(Clockwisw) Thr operon lader peptide KRISTTITTT
Provided your IDS are unique this will do what you want:
my ($id, $length, $name, $sequence);
my %data;
while(<DATA>){
chomp;
my #split = split(/,/);
($id, $length, $name) = #split[0..2] if /^\d+/;
$id =~ s/^\d+\s>\s//;
$data{$id} = [$name, $length, $_] if /^[A-Z]/;
}
open my $out, '>', 'out.csv' or die $!;
print $out "Id,Length,Name,Sequence\n";
foreach my $id (sort keys %data){
($length, $name, $sequence) = #{$data{$id}};
print $out "$id,$length,$name,$sequence\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
This works by splitting your data on , and building a hash of arrays, using the ids as keys and the other information as values. This can then be printed to a .csv file.
Here's another option:
use strict;
use warnings;
while ( my $lines = <DATA> . <DATA> ) {
print join (',', ( split />\s+|,\s+|\n/, $lines )[ 1 .. 4 ]), "\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Output:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
The while loop starts by reading in two lines at a time. The split uses a regex to split those lines on " >" or ", " or "\n", and then joins elements 1-4 from the split with a comma and prints the results.
Hope this helps!
With a somewhat awkward sed script:
sed -nE '/^[0-9]+[ \t]+>/ { s/^[0-9]+[ \t]+>[ \t]+//; h; n; x; G; s/\n/,/; s/[ \t]*,[ \t]*/,/g; p }'
Output:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
This you can import in your spreadsheet as CSV.
Edit: Same thing with Perl if you insist:
perl -lpe 'chomp($_ .= "," . <>) if (s/^\d+\s*>\s*//o); s/\s*,\s*/,/g'
And in Perl:
#!/usr/bin/perl
use strict; use warnings;
open(my $fh, "<", "foo.data") || die;
my $last_was_rec_start = 0;
my ($id, $len, $name);
foreach (my $lineno=1; my $line = <$fh>; $lineno++ ) {
chomp($line);
if ($last_was_rec_start) {
# Add validation that line matches protein sequence?
print "${id},${len},${name}',$line\n";
$last_was_rec_start = 0;
next;
}
my #fields = split(/,\s+/, $line);
unless (scalar(#fields) == 3) {
print STDERR "Malformed line ${lineno}; expecting 3 comma-delimited fields:\n${line}\n";
next;
};
$len = $fields[1];
$name = $fields[2];
unless ($fields[0] =~ /\d+ > (.*)/) {
print STDERR "Malformed line ${lineno}; expecting number >\n${line}\n";
next;
}
$last_was_rec_start = 1;
$id = $1;
}
Which gives this output on your example:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide',KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I',MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
Basically the code starts off by reading lines splitting on comma or ",". The first field sub-matched to find remove the number >. After we find a line match
the line after is taken as a sequence line.
However, you might also want to look at Bio::Perl. It can probably write CSV files and if your input in some standard format it might be able to read that as well.
Below please find sample code - in working wersion replace <DATA> with <STDIN> and use execute it as script < input-file > output-file
use strict; use warnings;
# print CSV header line
print "N, Id, Length, Name, Sequence\n";
my($line1,$line2);
while( defined($line1=<DATA>) and defined($line2=<DATA>)) {
# put two input lines slurped above into $_
local $_ = $line1 . $line2;
my ($N, $Id, $Length, $Name, $Sequence ) = m{
^(\d{1,6}) # $N - record numer (?)
\x20>\x20
([A-Z1-9_]{1,128}?) # $Id
\x20*,\x20*
([- ()0-9A-Za-z]{1,128}?) # Length
\x20*,\x20*
([^,\"\'\n\r]{1,256}?) # $Name
# the quotes (\"\') are escaped/backslashed to make SO syntax coloring work
\x20*\r?\n
([A-Z]{1,4096}?) # $Sequence
\r?\n
}sox or die "wrong line format (line $.):\n $_";
printf "%d, %s, %s, %s, %s\n", $N, $Id, $Length, $Name, $Sequence;
}
die if defined($line1); # incoplete set of input lines;
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
EDITED: I'm attempting to create a brief script that calls for an input fixed width file and a file with the start position and length of each attribute and then outputs the file as CSV instead of fixed width. I haven't messed with removing whitespace yet and am currently focusing on building the file reader portion.
Fixed:
My current issue is that this code returns data from the third row for $StartPosition and from the fourth row for $Length when they should both be first found on the first row of COMMA. I have no idea what is prompting this behavior.
Next issue: It only reads the first record in practice_data.txt I'm guessing it's something where I need to tell COMMA to go back to the beginning?
while (my $sourceLine = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$NextRecord ="";
while (my $commaLine = <COMMA>) {
my $Comma = index($commaLine, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($commaLine, 0, $Comma);
print "Start position is $StartPosition \n";
$Comma = $Comma + 1
$Length = substr($commaLine, $Comma);
print "Length is $Length \n";
$NextRecord = substr($sourceLine, $StartPosition, $Length);
$Output = "$Output . ',' . $NextRecord";
}
print OUTPUT "$Output \n";
}
practice_data.txt
1234512345John Doe 123 Mulberry Lane Columbus Ohio 43215Johnny Jane
5432154321Jason McKinny 423 Thursday Lane Columbus Ohio 43212Jase Jamie
4321543212Mike Jameson 289 Front Street Cleveland Ohio 43623James Sarah
Each record is 100 characters long.
Definitions.txt:
0,10
10,10
20,10
30,20
50,10
60,10
70,5
75,15
90,10
It always helps to provide enough information so that we can at least do some testing without having to read your code and imagine what the data must look like.
I suggest you use unpack, after first building a template from the file that holds the field specifications. Note that the A field specifier trims trailing spaces from the data.
It is all but essential to use the Text::CSV module to parse or generate well-formed CSV data. And I have used the autodie pragma to avoid having to explicitly check and report on the status of every I/O operation.
I have used this data
my_source_data.txt
12345678 ABCDE1234FGHIJK
my_field_spec.txt
0,8
10,5
15,4
19,6
And this program
use strict;
use warnings;
use 5.010;
use autodie;
use Text::CSV;
my #template;
open my $field_fh, '<', 'my_field_spec.txt';
while ( <$field_fh> ) {
my (#info) = /\d+/g;
die unless #info == 2;
push #template, sprintf '#%dA%d', #info;
}
my $template = "#template";
open my $source_fh, '<', 'my_source_data.txt';
my $csv = Text::CSV->new( { binary => 1, eol => $/ } );
while ( <$source_fh> ) {
my #fields = unpack $template;
$csv->print(\*STDOUT, \#fields);
}
output
12345678,ABCDE,1234,FGHIJK
It looks like you're slightly confused on how to read the contents of the COMMA filehandle.. Each time you read <COMMA>, you're reading another line from that file. Instead, read a line into a scalar like my $line = <FH> and use that instead:
while (my $source_line = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$Input = $_;
$NextRecord ="";
while (my $comma_line = <COMMA>) {
my $Comma = index($comma_line, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($comma_line, 0, $Comma);
print "Start position is $StartPosition \n";
$Length = substr($comma_line, $Comma);
print "Length is $Length \n";
$NextRecord = substr($Input, $StartPosition, $Length) + ',';
$Output = "$Output$NextRecord";
}
print OUTPUT "$Output \n";
}
I have a csv with about 160,000 lines, it looks like this:
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
I would like to pair lines where column 1 is the same, column 3 matches column 2 and where column 6 is a '+' while on the other line it is a '-'. If this is true I would like to sum column 4 and column 5.
My desired out put would be
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,-
the best solution I can think of is to duplicate the file and then match columns between the file and it's duplicate with perl:
#!/usr/bin/perl
use strict;
use warnings;
open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a,
$total_b,$sign_a,$sign_b);
while (<$firstfile>) {
my #col = split /,/;
$chr_a = $col[0];
$start = $col[1];
$end = $col[2];
$sum_a = $col[3];
$total_a = $col[4];
$sign_a = $col[5];
seek($secondfile,0,0);
while (<$secondfile>) {
my #seccol = split /,/;
$chr_b = $seccol[0];
$begin = $seccol[1];
$finish = $seccol[2];
$sum_b = $seccol[3];
$total_b = $seccol[4];
$sign_b = $seccol[5];
print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3],
$col[4]+=$seccol[4], $col[5]),
"\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
}
}
And that works fine, but ideally I'd like to be able to do this within the file itself without having to duplicate it, because I have many files and so I would like to run a script over all of them that is less time-consuming.
Thanks.
In the absence of a response to my comment, this program will do as you ask with the data you provide.
use strict;
use warnings;
my #last;
while (<DATA>) {
s/\s+\z//;
my #line = split /,/;
if (#last
and $last[0] eq $line[0]
and $last[2] eq $line[1]
and $last[5] eq '+' and $line[5] eq '-') {
$last[3] += $line[3];
$last[4] += $line[4];
print join(',', #last), "\n";
#last = ()
}
else {
print join(',', #last), "\n" if #last;
#last = #line;
}
}
print join(',', #last), "\n" if #last;
__DATA__
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
output
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+
I have a text file where file content has delimiter as space in beginning.
Its like below:
First line doesn't have any space in beginning.
Second line has 2 space.
Third line has 4 space in the beginning.
Fourth line has 6 spaces in the beginning.
Again this pattern is repeated till end of file in a random way as shown in text file eg below.
I want to read these lines from the text file and save the lines in pattern:
having no space in first column.
having 2 spaces in second column.
4 spaces in third column.
6 spaces in fourth column of a CSV file.
The text file structure is (representing spaces by #) :
ABC
##EFG"123"
####<HIJK> 22: test file
######LMNOP "Test"
######sssstt"123"
QRS
##TU"223"
####<www> 32: test2 file
######yz test1
####<www> 88: test3 file
######rreeeww
######oooiiiii
##PP
##ss
####<qqq> 89: test6 file
######hhhhggg
######bbbbaaa
######cccczzz
######uu test3
Expected output image:
I am new to Perl, I know how to open a file and read through line but I am not understanding how to store this kind of structure in CSV columns.
my $file = 'C:\\outputfile.txt';
open(my $fh, '<:encoding(UTF-8)', $file) or die "Could not open file '$file' $!";
while (my $row = <$fh>) { # reading each row till end of file
chomp $row;
//what should be done here ?
}
Please help.
If you have questions about code, I will say: yes, I can answer, but this is not good or the best example of Perl code. Just fast to write.
my $previous_count = "-1"; #beginning, we will think, that no spaces.
my $current_count = "0"; #current default value
my $maximum_count = 3; #u say so
my $to_written = "";
my $delimiter_between_columns = ",";
my $newline_separator = ";";
my $symbol_at_the_beginning = "#"; #input any symbol. But I suppose, you want "\s" <- whitespace' symbol class. input it like this: $var = "\s";
my #aggregate_array_of_ports=();
while(my $row = <DATA>){
#ok, read.
chomp($row);
#print "row is : $row\n";
if($row =~ m/^([$symbol_at_the_beginning]*)/){
#print length($1);
$current_count = length($1) / 2; #take number of spaces divided by 2
$row =~ s/^[$symbol_at_the_beginning]+//;
#hint here, we can get counts as 0,1,2,3 <-see?
#if you take first and third word, you need to add 2 separators.
#OR if you take count with LESSER then previous count, it mean, that you need output
#print"prev : $previous_count and curr : $current_count\n ";
#print"I will write: $to_written\n";
#print "\n PREV: $previous_count --> CURR: $current_count \n";
if($previous_count>=$current_count){
#output here
print "$to_written".$newline_separator."\n";
$previous_count = 0;
$to_written = "";
}
$previous_count = 0 if($previous_count==-1);
#print "$delimiter_between_columns x($current_count-$previous_count)\n";
#print "current: $current_count previous: $previous_count \n";
$to_written .= $delimiter_between_columns x ($current_count - $previous_count + (($current_count-$previous_count)==3?2:0) )."$row";
if ($current_count==($maximum_count-1)){
#print "I input this!: $to_written\n";
$to_written = prepare_to_input_four_spaces($to_written, $delimiter_between_columns);
}
$previous_count = $current_count;
#print"\n";
}
}
#print "$to_written".$newline_separator."\n";
sub prepare_to_input_four_spaces{
my $str = shift; #take string
my $delim = shift;
if ($str=~ m/(.+?[>])\s+(\d+)[:]\s+(.+?)$/){
#here I want to find first capture group before [>] (also it includes) |(.+?[>])|
#next, some spaces |\s+| and I want to catch port |(\d+)|.
#next, |[:]| symbol and some spaces again |\s+| before the tail of the string.
#and will catch this tail: |(.+?)$|.
#where $ mean the right "border" of the string (really - end of the string)
$str = $1.$delim.$2.$delim.$3;
}
return $str;
}
=pod
__DATA__
ABC
EFG"123"
HIJK (12345)
LMNOP "Test"
sssstt"123"
QRS
TU"223"
vwx"55"
www"88"
yz:test1
__END__
=cut
__DATA__
ABC
##EFG"123"
####<HIJK> 22: test file
######LMNOP "Test"
######sssstt"123"
QRS
##TU"223"
####<www> 32: test2 file
######yz test1
####<www> 88: test3 file
######rreeeww
######oooiiiii
##PP
##ss
####<qqq> 89: test6 file
######hhhhggg
######bbbbaaa
######cccczzz
######uu test3
Probably this is ok for you:
I just skipped putting the header and had put the separator as "|".you can any how change it.
> perl -lne 'if(/^[^\#]/){if($.!=1){print "$a"};$a=$_;}else{s/^#*//g;$a.="|$_";}END{print $a}' temp
ABC|EFG"123"|HIJK (12345)|LMNOP "Test"|sssstt"123"
QRS|TU"223"|vwx"55"|www"88"|yz:test1
I'm new to Perl, and I've hit a mental roadblock. I need to extract information from a tab delimited file as shown below.
#name years risk total
adam 5 100 200
adam 5 50 100
adam 10 20 300
bill 20 5 100
bill 30 10 800
In this example, the tab delimited file shows length of investment, amount of money risked, and total at the end of investment.
I want to parse through this file, and for each name (e.g. adam), calculate sum of years invested 5+5, and calculate sum of earnings (200-100) + (100-50) + (300-20). I also would like to save the totals for each name (200, 100, 300).
Here's what I have tried so far:
my $filename;
my $seq_fh;
open $seq_fh, $frhitoutput
or die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
my $yeartotal = 0;
my $earning = 0;
#$line[0] = name
#$line[1] = years
#$line[2] = start
#$line[3] = end
while (#line[0]){
$yeartotal += $line[1];
$earning += ($line[3]-$line[2]);
}
}
Any ideas of where I went wrong?
The Text::CSV module can be used to read tab-delimited data. Often much nicer than trying to manually hack yourself something up with split and so on when it comes to things like quoting, escaping, etc..
You're wrong here : while(#line[0]){
I'd do:
my $seq_fh;
my %result;
open($seq_fh, $frhitoutput) || die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
$result{$line[0]}{yeartotal} += $line[1];
$result{$line[0]}{earning} += $line[3] - $line[2];
}
You should use hash, something like this:
my %hash;
while (my $line = <>) {
next if $line =~ /^#/;
my ($name, $years, $risk, $total) = split /\s+/, $line;
next unless defined $name and defined $years
and defined $risk and defined $total;
$hash{$name}{years} += $years;
$hash{$name}{risk} += $risk;
$hash{$name}{total} += $total;
$hash{$name}{earnings} += $total - $risk;
}
foreach my $name (sort keys %hash) {
print "$name earned $hash{$name}{earnings} in $hash{$name}{years}\n";
}
Nice opportunity to explore Perl's powerful command line options! :)
Code
Note: this code should be a command line oneliner, but it's a little bit easier to read this way. When writing it in a proper script file, you really should enable strict and warnings and use a little bit better names. This version won't compile under strict, you have to declare our $d.
#!/usr/bin/perl -nal
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
# print summary
END { print "$_:\tyears: $d{$_}{y},\tearnings: $d{$_}{e}" for sort keys %d }
Output
adam: years: 20, earnings: 430
bill: years: 50, earnings: 885
Explanation
I make use of the -n switch here which basically lets your code iterate over the input records (-l tells it to use lines). The -a switch lets perl split the lines into the array #F. Simplified version:
while (defined($_ = <STDIN>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
}
%d is a hash with the names as keys and hashrefs as values, which contain years (y) and earnings (e).
The END block is executed after finishing the input line processing and outputs %d.
Use O's Deparse to view the code which is actually executed:
book:/tmp memowe$ perl -MO=Deparse tsv.pl
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
$d{$F[0]}{'y'} += $F[1];
$d{$F[0]}{'e'} += $F[3] - $F[2];
sub END {
print "${_}:\tyears: $d{$_}{'y'},\tearnings: $d{$_}{'e'}" foreach (sort keys %d);
}
;
}
tsv.pl syntax OK
It seems like a fixed-width file, I would use unpack for that