Split a line on every 16th comma - perl

I am using perl to extract "Yes," or "No," from a large CSV, and output to a file using this code
open my $fin, "leads.csv";
my $str;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
}
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
This is working correctly, and outputting data like this http://pastebin.com/r7Lwwz8p, however I need to break
to a new line after the 16th element so it looks like this on output: http://pastebin.com/xC8Lyk5R
Any tips/tricks greatly appreciated!

The following splits a line by commas, and then regroups them by 16 elements:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my #fields = split ',', $line;
while (my #data = splice #fields, 0, 16) {
print join(',', #data), "\n";
}
}
__DATA__
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineH,2,3,4,5,6,7,8,9,10,11,12
Outputs:
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineH,2,3,4,5,6,7,8,9,10,11,12

Use a variable to count the number of yes/no matches that you find, and then use the mod (%) operator to insert a newline into the string.
#!/usr/bin/perl
use strict;
use warnings;
open my $fin, "leads.csv";
my $str;
my $count = 0;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
$count++;
}
$str .= "\n" unless ($count % 16);
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);

Related

Pick up the longest peptide using perl

I want to find out the longest possible protein sequence translated from cds in 6 forward and reverse frame.
This is the example input format:
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
I would like to find out all the strings which start from "M" and stop at "X", count the each length of the strings and select the longest.
For example, in the case above:
the script will find,
>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX
Then count each match's length, and print the string and number of longest matches which is the result I want:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
But it prints out no answer. Does anyone know how to fix it? Any suggestion will be helpful.
#!/usr/bin/perl -w
use strict;
use warnings;
my #pep=();
my $i=();
my #Xnum=();
my $n=();
my %hash=();
my #k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
chomp;
if($_=~/^[^\>]/){
#pep=split(//, $_);
if($_ =~ /(X)/){
push(#Xnum, $1);
if($n >= 0 && $n <= $#Xnum){
if(#pep eq "M"){
for($i=1; $i<=$#pep; $i++){
$seq=join("",#pep);
$hash{$i}=$seq;
push(#k, $i);
}
}
elsif(#pep eq "X"){
$n=$n+1;
}
foreach (sort {$a cmp $b} #k){
print "$hash{$k[0]}\t$k[0]";
}
}
}
}
elsif($_=~/^\>/){
print "$_\n";
}
}
close IN;
Check out this Perl one-liner
$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7
$
There's more than one way to do it!
Try this too:
print and next if /^>/;
chomp and my #z = $_ =~ /(M[^X]*X)/g;
my $m = "";
for my $s (#z) {
$m = $s if length $s > length $m
}
say "$m\t" . length $m
Output:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
uses >=5.14 and make sure to run script with perl -n
As a one-liner:
perl -E 'print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
Here is solution using reduce from List::Util.
Edit: mistakenly used maxstr which gave results but is not what was needed. Have reedited this post to use reduce (correctly) instead.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
open my $fh, '<', \<<EOF;
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
EOF
my $id;
while (<$fh>) {
chomp;
if (/^>/) {
$id = $_;
}
else {
my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
print "$id\n$data\t" . length($data) . "\n" if $data;
}
}
Here's my take on it.
I like fasta files tucked into a hash, with the fasta name as the key. This way you can just add descriptions to it, e.g. base composition etc...
#!/usr/local/ActivePerl-5.20/bin/env perl
use strict;
use warnings;
my %prot;
open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
my $string = do { local $/; <$fh> };
close $fh;
chomp $string;
my #fasta = grep {/./} split (">", $string);
for my $aa (#fasta){
my ($key, $value) = split ("\n", $aa);
$value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
$prot{$key}->{'len'} = length($value);
$prot{$key}->{'prot'} = $value;
}
for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
last;
}
__DATA__
>1232
ASDFASMJJJJJMFASDFSDAFSDDFSA
>2343
AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
Output
>2343
MJJJJJJJJJJJJJJM 16

Perl : Need to append two columns if the ID's are repeating

If id gets repeated I am appending app1, app2 and printing it once.
Input:
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
Output:
id|Name|app1|app2
1|abc|234,265|231,321|
2|xyz|123|215|
3|asd|213|235|
Output I'm getting:
id|Name|app1|app2
1|abc|234,231|
2|xyz|123,215|
1|abc|265,321|
3|asd|213,235|
My Code:
#! usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
my $counter = 0;
my %RepeatNumber;
my $pos=0;
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
open(FH, '<', join('', $basedir, $file)) || die $!;
my $line = readline(FH);
unless ($counter) {
chomp $line;
print OUTFILE $line;
print OUTFILE "\n";
}
while ($line = readline(FH)) {
chomp $line;
my #obj = split('\|',$line);
if($RepeatNumber{$obj[0]}++) {
my $str1= join("|",$obj[0]);
my $str2=join(",",$obj[2],$obj[3]);
print OUTFILE join("|",$str1,$str2);
print OUTFILE "\n";
}
}
This should do the trick:
use strict;
use warnings;
my $file_in = "doctor.txt";
open (FF, "<$file_in");
my $temp = <FF>; # remove first line
my %out;
while (<FF>)
{
my ($id, $Name, $app1, $app2) = split /\|/, $_;
$out{$id}[0] = $Name;
push #{$out{$id}[1]}, $app1;
push #{$out{$id}[2]}, $app2;
}
foreach my $key (keys %out)
{
print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n";
}
EDIT
To see what the %out contains (in case it's not clear), you can use
use Data::Dumper;
and print it via
print Dumper(%out);
I'd tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use 5.14.0;
my %stuff;
#extract the header row.
#use the regex to remove the linefeed, because
#we can't chomp it inline like this.
#works since perl 5.14
#otherwise we could just chomp (#header) later.
my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r );
while (<DATA>) {
#turn this row into a hash of key-values.
my %row;
( $id, #row{#header} ) = split(/\|/);
#print for diag
print Dumper \%row;
#iterate each key, and insert into $row.
foreach my $key ( keys %row ) {
push( #{ $stuff{$id}{$key} }, $row{$key} );
}
}
#print for diag
print Dumper \%stuff;
print join ("|", "id", #header ),"\n";
#iterate ids in the hash
foreach my $id ( sort keys %stuff ) {
#join this record by '|'.
print join('|',
$id,
#turn inner arrays into comma separated via map.
map {
my %seen;
#use grep to remove dupes - e.g. "abc,abc" -> "abc"
join( ",", grep !$seen{$_}++, #$_ )
} #{ $stuff{$id} }{#header}
),
"\n";
}
__DATA__
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc.
Output is:
id|Name|app1|app2
1|abc|234,265|231,321
2|xyz|123|215
3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens:
#!/usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
select(OUTFILE);
open(FH, '<', join('', $basedir, $file)) || die $!;
print(scalar(<FH>));
my #lastobj = (undef);
foreach my $obj (sort {$a->[0] <=> $b->[0]}
map {chomp;[split('|')]} <FH>) {
if(defined($lastobj[0]) &&
$obj[0] eq $lastobj[0])
{#lastobj = (#obj[0..1],
$lastobj[2].','.$obj[2],
$lastobj[3].','.$obj[3])}
else
{
if($lastobj[0] ne '')
{print(join('|',#lastobj),"|\n")}
#lastobj = #obj[0..3];
}
}
print(join('|',#lastobj),"|\n");
Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].

Split list of delimited lines to hash

The following produces what i want.
#!/usr/bin/env perl
use 5.020;
use warnings;
use Data::Dumper;
sub command {
<DATA>
#in the reality instead of the DATA I have
#qx(some weird shell command what produces output like in the DATA);
}
my #lines = grep { !/^\s*$/ } command();
chomp #lines;
my $data;
#how to write the following nicer - more compact, elegant, etc.. ;)
for my $line (#lines) {
my #arr = split /:/, $line;
$data->{$arr[0]}->{text} = $arr[1];
$data->{$arr[0]}->{par} = $arr[2];
$data->{$arr[0]}->{val} = $arr[3];
}
say Dumper $data;
__DATA__
line1:some text1:par1:val1
line2:some text2:par2:val2
line3:some text3:par3:val3
Wondering how to write the loop in more perlish form. ;)
You can assign to a hash slice:
for my $line (#lines) {
my ($id, #arr) = split /:/, $line;
#{ $data->{$id} }{qw{ text par val }} = #arr;
}
Also, use the following instead of qx, so you don't need to store all the lines in an array:
open my $PIPE, '-|', 'command' or die $!;
while (<$PIPE>) {
# ...
}

cant retrieve values from hash reversal (Perl)

I've initialized a hash with Names and their class ranking as follows
a=>5,b=>2,c=>1,d=>3,e=>5
I've this code so far
my %Ranks = reverse %Class; #As I need to find out who's ranked first
print "\nFirst place goes to.... ", $Ranks{1};
The code only prints out
"First place goes to...."
I want it to print out
First place goes to....c
Could you tell me where' I'm going wrong here?
The class hash prints correctly
but If I try to print the reversed hash using
foreach $t (keys %Ranks) {
print "\n $t $Ranks{$t}"; }
It prints
5
abc23
cab2
ord
If this helps in any way
FULL CODE
#Script to read from the data file and initialize it into a hash
my %Code;
my %Ranks;
#Check whether the file exists
open(fh, "Task1.txt") or die "The File Does Not Exist!\n", $!;
while (my $line = <fh>) {
chomp $line;
my #fields = split /,/, $line;
$Code{$fields[0]} = $fields[1];
$Class{$fields[0]} = $fields[2];
}
close(fh);
#Prints the dataset
print "Code \t Name\n";
foreach $code ( keys %Code) {
print "$code \t $Code{$code}\n";
}
#Find out who comes first
my %Ranks = reverse %Class;
foreach $t (keys %Ranks)
{
print "\n $t $Ranks{$t}";
}
print "\nFirst place goes to.... ", $Ranks{1}, "\n";
When you want to check what your data structures actually contain, use Data::Dumper. use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper(\%Class));. You'll find un-chomped newlines.
You need to use chomp. At present your $fields[2] value has a trailing newline.
Change your file read loop to this
while (my $line = <fh>) {
chomp $line;
my #fields = split /,/, $line;
$Code{$fields[0]} = $fields[1];
$Class{$fields[0]} = $fields[2];
}

making one single line

keyword harry /
sally/
tally/
want that whenever the string matches with keyword it should also look for "/" character.This signifies continuation of line
Then I want output as
keyword harry sally tally
==========================
My current code:
#!/usr/bin/perl
open (file2, "trial.txt");
$keyword_1 = keyword;
foreach $line1 (<file2>) {
s/^\s+|\s+$//g;
if ($line1 =~ $keyword_1) {
$line2 =~ (s/$\//g, $line1) ;
print " $line2 " ;
}
}
If the ===== lines in your question are supposed to be in the output, then use
#! /usr/bin/env perl
use strict;
use warnings;
*ARGV = *DATA; # for demo only; delete
sub print_line {
my($line) = #_;
$line =~ s/\n$//; # / fix Stack Overflow highlighting
print $line, "\n",
"=" x (length($line) + 1), "\n";
}
my $line = "";
while (<>) {
$line .= $line =~ /^$|[ \t]$/ ? $_ : " $_";
if ($line !~ s!/\n$!!) { # / ditto
print_line $line;
$line = "";
}
}
print_line $line if length $line;
__DATA__
keyword jim-bob
keyword harry /
sally/
tally/
Output:
keyword jim-bob
================
keyword harry sally tally
==========================
You did not specify what to do with the lines that do not contain the keyword. You might use this code as an inspiration, though:
#!/usr/bin/perl
use warnings;
use strict;
my $on_keyword_line;
while (<>) {
if (/keyword/ or $on_keyword_line) {
chomp;
if (m{/$}) {
chop;
$on_keyword_line = 1;
} else {
$on_keyword_line = 0;
}
print;
} else {
$on_keyword_line = 0;
print "\n";
}
}
A redo is useful when dealing with concatenating continuation lines.
my $line;
while ( defined( $line = <DATA> )) {
chomp $line;
if ( $line =~ s{/\s*$}{ } ) {
$line .= <DATA>;
redo unless eof(DATA);
}
$line =~ s{/}{};
print "$line\n";
}
__DATA__
keyword harry /
sally/
tally/
and
done!!!
$ ./test.pl
keyword harry sally tally and
done!!!
I think you need to simply concatenate all lines that end in a slash, regardless of the keyword.
I suggest this code.
Updated to account for the OP's comment that continuation lines are terminated by backslashes.
while (<>) {
s|\\\s*\z||;
print;
}