Split a line on every 16th comma - perl
I am using perl to extract "Yes," or "No," from a large CSV, and output to a file using this code
open my $fin, "leads.csv";
my $str;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
}
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
This is working correctly, and outputting data like this http://pastebin.com/r7Lwwz8p, however I need to break
to a new line after the 16th element so it looks like this on output: http://pastebin.com/xC8Lyk5R
Any tips/tricks greatly appreciated!
The following splits a line by commas, and then regroups them by 16 elements:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my #fields = split ',', $line;
while (my #data = splice #fields, 0, 16) {
print join(',', #data), "\n";
}
}
__DATA__
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineH,2,3,4,5,6,7,8,9,10,11,12
Outputs:
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineH,2,3,4,5,6,7,8,9,10,11,12
Use a variable to count the number of yes/no matches that you find, and then use the mod (%) operator to insert a newline into the string.
#!/usr/bin/perl
use strict;
use warnings;
open my $fin, "leads.csv";
my $str;
my $count = 0;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
$count++;
}
$str .= "\n" unless ($count % 16);
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
Related
Pick up the longest peptide using perl
I want to find out the longest possible protein sequence translated from cds in 6 forward and reverse frame. This is the example input format: >111 KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX >222 WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK I would like to find out all the strings which start from "M" and stop at "X", count the each length of the strings and select the longest. For example, in the case above: the script will find, >111 has two matches: MGFSOX MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX >222 has one match: MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX Then count each match's length, and print the string and number of longest matches which is the result I want: >111 MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32 >222 MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38 But it prints out no answer. Does anyone know how to fix it? Any suggestion will be helpful. #!/usr/bin/perl -w use strict; use warnings; my #pep=(); my $i=(); my #Xnum=(); my $n=(); my %hash=(); my #k=(); my $seq=(); $n=0; open(IN, "<$ARGV[0]"); while(<IN>){ chomp; if($_=~/^[^\>]/){ #pep=split(//, $_); if($_ =~ /(X)/){ push(#Xnum, $1); if($n >= 0 && $n <= $#Xnum){ if(#pep eq "M"){ for($i=1; $i<=$#pep; $i++){ $seq=join("",#pep); $hash{$i}=$seq; push(#k, $i); } } elsif(#pep eq "X"){ $n=$n+1; } foreach (sort {$a cmp $b} #k){ print "$hash{$k[0]}\t$k[0]"; } } } } elsif($_=~/^\>/){ print "$_\n"; } } close IN;
Check out this Perl one-liner $ cat iris.txt >111 KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX >222 WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK $ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt >111 MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32 >222 MPPPPPX 7 $
There's more than one way to do it! Try this too: print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m Output: >111 MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32 >222 MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38 uses >=5.14 and make sure to run script with perl -n As a one-liner: perl -E 'print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
Here is solution using reduce from List::Util. Edit: mistakenly used maxstr which gave results but is not what was needed. Have reedited this post to use reduce (correctly) instead. #!/usr/bin/perl use strict; use warnings; use List::Util qw/reduce/; open my $fh, '<', \<<EOF; >111 KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX >222 WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK EOF my $id; while (<$fh>) { chomp; if (/^>/) { $id = $_; } else { my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g; print "$id\n$data\t" . length($data) . "\n" if $data; } }
Here's my take on it. I like fasta files tucked into a hash, with the fasta name as the key. This way you can just add descriptions to it, e.g. base composition etc... #!/usr/local/ActivePerl-5.20/bin/env perl use strict; use warnings; my %prot; open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!; my $string = do { local $/; <$fh> }; close $fh; chomp $string; my #fasta = grep {/./} split (">", $string); for my $aa (#fasta){ my ($key, $value) = split ("\n", $aa); $value =~ s/[A-Z]*(M.*M)[A-Z]/$1/; $prot{$key}->{'len'} = length($value); $prot{$key}->{'prot'} = $value; } for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){ print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n"; last; } __DATA__ >1232 ASDFASMJJJJJMFASDFSDAFSDDFSA >2343 AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA Output >2343 MJJJJJJJJJJJJJJM 16
Perl : Need to append two columns if the ID's are repeating
If id gets repeated I am appending app1, app2 and printing it once. Input: id|Name|app1|app2 1|abc|234|231| 2|xyz|123|215| 1|abc|265|321| 3|asd|213|235| Output: id|Name|app1|app2 1|abc|234,265|231,321| 2|xyz|123|215| 3|asd|213|235| Output I'm getting: id|Name|app1|app2 1|abc|234,231| 2|xyz|123,215| 1|abc|265,321| 3|asd|213,235| My Code: #! usr/bin/perl use strict; use warnings; my $basedir = 'E:\Perl\Input\\'; my $file ='doctor.txt'; my $counter = 0; my %RepeatNumber; my $pos=0; open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!; open(FH, '<', join('', $basedir, $file)) || die $!; my $line = readline(FH); unless ($counter) { chomp $line; print OUTFILE $line; print OUTFILE "\n"; } while ($line = readline(FH)) { chomp $line; my #obj = split('\|',$line); if($RepeatNumber{$obj[0]}++) { my $str1= join("|",$obj[0]); my $str2=join(",",$obj[2],$obj[3]); print OUTFILE join("|",$str1,$str2); print OUTFILE "\n"; } }
This should do the trick: use strict; use warnings; my $file_in = "doctor.txt"; open (FF, "<$file_in"); my $temp = <FF>; # remove first line my %out; while (<FF>) { my ($id, $Name, $app1, $app2) = split /\|/, $_; $out{$id}[0] = $Name; push #{$out{$id}[1]}, $app1; push #{$out{$id}[2]}, $app2; } foreach my $key (keys %out) { print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n"; } EDIT To see what the %out contains (in case it's not clear), you can use use Data::Dumper; and print it via print Dumper(%out);
I'd tackle it like this: #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use 5.14.0; my %stuff; #extract the header row. #use the regex to remove the linefeed, because #we can't chomp it inline like this. #works since perl 5.14 #otherwise we could just chomp (#header) later. my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r ); while (<DATA>) { #turn this row into a hash of key-values. my %row; ( $id, #row{#header} ) = split(/\|/); #print for diag print Dumper \%row; #iterate each key, and insert into $row. foreach my $key ( keys %row ) { push( #{ $stuff{$id}{$key} }, $row{$key} ); } } #print for diag print Dumper \%stuff; print join ("|", "id", #header ),"\n"; #iterate ids in the hash foreach my $id ( sort keys %stuff ) { #join this record by '|'. print join('|', $id, #turn inner arrays into comma separated via map. map { my %seen; #use grep to remove dupes - e.g. "abc,abc" -> "abc" join( ",", grep !$seen{$_}++, #$_ ) } #{ $stuff{$id} }{#header} ), "\n"; } __DATA__ id|Name|app1|app2 1|abc|234|231| 2|xyz|123|215| 1|abc|265|321| 3|asd|213|235| This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc. Output is: id|Name|app1|app2 1|abc|234,265|231,321 2|xyz|123|215 3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens: #!/usr/bin/perl use strict; use warnings; my $basedir = 'E:\Perl\Input\\'; my $file ='doctor.txt'; open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!; select(OUTFILE); open(FH, '<', join('', $basedir, $file)) || die $!; print(scalar(<FH>)); my #lastobj = (undef); foreach my $obj (sort {$a->[0] <=> $b->[0]} map {chomp;[split('|')]} <FH>) { if(defined($lastobj[0]) && $obj[0] eq $lastobj[0]) {#lastobj = (#obj[0..1], $lastobj[2].','.$obj[2], $lastobj[3].','.$obj[3])} else { if($lastobj[0] ne '') {print(join('|',#lastobj),"|\n")} #lastobj = #obj[0..3]; } } print(join('|',#lastobj),"|\n"); Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].
Split list of delimited lines to hash
The following produces what i want. #!/usr/bin/env perl use 5.020; use warnings; use Data::Dumper; sub command { <DATA> #in the reality instead of the DATA I have #qx(some weird shell command what produces output like in the DATA); } my #lines = grep { !/^\s*$/ } command(); chomp #lines; my $data; #how to write the following nicer - more compact, elegant, etc.. ;) for my $line (#lines) { my #arr = split /:/, $line; $data->{$arr[0]}->{text} = $arr[1]; $data->{$arr[0]}->{par} = $arr[2]; $data->{$arr[0]}->{val} = $arr[3]; } say Dumper $data; __DATA__ line1:some text1:par1:val1 line2:some text2:par2:val2 line3:some text3:par3:val3 Wondering how to write the loop in more perlish form. ;)
You can assign to a hash slice: for my $line (#lines) { my ($id, #arr) = split /:/, $line; #{ $data->{$id} }{qw{ text par val }} = #arr; } Also, use the following instead of qx, so you don't need to store all the lines in an array: open my $PIPE, '-|', 'command' or die $!; while (<$PIPE>) { # ... }
cant retrieve values from hash reversal (Perl)
I've initialized a hash with Names and their class ranking as follows a=>5,b=>2,c=>1,d=>3,e=>5 I've this code so far my %Ranks = reverse %Class; #As I need to find out who's ranked first print "\nFirst place goes to.... ", $Ranks{1}; The code only prints out "First place goes to...." I want it to print out First place goes to....c Could you tell me where' I'm going wrong here? The class hash prints correctly but If I try to print the reversed hash using foreach $t (keys %Ranks) { print "\n $t $Ranks{$t}"; } It prints 5 abc23 cab2 ord If this helps in any way FULL CODE #Script to read from the data file and initialize it into a hash my %Code; my %Ranks; #Check whether the file exists open(fh, "Task1.txt") or die "The File Does Not Exist!\n", $!; while (my $line = <fh>) { chomp $line; my #fields = split /,/, $line; $Code{$fields[0]} = $fields[1]; $Class{$fields[0]} = $fields[2]; } close(fh); #Prints the dataset print "Code \t Name\n"; foreach $code ( keys %Code) { print "$code \t $Code{$code}\n"; } #Find out who comes first my %Ranks = reverse %Class; foreach $t (keys %Ranks) { print "\n $t $Ranks{$t}"; } print "\nFirst place goes to.... ", $Ranks{1}, "\n";
When you want to check what your data structures actually contain, use Data::Dumper. use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper(\%Class));. You'll find un-chomped newlines.
You need to use chomp. At present your $fields[2] value has a trailing newline. Change your file read loop to this while (my $line = <fh>) { chomp $line; my #fields = split /,/, $line; $Code{$fields[0]} = $fields[1]; $Class{$fields[0]} = $fields[2]; }
making one single line
keyword harry / sally/ tally/ want that whenever the string matches with keyword it should also look for "/" character.This signifies continuation of line Then I want output as keyword harry sally tally ========================== My current code: #!/usr/bin/perl open (file2, "trial.txt"); $keyword_1 = keyword; foreach $line1 (<file2>) { s/^\s+|\s+$//g; if ($line1 =~ $keyword_1) { $line2 =~ (s/$\//g, $line1) ; print " $line2 " ; } }
If the ===== lines in your question are supposed to be in the output, then use #! /usr/bin/env perl use strict; use warnings; *ARGV = *DATA; # for demo only; delete sub print_line { my($line) = #_; $line =~ s/\n$//; # / fix Stack Overflow highlighting print $line, "\n", "=" x (length($line) + 1), "\n"; } my $line = ""; while (<>) { $line .= $line =~ /^$|[ \t]$/ ? $_ : " $_"; if ($line !~ s!/\n$!!) { # / ditto print_line $line; $line = ""; } } print_line $line if length $line; __DATA__ keyword jim-bob keyword harry / sally/ tally/ Output: keyword jim-bob ================ keyword harry sally tally ==========================
You did not specify what to do with the lines that do not contain the keyword. You might use this code as an inspiration, though: #!/usr/bin/perl use warnings; use strict; my $on_keyword_line; while (<>) { if (/keyword/ or $on_keyword_line) { chomp; if (m{/$}) { chop; $on_keyword_line = 1; } else { $on_keyword_line = 0; } print; } else { $on_keyword_line = 0; print "\n"; } }
A redo is useful when dealing with concatenating continuation lines. my $line; while ( defined( $line = <DATA> )) { chomp $line; if ( $line =~ s{/\s*$}{ } ) { $line .= <DATA>; redo unless eof(DATA); } $line =~ s{/}{}; print "$line\n"; } __DATA__ keyword harry / sally/ tally/ and done!!! $ ./test.pl keyword harry sally tally and done!!!
I think you need to simply concatenate all lines that end in a slash, regardless of the keyword. I suggest this code. Updated to account for the OP's comment that continuation lines are terminated by backslashes. while (<>) { s|\\\s*\z||; print; }