How to remove fasta formatted sequences that contain Ns - perl

I have a fasta file like such
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronY
acgtacgtNNNNa
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt
I need to remove sequences with at least 2 N in a row (because these introns are misannotated).
Here, it would be sequence " >ENS..._intronY " (Line 3, 4, and 5 should be removed)
any suggestions?
Thank you,

With gawk
awk -v RS='">' '!/NN/{printf $0RT}' file
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt

Since it appears you're pursuing bioinformatics, consider becoming familiar with Bio::SeqIO, as it'll help with this and many other fasta parsing jobs:
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => shift, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
print '>' . $seq->id . ' ' . $seq->desc . "\n" . $seq->seq . "\n"
if $seq->seq !~ /nn/i;
}
Usage: perl script.pl inFile [>outFile]
The last, optional parameter directs output to a file.
Output on your dataset:
>ENS..._intronX
acgtacgtacgtacgt
>ENS..._intronZ
acgtacgtacgtacgtacgtacgtacgtacgt
Hope this helps!

Related

How I search and print matched wold in UNIX or perl?

1=ABC,2=mnz,3=xyz
1=pqr,3=ijk,2=lmn
I have this in text file I want to search 1= and that should print only matched word 1=ABC and 1=pqr
Any suggestions in Perl or Unix?
Input:
$ cat grep.in
1=ABC,2=mnz,3=xyz
1=pqr,3=ijk,2=lmn
4=pqr,3=ijk,2=lmn
Command:
$ grep -o '1=[^,]\+' grep.in
1=ABC
1=pqr
Explanations:
You can just use grep on your input
-o is to output only the matching pattern
1=[^,]\+ the regex will match strings that start by 1= followed by at least one character that is not a comma (I have based this on the hypothesis that there is no comma in the right part of the = except the separator)
if you want to accept empty result you can change the \+ by *
It appears that your input data is in CSV format. Here is a Perl solution based on Text::CSV
parse the CSV content row-wise
print out columns that start with 1=
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => "\n",
}) or die "CSV\n";
# parse
while (my $row = $csv->getline(\*DATA)) {
foreach (#{ $row }) {
print "$_\n" if /^1=/;
}
}
exit 0;
__DATA__
1=ABC,2=mnz,3=xyz
1=pqr,3=ijk,2=lmn
Test run:
$ perl dummy.pl
1=ABC
1=pqr
Replace DATA with STDIN to read the input from standard input instead.

Move last character of line to specific column -- sed? awk?

I need to replace all lines ending with specific character (say, &) such that this character should be in certain column (say, 80).
Which tool is best?
I have started thinking about sed:
sed 's/\(.*\)&/\1 <what should be here??> &/'
but cannot understand how to replace with variable number of spaces such that & goes to column 80.
Thanks!
Use the /e switch to s/// that tells Perl to evaluate the replacement portion to compute the result.
#! /usr/bin/env perl
use strict;
use warnings;
while (<>) {
s/^(.*)(&)$/$1 . " " x (79 - length $1) . $2/e;
print;
}
Sample run:
$ echo -e 'foo&\n&\nbar &\nbaz' | ./align-ampersands
foo &
&
bar &
baz
If your input contains TAB characters, you will need to use more sophisticated processing.
Not sure if I understand your question correctly but you can try something like (assuming your file is space delimited):
awk '/&$/ {for(i=1;i<=NF;i++) $i=(i==80)?"& "$i:$i}1' yourFile
Awk and Perl will both work. Both have printf and substr:
#! /usr/bin/env perl
use warnings;
use strict;
my $string = "this is some text &";
my $last_char = substr($string, -1, 1);
$string = substr ($string, 0, length ($string ) - 1);
printf qq(%-79.79s%s\n), $string, $last_char;
The substr command is available in both Awk and Perl.
The whole command could be made into a one liner:
printf qq(%-79.79s%s\n), substr ($string, 0, length ($string ) - 1), substr($string, -1, 1);
awk '/&$/{$80="&"}1' file

Search and replace in Perl for particular word

I have a huge file which consists of similar lines below , with different clocks:
cmd -quiet [get_ports p1] ref_clocks "cudtclk_sp cudtclk"
cmd -quiet [get_ports p2] clock "cu2xdtclk_sp cu2xdtclk"
And I need to replace cudtclk with some other name like cdtclk whenever I have ref_clocks in my file, globally.
I have written following code but it doesn't seem to be working.
#!/usr/bin/perl
use strict;
use warnings;
sub clock_change
{       # Get the subroutine's argument.
my $arg = shift;
# Hash of stuff we want to replace.
my %replace = (
"cudtclk" => "cdtclk",
);
# See if there's a replacement for the given text.
my $text = $replace{$arg};
if(defined($text)) {
return $text;
}
return $arg;
}
open PAR, "<file name>";
while(<PAR>) {
$_ =~ s/\S+\s\S+\s\S+\s\S+\sref_clocks\s+(\S+\s+\S+)/clock_change($1)/eig;
print $_;   ##print it to some file later.
}
"And I need to replace cudtclk with some other name like cdtclk"
perl -pe 's/\bcudtclk\b/cdtclk/' thefile > newfile
"whenever I have ref_clocks"
perl -pe 's/\bcudtclk\b/cdtclk/ if /\bref_clocks\b/' thefile > newfile
Alternatively:
# saves original file as file.bak
perl -i.bak -pe 's/\bcudtclk\b/cdtclk/ if /\bref_clocks\b/' file
Tighten to suit your data, as necessary.
Although the substitution seems like unnecessarily complex, you can fix it with something similar to:
$_ =~ s/(ref_clocks\s+")([^_]+)_sp(\s+)\2/
$1.clock_change($2)."_sp$3".clock_change($2)/eig;

reformat text in perl

I have a file of 1000 lines, each line in the format
filename dd/mm/yyyy hh:mm:ss
I want to convert it to read
filename mmddhhmm.ss
been attempting to do this in perl and awk - no success - would appreciate any help
thanks
You can do a simple regular expression replacement if the format is really fixed:
s|(..)/(..)/.... (..):(..):(..)$|$2$1$3$4.$5|
I used | as a separator so that I do not need to escape the slashes.
You can use this with Perl on the shell in place:
perl -pi -e 's|(..)/(..)/.... (..):(..):(..)$|$2$1$3$4.$5|' file
(Look up the option descriptions with man perlrun).
Another somehow ugly approach: foreach line of code ($str here) you get from the file do something like this:
my $str = 'filename 26/12/2010 21:09:12';
my #arr1 = split(' ',$str);
my #arr2 = split('/',$arr1[1]);
my #arr3 = split(':',$arr1[2]);
my $day = $arr2[0];
my $month = $arr2[1];
my $year = $arr2[2];
my $hours = $arr3[0];
my $minutes = $arr3[1];
my $seconds = $arr3[2];
print $arr1[0].' '.$month.$day.$year.$hours.$minutes.'.'.$seconds;
Pipe your file to a perl script with:
while( my line = <> ){
if ( $line =~ /(\S+)\s+\(d{2})\/(\d{2})/\d{4}\s+(\d{2}):(\d{2}):(\d{2})/ ) {
print $1 . " " . $3 . $2 . $4 . $5 . '.' . $6;
}
}
Redirect the output however you want.
This says match line to:
(non-whitespace>=1)whitespace>=1(2digits)/(2digits)/4digits
whitepsace>=1(2digits):(2digits):(2digits)
Capture groups are in () numbered 1 to 6 left to right.
Using sed:
sed -r 's|/[0-9]{4} ||; s|/||; s/://; s/:/./' file.txt
delete the year /yyyy
delete the remaining slash
delete the first colon
change the remaining colon to a dot
Using awk:
awk '{split($2,d,"/"); split($3,t,":"); print $1, d[1] d[2] t[1] t[2] "." t[3]}'

How can I remove the timestamp from a filename in Perl?

I have a file which has a line in it as:
/hosting/logs/U01-ecom-SIT01/CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut_10.01.21_16.54.18.log`
I need a script which would read this line and remove the time stamp from it, that is:
10.01.21_16.54.18
The script should print the filename without the timestamp and holding the full path, that is:
/hosting/logs/U01-ecom-SIT01/CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut.log`
Please help as I'm unable to pattern match and output the file path without the timestamp.
echo "/hosting/logs/U01-ecom-SIT01/CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut_10.01.21_16.54.18.log" |
perl -pe "s/_\d\d\.\d\d\.\d\d_\d\d\.\d\d\.\d\d//;"
$ perl -e 's{_\d{2}\.\d{2}.\d{2}_\d{2}\.\d{2}.\d{2}}{} and print for #ARGV' /hosting/logs/U01-ecom-SIT01/CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut_10.01.21_16.54.18.log
Path shortened to prevent scrolling:
$ cat paths
CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut_10.01.21_16.54.18.log
$ perl -pe 's/(_(\d\d(\.\d\d){2})){2}\.log$/.log/' paths
CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut.log
The timestamp is made up of 2 sequences that look like _##.##.##. The subsequences end with 2 sequences of .##. These are the roles of the {2} quantifiers.
while(<>){
#s = split /\// ;
$fullpath=join("/",splice #s , 0, $#s);
#a = split /[_.]/ ,$s[-1];
$newfile="$fullpath/$a[0].$a[-1]";
print $newfile."\n";
}
You can use the following coding
use strict;
use warnings;
my $var; $var=/hosting/logs/U01-ecom-SIT01/CU01-DC05-IFIO_SIT01_NU01-nc3sz1ecmas11/waslogs/SystemOut_10.01.21_16.54.18.log";
$var=~s/_\d\d\.\d\d\.\d\d//g;
# $var=~s/_10\.01\.21_16\.54\.18//g; # You can use this way also
print "$var\n";