Extracting fasta ids after string match - perl

I have a list of fasta sequences as following:
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
The original fasta sequence is much longer than the subset posted here. I wanted to extract the 10 characters after the pattern "TCAT" into a separate file and did this
grep -oP "(?<=TCAT).{10}"
I do get the needed result as:
CTCACCTACT
TGATAAGGGG
I would like their corresponding fasta ids as one column and the extracted pattern as second column like:
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG

Try this one-liner
perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' file
with your given inputs
$ cat fasta.txt
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
$ perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' fasta.txt
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
$

Another way will be ussing awk command like this :
cat <your_file>| awk -F"_" '/Product/{printf "%s", $0; next} 1'|awk -F"TCAT" '{ print substr($1,1,35) "\t" substr($2,1,10)}'
the output :
Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
hope it help you.

Related

Normalize column fill with space on right

Working with the example log file below:
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
I need to normalize the last column filling with spaces until they has the lenght = 25
Tryed with unsuccessful perl code:
perl -F';' -lane '$F[5] = $F[5], sprintf "% 25d"; $" = ";"; print "#F"'
I need the output below:
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
$ awk 'BEGIN{FS=OFS=";"} {$NF=sprintf("%-25s",$NF)}1' file
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
So you can see the blanks:
$ awk 'BEGIN{FS=OFS=";"} {$NF=sprintf("%-25s",$NF)}1' file | tr ' ' '#'
1;000117;20190529;055529;9521;0988388019###############
1;000015;20190529;071944;2222;2231#####################
1;000012;20190529;072734;4258;4252#####################
1;000006;20190529;073336;2226;1000#####################
3;000005;20190529;073715;1000;037760967################
3;000004;20190529;073751;1000;037760967################
You were on the right track. More successful Perl codes:
perl -F';' -lane '$F[5]=sprintf("%-25s",$F[5]);print join ";",#F'
perl -F';' -pane '$F[5]=sprintf("%-25s",$F[5]);$_=join ";",#F'
This might work for you (GNU sed):
sed -i ':a;/;[^;]\{25\}$/!s/$/ /;ta' file
If the last field is not 25 characters long, add a space until it is.

Sed - replace words

I have a problem with replacing string.
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
I want to find occurrence of Svc till | appears and swap place with Stm till | appears.
My attempts went to replacing characters and this is not my goal.
awk -F'|' -v OFS='|'
'{a=b=0;
for(i=1;i<=NF;i++){a=$i~/^Stm=/?i:a;b=$i~/^Svc=/?i:b}
t=$a;$a=$b;$b=t}7' file
outputs:
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
the code exchange the column of Stm.. and Svc.., no matter which one comes first.
If perl solution is okay, assumes only one column matches each for search terms
$ cat ip.txt
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
$ perl -F'\|' -lane '
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F;
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t;
print join "|", #F;
' ip.txt
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
-F'\|' -lane split input line on |, see also Perl flags -pe, -pi, -p, -w, -d, -i, -t?
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F get index of columns matching Svc and Stm
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t swap the two columns
Or use ($F[$i[0]], $F[$i[1]]) = ($F[$i[1]], $F[$i[0]]); courtesy How can I swap two Perl variables
print join "|", #F print the modified array
You need to use capture groups and backreferences in a string substition.
The below will swap the 2:
echo '|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631' | sed 's/\(Stm.*|\)\(.*\)\(Svc.*|\)/\3\2\1/'
As pointed out in the comment from #Kent, this will not work if the strings were not in that order.

sed/awk/cut/grep - Best way to extract string

I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.
Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt

Sort a file with unordered columns of integers

I have an input file with two columns of integer values. I would like to chop the input file in this way
input file:
...
...
12312 565456
565456 12312
...
...
#
output file:
...
...
12312 565456
...
...
namely if two numbers are present in couple more then one time, writing a unique line in the output file where the first number if the smaller of the two.
How can be done with sort or a perl script?
You can try:
perl -nale ' #F=reverse #F if($F[0]>$F[1]);
$x=$F[0]." ".$F[1]; if(!$h{$x}){print $x;$h{$x}=1;}'
See it
You could combine perl and sort:
perl -lne 'BEGIN { $, = " " } print sort split' infile | sort -u
awk -vOFS="\t" '$2<$1 {print $2,$1} $1<=$2 {print}'|sort -u
would also work

How can I change spaces to underscores and lowercase everything?

I have a text file which contains:
Cycle code
Cycle month
Cycle year
Event type ID
Event ID
Network start time
I want to change this text so that when ever there is a space, I want to replace it with a _. And after that, I want the characters to lower case letter like below:
cycle_code
cycle_month
cycle_year
event_type_id
event_id
network_start_time
How could I accomplish this?
Another Perl method:
perl -pe 'y/A-Z /a-z_/' file
tr alone works:
tr ' [:upper:]' '_[:lower:]' < file
Looking into sed documentation some more and following advice from the comments the following command should work.
sed -r {filehere} -e 's/[A-Z]/\L&/g;s/ /_/g' -i
There is a perl tag in your question as well. So:
#!/usr/bin/perl
use strict; use warnings;
while (<DATA>) {
print join('_', split ' ', lc), "\n";
}
__DATA__
Cycle code
Cycle month
Cycle year
Event type ID
Event ID
Network start time
Or:
perl -i.bak -wple '$_ = join('_', split ' ', lc)' test.txt
sed "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ /abcdefghijklmnopqrstuvwxyz_/" filename
Just use your shell, if you have Bash 4
while read -r line
do
line=${line,,} #change to lowercase
echo ${line// /_}
done < "file" > newfile
mv newfile file
With gawk:
awk '{$0=tolower($0);$1=$1}1' OFS="_" file
With Perl:
perl -ne 's/ +/_/g;print lc' file
With Python:
>>> f=open("file")
>>> for line in f:
... print '_'.join(line.split()).lower()
>>> f.close()