Extract a specific gene from several prokka-annotated sequences - annotations

I annotated 500 sequences with Prokka from which I need to specifically extract only TcdA gene from all sequences, I need use the annotation of .ffn file of all sequences.
¿How can I do this automatically without having to open each folder of each sequence noted?
Prokka files:
Strain1
>Strain1.err
>Strain1.faa
>Strain1.fna
>Strain1.ffn *I use this file for extract gene*
I need the TcdA gene of the 500 sequences
Strain1_01428 glycosylating toxin TcdA
ATGTCTTTAATATCTAAAGAAGAGTTAATAAAACTCGCATATAGCATTAGACCAAGAGAA
AATGAGTATAAAACTATATTAACTAATTTAGACGAATATAATAAGTTAACTACAAACAAT
AATGAAAATAAATATTTACAATTAAAAAAACTAAATGAATCAATTGATGTTTTTATGAAT
AAATATAAAAATTCAAGCAGAAATAGAGCACTCTCTAATCTAAAAAAAGATATATTAAAA
GAAGTAATTCTTATTAAAAATTCCAATACAAGTCCTGTAGAAAAAAATTTACATTTTGTA

something like:
for i in /path/to/*.ffn; do awk 'BEGIN {RS=">"} /glycosylating toxin TcdA/ {print ">"$0}' $i > TcdA.fasta; done

Related

Replace file entries in subdirectories with sed from list

I want to replace a certain line (#6) in a whole bunch of documents that looks like this:
N Metal1 Metal2 Metal3 Metal4
where the metals need to be replaced with chemical symbols from a list of permutations:
CrHfMoNb CrHfMoTa CrHfMoTi CrHfMoV CrHfMoW CrHfMoZr CrHfNbTa CrHfNbTi CrHfNbV CrHfNbW CrHfNbZr CrHfTaTi CrHfTaV CrHfTaW CrHfTaZr CrHfTiV CrHfTiW CrHfTiZr CrHfVW CrHfVZr CrHfWZr CrMoNbTa CrMoNbTi CrMoNbV CrMoNbW CrMoNbZr CrMoTaTi CrMoTaV CrMoTaW CrMoTaZr CrMoTiV CrMoTiW CrMoTiZr CrMoVW CrMoVZr CrMoWZr CrNbTaTi CrNbTaV CrNbTaW CrNbTaZr CrNbTiV CrNbTiW CrNbTiZr CrNbVW CrNbVZr CrNbWZr CrTaTiV CrTaTiW CrTaTiZr CrTaVW CrTaVZr CrTaWZr CrTiVW CrTiVZr CrTiWZr CrVWZr HfMoNbTa HfMoNbTi HfMoNbV HfMoNbW HfMoNbZr HfMoTaTi HfMoTaV HfMoTaW HfMoTaZr HfMoTiV HfMoTiW HfMoTiZr HfMoVW HfMoVZr HfMoWZr HfNbTaTi HfNbTaV HfNbTaW HfNbTaZr HfNbTiV HfNbTiW HfNbTiZr HfNbVW HfNbVZr HfNbWZr HfTaTiV HfTaTiW HfTaTiZr HfTaVW HfTaVZr HfTaWZr HfTiVW HfTiVZr HfTiWZr HfVWZr MoNbTaTi MoNbTaV MoNbTaW MoNbTaZr MoNbTiV MoNbTiW MoNbTiZr MoNbVW MoNbVZr MoNbWZr MoTaTiV MoTaTiW MoTaTiZr MoTaVW MoTaVZr MoTaWZr MoTiVW MoTiVZr MoTiWZr MoVWZr NbTaTiV NbTaTiW NbTaTiZr NbTaVW NbTaVZr NbTaWZr NbTiVW NbTiVZr NbTiWZr NbVWZr TaTiVW TaTiVZr TaTiWZr TaVWZr TiVWZr
to make it look for example like this:
N Cr Hf Mo Nb
I can do this easily with the sed command using:
sed -i '6s/Metal1 Metal2 Metal3 Metal4/Cr Hf Mo Nb/' filename`
The problem is that I need to do it automatically for all 126 combinations, where each file is residing in its own subdirectory for each composition and has to be adjusted accordingly to its own elements. The file always has the same name and is completely identical before this change.
The chemical symbols have to be listed alphabetically and there must be one space between each, or the code won't work. I assume this is difficult because all the used elements have two letters in their symbols except for V. Is there an efficient way to do this?
Handling the two-char (Zr) one-char (W) problem is easy. Each capital letter marks the beginning of a new element.
cd dir/
for dir in *N; do
split="$(sed 's/N$//;s/[A-Z]/ &/g' <<< "$dir")"
sed -i "s/N Metal1 Metal2 Metal3 Metal4/N $split/" "$dir/file"
done
Note that $split starts with a space, so the replacement string is something like N Cr Hf Mo Nb with two spaces between N and Cr – just as you wanted.

Extracting values from a single file

I have a file with multiple lines; but a specific line contains tons of information, with several repeated expressions. I'm trying to extract some specific values. I first tried some commands with sed, for instance, but with no success. So, I was wondering if you could give me some insights.
So, here you have one fraction of the unique line of the given document I mentioned:
[...]6[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01},DLOOP.rate_median=0.04131395026396427,length=
[...]
10[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61},DLOOP.rate_median=0.04131395026396427,length=
[...]
My aim here is first to extract all the values that is between the brackets, after "habitat.set.prob={". and put them in a single line in a text file.
Also, it would be important to extract the numbers that appears just before the expression "[&length_range=]", which in this case are "6" and "10". They are the label of the set of numbers after "prob={"
So the set of numbers I want to extract always appears between "habitat.set.prob={" and "},DLOOP.rate_median", while the other number (the label) is always rigth before "[&length_range="; but what is before the label is not the same expression; actually it is a random number.
The goal then is end up with a file with the following characteristcs:
6 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
10 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
and so on …
What do you think? Is this possible?
I started with this very basic command at least to try to extract the set of numbers, but it didn't work
sed -n "/habitat.set.prob={/,/},DLOOP.rate_median=/ p"
| Well... I got some improvement.
I was able to get the values at least:
awk '{gsub("habitat.set.prob={","\n");printf"%s",$0}' filename | awk -F'},' '{print $1"}"}' | grep -iv "TREE" > stats.txt
|
Many thanks in advance.
Cheers,
Luiz
Something like that:
sed -rn '/.*[0-9]+\[&length_range=\{/,/habitat.set.prob=\{/{s/.*\b([0-9]+)\[&length_range.*/\1/p; s/.*habitat.set.prob=\{([^D]+)\},DLOOP.rate.*/\1/p}' habitat
6
0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01
10
0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
The first part '/.a./,/.b./' searches from pattern a to b, distributed over multiple lines. The -n told sed to do non-printing as default.
In '/.a./,/.b./{s/.c./.d./p; s/.e./.f./p}'
there are two substitution commands with p=print in curly braces.
I am not sure if you really digged a little, so not providing the complete answer, but let's hope this would help you:
for the first part: getting the no(which you call as label) you didn't mention if there is any specific pattern, so try this (data is the file which contains the actual input) - you need to work on how to get the number and tweak the RE a bit
sed -n 's/.*\([0-9][0-9]*\).*length_range.*/\1/p' data
For the other part which gives the numericals between habitat and DLOOP:
sed -n 's/.*habitat.set.prob=\(.*\),DLOOP.*/\1/pg' data | tr '{' ' ' | tr '}' ' '
Now, try to take this as a starter and work on your output to get your desired result!
To explain a bit:
In the first section - I am trying to capture the numericals between anything(.*) and (.*)length_range [you can escape the character [ and & by using \ in front of them]
In the second section: I am capturing pattern in between habitat.set.prob and DLOOP and then doin a tr to remove the brackets.
#include <iostream>
using namespace std;
int main()
{
string p = "1:2:3:4"; //input your string
int arr[4] = {}; //create a new empty integer array to put the integers in it
for(int i=0, j=0; i <p.length(); i++){//loop on the string to extract integers
if( p[i] == ':'){continue;}//if the value = ':' skip it and continue
arr[j]=(int)p[i]-48;j++;//put the integer in the array we created
}
cout << "String={"<<arr[0]<<" "<<arr[1]<<" "<<arr[2]<<" "<<arr[3]<<"}";//print the array
return 0;
}

convert row to column based on text

I have a rather large file (single column) with data similar to this:
BT1111
2.2.2.2/3
3.3.3.3/4
7.2.1.1/5
BT6766
2.2.1.1/5
4.5.1.1/7
BT9898
4.4.4.4/2
8.8.8.8/9
I wish to find a function that can align it into two columns, by moving all entries starting with digit one column ($1 to $2) and enrich it with the corresponding BT field, so desired output should be
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9
I can't imagine how to ensure the "look for next occurence" should be performed, but hope there is a function for it I have managed to overlook ?
perl -nle'if (/^\D/) { $n=$_ } else { print "$n;$_" }' input.txt
See Specifying file to process to Perl one-liner for alternate usages.
$ awk '/BT/{a=$1; next}{print a ";" $1}' input.txt
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9

sed: replace letter between square brackets

I have the following string:
signal[i]
signal[bg]
output [10:0]
input [i:1]
what I want is to replace the letters between square brackets (by underscore for example) and to keep the other strings that represents table declaration:
signal[_]
signal[__]
output [10:0]
input [i:1]
thanks
try:
awk '{gsub(/\[[a-zA-Z]+\]/,"[_]")} 1' Input_file
Globally substituting the (bracket)alphabets till their longest match then with [_]. Mentioning 1 will print the lines(edited or without edited ones).
EDIT: Above will substitute all alphabets with one single _, so to get as many underscores as many characters are there following may help in same.
awk '{match($0,/\[[a-zA-Z]+\]/);VAL=substr($0,RSTART+1,RLENGTH-2);if(VAL){len=length(VAL);;while(i<len){q=q?q"_":"_";i++}};gsub(/\[[a-zA-Z]+\]/,"["q"]")}1' Input_file
OR
awk '{
match($0,/\[[a-zA-Z]+\]/);
VAL=substr($0,RSTART+1,RLENGTH-2);
if(VAL){
len=length(VAL);
while(i<len){
q=q?q"_":"_";
i++
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]")
}
1
' Input_file
Will add explanation soon.
EDIT2: Following is the one with explanation purposes for OP and users.
awk '{
match($0,/\[[a-zA-Z]+\]/); #### using match awk's built-in utility to match the [alphabets] as per OP's requirement.
VAL=substr($0,RSTART+1,RLENGTH-2); #### Creating a variable named VAL which has substr($0,RSTART+1,RLENGTH-2); which will have substring value, whose starting point is RSTART+1 and ending point is RLENGTH-2.
RSTART and RLENGTH are the variables out of the box which will be having values only when awk finds any match while using match.
if(VAL){ #### Checking if value of VAL variable is NOT NULL. Then perform following actions.
len=length(VAL); #### creating a variable named len which will have length of variable VAL in it.
while(i<len){ #### Starting a while loop which will run till the value of VAL from i(null value).
q=q?q"_":"_"; #### creating a variable named q whose value will be concatenated it itself with "_".
i++ #### incrementing the value of variable i with 1 each time.
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]") #### Now globally substituting the value of [ alphabets ] with [ value of q(which have all underscores in it) then ].
}
1 #### Mentioning 1 will print (edited or non-edited) lines here.
' Input_file #### Mentioning the Input_file here.
Alternative gawk solution:
awk -F'\\[|\\]' '$2!~/^[0-9]+:[0-9]$/{ gsub(/./,"_",$2); $2="["$2"]" }1' OFS= file
The output:
signal[_]
signal[__]
output [10:0]
-F'\\[|\\]' - treating [ and ] as field separators
$2!~/^[0-9]+:[0-9]$/ - performing action if the 2nd field does not represent table declaration
gsub(/./,"_",$2) - replace each character with _
This might work for you (GNU sed);
sed ':a;s/\(\[_*\)[[:alpha:]]\([[:alpha:]]*\]\)/\1_\2/;ta' file
Match on opening and closing square brackets with any number of _'s and at least one alpha character and replace said character by an underscore and repeat.
awk '{sub(/\[i\]/,"[_]")sub(/\[bg\]/,"[__]")}1' file
signal[_]
signal[__]
output [10:0]
input [i:1]
The explanation is as follows: Since bracket is as special character it has to be escaped to be handled literally then it becomes easy use sub.

Deduplicate FASTA, keep a seq id

I need to format files for a miRNA-identifying tool (miREAP).
I have a fasta file in the following format:
>seqID_1
CCCGGCCGTCGAGGC
>seqID_2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_4
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_5
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_6
AGGGCACGCCTGCCTGGGCGTCACGC
I want to count the number of times each sequence occurs and append that number to the seqID line. The count for each sequence and an original ID referring to the sequence need only appear once in the file like this:
>seqID_1 1
CCCGGCCGTCGAGGC
>seqID_2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
Fastx_collapser does the trick nearly as I'd like (http://hannonlab.cshl.edu/fastx_toolkit/index.html). However, rather than maintain seqIDs, it returns:
>1 1
CCCGGCCGTCGAGGC
>2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
This means that the link between my sequence, seqID, and genome mapping location is lost. (Each seqID corresponds to a sequence in my fasta file and a genome mapping spot in a separate Bowtie2-generated .sam file)
Is there a simple way to do the desired deduplication at the command line?
Thanks!
linearize and sort/uniq -c
awk '/^>/ {if(N>0) printf("\n"); ++N; printf("%s ",$0);next;} {printf("%s",$0);} END { printf("\n");}' input.fa | \
sort -t ' ' -k2,2 | uniq -f 1 -c |\
awk '{printf("%s_%s\n%s\n",$2,$1,$3);}'
>seqID_2_2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_1_1
CCCGGCCGTCGAGGC
>seqID_3_3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA