convert row to column based on text - perl

I have a rather large file (single column) with data similar to this:
BT1111
2.2.2.2/3
3.3.3.3/4
7.2.1.1/5
BT6766
2.2.1.1/5
4.5.1.1/7
BT9898
4.4.4.4/2
8.8.8.8/9
I wish to find a function that can align it into two columns, by moving all entries starting with digit one column ($1 to $2) and enrich it with the corresponding BT field, so desired output should be
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9
I can't imagine how to ensure the "look for next occurence" should be performed, but hope there is a function for it I have managed to overlook ?

perl -nle'if (/^\D/) { $n=$_ } else { print "$n;$_" }' input.txt
See Specifying file to process to Perl one-liner for alternate usages.

$ awk '/BT/{a=$1; next}{print a ";" $1}' input.txt
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9

Related

Extract a specific gene from several prokka-annotated sequences

I annotated 500 sequences with Prokka from which I need to specifically extract only TcdA gene from all sequences, I need use the annotation of .ffn file of all sequences.
¿How can I do this automatically without having to open each folder of each sequence noted?
Prokka files:
Strain1
>Strain1.err
>Strain1.faa
>Strain1.fna
>Strain1.ffn *I use this file for extract gene*
I need the TcdA gene of the 500 sequences
Strain1_01428 glycosylating toxin TcdA
ATGTCTTTAATATCTAAAGAAGAGTTAATAAAACTCGCATATAGCATTAGACCAAGAGAA
AATGAGTATAAAACTATATTAACTAATTTAGACGAATATAATAAGTTAACTACAAACAAT
AATGAAAATAAATATTTACAATTAAAAAAACTAAATGAATCAATTGATGTTTTTATGAAT
AAATATAAAAATTCAAGCAGAAATAGAGCACTCTCTAATCTAAAAAAAGATATATTAAAA
GAAGTAATTCTTATTAAAAATTCCAATACAAGTCCTGTAGAAAAAAATTTACATTTTGTA
something like:
for i in /path/to/*.ffn; do awk 'BEGIN {RS=">"} /glycosylating toxin TcdA/ {print ">"$0}' $i > TcdA.fasta; done

Extracting values from a single file

I have a file with multiple lines; but a specific line contains tons of information, with several repeated expressions. I'm trying to extract some specific values. I first tried some commands with sed, for instance, but with no success. So, I was wondering if you could give me some insights.
So, here you have one fraction of the unique line of the given document I mentioned:
[...]6[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01},DLOOP.rate_median=0.04131395026396427,length=
[...]
10[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61},DLOOP.rate_median=0.04131395026396427,length=
[...]
My aim here is first to extract all the values that is between the brackets, after "habitat.set.prob={". and put them in a single line in a text file.
Also, it would be important to extract the numbers that appears just before the expression "[&length_range=]", which in this case are "6" and "10". They are the label of the set of numbers after "prob={"
So the set of numbers I want to extract always appears between "habitat.set.prob={" and "},DLOOP.rate_median", while the other number (the label) is always rigth before "[&length_range="; but what is before the label is not the same expression; actually it is a random number.
The goal then is end up with a file with the following characteristcs:
6 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
10 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
and so on …
What do you think? Is this possible?
I started with this very basic command at least to try to extract the set of numbers, but it didn't work
sed -n "/habitat.set.prob={/,/},DLOOP.rate_median=/ p"
| Well... I got some improvement.
I was able to get the values at least:
awk '{gsub("habitat.set.prob={","\n");printf"%s",$0}' filename | awk -F'},' '{print $1"}"}' | grep -iv "TREE" > stats.txt
|
Many thanks in advance.
Cheers,
Luiz
Something like that:
sed -rn '/.*[0-9]+\[&length_range=\{/,/habitat.set.prob=\{/{s/.*\b([0-9]+)\[&length_range.*/\1/p; s/.*habitat.set.prob=\{([^D]+)\},DLOOP.rate.*/\1/p}' habitat
6
0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01
10
0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
The first part '/.a./,/.b./' searches from pattern a to b, distributed over multiple lines. The -n told sed to do non-printing as default.
In '/.a./,/.b./{s/.c./.d./p; s/.e./.f./p}'
there are two substitution commands with p=print in curly braces.
I am not sure if you really digged a little, so not providing the complete answer, but let's hope this would help you:
for the first part: getting the no(which you call as label) you didn't mention if there is any specific pattern, so try this (data is the file which contains the actual input) - you need to work on how to get the number and tweak the RE a bit
sed -n 's/.*\([0-9][0-9]*\).*length_range.*/\1/p' data
For the other part which gives the numericals between habitat and DLOOP:
sed -n 's/.*habitat.set.prob=\(.*\),DLOOP.*/\1/pg' data | tr '{' ' ' | tr '}' ' '
Now, try to take this as a starter and work on your output to get your desired result!
To explain a bit:
In the first section - I am trying to capture the numericals between anything(.*) and (.*)length_range [you can escape the character [ and & by using \ in front of them]
In the second section: I am capturing pattern in between habitat.set.prob and DLOOP and then doin a tr to remove the brackets.
#include <iostream>
using namespace std;
int main()
{
string p = "1:2:3:4"; //input your string
int arr[4] = {}; //create a new empty integer array to put the integers in it
for(int i=0, j=0; i <p.length(); i++){//loop on the string to extract integers
if( p[i] == ':'){continue;}//if the value = ':' skip it and continue
arr[j]=(int)p[i]-48;j++;//put the integer in the array we created
}
cout << "String={"<<arr[0]<<" "<<arr[1]<<" "<<arr[2]<<" "<<arr[3]<<"}";//print the array
return 0;
}

Change one line to column

I have a file like this:
20180127200000
DEFAULT, Proc_m0_s1
DEFAULT, Proc_m0_s2
DEFAULT, Proc_m0_s3
20180127200001
DEFAULT, Proc_m0_s1
DEFAULT, Proc_m0_s2
DEFAULT, Proc_m0_s3
I'd like to convert to:
20180127200000 DEFAULT, Proc_m0_s1
20180127200000 DEFAULT, Proc_m0_s2
20180127200000 DEFAULT, Proc_m0_s3
20180127200001 DEFAULT, Proc_m0_s1
20180127200001 DEFAULT, Proc_m0_s2
20180127200001 DEFAULT, Proc_m0_s3
The time stamp appears every 4 lines.
Awk should be both faster and simpler than a shell loop for this type of task.
awk 'NR%4==1 { time=$1; next }
{ print time " " $0 }' file
NR is the line number. If the line number modulo 4 is one, remember the first field but don't print. Otherwise, print with the remembered value and a space as the prefix.
iconvert() {
while true; do
read linePrefix || return 0
for i in $(seq $((prefix_lines - 1))); do
read line || return 0
echo "$linePrefix" "$line"
done
done
}
You just call iconvert like this:
prefix_lines=4
iconvert < file.txt
which gives your desired output.

sed: replace letter between square brackets

I have the following string:
signal[i]
signal[bg]
output [10:0]
input [i:1]
what I want is to replace the letters between square brackets (by underscore for example) and to keep the other strings that represents table declaration:
signal[_]
signal[__]
output [10:0]
input [i:1]
thanks
try:
awk '{gsub(/\[[a-zA-Z]+\]/,"[_]")} 1' Input_file
Globally substituting the (bracket)alphabets till their longest match then with [_]. Mentioning 1 will print the lines(edited or without edited ones).
EDIT: Above will substitute all alphabets with one single _, so to get as many underscores as many characters are there following may help in same.
awk '{match($0,/\[[a-zA-Z]+\]/);VAL=substr($0,RSTART+1,RLENGTH-2);if(VAL){len=length(VAL);;while(i<len){q=q?q"_":"_";i++}};gsub(/\[[a-zA-Z]+\]/,"["q"]")}1' Input_file
OR
awk '{
match($0,/\[[a-zA-Z]+\]/);
VAL=substr($0,RSTART+1,RLENGTH-2);
if(VAL){
len=length(VAL);
while(i<len){
q=q?q"_":"_";
i++
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]")
}
1
' Input_file
Will add explanation soon.
EDIT2: Following is the one with explanation purposes for OP and users.
awk '{
match($0,/\[[a-zA-Z]+\]/); #### using match awk's built-in utility to match the [alphabets] as per OP's requirement.
VAL=substr($0,RSTART+1,RLENGTH-2); #### Creating a variable named VAL which has substr($0,RSTART+1,RLENGTH-2); which will have substring value, whose starting point is RSTART+1 and ending point is RLENGTH-2.
RSTART and RLENGTH are the variables out of the box which will be having values only when awk finds any match while using match.
if(VAL){ #### Checking if value of VAL variable is NOT NULL. Then perform following actions.
len=length(VAL); #### creating a variable named len which will have length of variable VAL in it.
while(i<len){ #### Starting a while loop which will run till the value of VAL from i(null value).
q=q?q"_":"_"; #### creating a variable named q whose value will be concatenated it itself with "_".
i++ #### incrementing the value of variable i with 1 each time.
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]") #### Now globally substituting the value of [ alphabets ] with [ value of q(which have all underscores in it) then ].
}
1 #### Mentioning 1 will print (edited or non-edited) lines here.
' Input_file #### Mentioning the Input_file here.
Alternative gawk solution:
awk -F'\\[|\\]' '$2!~/^[0-9]+:[0-9]$/{ gsub(/./,"_",$2); $2="["$2"]" }1' OFS= file
The output:
signal[_]
signal[__]
output [10:0]
-F'\\[|\\]' - treating [ and ] as field separators
$2!~/^[0-9]+:[0-9]$/ - performing action if the 2nd field does not represent table declaration
gsub(/./,"_",$2) - replace each character with _
This might work for you (GNU sed);
sed ':a;s/\(\[_*\)[[:alpha:]]\([[:alpha:]]*\]\)/\1_\2/;ta' file
Match on opening and closing square brackets with any number of _'s and at least one alpha character and replace said character by an underscore and repeat.
awk '{sub(/\[i\]/,"[_]")sub(/\[bg\]/,"[__]")}1' file
signal[_]
signal[__]
output [10:0]
input [i:1]
The explanation is as follows: Since bracket is as special character it has to be escaped to be handled literally then it becomes easy use sub.

Insert double quotes multiple times into string

I have inherited a flat html file with a few hundred lines similar to this:
<blink>
<td class="pagetxt bordercolor="#666666 width="203 colspan="3 height="20>
</blink>
So far I have not been able to work out a sed way of inserting the closing double quotes for each element. Probably needs something other than sed to do this. Can anyone suggest an easy way to do this?
Thanks
sed -i 's/"\([^" >]\+\)\( \|>\)/"\1"\2/g' file.html
Explanation:
" - leading double quote
\([^" >]\+\) - non-quote-or-space-or-'>' chars, grouped (into group 1)
\( \|>\) - terminating space or '>', grouped (into group 2)
We replace it with '"<group1>"<group2>'.
One solution that pops out at me is to parse through each line of the file looking for the quote. When it finds one, activate a flag to keep track of being inside a quoted area, then continue parsing the line until it hits the first space or > it comes to and inserts an additional " just before it. Flip the flag off, then continue through the string looking for the next quote. Probably not a perfect solution, but a start perhaps.
If all lines share the same structure, you could use a simple texteditor to globally replace
' bordercolor'
with
'" bordercolor'
(without single-quotes). This is then independend from the field values and works similarly for the other fields. You still have to do some manual work, but if it's just one big file, I'd bite the bullet this time and not waste probably more time working out a sed-solution.
This should do if your file is simple - it won't work if you have whitespace which should be inside the quotes - in that case, a more complex code will be needed, but can be done along the same lines.
#!usr/bin/env python
#change the "utf-8" bellow to your files encoding
data = open("<myfile.html>").read().decode("utf-8")
new_data = []
inside_tag = False
inside_quotes = False
for char in data:
if char == "<":
inside_tag = True
if char == '"':
inside_quotes = True
if inside_tag and (char.isspace() or char==">") and inside_quotes:
new_data.append('"')
inside_quotes = False
if char == ">":
inside_tag = False
new_data.append(char)
outputfile = open("<mynewfile.html>", "wt")
outputfile.write("".join(new_data).encode("utf-8"))
outputfile.close()
with bash
for file in *
do
flag=0
while read -r line
do
case "$line" in
*"<blink>"*)
flag=1
;;
esac
if [ "$flag" -eq 1 ];then
case "$line" in
*class=\"pagetxt*">" )
line="${line%>}\">"
flag=0
;;
esac
fi
echo "${line}"
done <"file" > temp
mv temp "$file"
done
Regular expressions are your friend:
Find: (="[^" >]+)([ >])
Replace: \1"\2
After you've done that, make sure to run this one too:
Find: </?blink>
Replace: \n
(This won't fix more than one class on an element, like <element class="class1 class2 id="jimmy">)