Remove everything after and including a tab in FASTA header? - sed

I am trying to keep only the first field identifier for each sequence in a .fasta file that looks like this:
>hetGla3 ENST00000215754.179
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10 ENST00000215754.270
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN
I want to remove the \tab and "ENST..." identifier after it, returning:
>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN
I have already tried sed to remove all whitespaces from headers, but it doesn't appear to be working (returns the original format):
sed 's/\.[^\.]*//'
Any help would be greatly appreciated! Thank you.

Using GNU sed
$ sed -E '/^>/s/( +|\t).*//' input_file
>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN

Another option could be capturing the first part without spaces in group 1 and match the rest of the line that should be removed.
In the replacement use capture group 1 using \1
sed -E 's/^(>[^[:space:]]+)[[:space:]].*/\1/' file
The pattern matches:
^ Start of string
(>[^[:space:]]+) Capture group 1, match > and 1+ non spaces using a negated character class
[[:space:]] Match a single space
.* Match the rest of the line
Output
>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN
If awk is also an option, you can print field 1 if the line starts with >, else you print the whole line.
awk '/^>/ {print $1;next}1' file

This is the job that cut exists to do:
$ cut -f1 file
>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN

Related

How to replace a fixed position character of a string?

Suppose I have a file having a string AKASHMANDAL
I want to replace 7th positioned character (whatever the character may be) with "D"
Output will looks like
AKASHMDNDAL
I tried with the following command which only add the character after 7th position
sed -E 's/^(.{7})/\1D/' file
This gives me AKASHMADNDAL
How can I replace the character instead of just adding?
Substitute any character in the 7th position using sed
$ sed 's/./D/7' input_file
AKASHMDNDAL
You can simply match one character outside of the capture group:
sed -E 's/^(.{6})./\1D/'
(notice the dot outside the parenthesis)
If you can consider an awk solution. awk can handle it better without regex and with more power to tweak based on positions:
awk '{print substr($0,1,6) "D" substr($0,8)}' file
AKASHMDNDAL
With your shown samples only, please try following awk code. Written and tested in GNU awk. Here is the Online demo for used awk code here.
awk -v RS='^.{7}' '
RT{
sub(/.$/,"",RT)
ORS=RT"D"
print
}
END{
ORS=""
print
}
' Input_file

GREP Print Blank Lines For Non-Matches

I want to extract strings between two patterns with GREP, but when no match is found, I would like to print a blank line instead.
Input
This is very new
This is quite old
This is not so new
Desired Output
is very
is not so
I've attempted:
grep -o -P '(?<=This).*?(?=new)'
But this does not preserve the second blank line in the above example. Have searched for over an hour, tried a few things but nothing's worked out.
Will happily used a solution in SED if that's easier!
You can use
#!/bin/bash
s='This is very new
This is quite old
This is not so new'
sed -En 's/.*This(.*)new.*|.*/\1/p' <<< "$s"
See the online demo yielding
is very
is not so
Details:
E - enables POSIX ERE regex syntax
n - suppresses default line output
s/.*This(.*)new.*|.*/\1/ - finds any text, This, any text (captured into Group 1, \1, and then any text again, or the whole string (in sed, line), and replaces with Group 1 value.
p - prints the result of the substitution.
And this is what you need for your actual data:
sed -En 's/.*"user_ip":"([^"]*).*|.*/\1/p'
See this online demo. The [^"]* matches zero or more chars other than a " char.
With your shown samples, please try following awk code.
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} NF!=3{print ""}' Input_file
OR
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} {print ""}' Input_file
Explanation: Simple explanation would be, setting This\\s+ OR \\s+new as field separators for all the lines of Input_file. Then in main program checking condition if NF(number of fields) are 3 then print 2nd field (where next will take cursor to next line). In another condition checking if NF(number of fields) is NOT equal to 3 then simply print a blank line.
sed:
sed -E '
/This.*new/! s/.*//
s/.*This(.*)new.*/\1/
' file
first line: lines not matching "This.*new", remove all characters leaving a blank line
second lnie: lines matching the pattern, keep only the "middle" text
this is not the pcre non-greedy match: the line
This is new but that is not new
will produce the output
is new but that is not
To continue to use PCRE, use perl:
perl -lpe '$_ = /This(.*?)new/ ? $1 : ""' file
This might work for you:
sed -E 's/.*This(.*)new.*|.*/\1/' file
If the first match is made, the line is replace by everything between This and new.
Otherwise the second match will remove everything.
N.B. The substitution will always match one of the conditions. The solution was suggested by Wiktor Stribiżew.

sed - Replace comma after first regex match

i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.
CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.
The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.
If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.
Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.
In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt

Select specific items from a file using sed

I'm very much a junior when it comes to the sed command, and my Bruce Barnett guide sits right next to me, but one thing has been troubling me. With a file, can you filter it using sed to select only specific items? For example, in the following file:
alpha|november
bravo|october
charlie|papa
alpha|quebec
bravo|romeo
charlie|sahara
Would it be possible to set a command to return only the bravos, like:
bravo|october
bravo|romeo
With sed:
sed '/^bravo|/!d' filename
Alternatively, with grep (because it's sort of made for this stuff):
grep '^bravo|' filename
or with awk, which works nicely for tabular data,
awk -F '|' '$1 == "bravo"' filename
The first two use a regular expression, selecting those lines that match it. In ^bravo|, ^ matches the beginning of the line and bravo| the literal string bravo|, so this selects all lines that begin with bravo|.
The awk way splits the line across the field separator | and selects those lines whose first field is bravo.
You could also use a regex with awk:
awk '/^bravo|/' filename
...but I don't think this plays to awk's strengths in this case.
Another solution with sed:
sed -n '/^bravo|/p' filename
-n option => no printing by default.
If line begins with bravo|, print it (p)
2 way (at least) with sed
removing unwanted line
sed '/^bravo\|/ !d' YourFile
Printing only wanted lines
sed -n '/^bravo\|/ p' YourFile
if no other constraint or action occur, both are the same and a grep is better.
If there will be some action after, it could change the performance where a d cycle directly to the next line and a p will print then continue the following action.
Note the escape of pipe is needed for GNU sed, not on posix version

sed command to substitute any content after equals on line 3 with a string

I am trying to substitute this line
<data_item name="any_text">
which is on line 3 with
<data_item name="my_text">
So I tried something like sed '3s/=*/=my_text/' input_file > output_file
But this is printing my text at the beginning of the line. Tried braces around (=*) and (=my_text) but that doesn't do anything.
Try this
sed '3s/any_text/my_text/' file
Example:
$ echo '<data_item name="any_text">' | sed '1s/any_text/my_text/'
<data_item name="my_text">
Your code:
sed '3s/=*/=my_text/'
Your code will replace 0 or more equals on the third line with =my_text.
sed '3s/=".*"/="mytext"/'
should work. You need the . because it matches any character and the * says you want 0 or more of the preceding symbol .. In regex terms the * doesn't mean match anything. It is a quantifier which means it specifies and amount and is generally used to specify the number of the preceding terms you want to match.
sed '3 c\
<data_item name="my_text">' YourFile
Assuming that the content of my_text does not contain unescape \ & and '
avoid < and > on same file in samle instruction, could have unexpected result. Use a temporary file on -i on GNU sed