Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)

Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file) - sed

I have a file containing the following text:
>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC
If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.
Expected output:
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
I have tried sed -r '/^.{,5}$/d', but it also deletes the lines with ">".

Using sed
$ sed '/^>/N;/\n[A-Z]\{6,\}$/!d' input_file
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

With your shown samples, with awk you could try following awk code. Simple explanation would be, if a line starts from > then setting variable val with value of current line and next will skip all further statements from here. Then if a line doesn't start from > if length of current line is greater than 5 then printing val ORS and current line.
awk '/^>/{val=$0;next} length($0)>5{print val ORS $0}' Input_file

With a GNU sed, you can use
sed -E '/>/N;/\n[^>].{0,4}$/d'
Details:
/>/ - finds lines with > (if it must be at the start, add ^ before >)
N - reads the line and appends it to the pattern space with a leading newline
\n[^>].{0,4}$ - a newline, a char other than a > (as the first char should not be >) and then zero to four chars till end of the string
d removes the value in pattern space.
See the online demo:
#!/bin/bash
s='>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC'
sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"
Output:
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

Do not reinvent the wheel. Use common bioinformatics tools for that, such as seqtk or seqkit. Among other things, these tools can handle multiline FASTA sequences. Examples:
seqtk seq -L 5 in.fasta > out.fasta
seqkit seq -m 5 in.fa > out.fasta
To install these tools, use conda, specifically miniconda, for example:
conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate
REFERENCES:
Remove sequences <300 bases from FASTA file: https://www.biostars.org/p/329680/
seqtk: https://github.com/lh3/seqtk
seqkit: https://bioinf.shenwei.me/seqkit/
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

very inelegant solution just to get it to work =(
{m,g}awk ' !_<NR && 5 < length($(NF-=_==$NF))\
&& \
($!NF =">seq" $!_ ORS $+NF)^_' FS='[ \t\n]+' OFS='\n' RS='\n*>seq'
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

I would recommend Biopython for this. It makes FASTA processing convenient.
'''
Biopython version: 1.79
'''
from Bio import SeqIO
for seq_record in SeqIO.parse("input.fasta", "fasta"):
if len(seq_record.seq)>=5:
print(">",seq_record.id)
print(seq_record.seq)

Related

Use sed to remove lines that do not match a pattern but keep header line

I am cleaning up a dataset (csv dataset). I only want to consider registers in which all fields are complete and have the right type of values. This is what I tried:
sed -r '{
/regex_pattern/!d
more commands follow...
}' $1
The program works just fine and does what it is supposed to do. The problem is that it also removes the very first line (header line) since it does not match the specific regex_pattern. I know there is a way to specify the range in which the command should apply so for example:
sed '2,$ s/A/a/'
will do substitutions on data skipping the header line. Based on this logic I tried:
sed -r '{
2,$/regex_pattern/!d
more commands follow...
}' $1
so that the header line will be untouched however this code does not run at all.So what (and why) would be the right command to do what I am intending?
As an example, imagine my csv file is fruits.csv and that my regex_pattern is [0-9]+,[0-9]+
apples,oranges
20,5
7,3
,4
a,b
12,22
When I call the .sh script that contains the sed commands in should output:
apples,oranges
20,5
7,3
12,22
So, note that:
Header line was not deleted even though it does not match the regex_pattern.
Line number 4, i.e. ",4" was deleted as it does not match the regex_pattern.
Line number 5, i.e. "a,b" was deleted as it does not match the regex_pattern.
Any help is very much appreciated and I wish to thank you all in advance.
Kind regards.

You could write it like this, matching the whole line, starting at the second line:
sed -r '
2,${/^[0-9]+,[0-9]+$/!d}
' file
Output
apples,oranges
20,5
7,3
12,22
If you also want to allow single numbers or more than just 2 comma separated numbers:
sed -r '
2,${/^[0-9]+(,[0-9]+)*$/!d}
' file

Using sed
$ sed '2,${/[0-9]\+,[0-9]\+/!d}' input_file
apples,oranges
20,5
7,3
12,22

any one of these should work in gawk, mawk1/2, or macos nawk
mawk 'NF-_^(NF==NR)' FS='^[0-9]+,[0-9]+$'
nawk '(NF!=NR)!=NF' FS='^[0-9]+,[0-9]+$'
gawk 'NF-(NF!~NR)' FS='^[0-9]+,[0-9]+$'
'
apples,oranges
20,5
7,3
12,22
more concisely would be
mawk -F'[0-9]+,[0-9]+' '(NF<NR)-NF' # using FS
gawk '/[0-9]+,[0-9]+/^+(NF<NR)' # not using FS
nawk '(NF<NR)<=/([0-9]+,?){2}/' # same approach, rev. order
mawk '(NF~NR)-/[0-9]+,[0-9]+/' # truly fringe but
# concise syntax
nawk '(NF~NR)!=/([0-9]+,?){2}/' # same approach, to
# circumvent nawk peculiarities

sed is a bad choice for working with CSVs since it doesn't have any inbuilt functionality for working with fields, nor literal strings, nor variables, doesn't use EREs by default (all of the answers you have so far will only work with GNU sed), etc. To do what you specifically want with any awk in any shell on every Unix box is simply:
$ awk 'NR==1 || /[0-9]+,[0-9]+/' file
apples,oranges
20,5
7,3
12,22
which says "if the current line number (stored in NR) is 1 or the regexp matches the current line contents then print the line". Anything else you want to do with your CSV will also be easier with awk than with sed.

Meh, I would just preserve first line.
sed -r '
1{p;d}
/regex_pattern/!d
more commands follow...
' "$1"
or run it not for first line:
1!{
/regex_pattern/!d
more commands follow...
}

This might work for you (GNU sed):
sed -E '1!{/^[0-9]+,[0-9]+$/!d}' file
If it is not the first line, delete any line that does not match one set of comma separated natural numbers.
Alternative:
sed -E '1b;/^[0-9]+,[0-9]+$/!d' file
Or:
sed -nE '1p;1b;/^[0-9]+,[0-9]+$/p' file

Sed Process Substitution on Insert - Without Backslashes

I have function that prints a header that needs to be applied across several files, but if I utilize a sed process substitution the lines prior to the last have a backslash \ on them.
E.g.
function print_header() {
cat << EOF
-------------------------------------------------------------------
$(date '+%B %d, %Y # ~ %r') ID:$(echo $RANDOM)
EOF
}
If I then take a file such as test.txt:
line 1
line 2
line 3
line 4
line 5
sed "1 i $(print_header | sed 's/$/\\/g')" test.txt
I get:
-------------------------------------------------------------------\
November 24, 2015 # ~ 11:18:28 AM ID:13187
line 1
line 2
line 3
line 4
line 5
Notice the troublesome backslash at the end of the first line, I'd like to not have that backslash appear. Any ideas?

I would use cat for that:
cat <(print_header) file > file_with_header

This behavior depends on the sed dialect. Unfortunately, it's one of the things which depends on which version you have.
To simplify debugging, try specifying verbatim text. Here's one from a Debian system.
vnix$ sed '1i\
> foo\
> bar' <<':'
> hello
> goodbye
> :
foo
bar
hello
goodbye
Your diagnostics appear to indicate that your sed dialect does not in fact require the backslash after the first i.
Since you are generating the contents of the header programmatically anyway, my recommended solution would be to refactor the code so that you can avoid this conundrum. If you don't want cat <<EOF test.txt then maybe experiment with sed 1r/dev/stdin' <<EOF test.txt (I could not get 1r- to work, but /dev/stdin should be portable to any Linux.)

Here is my kludgy fix, if you can find something more elegant I'll gladly credit you:
sed "1 i $(print_header | sed 's/$/\\/g;$s/$/\x01/')" test.txt | tr -d '\001'
This puts an unprintable SOH (\x01) ascii Start Of Header character after the inserted text, that precludes the backslashes and then I run it over tr to delete the SOH chars.

Using sed to keep the beginning of a line

I have a file in which some lines start by a >
For these lines, and only these ones, I want to keep the first eleven characters.
How can I do that using sed ?
Or maybe something else is better ?
Thanks !
Muriel

Let's start with this test file:
$ cat file
line one with something or other
>1234567890abc
other line in file
To keep only the first 11 characters of lines starting with > while keeping all other lines:
$ sed -r '/^>/ s/(.{11}).*/\1/' file
line one with something or other
>1234567890
other line in file
To keep only the first eleven characters of lines starting with > and deleting all other lines:
$ sed -rn '/^>/ s/(.{11}).*/\1/p' file
>1234567890
The above was tested with GNU sed. For BSD sed, replace the -r option with -E.
Explanation:
/^>/ is a condition. It means that the command which follows only applies to lines that start with >
s/(.{11}).*/\1/ is a substitution command. It replaces the whole line with just the first eleven characters.
-r turns on extended regular expression format, eliminating the need for some escape characters.
-n turns off automatic printing. With -n in effect, lines are only printed if we explicitly ask them to be printed. In the second case above, that is done by adding a p after the substitute command.
Other forms:
$ sed -r 's/(>.{10}).*/\1/' file
line one with something or other
>1234567890
other line in file
And:
$ sed -rn 's/(>.{10}).*/\1/p' file
>1234567890

Replacing several lines in a script with a single line using sed

Say I have a script where I want to change several lines for a single line.
For example, I got a new function that can summarize several commands, so that I can replace in my script as follows:
Original
some_code
command1
command2
command3
some_more_code
Edited
some_code
foo()
some_more_code
How would you do that using sed?

sed '/some_code/,/command3/ !b
/some_code/ b
/command3/ a\
foo()
d' YourFile
be carrefull about meta character ( like &\\^$[]{}().) in any of the pattern (except your foo() line)

I am answering my own question here.
I couldn't figure out a way to do it in one go, so I split the problem into two parts.
Part 1: replace the first line
sed -e 's/command1/foo()/g' file1 > file2
Part 2: remove the rest of the lines
sed -e '/command2/,+1d/' file2 > file3
I'd prefer a more elegant way though, where I can be flexible in the number of lines that I am replacing, possibly matching the last command in the block. Any ideas?

Just use awk:
$ awk -v RS='^$' -v ORS= '{sub(/command1\ncommand2\ncommand3/,"foo()")}1' file
some_code
foo()
some_more_code
The above uses GNU awk for multi-char RS.

This might work for you (GNU sed):
sed '/command1/,/command3/c\foo()' file

sed + remove "#" and empty lines with one sed command

how to remove comment lines (as # bal bla ) and empty lines (lines without charecters) from file with one sed command?
THX
lidia

If you're worried about starting two sed processes in a pipeline for performance reasons, you probably shouldn't be, it's still very efficient. But based on your comment that you want to do in-place editing, you can still do that with distinct commands (sed commands rather than invocations of sed itself).
You can either use multiple -e arguments or separate commands with a semicolon, something like (just one of these, not both):
sed -i 's/#.*$//' -e '/^$/d' fileName
sed -i 's/#.*$//;/^$/d' fileName
The following transcript shows this in action:
pax> printf 'Line # with a comment\n\n# Line with only a comment\n' >file
pax> cat file
Line # with a comment
# Line with only a comment
pax> cp file filex ; sed -i 's/#.*$//;/^$/d' filex ; cat filex
Line
pax> cp file filex ; sed -i -e 's/#.*$//' -e '/^$/d' filex ; cat filex
Line
Note how the file is modified in-place even with two -e options. You can see that both commands are executed on each line. The line with a comment first has the comment removed then all is removed because it's empty.
In addition, the original empty line is also removed.

#paxdiablo has a good answer but it can be improved.
(1) The '/^$/d' clause only matches 100% blank lines.
If you want to also match lines that are entirely whitespace (spaces, tabs etc.) use this instead:
'/^\s*$/d'
(2) The 's/#.*$//' clause only matches lines that start with the # character in column 0.
If you want to also match lines that have only whitespace before the first # use this instead:
'/^\s*#.*$/d'
The above criteria may not be universal (e.g. within a HEREDOC block, or in a Python multi-line string the different approaches could be significant), but in many cases the conventional definition of "blank" lines include whitespace-only, and "comment" lines include whitespace-then-#.
(3) Lastly, on OSX at least, the #paxdiablo solution in which the first clause turns comment lines into blank lines, and the second clause strips blank lines (including what were originally comments) doesn't work. It seems to be more portable to make both clauses /d delete actions as I've done.
The revised command incorporating the above is:
sed -e '/^\s*#.*$/d' -e '/^\s*$/d' inputFile

This tiny jewel removes all # comments, no matter where they begin in a line (see caution below):
sed -e 's/\s*#.*$//'
Example:
text="
this is a # test
#this is a test
#this is a #test
this is # another #test
"
$echo "$text" | sed -e 's/\s*#.*$//'
this is a
this is
Next this removes any resulting blank lines:
$echo "$text" | sed -e 's/\s*#.*$//' | sed -e '/^\s*$/d'
Caution: Depending on the syntax and/or interpretation of the lines your processing, this might not be an appropriate solution, as it just stupidly removes end of lines, even if the '#' is part of your data or code. However, for use cases where you'll never use a hash except for as an end of line comment then it works fine. So just as with all coding, context must be taken into consideration.

Alternative variant, using grep:
cat file.txt | grep -Ev '(#.*$)|(^$)'

you can use awk
awk 'NF{gsub(/^[ \t]*#/,"");print}' file

First example(paxdiablo) is very good except its not change file, just output result. If you want to change it inline:
sudo sed -i 's/#.*$//;/^$/d' inputFile

On (one of) my linux boxes, sed understands extended regular expressions with the -r option, so:
sed -r '/(^\s*#)|(^\s*$)/d' squid.conf.installed
is very useful for showing all non-blank, non comment lines.
The regex matches either start of line followed by zero or more spaces or tabs followed by either a hash or end of line, and deletes those matching lines from the input.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file) - sed

Using sed $ sed '/^>/N;/\n[A-Z]\{6,\}$/!d' input_file >seq2 CATCTCGGGA >seq4 ATTCCGTGCC

very inelegant solution just to get it to work =( {m,g}awk ' !_<NR && 5 < length($(NF-=_==$NF))\ && \ ($!NF =">seq" $!_ ORS $+NF)^_' FS='[ \t\n]+' OFS='\n' RS='\n*>seq' >seq2 CATCTCGGGA >seq4 ATTCCGTGCC

I would recommend Biopython for this. It makes FASTA processing convenient. ''' Biopython version: 1.79 ''' from Bio import SeqIO for seq_record in SeqIO.parse("input.fasta", "fasta"): if len(seq_record.seq)>=5: print(">",seq_record.id) print(seq_record.seq)

Related

Use sed to remove lines that do not match a pattern but keep header line

Sed Process Substitution on Insert - Without Backslashes

Using sed to keep the beginning of a line

Replacing several lines in a script with a single line using sed

sed + remove "#" and empty lines with one sed command

Categories

Resources