sed: behavior of H and D - sed

My sed script is this:
# script.sed
1,3H
1,3g
3D
When I run it, I get the following:
$ seq 5 | sed -f script.sed
1
1
2
4
5
However, this seems wrong to me. On line 3, once the D command is executed, the pattern space has
1
2
3
When the cycle is restarted, H should set the hold space to:
<empty_line>
1
2
3
1
2
3
Then, g should set the pattern space to the same content. D will then remove the first (empty) line. Every time the cycle is restarted, the hold space will effectively double. Hence, this should lead to an infinite loop.
What am I missing?

Below, I show how I interpret the expected execution, showing as an ordered pair the result of the command, with the pattern space first and the hold space following:
1: H(1,\n1) g(\n1,\n1) > \n1\n
2: H(2,\n1\n2) g(\n1\n2,\n1\n2) > \n1\n2\n
3: H(3,\n1\n2\n3) g(\n1\n2\n3,\n1\n2\n3) D(,\n1\n2\n3) >
4: > 4\n
5: > 5\n
If I take the output of this interpretation and concatenate it into an echo command with the -e option, I get:
$ echo -e '\n1\n\n1\n2\n4\n5\n'
1
1
2
4
5

Related

Appending using sed pattern after certain line number

I am using following command to append string after AMP, but now I want to add after to AMP which is after SET2 or line number 9, can we modify this command to append the string only after SET2 or line number 9? And if I want to add to only to SET1 AMPs or before line number 9 , could someone help me with the command, thanks.
$ sed -i '/AMP/a Target4' test.txt
$ cat test.txt
#SET1
AMP
Target 1
Target 2
AMP
Target 3
Target 4
Target 5
#Set2
AMP
Target 11
Target 12
Note there is no line between above text.
Would you please try the following:
sed -i '
/^#Set2/,${ ;# if the line starts with "#Set2", execute the {block} until the last line $
/AMP/a Target4 ;# append the string after "AMP"
} ;# end of the block
' test.txt
If you want to append the string before the #Set2 line, please try:
sed -i '
1,/^#Set2/ { ;# excecute the {block} while the line number >= 1 until the line matches the pattern /^#Set2/
/AMP/a Target4
}
' test.txt
The expression address1,address2 is a flip-flop operator. Once the
address1 (line number, regular expression, or other condition) meets,
the operator keeps on returning true until the address2 meets.
Then the following command or block is executed from address1 until
address2.
If you want to add to after AMP which is after #Set2 or line number 9,
I think it is better to process up to the 8th line and after the 9th line separately.
For example, the command is below:
sed '
1,8{
/^#Set2/,${
/AMP/a Target4
}
}
9,${
/AMP/a Target4
}' test.txt

sed: Delete first line of hold space?

How do I delete the first line of the hold space in sed?
I've tried
x;
s/.*\n//;
x;
But .*\n matches up to the last newline, deleting all the lines except for the last one.
this should remove the 1st line from "hold space"
x;s/[^\n]*\n//
Example:
kent$ sed -n 'H;${x;p}' <(seq 3)
1
2
3
remove the first empty line:
kent$ sed -n 'H;${x;s/[^\n]*\n//;p}' <(seq 3)
1
2
3
Simple put any random string with h i.e 1h;1d, by default it's empty.

replace two consecutive lines based on a pattern and repeat through out the file

I'm trying to replace two consecutive lines based on a pattern match, and would want this to repeat for the entire file. Here is the input file:
c aaaaa bbb
+ 0.1
c xxxx
c yyyy
+ 0.2
* c gggg
m eeeee hhhhh
+ 0.3
The command I tried is:
sed '/^c/{N;s/+/*+/}'
I expected to see a * prepended to each line beginning, but only those lines immediatlely following a c line:
c aaaaa bbb
*+ 0.1
c xxxx
c yyyy
*+ 0.2
* c gggg
m eeeee hhhhh
+ 0.3
what I actually get:
c aaaaa bbb
*+ 0.1
c xxxx
c yyyy
+ 0.2
* c gggg
m eeeee hhhhh
+ 0.3
Here, i see only the first occurrence of + (with previous line beginning with c) is getting replaced with *+. The second occurrence of + in the file is not getting replaced.
What am I doing wrong? How do I get the result I want: replacement happens in multiple consecutive lines in the file?
The problem you run into is that when a line that starts with c comes right after another line that comes with c, the N command in your code consumes it, and it isn't available for checking when you process the line that comes next.
Instead of reading ahead to see if the next line should be changed, I'd remember the last line and look back to see if the current line should be changed:
sed 'x; G; /^c/ s/+/*+/; s/.*\n//' file
This works as follows:
x # Swap pattern space and hold buffer. Because we do this here,
# the previous line will be in the hold buffer for every line
# (except the first, then it is empty)
G # append hold buffer to pattern space. Now the pattern space
# contains the previous line followed by the current line.
/^c/ s/+/*+/ # If the pattern space begins with a c (i.e., if the previous
# line began with a c), replace + with *+
s/.*\n// # Remove the first line (the previous one) from the pattern
# space
# Then drop off the end. The changed current line is printed.
sed -e 'H;$!d' -e 'x' -e ':cycle' -e 's/\(\nc[[:alnum:][:blank:][:punct:]]*\n\)+/\1*+/g;t cycle' -e 's/.//' YourFile
Posix version changing the whoe in max 2 internal cycle
load the file in memory (-e 'H;$!d' -e 'x')
Add the * in front of line starting with a + after a line starting with a c ( s/\(\nc[[:alnum:][:blank:][:punct:]]*\n\)+/\1*+/g)
do the same if occur in previous line ( :cycle and t cycle)
use a trick to insure starting with new line( H append current line to buffer also for first line so an extra new line as heading) (for first line with a c) and remove this at the end ('s/.//)

Deduplicate FASTA, keep a seq id

I need to format files for a miRNA-identifying tool (miREAP).
I have a fasta file in the following format:
>seqID_1
CCCGGCCGTCGAGGC
>seqID_2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_4
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_5
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_6
AGGGCACGCCTGCCTGGGCGTCACGC
I want to count the number of times each sequence occurs and append that number to the seqID line. The count for each sequence and an original ID referring to the sequence need only appear once in the file like this:
>seqID_1 1
CCCGGCCGTCGAGGC
>seqID_2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
Fastx_collapser does the trick nearly as I'd like (http://hannonlab.cshl.edu/fastx_toolkit/index.html). However, rather than maintain seqIDs, it returns:
>1 1
CCCGGCCGTCGAGGC
>2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
This means that the link between my sequence, seqID, and genome mapping location is lost. (Each seqID corresponds to a sequence in my fasta file and a genome mapping spot in a separate Bowtie2-generated .sam file)
Is there a simple way to do the desired deduplication at the command line?
Thanks!
linearize and sort/uniq -c
awk '/^>/ {if(N>0) printf("\n"); ++N; printf("%s ",$0);next;} {printf("%s",$0);} END { printf("\n");}' input.fa | \
sort -t ' ' -k2,2 | uniq -f 1 -c |\
awk '{printf("%s_%s\n%s\n",$2,$1,$3);}'
>seqID_2_2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_1_1
CCCGGCCGTCGAGGC
>seqID_3_3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA

Add leading 0 in sed substitution

I have input data:
foo 24
foobar 5 bar
bar foo 125
and I'd like to have output:
foo 024
foobar 005 bar
bar foo 125
So I can use this sed substitutions:
s,\([a-z ]\+\)\([0-9]\)\([a-z ]*\),\100\2\3,
s,\([a-z ]\+\)\([0-9][0-9]\)\([a-z ]*\),\10\2\3,
But, can I make one substitution, that will do the same? Something like:
if (one digit) then two leading 0
elif (two digits) then one leading 0
Regards.
I doubt that the "if - else" logic can be incorporated in one substitution command without saving the intermediate data (length of the match for instance). It doesn't mean you can't do it easily, though. For instance:
$ N=5
$ sed -r ":r;s/\b[0-9]{1,$(($N-1))}\b/0&/g;tr" infile
foo 00024
foobar 00005 bar
bar foo 00125
It uses recursion, adding one zero to all numbers that are shorter than $N digits in a loop that ends when no more substitutions can be made. The r label basically says: try to do substitution, then goto r if found something to substitute. See more on flow control in sed here.
Use two substitute commands: the first one will search for one digit and will insert two zeroes just before, and the second one will search for a number with two digits and will insert one zero just before. GNU sed is needed because I use the word boundary command to search for digits (\b).
sed -e 's/\b[0-9]\b/00&/g; s/\b[0-9]\{2\}\b/0&/g' infile
EDIT to add a test:
Content of infile:
foo 24 9
foo 645 bar 5 bar
bar foo 125
Run previous command with following output:
foo 024 009
foo 645 bar 005 bar
bar foo 125
Add the max number of leading zeros first, then take this number of characters from the end:
echo 55 | sed -e 's:^:0000000:' -e 's:0\+\(.\{8\}\)$:\1:'
00000055
You seem to have the sed options covered, here's one way with awk:
BEGIN { RS="[ \n]"; ORS=OFS="" }
/^[0-9]+$/ { $0 = sprintf("%03d", $0) }
{ print $0, RT }
I find the following sed approach to pad an integer number with zeroes to 5 (n) digits quite straighforward:
sed -e "s/\<\([0-9]\{1,4\}\)\>/0000\1/; s/\<0*\([0-9]\{5\}\)\>/\1/"
If there is at least one, at most 4 (n-1) digits, add 4 (n-1) zeroes in
front
If there is any number of zeroes followed by 5 (n) digits after the first transformation, keep just these last 5 (n) digits
When there happen to be more than 5 (n) digits, this approach behaves the usual way -- nothing is padded or trimmed.
Input:
0
1
12
123
1234
12345
123456
1234567
Output:
00000
00001
00012
00123
01234
12345
123456
1234567
This might work for you (GNU sed):
echo '1.23 12,345 1 12 123 1234 1' |
sed 's/\(^\|\s\)\([0-9]\(\s\|$\)\)/\100\2/g;s/\(^\|\s\)\([0-9][0-9]\(\s\|$\)\)/\10\2/g'
1.23 12,345 001 012 123 1234 001
or perhaps a little easier on the eye:
sed -r 's/(^|\s)([0-9](\s|$))/\100\2/g;s/(^|\s)([0-9][0-9](\s|$))/\10\2/g'