Using sed to replace strings containing special characters - sed

I'm trying to edit some fastq files.
Essentially I want to change:
#SRX1409044.10.1 10 length=80
to:
#SRX1409044.10/1 10 length=80
for every line that contains .1 in the file.
I've tried using sed:
sed 's#.1#/1#g'
It works for most lines, however for lines such as:
#SRX1409044.11.1 11 length=80
I get:
#SRX1409044./1/1 /1 length=80
I've had a search around and I think I may have to escape the special characters? Every post I came across only gave examples for swapping special characters on their own so I'm not too sure how to go about it.

This command changes the first occurrence of .1 (a dot followed by a 1 and a space) on each line to /1 – notice the escaped .:
sed 's|\.1 |/1 |' infile
For an example input file such as
#SRX1409044.10.1 10 length=80
#SRX1409044.12.1 10 length=80
#SRX1409044.14.1 10 length=80
#SRX1409044.15.1 10 length=80
#SRX1409044.990.1 10 length=80
the result is
#SRX1409044.10/1 10 length=80
#SRX1409044.12/1 10 length=80
#SRX1409044.14/1 10 length=80
#SRX1409044.15/1 10 length=80
#SRX1409044.990/1 10 length=80
Now, if the .1 could also be at the end of a line, we have to change the command slightly because we require a space at the moment:
sed 's#\.1\( \|$\)#/1\1#' infile
This is ".1 followed by either a space or the end of the line, replace with /1 and whatever came after the .1". For example:
$ sed 's#\.1\( \|$\)#/1\1#' <<< 'SRX1409044.116884523.1'
SRX1409044.116884523/1

The decimal point . is escaped as \.
I think your problem is you need to distinguish single digits and double digits.
If you are not having more than 2 digits the simplest is to repeat the regexp twice, like:
[0-9][0-9]
Matches any 2 digit number or digit sequence.
since I don't know which version of sed you are using, and what its enhanced abilities are.
Also I'm not sure what you want to accept, and what you want to reject.

Related

sed Remove lines between two patterns (excluding end pattern)

given text like
_adsp TXT "dkim=all"
VVKMU6SE3C2MF88BG4DJQAECMR9SIIF0 NSEC3 1 1 10 C4F407437E8EA4C5 (
175MCHR31K25LP89OVJI5LCE0JA2N2AP
A MX TXT AAAA RRSIG SPF )
RRSIG NSEC3 7 3 1800 (
20200429171433 20200330161758 11672 example.com
H3l26qmtkuiFZCeSYCCAo5krFE3gjM0I8UeQ9jhj3STy
X6fM0YizCHEuv4VZynOJGJc1XJnHRHI+p7yLlZ+OVseK
UfIkPVP+VOmlerwozEpM+Tnt8evwnMTDbcn0zxf/6YJx
kZeO2AszWkRZ0bctqW7INYo8YuyyuTSxSr8se27fiaPA
4GXQymepGgv/JGqargzHbyhhkDhENmNo7Qwkjl+a0kI4
6qqKcEWCsDvnlYUQiDFzc5oRs2j7TT9uybTfwUDQxV+t
MQFMhzu7LNbRIUuOb16sAEGSdl9mWQ4sZRJ9wuXJWbso
G+3tY0pBbq4ffScz/JKcrJ0qAuBF1F5JcQ== )
$TTL 1800
I want to get rid by the part with the "(not beginning with whitespace) NSEC3 " until the first line not beginning with a whitespace character.
resulting
_adsp TXT "dkim=all"
$TTL 1800
in the example.
I tried sed '/^[^\s].*\sNSEC3\s/,/^[^\s]/d;' filename but that doesn't work as expected, example results in
_adsp TXT "dkim=all"
H3l26qmtkuiFZCeSYCCAo5krFE3gjM0I8UeQ9jhj3STy
X6fM0YizCHEuv4VZynOJGJc1XJnHRHI+p7yLlZ+OVseK
UfIkPVP+VOmlerwozEpM+Tnt8evwnMTDbcn0zxf/6YJx
kZeO2AszWkRZ0bctqW7INYo8YuyyuTSxSr8se27fiaPA
4GXQymepGgv/JGqargzHbyhhkDhENmNo7Qwkjl+a0kI4
6qqKcEWCsDvnlYUQiDFzc5oRs2j7TT9uybTfwUDQxV+t
MQFMhzu7LNbRIUuOb16sAEGSdl9mWQ4sZRJ9wuXJWbso
G+3tY0pBbq4ffScz/JKcrJ0qAuBF1F5JcQ== )
$TTL 1800
so resuming printout way too early?
what do I miss?
thank you
P.S.:
you maybe see what I want to do is removing DNSSEC parts out of an named zone. didn't find any other way to remove RRSIG and NSEC3 entries, yet. If someone has an idea, I would appreciate that too.
[\s] matches a literal \ or s characters. It doesn't match whitespace.
The /^[^\s]/d; (if [\s] would work as you expect) will also include removing the last line with non-leading whitespaces. I think you have to loop manually.
On the example you've given, the following seems to work:
sed -n '/^[^ \t].*\sNSEC3\s/{ :a; n; /^[^ \t]/bb; ba}; :b; p'
This might work for you (GNU sed):
sed -n '/^\S.*NSEC/{:a;n;/^\S/!ba};p' file
Turn off implicit printing by using the -n option.
Throw away lines between one starting with a non-space and containing the string NSEC and any lines not starting with a non-space.
Print all other lines.
Alternative:
sed '/^\S.*NSEC/,/^\S/{/^\S.*NSEC\|^\s/d}' file
Yet another alternative:
sed '/^\S.*NSEC/{:a;N;/\n\S/!ba;s/.*\n//}' file
And another:
sed '/^\S.*NSEC/{:a;N;/\n\S/!s/\n//;ta;D}' file
N.B. The first two solutions will delete lines regardless of a line delimiting the end of the deletions. Whereas the last two solutions will only delete lines if there is a line delimiting the end of the deletions.

Join certain lines with sed

I have an input which looks like this:
1
2
3
4
5
6
And I want to transform it with sed to :
12
345
6
I know it can be easily done with other tools but I want to do it specifically with sed as a learning exercise.
I have attempted this:
sed ':x ; /^ *$/{ N; s/\n// ; bx; }'
But it prints :
123456
Can someone help me fix this?
Quoting from the GNU sed manual:
A common technique to process blocks of text such as paragraphs (instead of line-by-line) is using the following construct:
sed '/./{H;$!d} ; x ; s/REGEXP/REPLACEMENT/'
The first expression, /./{H;$!d} operates on all non-empty lines, and adds the current line (in the pattern space) to the hold space. On all lines except the last, the pattern space is deleted and the cycle is restarted.
The other expressions x and s are executed only on empty lines (i.e. paragraph separators). The x command fetches the accumulated lines from the hold space back to the pattern space. The s/// command then operates on all the text in the paragraph (including the embedded newlines).
And indeed,
sed '/./{H;$!d} ; x ; s/\n//g'
does what you want.
FWIW here's how to really do that task in UNIX:
$ awk -v RS= -v OFS= '{$1=$1}1' file
12
345
6
The above will work on any UNIX box.
A GNU awk approach:
$ awk -F"\n" '{gsub("\n","");}1' RS='\n{2,}' file
12
345
6
Note it will add a trailing newline\n after last line.

How to ignore 1st occurence of alphanumeric and replace everything after 2nd occurence?

This is a followup question on an existing question -
How to replace fixlength alphanumeric character?,
There are two use cases -
a. Remove everything from where 2nd alphanumeric in a line start, if
it contain two alnum of size 7.
b. Remove everything from where 1st alphanumeric in a line start, if
it contain only one alnum of size 7.
testing-1xs-a-2x-782b1x9.abc.txt
testing-12a-b-2y-486eee2.bcd.txt
testing-1a-c-2z-b62cx7d.cde.txt
testing-1aasdfa-c-2z-b62cx7d.cde.txt
I tried this command - sed 's/[a-zA-Z0-9]{7}.*//2g' file
Expected output :
testing-1xs-a-2x
testing-12a-b-2y
testing-1a-c-2z
testing-1aasdfa-c-2z
This might work for you (GNU sed):
sed -r 's/-?\w{7}([-.]\w{1,6}\b)*//2g' file
Remove an optional - followed by a 7 character word, followed by zero or more words between 1 and 6 characters long preceded by a . or a -.
N.B. The last line of the test data will be just testing as 2g means 2 or more in GNU sed.

SED Command to remove first digits and spaces of each line

I have a simple text file in below format.
1 12658003Y
2 34345345N
3 34653785Y
4 36452342N
5 86747488Y
6 34634543Y
so on
10 37456338Y
11 33535555Y
12 37456378Y
so on
100 23432434Y
As you can see there are two white spaces after first number.
I'm trying to write SED command to remove the digits before whitespaces. Is there any SED command to remove spaces and number before spaces?
Output file should look like below.
12658003Y
34345345N
34653785Y
36452342N
so on..
Please assist. I'm very new to shell scripting.
sed 's/[0-9]\+\s\+//' infile > outfile
Explanation:
s: we want to use substitution
/: mark start and end of the expression we want to match
[0-9]: match any digit
+: match the previous one or more time
\s: space
+: match the previous one or more time
/: mark start of what we want to change our matches to (which is nothing)
/: some special operators goes after this (we use no such)
infile: the file we want to change
>: pipe stdout to
outfile: where we want to store output
Your sed command would be,
sed 's/.* //g' file
This would remove the first numbers along with the space followed.
Remove leading digits, then following spaces:
sed 's/^[0-9]* *//' file
sed 's/^[0-9]*[ ]*//g' input.txt

How to use 'sed or gawk' to delete a text block until the third line previous the last one

Good day,
I was wondering how to delete a text block like this:
1
2
3
4
5
6
7
8
and delete from the second line until the third line previous the last one, to obtain:
1
2
6
7
8
Thanks in advance!!!
BTW This text block is just an example, the real text blocks I working on are huge and each one differs among them in the line numbers.
Getting the number of lines with wc and using awk to print the requested range:
$ awk 'NR<M || NR>N-M' M=3 N="$(wc -l file)" file
1
2
6
7
8
This allows you to easily change the range by just changing the value of M.
This might work for you (GNU sed):
sed '3,${:a;$!{N;s/\n/&/3;Ta;D}}' file
or i f you prefer:
sed '1,2b;:a;$!{N;s/\n/&/3;Ta;D}' file
These always print the first two lines, then build a running window of three lines.
Unless the end of file is reached the first line is popped off the window and deleted. At the end of file the remaining 3 lines are printed.
since you mentioned huge and also line numbers could be differ. I would suggest this awk one-liner:
awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}' file
it processes the input file only once, without (pre) calculating total line numbers
it stores minimal data in memory, in all processing time, only 3 lines data were stored.
If you want to change the filtering criteria, for example, removing from line x to $-y, you just simply change the offset in the oneliner.
add a test:
kent$ seq 8|awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}'
1
2
6
7
8
Using sed:
sed -n '
## Append second line, print first two lines and delete them.
N;
p;
s/^.*$//;
## Read next three lines removing leading newline character inserted
## by the "N" command.
N;
s/^\n//;
N;
:a;
N;
## I will keep three lines in buffer until last line when I will print
## them and exit.
$ { p; q };
## Not last line yet, so remove one line of buffer based in FIFO algorithm.
s/^[^\n]*\n//;
## Goto label "a".
ba
' infile
It yields:
1
2
6
7
8