sh script to get rid of block of text on multiple lines - sed

Maybe this has been asked, but I cant quite seem to find it, but the problem is that I want to get rid of a block of text (or any time it comes up in a file), and replace it with nothing. The example is decoding a certificate and removing bag attributes, in which the blocks of text can vary slightly in between the beginning and end text, but they always start the same and end with the line I want to keep:
e.g.
The file will contain text like the following, and may have several instances when chained certificates are used:
Bag Attributes
localKeyID: xx 00 yy 00
friendlyName: something.domain.com
subject=/serialNumber=Coporate Entity/jurisdictionCountryName=US
/jurisdictionStateOrProvinceName=Washington/businessCategory=Corporate Entity/C=US /ST=Washington/L=Seattle/O=Bobs Place/OU=Bobs/CN=something.domain.com
issuer=/C=GB/O=GoDaddy/CN=GoDaddy Certificate
-----BEGIN CERTIFICATE-----
I want to be able to look for this pattern that starts with "Bag Attributes" and ends with the line that contains "-----BEGIN CERTIFICATE-----" and delete all the lines BEFORE the "-----BEGIN CERTIFICATE-----" (the italicized text only). When they are chained the previous cert entry ends with ""-----END CERTIFICATE-----", so I don't want to just go back to the beginning of the file or so many lines, I really want to start with the "Bag Attributes" line and end with the line prior to the next "-----BEGIN CERTIFICATE-----" string.
I tried the following, but only succeeded in deleting all the contents, and not just the lines I wanted:
sed -i -ne '/Bag Attributes/ {p; r $CertName.pem' -e ':a; n; /-----BEGIN CERTIFICATE-----/ {p; b}; ba}; p' "-----BEGIN CERTIFICATE-----"
This was supposed to take the file "$CertName.pem" and look for everything starting with "Bag Attributes" through the "-----BEGIN CERTIFICATE-----" line and then just replace it with "-----BEGIN CERTIFICATE-----", but that obviously did not work as I was hoping.
Any suggestions?

With awk it is easy:
awk '
/-----BEGIN CERTIFICATE----/ {pr=0}
/Bag Attributes/ {pr=1}
pr==0 {print}
' file
With sed you can do:
sed 's/-----BEGIN CERTIFICATE----/\r/' file |
sed -rz 's/Bag Attributes[^\r]*\r/-----BEGIN CERTIFICATE----/g'

I got this to work via the following, but now have one more issue.
Working line to get rid of the bag attributes throughout:
sed -ne '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/p' $CertName.pem > $CertName.tmp1
Now I also have to remove one of the entries in the certificate chain, so how do I reverse the lines above and look for the string of two lines and ends with two lines, then trim out that block of information?:
e.g. look for pattern
-----Begin Certificate-----
abxdrfg12765
through
oiyrntlklkdn
-----End Certificate-----
like in this example below to find and delete the italicized text:
-----Begin Certificate-----
acbdef123456
mlkhdsftonnl
ljhfnodvlndv
qpiuekjnxvsn
lmnopqr43210
-----End Certificate-----
-----Begin Certificate-----
abxdrfg12765
mlkhdsftonnl
ljhfnodvlndv
qpiuekjnxvsn
oiyrntlklkdn
-----End Certificate-----
-----Begin Certificate-----
xyzabc543211
mlkhdsftonnl
ljhfnodvlndv
qpiuekjnxvsn
lmnopqr43210
-----End Certificate-----
The two blocks of text above and below the italicized text have different patterns, but the first is always unpredictable, and the second and third blocks of text may be swapped, so again I cant find n lines down or up and remove those, so again I have to look for the multi-line pattern and remove that block of text, but since SED doesn't do great at multi-line text, how would I do this type of parsing and deletion?
Any suggestions for problem 2?

Related

Use sed for Mixed Case Tags

Trying to reformat tags in an xlm file with gnu sed v4.7 on win10 (shoot me). sed is in the path and run from the Command Prompt. Need to escape some windows command-line characters with ^.
sourcefile
BEGIN
...
<trn:description>V7906 03/11 ALFREDOCAMEL HATSWOOD 74564500125</trn:description>
...
END
(There are three spaces at the start of the line.)
Expected output:
BEGIN
...
<trn:description>V7906 03/11 Alfredocamel Hatswood 74564500125</trn:description>
...
END
I want Title Case but this does in-place to lower case:
sed -i 's/^<trn:description^>\(.*\)^<\/trn:description^>$/^<trn:description^>\L\1^<\/trn:description^>/g' sourcefile
This command changes to Title Case:
sed 's/.*/\L^&/; s/\w*/\u^&/g' sourcefile
Can this be brought together as a one-liner to edit the original sourcefile in-place?
I want to use sed because it is available on the system and the code is consistently structured. I'm aware I should use a tool like xmlstarlet as explained:
sed ... code can't distinguish a comment that talks about sessionId tags from a real sessionId tag; can't recognize element encodings; can't deal with unexpected attributes being present on your tag; etc.
Thanks to Whirlpool Forum members for the answer and discussion.
It was too hard to achieve pattern matching "within the tags" in sed and the file was well formed so the required lines were changed:
sed -i.bak '/^<trn:description^>/s/\w\+/\L\u^&/g; s/^&.*;\^|Trn:Description/\L^&/g' filename
Explanation
in-place edit saving original file with .bak extension
select lines containing <trn:description>
for one or more words
replace first character with uppercase and rest with lowercase
select strings starting with & and ending with ; or Trn:Description
restore codes by replacing characters with lowercase
source/target filename
Note: ^ is windows escape character and is not required in other implementations

sed Remove lines between two patterns (excluding end pattern)

given text like
_adsp TXT "dkim=all"
VVKMU6SE3C2MF88BG4DJQAECMR9SIIF0 NSEC3 1 1 10 C4F407437E8EA4C5 (
175MCHR31K25LP89OVJI5LCE0JA2N2AP
A MX TXT AAAA RRSIG SPF )
RRSIG NSEC3 7 3 1800 (
20200429171433 20200330161758 11672 example.com
H3l26qmtkuiFZCeSYCCAo5krFE3gjM0I8UeQ9jhj3STy
X6fM0YizCHEuv4VZynOJGJc1XJnHRHI+p7yLlZ+OVseK
UfIkPVP+VOmlerwozEpM+Tnt8evwnMTDbcn0zxf/6YJx
kZeO2AszWkRZ0bctqW7INYo8YuyyuTSxSr8se27fiaPA
4GXQymepGgv/JGqargzHbyhhkDhENmNo7Qwkjl+a0kI4
6qqKcEWCsDvnlYUQiDFzc5oRs2j7TT9uybTfwUDQxV+t
MQFMhzu7LNbRIUuOb16sAEGSdl9mWQ4sZRJ9wuXJWbso
G+3tY0pBbq4ffScz/JKcrJ0qAuBF1F5JcQ== )
$TTL 1800
I want to get rid by the part with the "(not beginning with whitespace) NSEC3 " until the first line not beginning with a whitespace character.
resulting
_adsp TXT "dkim=all"
$TTL 1800
in the example.
I tried sed '/^[^\s].*\sNSEC3\s/,/^[^\s]/d;' filename but that doesn't work as expected, example results in
_adsp TXT "dkim=all"
H3l26qmtkuiFZCeSYCCAo5krFE3gjM0I8UeQ9jhj3STy
X6fM0YizCHEuv4VZynOJGJc1XJnHRHI+p7yLlZ+OVseK
UfIkPVP+VOmlerwozEpM+Tnt8evwnMTDbcn0zxf/6YJx
kZeO2AszWkRZ0bctqW7INYo8YuyyuTSxSr8se27fiaPA
4GXQymepGgv/JGqargzHbyhhkDhENmNo7Qwkjl+a0kI4
6qqKcEWCsDvnlYUQiDFzc5oRs2j7TT9uybTfwUDQxV+t
MQFMhzu7LNbRIUuOb16sAEGSdl9mWQ4sZRJ9wuXJWbso
G+3tY0pBbq4ffScz/JKcrJ0qAuBF1F5JcQ== )
$TTL 1800
so resuming printout way too early?
what do I miss?
thank you
P.S.:
you maybe see what I want to do is removing DNSSEC parts out of an named zone. didn't find any other way to remove RRSIG and NSEC3 entries, yet. If someone has an idea, I would appreciate that too.
[\s] matches a literal \ or s characters. It doesn't match whitespace.
The /^[^\s]/d; (if [\s] would work as you expect) will also include removing the last line with non-leading whitespaces. I think you have to loop manually.
On the example you've given, the following seems to work:
sed -n '/^[^ \t].*\sNSEC3\s/{ :a; n; /^[^ \t]/bb; ba}; :b; p'
This might work for you (GNU sed):
sed -n '/^\S.*NSEC/{:a;n;/^\S/!ba};p' file
Turn off implicit printing by using the -n option.
Throw away lines between one starting with a non-space and containing the string NSEC and any lines not starting with a non-space.
Print all other lines.
Alternative:
sed '/^\S.*NSEC/,/^\S/{/^\S.*NSEC\|^\s/d}' file
Yet another alternative:
sed '/^\S.*NSEC/{:a;N;/\n\S/!ba;s/.*\n//}' file
And another:
sed '/^\S.*NSEC/{:a;N;/\n\S/!s/\n//;ta;D}' file
N.B. The first two solutions will delete lines regardless of a line delimiting the end of the deletions. Whereas the last two solutions will only delete lines if there is a line delimiting the end of the deletions.

How to sed replace UTF-8 characters with HTML entities in another file?

I'm running cygwin under windows 10
Have a dictionary file (1-dictionary.txt) that looks like this:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
The separators between are TABs (\ts).
The dictionary file is encoded as UTF-8.
Want to replace words and symbols in the first column with words and HTML entities in the second column.
My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.
Sample text looks like this:
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
I run the following sed one-liner in a shell script (./3-script.sh):
sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt
The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.
However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:
vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
If i use only the specific symbol (not the full word) I get results like this:
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
The ASCII quote symbol is appended with " - it is not replaced.
Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.
The expected output would look like this:
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?
I tried it, just replace all & with \& in your 1-dictionary.txt will solve your problem.
Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \ to prepare them to be escaped.
And the to part will have special characters too, mainly \ and &, add extra \ to prepare them to be escaped too.
Above linked to GNU sed's document, for other sed version, you can also check man sed.

sed to replace a string which consist of a forwardslash

I'm trying to replace below specific lines in a file
/ACCOUNT/passwd=
/BMC/CONFIRMATION/PASSWORD=
I need help in preparing the sed command
The required output would look something like this
/ACCOUNT/passwd=-2$-$A88CA7BD3DADDDFFC
/TMC/CONFIRMATION/PASSWORD=-2$-$A88CA7BD3DADDDFFC
Any help is appreciated.
There is nothing special about the forward slash, except if you choose to use it as the delimiter in your sed command, so don’t:
sed 's,ACCOUNT/passwd=,ACCOUNT/passwd=-2$-$A88CA7BD3DADDDFFC,g'
And similar for other target strings.
Here I’ve used a comma as the delimiter. You can choose another character as you prefer.

Using sed to eliminate all lines that do not match the desired form

I have a single column csv that looks something like this:
KFIG
KUNV
K~LK
K7RT
3VGT
Some of the datapoints are garbled in transmission. I need to keep only the entries that begin with a capital letter, then the other three digits could be a capital letter OR a number. For example, in the list above I would have to delete K~LK and 3VGT.
I know that to delete all but capital letters I can write
sed -n '/[A-Z]\{4,\}/p'
I just want to adjust this to where the last three digits could be capital letters or numbers. Any help would be appreciated.
Just use:
sed -n '/[A-Z][A-Z0-9]\{3,\}/p'
However, if these identifiers are really all that there is in the file, I would propose the following command (it will assure that the whole line is matched, so it will reject for example identifiers more than 4 characters long):
sed -n '/^[A-Z][A-Z0-9]\{3\}$/p'
^ means "match zero-length string at the beginning of line";
\{3\} means "match exactly 3 occurences of the previous atom", the previous atom being [A-Z0-9];
$ means "match zero-length string at the end of line".