Sed capitalize every word on specific lines - sed

I have markdown files formated like
chapter one
Blah, blah, blah.
chapter one-hundred-fifty-three
And also files formatted like
CHAPTER ONE
Blah, blah, blah.
CHAPTER ONE-HUNDRED-FIFTY-THREE
In both cases I want to capitalize the chapter lines to say
# Chapter One
Blah, blah, blah.
# Chapter One-Hundred-Fifty-Three
I want to use sed (or awk or possibly some other linux cli program that pipes input and output)
I've found solutions to cap every word in a file but I'm not sure how to restrict it to specific lines or how to include words-connected-with-dashes-instead-of-whitespace

Using GNU sed, use an address (indicating the target line number) to tell it to apply the substitution to the desired line. For example to apply the substitution to the first line:
sed -r '1s/(\w)(\w*)/\U\1\L\2/g' file
To apply the substitution to the third line:
sed -r '3s/(\w)(\w*)/\U\1\L\2/g' file
To apply the substitution to both the first and third lines:
sed -r -e '1s/(\w)(\w*)/\U\1\L\2/g' -e '3s/(\w)(\w*)/\U\1\L\2/g'
If you don't mind the second line being modified, you can use an address range:
sed -r '1,3s/(\w)(\w*)/\U\1\L\2/g'
EDIT:
As per comments below:
sed -r '/^chapter/I { s/^/# /; s/(\w)(\w*)/\U\1\L\2/g }' file
Results:
# Chapter One
Blah, blah, blah.
# Chapter One-Hundred-Fifty-Three
# Chapter One
Blah, blah, blah.
# Chapter One-Hundred-Fifty-Three

I'd do something like that in Perl:
#!/usr/bin/perl
while(<>) {
if(/^chapter/i) {
$_ = join " ", map ucfirst, split / /, lc;
$_ = join "-", map ucfirst, split /-/;
}
print;
}
Call this like e.g. perl script < input-text > capitalized-text. My Perl-fu is a but rusty, I'm sure somebody will fold this into a oneliner called as an argument.

Related

Can grep or sed show only words that match multiple search patterns in a line?

I am wondering, if one can print the matched strings as it is in each line... using grep or sed?
TestCase1: File1 contains below text
The Sun
Thunder The Rain They say
They say The dance
If I use this command:
egrep -o 'The|They' File1
The output I get is:
The
The
They
They
The
But, my expected output should be as below:
The
The They
They The
I am aware that, In grep the option -o, --only-matching prints only the matched non-empty) parts of a matching line, with each such part on a separate output line.
Edit: Please also suggest, if one wants to have a filter with exact word match with multiple match strings
i.e. <The> and <They> exact word match? Space separated words simply.
TestCase2: File2 contains below text
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Output is:
The
The They
They the
the
The the they
How to approach this?
With GNU awk for FPAT:
$ awk -v FPAT='\\<[Tt]hey?\\>' '{$1=$1}1' file
The
The They
They The
They the
The the they
Note that that can't NOT identify They when it appears in They're. If that's really an issue and you want to look for space-separated complete strings then this might be what you want:
$ awk '{c=0; for (i=1;i<=NF;i++) if ($i ~ /^[Tt]hey?$/) printf "%s%s", (c++?OFS:""), $i; print ""}' file
The
The They
They The
the
The the they
If not, let us know.
The above was run against this iteration of the OPs posted sample input:
$ cat file
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Best do it with Perl:
~$ perl -nE 'say /They? /g' File1
The
The They
They The
EDIT : Add new conditions. The regex still matches all but the lowercase the. Adding the i flag makes the match case-insensitive and matches all your test strings.
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
There is a little bit of a trick here: the match also picks up the space after the ? and prints it in the output. E.g. the first line of output is realy: "The_\n" - where "_" = space character. This may or may not be acceptable. One way to remove the spaces and reassemble the string would be:
$ perl -nE 'say join " ", map {substr $_,0,-1} /They? /ig' File1
As to your question about matching full words <The> and <They>, as you put it, the ? in They? indicates that the 'y' is optional. I.e. matches 0 or 1 times. Therefore the pattern is considering 'The' and 'They' as full words, one or the other, followed by a space. You could rewrite the pattern as:
$ perl -nE 'say /(?:They|The) /ig' File1
And effect the same output.
Now that you are considering lowercase the you may run into more edge case "gotchas" like words that end in "the". "loathe" and "tythe" come to mind.
$ echo "I'm loathe to cringe and tythe socks" >> File1
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
the the <--- not wanted!
You can then add the \b test in to match on word boundaries (as in zdim's answer):
$ perl -nE 'say /\bThey? /ig' File1
The
The They
They The
the
The the they
<-- But you get this empty line where no match occurs
So to refine further, you could only print if the line matches. Like this:
$ perl -nE 'say /\bThey? /ig if /\bThey? /i' File1
The
The They
They The
the
The the they
Then, I'm sure, you can find more edge cases that will blow it all up and force further refinement.
Things are not fully specified so here are a couple of possibilities
To catch all words starting with The, and print them with a space in between
perl -wnE'say join " ", /\bThe\w*/g' file
where \b is a word-boundary, a zero-width anchor, and \w is a word character. Using \S (a non-space character) is yet more permissive.
For only The or They can instead use
perl -wnE'say join " ", /\bThey?\b/g' file
where y? makes y optional.
To allow the as well use [tT] instead of T in the pattern, or /i for either case for all chars.
It's been clarified in coments that punctuation after The|They isn't allowed, and that low case t is. Then we need to constrain the match by space, not word boundary, and use [tT] as mentioned
perl -wnE'say join " ", /\b([Tt]hey?)\s/g' file
Now the capturing parenthesis () are needed since \s does consume, unlike \b before.
This prints the desired output with the provided input.
awk to the rescue!
$ awk -v p="They?" '$0~p{for(i=1;i<=NF;i++) if($i~p) printf "%s",$i OFS; print ""}' file
The
The They
They The
try one more awk:
awk '{while(match($0,/The|They/)){string=substr($0,RSTART,RLENGTH);VAL=VAL?VAL OFS string:string;$0=substr($0,RSTART+RLENGTH+1);};print VAL;VAL=""}' Input_file
NON-ONE line form of solution as follows too.
awk '{
while(match($0,/The|They/)){
string=substr($0,RSTART,RLENGTH);
VAL=VAL?VAL OFS string:string;
$0=substr($0,RSTART+RLENGTH+1);
};
print VAL;
VAL=""
}
' Input_file
Will add the explanation shortly for same.

sed: replace pattern only if followed by empty line

I need to replace a pattern in a file, only if it is followed by an empty line. Suppose I have following file:
test
test
test
...
the following command would replace all occurrences of test with xxx
cat file | sed 's/test/xxx/g'
but I need to only replace test if next line is empty. I have tried matching a hex code, but that doesn ot work:
cat file | sed 's/test\x0a/xxx/g'
The desired output should look like this:
test
xxx
xxx
...
Suggested solutions for sed, perl and awk:
sed
sed -rn '1h;1!H;${g;s/test([^\n]*\n\n)/xxx\1/g;p;}' file
I got the idea from sed multiline search and replace. Basically slurp the entire file into sed's hold space and do global replacement on the whole chunk at once.
perl
$ perl -00 -pe 's/test(?=[^\n]*\n\n)$/xxx/m' file
-00 triggers paragraph mode which makes perl read chunks separated by one or several empty lines (just what OP is looking for). Positive look ahead (?=) to anchor substitution to the last line of the chunk.
Caveat: -00 will squash multiple empty lines into single empty lines.
awk
$ awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
Basically store previous line in l, substitute pattern in l if current line is empty. Print l. Finally print the very last line.
Output in all three cases
test
xxx
xxx
...
This might work for you (GNU sed):
sed -r '$!N;s/test(\n\s*)$/xxx\1/;P;D' file
Keep a window of 2 lines throughout the length of the file and if the second line is empty and the first line contains the pattern then make a substitution.
Using sed
sed -r ':a;$!{N;ba};s/test([^\n]*\n(\n|$))/xxx\1/g'
explanation
:a # set label a
$ !{ # if not end of file
N # Add a newline to the pattern space, then append the next line of input to the pattern space
b a # Unconditionally branch to label. The label may be omitted, in which case the next cycle is started.
}
# simply, above command :a;$!{N;ba} is used to read the whole file into pattern.
s/test([^\n]*\n(\n|$))/xxx\1/g # replace the key word if next line is empty (\n\n) or end of line ($)

sed delete lines not containing specific string

I'm new to sed and I have the following question. In this example:
some text here
blah blah 123
another new line
some other text as well
another line
I want to delete all lines except those that contain either string 'text' and or string 'blah', so my output file looks like this:
some text here
blah blah 123
some other text as well
Any hints how this can be done using sed?
This might work for you:
sed '/text\|blah/!d' file
some text here
blah blah 123
some other text as well
You want to print only those lines which match either 'text' or 'blah' (or both), where the distinction between 'and' and 'or' is rather crucial.
sed -n -e '/text/{p;n;}' -e '/blah/{p;n;}' your_data_file
The -n means don't print by default. The first pattern searches for 'text', prints it if matched and skips to the next line; the second pattern does the same for 'blah'. If the 'n' was not there then a line containing 'text and blah' would be printed twice. Although I could have use just -e '/blah/p', the symmetry is better, especially if you need to extend the list of matched words.
If your version of sed supports extended regular expressions (for example, GNU sed does, with -r), then you can simplify that to:
sed -r -n -e '/text|blah/p' your_data_file
You could simply do it through awk,
$ awk '/blah|text/' file
some text here
blah blah 123
some other text as well
Are you looking for the grep?
Here is an example to look for different texts.
cat yourfile.txt | grep "text\|blah"

replace two newlines to one in shell command line

There are lot of questions about replacing multi-newlines to one newline but no one is working for me.
I have a file:
first line
second line MARKER
third line MARKER
other lines
many other lines
I need to replace two newlines (if they exist) after MARKER to one newline. A result file should be:
first line
second line MARKER
third line MARKER
other lines
many other lines
I tried sed ':a;N;$!ba;s/MARKER\n\n/MARKER\n/g' Fail.
sed is useful for single line replacements but has problems with newlines. It can't find \n\n
I tried perl -i -p -e 's/MARKER\n\n/MARKER\n/g' Fail.
This solution looks closer, but it seems that regexp didn't reacts to \n\n.
Is it possible to replace \n\n only after MARKER and not to replace other \n\n in the file?
I am interested in one-line-solution, not scripts.
I think you were on the right track. In a multi-line program, you would load the entire file into a single scalar and run this substitution on it:
s/MARKER\n\n/MARKER\n/g
The trick to getting a one-liner to load a file into a multi-line string is to set $/ in a BEGIN block. This code will get executed once, before the input is read.
perl -i -pe 'BEGIN{$/=undef} s/MARKER\n\n/MARKER\n/g' input
Your Perl solution doesn't work because you are search for lines that contain two newlines. There is no such thing. Here's one solution:
perl -ne'print if !$m || !/^$/; $m = /MARKER$/;' infile > outfile
Or in-place:
perl -i~ -ne'print if !$m || !/^$/; $m = /MARKER$/;' file
If you're ok with loading the entire file into memory, you can use
perl -0777pe's/MARKER\n\n/MARKER\n/g;' infile > outfile
or
perl -0777pe's/MARKER\n\K\n//g;' infile > outfile
As above, you can use -i~ do edit in-place. Remove the ~ if you don't want to make a backup.
awk:
kent$ cat a
first line
second line MARKER
third line MARKER
other lines
many other lines
kent$ awk 'BEGIN{RS="\x034"} {gsub(/MARKER\n\n/,"MARKER\n");printf $0}' a
first line
second line MARKER
third line MARKER
other lines
many other lines
See sed one liners.
awk '
marker { marker = 0; if (/^$/) next }
/MARKER/ { marker = 1 }
{ print }
'
This can be done in very simple sed.
sed '/MARKER$/{n;/./!d}'
This might work for you:
sed '/MARKER/,//{//!d}'
Explanation:
Deletes all lines between MARKER's preserving the MARKER lines.
Or:
sed '/MARKER/{n;N;//D}'
Explanation:
Read the next line after MARKER, then append the line after that. Delete the previous line if the current line is a MARKER line.

How do i print word after regex but not a similar word?

I want an awk or sed command to print the word after regexp.
I want to find the WORD after a WORD but not the WORD that looks similar.
The file looks like this:
somethingsomething
X-Windows-Icon=xournal
somethingsomething
Icon=xournal
somethingsomething
somethingsomething
I want "xournal" from the one that say "Icon=xournal". This is how far i have come until now. I have tried an AWK string too but it was also unsuccessful.
cat "${file}" | grep 'Icon=' | sed 's/.*Icon=//' >> /tmp/text.txt
But i get both so the text file gives two xournal which i don't want.
Use ^ to anchor the pattern at the beginning of the line. And you can even do the grepping directly within sed:
sed -n '/^Icon=/ { s/.*=//; p; }' "$file" >> /tmp/text.txt
You could also use awk, which I think reads a little better. Using = as the field separator, if field 1 is Icon then print field 2:
awk -F= '$1=="Icon" {print $2}' "$file" >> /tmp/text.txt
This might be useful even though Perl is not one of the tags.
In case if you are interested in Perl this small program will do the task for you:
#!/usr/bin/perl -w
while(<>)
{
if(/Icon\=/i)
{
print $';
}
}
This is the output:
C:\Documents and Settings\Administrator>io.pl new2.txt
xournal
xournal
explanation:
while (<>) takes the input data from the file given as an argument on the command line while executing.
(/Icon\=/i) is the regex used in the if condition.
$' will print the part of the line after the regex.
All you need is:
sed -n 's/^Icon=//p' file