Cli command to remove specific newlines - perl

Given a markdown file that contains a bunch of these blocks:
```
json :
{
"something": "here"
}
```
And I want to fix all of these to become valid markdown, i.e.:
```json
{
"something": "here"
}
```
How can I do that effectively across any number of files?
I have Googled around a bit and found similar issues, but been unable to convert their solutions to my specific need. It seems that SED is not great at multiple line matching and the inclusion of the ` character is obviously also causing issues.
I've tried with
perl -pe "s/\njson :/json/g"
but that did not give any matches.

To make your Perl program work, you need to change the input record separator $/. A simple BEGIN block will do to undef it before the program runs its while loop.
foo is your input file.
$ perl -pe 'BEGIN{undef $/} s/\njson :/json/g' foo
```json
{
"something": "here"
}
Perl will now slurp in the whole file at once, which should be fine for a markdown document. If you want to process files of several GBs of size, get more RAM though.
Note that you need -i as well to do in-place editing.
$ perl -pi -e '...' *
A much shorter version is to use the -0 flag instead of the BEGIN block to tell Perl about the input record separator. perlrun says this:
The special value 00 will cause Perl to slurp files in paragraph
mode. Any value 0400 or above will cause Perl to slurp files whole,
but by convention the value 0777 is the one normally used for this
purpose.
You could have detected this yourself by running your program with the re 'debug' pragma, which turns on debugging mode for regex. It would have told you.
$ perl -Mre=debug -pe 's/\njson :/json/g' foo
Compiling REx "\njson :"
Final program:
1: EXACT <\njson :> (4)
4: END (0)
anchored "%njson :" at 0 (checking anchored isall) minlen 7
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried
```
Matching REx "\njson :" against "json :%n"
Intuit: trying to determine minimum start position...
Did not find anchored substr "%njson :"...
Match rejected by optimizer
json :
Matching REx "\njson :" against "{%n"
Regex match can't succeed, so not even tried
{
Matching REx "\njson :" against " %"something%": %"here%"%n"
Intuit: trying to determine minimum start position...
Did not find anchored substr "%njson :"...
Match rejected by optimizer
"something": "here"
Matching REx "\njson :" against "}%n"
Regex match can't succeed, so not even tried
}
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried
```
Matching REx "\njson :" against "%n"
Regex match can't succeed, so not even tried
Freeing REx: "\njson :"
The giveaway is this:
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried

Related

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

Perl one-liner: deleting a line with pattern matching

I am trying to delete bunch of lines in a file if they match with a particular pattern which is variable.
I am trying to delete a line which matches with abc12, abc13, etc.
I tried writing a C-shell script, and this is the code:
**!/bin/csh
foreach $x (12 13 14 15 16 17)
perl -ni -e 'print unless /abc$x/' filename
end**
This doesn't work, but when I use the one-liner without a variable (abc12), it works.
I am not sure if there is something wrong with the pattern matching or if there is something else I am missing.
Yes, it's the fact you're using single quotes. It means that $x is being interpreted literally.
Of course, you're also doing it very inefficiently, because you're processing each file multiple times.
If you're looking to remove lines abc12 to abc17 you can do this all in one go:
perl -n -i.bak -e 'print unless m/abc1[234567]/' filename
Try this
perl -n -i.bak -e 'print unless m/abc1[2-7]/' filename
using the range [2-7] only removes the need to type [234567] which has the effect of saving you three keystrokes.
man 1 bash: Pattern Matching
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched.
A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

Convert Perl to Shell

I have Perl script that I use to SNMP walk devices. However the server I have available to me does not allow me to install all the modules needed. So I need to convert the script to Shell (sh). I can run the script on individual devices but would like it to read from a text like it did in Perl. The Perl Script starts with:
open(TEST, "cat test.txt |");
#records=<TEST>;
close(TEST);
foreach $line (#records)
{
($field1, $field2, $field3)=split(/\s+/, $line);
# Run and record SNMP walk results.
Depending on exactly what the input is and what you are trying to do, that perl code fragment would likely translate to:
while read field1 field2 field3
do
# Run and record SNMP walk results.
echo "1=$field1 2=$field2 3=$field3"
done <text.txt
For example, if text.txt is:
$ cat text.txt
one two three
i ii iii
Then, the above code produces the output:
1=one 2=two 3=three
1=i 2=ii 3=iii
As you can see, the shell read command reads a line (record) at a time and also does splitting on whitespace. There are many options for read to control whether newlines or something else divide records (-d) and whether splitting is to be done on whitespace or something else (IFS) or whether backslashes in the input are to be treated as escape characters or not (-r). See man bash.
while read string; do
str1=${string%% *}
str3=${string##* }
temp=${string#$str1 }
str2=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
alternate version
while read string; do
str1=${string%% *}
temp=${string#$str1 }
str2=${temp%% *}
temp=${string#$str1 $str2 }
str3=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
POSIX substring parameter expansion
${parameter%word}
Remove Smallest Suffix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the suffix matched by the pattern
deleted.
${parameter%%word}
Remove Largest Suffix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the suffix matched by the pattern deleted.
${parameter#word}
Remove Smallest Prefix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the prefix matched by the pattern
deleted. ${parameter##word} Remove Largest Prefix Pattern. The word
shall be expanded to produce a pattern. The parameter expansion shall
then result in parameter, with the largest portion of the prefix
matched by the pattern deleted.
${parameter##word}
Remove Largest Prefix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the prefix matched by the pattern deleted.

Please explain one line of perl code

I have such line from https://camlistore.googlesource.com/camlistore/+/master/third_party/rewrite-imports.sh
find . -type f -name '*.go' -exec perl -pi -e 's!"code.google.com/!"camlistore.org/third_party/code.google.com/!' {} \;
I would like help understanding what exactly this does:
perl -pi -e 's!"code.google.com/!"camlistore.org/third_party/code.google.com/!'
Especialy exclamation marks and ". Thanks!
From perldoc perlrun:
-p means "run the expression for each line, and print the result"
-i means "edit the input file in place"
-e means "the next parameter is the Perl expression to evaluate"
For the expression itself:
The ! marks are the separators for the s (substitution) operator. Any non-alphanumeric character can be used for that - whatever follows the s.
The " characters don't mean anything special, they're just part of the text to be replaced, and the replacement.
So we have:
s: substitute
!: (separator)
"code.google.com/: text to find
!: (separator)
"camlistore.org/third_party/code.google.com/: replacement text
!: (separator)
Which all means:
For each line in the file
Find the text "code.google.com/
And (if found) replace it with "camlistore.org/third_party/code.google.com/
The bangs ! are just an alternative delimiter for the search and replace regex s///.
Because the content of the search and replace includes forward slashes, it makes sense to use a different delimiter to avoid having to escape them all. Exclamation points are sometimes used for this purpose s!!!, but my preferred alternate are braces: s{}{}.
As for what that code is done, it's replacing all references to "code.google.com/ with "camlistore.org/third_party/code.google.com/ in the found files.
This is a pretty straightforward search-and-replace. The s/PATTERN/REPLACEMENT/ operator sees if a string matches the regular expression pattern and replaces the part that matches with the value of the replacement string.
Since sometimes / characters are an inconvenient delimiter (such as dealing with web URIs), Perl allows you to swap them out for other characters, in this case they chose to use !.
The -p switch causes Perl to assume a loop around the code in question for processing lines. The -i switch allows input lines to be edited in-place as they are processed, optionally preserving the original in another file. (See perldoc perlrun for the gory details.)
So all this code is doing is replacing lines that contain "code.google.com/ with "camlistore.org/third_party/code.google.com/.

need to delete the entire line except the matching strings

What I need is:
I need to delete the entire line but need to keep the matching strings.
matching pattern starting with Unhandled and ending with a :
I tried the below code which prints the matching pattern, but I need to delete the extra lines from the file.
perl -0777 -ne 'print "Unhandled error at$1\n" while /Unhandled\ error\ at(.*?):/gs' filename
Below is the sample input:
2012-04-09 01:52:13,717 - uhrerror - ERROR - 22866 - /home/shabbir/web/middleware.py process_exception - 217 - Unhandled error at /user/resetpassword/: : {'mod_wsgi.listener_port': '8080', 'HTTP_COOKIE': "__utma=1.627673239.1309689718.1333823126.1333916263.156; __utmz=1.1333636950.152.101.utmgclid=CMmkz934na8CFY4c6wod_R8JbA|utmccn=(not%20set)|utmcmd=(not%20set)|utmctr=non-stick%20kadai%20online; subpopdd=yes; _msuuid_1690zlm11992=FCC09820-3004-413A-97A3-1088EE128CE9; _we_wk_ls_=%7Btime%3A'1322900804422'%7D; _msuuid_lf2uu38ua0=08D1CEFE-3C19-4B9E-8096-240B92BA0ADD; nevermissadeal=True; sessionid=c1e850e2e7db09e98a02415fc1ef490; __utmc=1; __utmb=1.7.10.1333916263; 'wsgi.file_wrapper': , 'HTTP_ACCEPT_ENCODING': 'gzip, deflate'}
The code you gave already provides the requested behaviour.
That said, there's a huge redundant string in your program you can eliminate.
perl -0777nE'say $1 while /(Unhandled error at .*?):/gs' filename
Finally, slurping the entire file seems entirely superfluous.
perl -nE'say $1 if /(Unhandled error at .*?):/g' filename
perl -0777 -i -pe 's/.*?(Unhandled error .*?):.*/$1/g' filename
This will replace error block with matched string in the file.
-0777 : will force Perl to read the whole file in one shot.
-i : means in-place editing of files.
-p : means loop line-by-line through contents of file,execute code in single quotes i.e.'s/.*?(Unhandled error .*?):.*/$1/g',and print the result(matched string),which is written back to file using -i option.
-e : for command-line
If one match is all you want to keep from the whole string, you could replace the string value with the match afterwards. (i.e. Simply assign the new value)
If you have several matches within the string, the least complicated method may be to store the matches temporarily in an array. Then just discard the original variable if you don't need it anymore.
I would use -l option to handle line endings (less version dependent, prints a new line for each match), and a for loop to print all the matches, not just the first one $1. No need to slurp the file with -0777.
perl -nwle 'print for /Unhandled error at .*?:/g'
Note that with the /g modifier, a capturing parenthesis is not required.
If only one (the first) match is to be printed, /g is redundant and you can just use $1:
perl -nlwe 'print $1 if /(Unhandled error at .*?):/'