Convert Perl to Shell - perl

I have Perl script that I use to SNMP walk devices. However the server I have available to me does not allow me to install all the modules needed. So I need to convert the script to Shell (sh). I can run the script on individual devices but would like it to read from a text like it did in Perl. The Perl Script starts with:
open(TEST, "cat test.txt |");
#records=<TEST>;
close(TEST);
foreach $line (#records)
{
($field1, $field2, $field3)=split(/\s+/, $line);
# Run and record SNMP walk results.

Depending on exactly what the input is and what you are trying to do, that perl code fragment would likely translate to:
while read field1 field2 field3
do
# Run and record SNMP walk results.
echo "1=$field1 2=$field2 3=$field3"
done <text.txt
For example, if text.txt is:
$ cat text.txt
one two three
i ii iii
Then, the above code produces the output:
1=one 2=two 3=three
1=i 2=ii 3=iii
As you can see, the shell read command reads a line (record) at a time and also does splitting on whitespace. There are many options for read to control whether newlines or something else divide records (-d) and whether splitting is to be done on whitespace or something else (IFS) or whether backslashes in the input are to be treated as escape characters or not (-r). See man bash.

while read string; do
str1=${string%% *}
str3=${string##* }
temp=${string#$str1 }
str2=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
alternate version
while read string; do
str1=${string%% *}
temp=${string#$str1 }
str2=${temp%% *}
temp=${string#$str1 $str2 }
str3=${temp%% *}
echo $str1 $str2 $str3
done <test.txt
POSIX substring parameter expansion
${parameter%word}
Remove Smallest Suffix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the suffix matched by the pattern
deleted.
${parameter%%word}
Remove Largest Suffix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the suffix matched by the pattern deleted.
${parameter#word}
Remove Smallest Prefix Pattern. The word shall be expanded to produce
a pattern. The parameter expansion shall then result in parameter,
with the smallest portion of the prefix matched by the pattern
deleted. ${parameter##word} Remove Largest Prefix Pattern. The word
shall be expanded to produce a pattern. The parameter expansion shall
then result in parameter, with the largest portion of the prefix
matched by the pattern deleted.
${parameter##word}
Remove Largest Prefix Pattern. The word shall be expanded to produce a
pattern. The parameter expansion shall then result in parameter, with
the largest portion of the prefix matched by the pattern deleted.

Related

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

Cli command to remove specific newlines

Given a markdown file that contains a bunch of these blocks:
```
json :
{
"something": "here"
}
```
And I want to fix all of these to become valid markdown, i.e.:
```json
{
"something": "here"
}
```
How can I do that effectively across any number of files?
I have Googled around a bit and found similar issues, but been unable to convert their solutions to my specific need. It seems that SED is not great at multiple line matching and the inclusion of the ` character is obviously also causing issues.
I've tried with
perl -pe "s/\njson :/json/g"
but that did not give any matches.
To make your Perl program work, you need to change the input record separator $/. A simple BEGIN block will do to undef it before the program runs its while loop.
foo is your input file.
$ perl -pe 'BEGIN{undef $/} s/\njson :/json/g' foo
```json
{
"something": "here"
}
Perl will now slurp in the whole file at once, which should be fine for a markdown document. If you want to process files of several GBs of size, get more RAM though.
Note that you need -i as well to do in-place editing.
$ perl -pi -e '...' *
A much shorter version is to use the -0 flag instead of the BEGIN block to tell Perl about the input record separator. perlrun says this:
The special value 00 will cause Perl to slurp files in paragraph
mode. Any value 0400 or above will cause Perl to slurp files whole,
but by convention the value 0777 is the one normally used for this
purpose.
You could have detected this yourself by running your program with the re 'debug' pragma, which turns on debugging mode for regex. It would have told you.
$ perl -Mre=debug -pe 's/\njson :/json/g' foo
Compiling REx "\njson :"
Final program:
1: EXACT <\njson :> (4)
4: END (0)
anchored "%njson :" at 0 (checking anchored isall) minlen 7
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried
```
Matching REx "\njson :" against "json :%n"
Intuit: trying to determine minimum start position...
Did not find anchored substr "%njson :"...
Match rejected by optimizer
json :
Matching REx "\njson :" against "{%n"
Regex match can't succeed, so not even tried
{
Matching REx "\njson :" against " %"something%": %"here%"%n"
Intuit: trying to determine minimum start position...
Did not find anchored substr "%njson :"...
Match rejected by optimizer
"something": "here"
Matching REx "\njson :" against "}%n"
Regex match can't succeed, so not even tried
}
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried
```
Matching REx "\njson :" against "%n"
Regex match can't succeed, so not even tried
Freeing REx: "\njson :"
The giveaway is this:
Matching REx "\njson :" against "```%n"
Regex match can't succeed, so not even tried

Converting long single-line comments into multiple short lines

I have some lines with very long single-line comments:
# this is a comment describing the function, let's pretend it's long.
function whatever()
{
# this is an explanation of something that happens in here.
do_something();
}
For this example (adapting it to other numbers should be trivial) I want
each line to contain at most 33 characters (each indentation level is 4 spaces) and
to be broken at the last possible space
each additional line do be indented exactly like the original line.
So it would end up looking like this:
# this is a comment describing
# the function, let's pretend
# it's long.
function whatever()
{
# this is an explanation of
# something that happens in
# here.
do_something();
}
I'm trying to write a sed script for that, my attempt looking like this (leaving out the attempts to make it break at a particular character count for clarity and because it didn't work):
s/\(^[^#]*# \)\(.*\) \(.*\)/\1\2\n\1\3/g;
This breaks the line only once and not repeatedly like I falsely assumed g to do (and which it actually would do if it were only s/ /\n/g or something).
Perl to the rescue!
Its Text::Wrap module does what you need:
perl -MText::Wrap='wrap,$columns' -pe '
s/^(\s*#)(.*)/$columns = 33 - length $1; wrap("$1", "$1 ", "$2")/e
' < input > output
-M uses the given module with the given parameters. Here, we'll use the wrap function and the $columns variable.
-p reads the input line by line and prints the possibly modified line (like sed)
s///e is a substitution that uses code in the replacement part, the matching part is replaced by the value returned from the code
to calculate the width, we subtract the initial whitespace from 33. If you use tabs in your sources, you'll have to handle them specially.
wrap takes three parameters: prefix for the first line, prefix for the rest of the lines (in this case, they're almost the same: the comment prefix, we just need to add the space to the second one), and the text to wrap.
Comparing the output to yours, it seems you want 33 characters regardless of the leading whitespace. If that's true, just remove the - length $1 part.

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.

Why am I only getting one match?

I am parsing a log file trying to pull out the lines where the phrase "failures=" is a unique non-zero digit.
The first part of my perl one liner will pull out all the lines where "failures" are greater than zero. But that part of the log file repeats until a new failure occurs, i.e., after the first failure the log entries will be "failures=1" until the second error then it will read, "failures=2".
What I'd like to do is pull only the first line where that value changes and I thought I had it with this:
cat -n Logstats.out | perl -nle 'print "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+).*tracking_id="(\d+)"/ && $2 > 0' | perl -ne 'print unless $a{/failures=\d+/}++'
However, that only pulls the first non-zero "failure" line and nothing else. What do I need to change for it to pull all the unique values for "failures"?
thanks in advance!
Update: The amount of text in each line up to the "tracking_id" is more text than I can post. Sorry!
2011-09-06 14:14:18 [INFO] [logStats]: name=somename id=d6e6f4f0-4c0d-93b6-7a71-8e3100000030
successes=1 failures=0 skipped=0 eventDelta=41 original=188 simulated=229
totalDelta=41 averageDelta=41 min=0 max=41 averageOriginalDuration=188 averageSimulatedDuriation=229(txid = b3036eca-6288-48ef-166f-5ae200000646
date = 2011-09-02 08:00:00 type = XML xml
=
perl -ne 'print unless $a{/failures=\d+/}++'
does not work because a hash subscript is evaluated in scalar context, and the m// operator does not return the match in scalar context. Instead, it returns 1. So (since every line matches), what you wrote is equivalent to:
perl -ne 'print unless $a{1}++'
and I think you can see the problem there.
There's a number of ways to work around that, but I'd use:
perl -ne 'print unless /(failures=\d+)/ and $a{$1}++'
However, I'd do the whole thing in one call to perl, including numbering the lines:
perl -nle '
print "Line No. $.: failures=$1; eventDelta=$2; tracking_id=$3"
if /failures=(\d+).*?eventDelta=(.\d+).*?tracking_id="(\d+)"/
&& $1 > 0
&& !$seen{$1}++
' Logstats.out
($. automatically counts input lines in Perl. The line breaks can be removed if desired, but it will actually work as is.)
you could use a hash to store te results and print it:
perl -nle '$f{$2} ||= "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+ ).*tracking_id="(\d+)"/ && $2;END{ print $f{$_} for keys %f }' Logstats.out
(not tested due to missing input data...)
HTH,
Paul
Since your input does not match your regex, I can't really help you. But I can tell you that this is doing a lot of backtracking--and that's bad if there is a lot of data after the part that you're interested in.
So here is some alternative ideas:
qr{ \s # a single space
failures=(\d+) # the entry for failures
\s+ # at least one space
skipped=\d+ # skipped
\s+
eventDelta=(.\d+)
.*? # any number of non-newline characters *UNTIL* ...
\btracking_id="(\d+)" # another specified sequence of data
}x;
The parser will scan "skipped=" and then a group of digits a lot faster than scanning the rest of the line and backtracking when it fails back to 'eventDelta=', it is better to put it in, if you know it will always be there.
Since you don't put tracking_id in your example, I can't tell how it occurs, so in this case we used a non-greedy any match which will always be looking for the next sequence. Again, if there is a lot of data in the line, then you do not want to scan to then end and backtrack until you find that you've already read 'tracking_id="nnn"'. However, lookaheads cost processing time, it is still better to spell out 'skipped=' and all possible values then a non-greedy "any match".
You'll also notice that after accepting any data, I specify that tracking_id should appear at a word boundary, which disambiguates it from the possible--though not likely 'backtracking_id='.