Why am I only getting one match? - perl

I am parsing a log file trying to pull out the lines where the phrase "failures=" is a unique non-zero digit.
The first part of my perl one liner will pull out all the lines where "failures" are greater than zero. But that part of the log file repeats until a new failure occurs, i.e., after the first failure the log entries will be "failures=1" until the second error then it will read, "failures=2".
What I'd like to do is pull only the first line where that value changes and I thought I had it with this:
cat -n Logstats.out | perl -nle 'print "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+).*tracking_id="(\d+)"/ && $2 > 0' | perl -ne 'print unless $a{/failures=\d+/}++'
However, that only pulls the first non-zero "failure" line and nothing else. What do I need to change for it to pull all the unique values for "failures"?
thanks in advance!
Update: The amount of text in each line up to the "tracking_id" is more text than I can post. Sorry!
2011-09-06 14:14:18 [INFO] [logStats]: name=somename id=d6e6f4f0-4c0d-93b6-7a71-8e3100000030
successes=1 failures=0 skipped=0 eventDelta=41 original=188 simulated=229
totalDelta=41 averageDelta=41 min=0 max=41 averageOriginalDuration=188 averageSimulatedDuriation=229(txid = b3036eca-6288-48ef-166f-5ae200000646
date = 2011-09-02 08:00:00 type = XML xml
=

perl -ne 'print unless $a{/failures=\d+/}++'
does not work because a hash subscript is evaluated in scalar context, and the m// operator does not return the match in scalar context. Instead, it returns 1. So (since every line matches), what you wrote is equivalent to:
perl -ne 'print unless $a{1}++'
and I think you can see the problem there.
There's a number of ways to work around that, but I'd use:
perl -ne 'print unless /(failures=\d+)/ and $a{$1}++'
However, I'd do the whole thing in one call to perl, including numbering the lines:
perl -nle '
print "Line No. $.: failures=$1; eventDelta=$2; tracking_id=$3"
if /failures=(\d+).*?eventDelta=(.\d+).*?tracking_id="(\d+)"/
&& $1 > 0
&& !$seen{$1}++
' Logstats.out
($. automatically counts input lines in Perl. The line breaks can be removed if desired, but it will actually work as is.)

you could use a hash to store te results and print it:
perl -nle '$f{$2} ||= "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+ ).*tracking_id="(\d+)"/ && $2;END{ print $f{$_} for keys %f }' Logstats.out
(not tested due to missing input data...)
HTH,
Paul

Since your input does not match your regex, I can't really help you. But I can tell you that this is doing a lot of backtracking--and that's bad if there is a lot of data after the part that you're interested in.
So here is some alternative ideas:
qr{ \s # a single space
failures=(\d+) # the entry for failures
\s+ # at least one space
skipped=\d+ # skipped
\s+
eventDelta=(.\d+)
.*? # any number of non-newline characters *UNTIL* ...
\btracking_id="(\d+)" # another specified sequence of data
}x;
The parser will scan "skipped=" and then a group of digits a lot faster than scanning the rest of the line and backtracking when it fails back to 'eventDelta=', it is better to put it in, if you know it will always be there.
Since you don't put tracking_id in your example, I can't tell how it occurs, so in this case we used a non-greedy any match which will always be looking for the next sequence. Again, if there is a lot of data in the line, then you do not want to scan to then end and backtrack until you find that you've already read 'tracking_id="nnn"'. However, lookaheads cost processing time, it is still better to spell out 'skipped=' and all possible values then a non-greedy "any match".
You'll also notice that after accepting any data, I specify that tracking_id should appear at a word boundary, which disambiguates it from the possible--though not likely 'backtracking_id='.

Related

Perl one-liner: deleting a line with pattern matching

I am trying to delete bunch of lines in a file if they match with a particular pattern which is variable.
I am trying to delete a line which matches with abc12, abc13, etc.
I tried writing a C-shell script, and this is the code:
**!/bin/csh
foreach $x (12 13 14 15 16 17)
perl -ni -e 'print unless /abc$x/' filename
end**
This doesn't work, but when I use the one-liner without a variable (abc12), it works.
I am not sure if there is something wrong with the pattern matching or if there is something else I am missing.
Yes, it's the fact you're using single quotes. It means that $x is being interpreted literally.
Of course, you're also doing it very inefficiently, because you're processing each file multiple times.
If you're looking to remove lines abc12 to abc17 you can do this all in one go:
perl -n -i.bak -e 'print unless m/abc1[234567]/' filename
Try this
perl -n -i.bak -e 'print unless m/abc1[2-7]/' filename
using the range [2-7] only removes the need to type [234567] which has the effect of saving you three keystrokes.
man 1 bash: Pattern Matching
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched.
A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

Change formatting on paragraphs, with perl

I have a number of paragraphs that have returns at the end of a line. I do not want returns at the end of lines, I will let the layout program take care of that. I would like to remove the returns, and replace them with spaces.
The issue is that I do want returns in between paragraphs. So, if there is more than one return in a row (2, 3, etc) I would like to keep two returns.
This would allow for there to be paragraphs, with one blank line between then, but all other formatting for lines would be removed. This would allow the layout program to worry about the line breaks, and not the have the breaks determined by a set number of characters, as they are now.
I would like to use Perl to accomplish this change, but am open to other methods.
example text:
This is a test.
This is just a test.
This too is a test.
This too is just a test.
would become:
This is a test. This is just a test.
This too is a test. This too is just a test.
Can this be done easily?
Using a perl one-liner. Replace 2 or more newlines with just 2. Strip all single newlines:
perl -0777 -pe 's{(\n{2})\n*|\n}{$1//" "}eg' file.txt > newfile.txt
Switches:
-0777: Slurps the entire file
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
I came up with another solution and also wanted to explain what your regex was matching.
Matt#MattPC ~/perl/testing/8
$ cat input.txt
This is a test.
This is just a test.
This too is a test.
This too is just a test.
another test.
test.
Matt#MattPC ~/perl/testing/8
$ perl -e '$/ = undef; $_ = <>; s/(?<!\n)\n(?!\n)/ /g; s/\n{2,}/\n\n/g; print' input.txt
This is a test. This is just a test.
This too is a test. This too is just a test.
another test. test.
I basically just wrote a perl program and mashed it into a one-liner. It would normally look like this.
# First two lines read in the whole file
$/ = undef;
$_ = <>;
# This regex replaces every `\n` by a space
# if it is not preceded or followed by a `\n`
s/(?<!\n)\n(?!\n)/ /g;
# This replaces every two or more \n by two \n
s/\n{2,}/\n\n/g;
# finally print $_
print;
perl -p -i -e 's/(\w+|\s+)[\r\n]/$1 /g' abc.txt
Part of the problem here is what you are matching. (\w+|\s+) matches one of more word characters, which is the same as [a-zA-Z0-9_], OR one or more whitespace characters, which is the same as [\t\n\f\r ].
This wouldn't match your input, since you aren't matching periods, and no line consists of only white space or only characters (even the blank lines would need two whitespace characters to match it, since we have [\r\n] at the end). Plus, neither would match a period.

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.

How to reformat a source file to go from 2 space indentations to 3?

This question is nearly identical to this question except that I have to go to three spaces (company coding guidelines) rather than four and the accepted solution will only double the matched pattern. Here was my first attempt:
:%s/^\(\s\s\)\+/\1 /gc
But this does not work because four spaces get replaced by three. So I think that what I need is some way to get the count of how many times the pattern matched "+" and use that number to create the other side of the substitution but I feel this functionality is probably not available in Vim's regex (Let me know if you think it might be possible).
I also tried doing the substitution manually by replacing the largest indents first and then the next smaller indent until I got it all converted but this was hard to keep track of the spaces:
:%s/^ \(\S\)/ \1/gc
I could send it through Perl as it seems like Perl might have the ability to do it with its Extended Patterns. But I could not get it to work with my version of Perl. Here was my attempt with trying to count a's:
:%!perl -pe 'm<(?{ $cnt = 0 })(a(?{ local $cnt = $cnt + 1; }))*aaaa(?{ $res = $cnt })>x; print $res'
My last resort will be to write a Perl script to do the conversion but I was hoping for a more general solution in Vim so that I could reuse the idea to solve other issues in the future.
Let vim do it for you?
:set sw=3<CR>
gg=G
The first command sets the shiftwidth option, which is how much you indent by. The second line says: go to the top of the file (gg), and reindent (=) until the end of the file (G).
Of course, this depends on vim having a good formatter for the language you're using. Something might get messed up if not.
Regexp way... Safer, but less understandable:
:%s#^\(\s\s\)\+#\=repeat(' ',strlen(submatch(0))*3/2)#g
(I had to do some experimentation.)
Two points:
If the replacement starts with \=, it is evaluated as an expression.
You can use many things instead of /, so / is available for division.
The perl version you asked for...
From the command line (edits in-place, no backup):
bash$ perl -pi -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
(in-place, original backed up to "YOUR_FILE.bak"):
bash$ perl -pi.bak -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
From vim while editing YOUR_FILE:
:%!perl -pe 's{^((?: )+)}{" " x (length($1)/2)}e'
The regex matches the beginning of the line, followed by (the captured set of) one or more "two space" groups. The substitution pattern is a perl expression (hence the 'e' modifier) which counts the number of "two space" groups that were captured and creates a string of that same number of "three space" groups. If an "extra" space was present in the original it is preserved after the substitution. So if you had three spaces before, you'll have four after, five before will turn into seven after, etc.

How to find rows which has a certain char less than a certain count?

I am trying to write a shell/perl command which will give me the row numbers, which has number of fields less than a certain count.
E.g. I have a comma-delimited text file. I am trying to find those rows which has less than, say 15, fields. So I guess the problem essentially boils down to returning rows which has less than 14 commas.
Can anyone help me with that?
Thanks!
You can do this easily in bash by calling awk. This sort of script is exactly what awk was designed to do.
awk -F, '{ if (NF < 15 ) print NR "," $0 }' fileToTest
-F, tells awk to split each line on the comma char, AND NF (Number_of_Fields) indicates how many fields where split in each line. Change the 15 value as needed to help you validate your files.
Don't forget that CSV files may have commas embedded inside the fields if the field is surrounded by quotes, i.e.
fld1, "text for, fld2", fld3, fld4,....
Solving that problem is significantly harder Use a tab char to separate your fields (or some other character you can be sure will never appear in your data), and then sleep easy at night ;-)
I hope this helps.
Cute version
perl -lne 'print if tr/,// < 14
tr/x// is a Perl idiom for counting the number of xes in a string.
More flexible version
perl -F, -lane 'print if #F < 15`
-a enables "autosplit mode", -F sets the delimiter to comma, and the code in the -e says to print if there are less than 15 fields. This is nice if you eventually want to do something else with the contents of the fields, since they're available in #F already split on comma.
Properly CSV version
Doesn't make a nice one-liner, but you might consider using Text::xSV or Text::CSV_XS if your data is really CSV and not merely "comma separated" — the difference is that CSV can contain embedded commas, newlines, and other weird things by using quoted fields.
You also asked for Perl. This is not the only way and it assumes the commas are always field delimiters–
perl -ne 'print "$.: $_" if 15 > split/,/' my-comma-file.txt