Get rid of repeated fields in a log line - powershell

I have logs lines in which the information fields are repeated, the first time they are separated by a comma and a space, the second time they are separated by a semicolon, I want to get rid of their second occurrence, the word (SECOND) is not in the log, I put it there to make it more clear
targets:somehost state:Memory\Buffers=398672, Memory\Cached=4620216, Memory\MemFree=833748, Memory\MemTotal=8001352 (SECOND) Memory\Buffers=398672;Memory\Cached=4620216;Memory\MemFree=833748;Memory\MemTotal=8001352 type:Unix Resources
I Was thinking in using replace.
%{$_ -replace "Memory\\Buffers=([0-9]+);Memory\\Cached=([0-9]+);Memory\\MemFree=([0-9]+);Memory\\MemTotal=([0-9]+)",""}
but the log has a lot more fields, that I din't put in here to make it more readable.
is there a better way to do this?

Assuming all your log lines follow the pattern of your sample line, and assuming that all fields following state: are repeated, you can use the following regex (using a simplified input string, in which the fields Me\Bu=398672 and Me\Ca=4620216 are repeated):
'a:b c:Me\Bu=398672, Me\Ca=4620216 Me\Bu=398672;Me\Ca=4620216 d:e f' | % {
$_ -replace '[^ ]+;[^ ]+ '
}
The above yields:
a:b c:Me\Bu=398672, Me\Ca=4620216 d:e f
[^ ]+; matches the first field in the second occurrence of the fields list.
[^ ]+ matches all remaining fields in the second list.

You can remove everything starting from (and including) the second occurrence of Memory\Buffers= with a positive look-behind assertion ((?<=...)):
$string -replace '(?<=Memory\\Buffers=.*?)\s*Memory\\Buffers.*$'
As you've found, with a positive look-ahead assertion ((?=...)) you can then specify where the second sequence stops:
$string -replace '(?<=Memory\\Buffers=.*?)\s*Memory\\Buffers.*(?=\stype:)'

Related

How to find and replace with sed, except when between curly braces?

I have a command like this, it is marking words to appear in an index in the document:
sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt
The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
Curley brackets in this case will always appear on the same line.
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.
Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.
Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.
Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.
Overall, it's Latex. I believe it would be simpler to do it manually.
By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:
sed '
# add our input/output separator - just a newline
s/^/\n/
: loop
# l1000
# Ignore any "\keywords" and "{stuff}"
/^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
s//\3\1\n\2/
b loop
}
# Replace hello followed by anthing not {}
# We match till the end because regex is greedy
# so that .* will eat everything.
/^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
s//\\keywordis{hello}{a common greeting}\3\1\n\2/
b loop
}
# Hello was not matched - ignore anything irrelevant
# note - it has to match at least one character after newline
/^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
s//\3\1\n\2/
b loop
}
s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'
outputs:
\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet

Sed seems to replace only the last occurrence in global string substitution

I use this command, but it's not working ad intended:
echo "0+223+141+800+450+1*(106+400)+1*(1822+500)+1*(183+400)" | sed 's/\*\(.*\)+/*\1suma/g'
This is the expected output:
0+223+141+800+450+1*(106suma400)+1*(1822suma500)+1*(183suma400)
but this is what I get:
0+223+141+800+450+1*(106+400)+1*(1822+500)+1*(183suma400)
It looks like only the last occurrence is being replaced, despite the use of g.
Try the following:
echo "0+223+141+800+450+1*(106+400)+1*(1822+500)+1*(183+400)" |
sed 's/\(\*([^+]*\)+/\1suma/g'
which yields:
0+223+141+800+450+1*(106suma400)+1*(1822suma500)+1*(183suma400)
The trick is to avoid sed's invariably greedy matching, so expression [^+]* is used instead of .*, so as to only match up to the next +.
Note that your attempt didn't only replace the last occurrence of your intended pattern, but - due to greedy matching - found only 1 match spanning multiple intended patterns, which it replaced:
\*\(.*\)+ matched *(106+400)+1*(1822+500)+1*(183+ - everything from the first * literal to the last + literal, and capture group \1 therefore expanded to (106+400)+1*(1822+500)+1*(183

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?
It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.
1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.
it matches a literal backslash followed by 0 or more spaces followed by the end of the line.
it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).
It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.
(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.