Matching line subfield deliminated with square brackets - perl

I have a file that contains lines that contains fields delimited with square brackets, for example :
[tag "x"][severity "y"][id "z"][client 1]
I need to extract the data from the client field. But I am struggling with the best way to do this. Obviously its too advanced for the likes of cut.
I have been struggling to use sed (and I'm not even sure sed is the "best" or "most appropriate" tool), but sed regex like this doesn't seem to work :
sed 's/^.*\[client\(.*\)/\1/g'
I'm guessing the "most appropriate" tool is probably Perl with some sort of Perl module ?

In Perl, you can capture each bracket contents like so:
$ perl -lne 'print $1 while /(?<=\[)([^\]]+)(?=\])/g' file
tag "x"
severity "y"
id "z"
client 1
So then if you only want the client match you can do:
$ perl -lne 'for (/(?<=\[)([^\]]+)(?=\])/g) { print if /^client\b/ }' file
client 1
As pointed out in comments, /\[([^\]]+)\]/g is maybe a little more efficient.
$ perl -lne 'for (/\[([^\]]+)\]/g) { print if /^client\b/}' file
client 1

You don't show your expected output so it's a guess but based on what it looks like the script you posted is attempting to do - is this what you want?
$ sed 's/.*\[client *\([^]]*\).*/\1/g' file
1

I would use tr -d.
echo '[tag "x"][severity "y"][id "z"][client 1]' | tr -d '[]'
tag "x"severity "y"id "z"client 1

echo '[tag "x"][severity "y"][id "z"][client 1]' | awk -F'[][]+' '{print $5}'
client 1

Related

Can grep or sed show only words that match multiple search patterns in a line?

I am wondering, if one can print the matched strings as it is in each line... using grep or sed?
TestCase1: File1 contains below text
The Sun
Thunder The Rain They say
They say The dance
If I use this command:
egrep -o 'The|They' File1
The output I get is:
The
The
They
They
The
But, my expected output should be as below:
The
The They
They The
I am aware that, In grep the option -o, --only-matching prints only the matched non-empty) parts of a matching line, with each such part on a separate output line.
Edit: Please also suggest, if one wants to have a filter with exact word match with multiple match strings
i.e. <The> and <They> exact word match? Space separated words simply.
TestCase2: File2 contains below text
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Output is:
The
The They
They the
the
The the they
How to approach this?
With GNU awk for FPAT:
$ awk -v FPAT='\\<[Tt]hey?\\>' '{$1=$1}1' file
The
The They
They The
They the
The the they
Note that that can't NOT identify They when it appears in They're. If that's really an issue and you want to look for space-separated complete strings then this might be what you want:
$ awk '{c=0; for (i=1;i<=NF;i++) if ($i ~ /^[Tt]hey?$/) printf "%s%s", (c++?OFS:""), $i; print ""}' file
The
The They
They The
the
The the they
If not, let us know.
The above was run against this iteration of the OPs posted sample input:
$ cat file
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Best do it with Perl:
~$ perl -nE 'say /They? /g' File1
The
The They
They The
EDIT : Add new conditions. The regex still matches all but the lowercase the. Adding the i flag makes the match case-insensitive and matches all your test strings.
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
There is a little bit of a trick here: the match also picks up the space after the ? and prints it in the output. E.g. the first line of output is realy: "The_\n" - where "_" = space character. This may or may not be acceptable. One way to remove the spaces and reassemble the string would be:
$ perl -nE 'say join " ", map {substr $_,0,-1} /They? /ig' File1
As to your question about matching full words <The> and <They>, as you put it, the ? in They? indicates that the 'y' is optional. I.e. matches 0 or 1 times. Therefore the pattern is considering 'The' and 'They' as full words, one or the other, followed by a space. You could rewrite the pattern as:
$ perl -nE 'say /(?:They|The) /ig' File1
And effect the same output.
Now that you are considering lowercase the you may run into more edge case "gotchas" like words that end in "the". "loathe" and "tythe" come to mind.
$ echo "I'm loathe to cringe and tythe socks" >> File1
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
the the <--- not wanted!
You can then add the \b test in to match on word boundaries (as in zdim's answer):
$ perl -nE 'say /\bThey? /ig' File1
The
The They
They The
the
The the they
<-- But you get this empty line where no match occurs
So to refine further, you could only print if the line matches. Like this:
$ perl -nE 'say /\bThey? /ig if /\bThey? /i' File1
The
The They
They The
the
The the they
Then, I'm sure, you can find more edge cases that will blow it all up and force further refinement.
Things are not fully specified so here are a couple of possibilities
To catch all words starting with The, and print them with a space in between
perl -wnE'say join " ", /\bThe\w*/g' file
where \b is a word-boundary, a zero-width anchor, and \w is a word character. Using \S (a non-space character) is yet more permissive.
For only The or They can instead use
perl -wnE'say join " ", /\bThey?\b/g' file
where y? makes y optional.
To allow the as well use [tT] instead of T in the pattern, or /i for either case for all chars.
It's been clarified in coments that punctuation after The|They isn't allowed, and that low case t is. Then we need to constrain the match by space, not word boundary, and use [tT] as mentioned
perl -wnE'say join " ", /\b([Tt]hey?)\s/g' file
Now the capturing parenthesis () are needed since \s does consume, unlike \b before.
This prints the desired output with the provided input.
awk to the rescue!
$ awk -v p="They?" '$0~p{for(i=1;i<=NF;i++) if($i~p) printf "%s",$i OFS; print ""}' file
The
The They
They The
try one more awk:
awk '{while(match($0,/The|They/)){string=substr($0,RSTART,RLENGTH);VAL=VAL?VAL OFS string:string;$0=substr($0,RSTART+RLENGTH+1);};print VAL;VAL=""}' Input_file
NON-ONE line form of solution as follows too.
awk '{
while(match($0,/The|They/)){
string=substr($0,RSTART,RLENGTH);
VAL=VAL?VAL OFS string:string;
$0=substr($0,RSTART+RLENGTH+1);
};
print VAL;
VAL=""
}
' Input_file
Will add the explanation shortly for same.

How to modify the matched pattern

Just wondering if there is a handy way to modify matched pattern variable in Perl one liner. For instance in the string abcdef I'd like to replace def with e (output abce) using a command looking like this :
echo "abcdef" | perl -pne 's/(def)/{command that trims first and last character of $1 and returns it as a string for perl to use it as a replacement}/'
It would be easy to use such functionality to perform various formating tasks. Can we do this in sed ?
This is easy in Perl with the /e flag:
echo 'abcdef' | perl -pe 's/(def)/substr $1, 1, -1/e'
e tells perl to parse the replacement part as a block of code, not a string. You can put arbitrary code in there.
But your concrete task (trimming the first and last character) can also be done like this:
echo 'abcdef' | perl -pe 's/d(e)f/$1/'
(Also, perl -p already implies -n. No need to specify both.)

grep or awk - how to return line if column 1 and 3 have the same value

I have a tab delimited file and I want the output to have the entire line in my file if values in column 1 are the same as the values in column 3. Having very limited knowledge in perl and linux, this is as close as I came to a solution.
File example
Apple Sugar Apple
Apple Butter Orange
Raisins Flour Orange
Orange Butter Orange
The results would be:
Apple Sugar Apple
Orange Butter Orange
Code:
#!/bin/sh
awk '{
prev=$0; f1=$1; f3=$3;
getline
if ($1 == $3) {
print prev
print
}'
} myfilename
I am sure that there is an easier solution to it. Maybe even a grep or awk on the command line. But that was the only code I could find that seemed to give me my solution.
Thanks!
It's easy with awk:
awk '$1 == $3' myfile
The default action is to print out the record, so if fields 1 and 3 are equal, that's what will happen.
Using awk
awk is the tool for the job:
awk '$1 == $3'
If your fields in the data are strictly tab separated and may contain blanks, then you will need to specify the field separator explicitly:
awk -F'\t' '$1 == $3'
(where the The \t represents a tab; you may have to type Tab (or even Control-VTab) to get it into the string).
Using grep
You can do it with grep, but you don't want to do it with grep:
grep -E '([A-Za-z]+)\t[A-Za-z]+\t\1'
The key part of the regex is the \1 which means 'the same value as the first captured string.
You might even go through gyrations like this in bash:
grep -E $'([A-Za-z]+)\t[A-Za-z]+\t\\1'
You could simplify life by noting (assuming) there are no spaces within fields:
grep -E '([A-Za-z]+)[[:space:]]+[A-Za-z]+[[:space:]]+\1'
As noted in one of the comments, I didn't put a $ at the end of the search pattern; it would be feasible (though the data would have to be cleaned up to contain tabs and drop trailing blanks), so that 'Good Noise GoodBad' would not be picked up. There are other ways to do it, and you can make the regex more and more complex to handle more possible situations. But those only go to emphasize that the awk solution is better; awk deals with the details automatically.
Using grep:
grep -P "([^\t]+)\t[^\t]+\t\1" inFile

Remove from the beginning till certain part in a string

I work with strings like
abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf
and I need to get a new one where I remove in the original string everything from the beginning till the last appearance of "_" and the next characters (can be 3, 4, or whatever number)
so in this case I would get
_adf
How could I do it with "sed" or another bash tool?
Regular expression pattern matching is greedy. Hence ^.*_ will match all characters up to and including the last _. Then just put the underscore back in:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*_/_/'
sed 's/^(.*)_([^_]*)$/_\2/' < input.txt
Do you need to modify the string, or just find everything after the last underscore? The regex to find the last _{anything} would be /(_[^_]+)$/ ($ matches the end of the string), or if you also want to match a trailing underscore with nothing after it, /(_[^_]*)$/.
Unless you really need to modify the string in place instead of just finding this piece, or you really want to do this from the command line instead of a script, this regex is a bit simpler (you tagged this with perl, so I wasn't sure quite how committed to using just the command line as opposed to a simple script you were).
If you do need to modify the string in place, sed -i 's/(_[^_]+)$/\1/' myfile or sed -i 's/(_[^_]+)$/\1/g' myfile. The -i (edit: I decided not to be lazy and look up the proper syntax...) the -i flag will just overwrite the old file with the new one. If you want to create a new file and not clobber the old one, sed -e 's/.../.../g' oldfile > newfile. The g after the s/// will do this for all instances in the file you pass into sed; leaving it out just replaces the first instance.
If the string is not by itself at the end of the line, but rather embedded in other text. but just separated by whitespace, replace the $ with \s, which will match a whitespace character (the end of a word).
If you have strings like these in bash variables (I don't see that specified in the question), you can use parameter expansion:
s="abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf"
t="_${s##*_}"
echo "$t" # ==> _adf
In Perl, you could do this:
my $string = "abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf";
if ( $string =~ m/(_[^_]+)$/ ) {
print $1;
}
[Edit]
A Perl one liner approach (ie, can be run from bash directly):
perl -lne 'm/(_[^_]+)$/ && print $1;' infile > outfile
Or using substitution:
perl -pe 's/.*(_[^_]+)$/$1/' infile > outfile
Just group the last non-underscore characters preceded by the last underscore with \(_[^_]*\), then reference this group with \1:
sed 's/^.*\(_[^_]*\)$/\1/'
Result:
$ echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*\(_[^_]*\)$/\1/'
_adf
A Perl way:
echo 'abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf' | \
perl -e 'print ((split/(_)/,<>)[-2..-1])'
output:
_adf
Just for fun:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | tr _ '\n' | tail -n 1 | rev | tr '\n' _ | rev

How do i print word after regex but not a similar word?

I want an awk or sed command to print the word after regexp.
I want to find the WORD after a WORD but not the WORD that looks similar.
The file looks like this:
somethingsomething
X-Windows-Icon=xournal
somethingsomething
Icon=xournal
somethingsomething
somethingsomething
I want "xournal" from the one that say "Icon=xournal". This is how far i have come until now. I have tried an AWK string too but it was also unsuccessful.
cat "${file}" | grep 'Icon=' | sed 's/.*Icon=//' >> /tmp/text.txt
But i get both so the text file gives two xournal which i don't want.
Use ^ to anchor the pattern at the beginning of the line. And you can even do the grepping directly within sed:
sed -n '/^Icon=/ { s/.*=//; p; }' "$file" >> /tmp/text.txt
You could also use awk, which I think reads a little better. Using = as the field separator, if field 1 is Icon then print field 2:
awk -F= '$1=="Icon" {print $2}' "$file" >> /tmp/text.txt
This might be useful even though Perl is not one of the tags.
In case if you are interested in Perl this small program will do the task for you:
#!/usr/bin/perl -w
while(<>)
{
if(/Icon\=/i)
{
print $';
}
}
This is the output:
C:\Documents and Settings\Administrator>io.pl new2.txt
xournal
xournal
explanation:
while (<>) takes the input data from the file given as an argument on the command line while executing.
(/Icon\=/i) is the regex used in the if condition.
$' will print the part of the line after the regex.
All you need is:
sed -n 's/^Icon=//p' file