Wrong line numbers with multiline grep -P [ Perl extension ] match - perl

I have been using grep with perl extension for multiline match .However it turns out that the line number of all the match depends on the number of lines in the first multiline match !
The grep regex to find the c function start :
grep -iPn '^[^\S\n]*?\w+\s+\w+?\s*\([\w-0-9,/* \s]*\)\s*\{$'
I can explain better with an example :
Suppose these two function exists in the source file
int f1(int a) {
int
f2 (int b )
{
In this case ,the grep matches the regex successfully and the line number output to stdout is in par with the line number of the source file .
The problem arises when a multiline function comes first .This alters the line number output and after examining the file for some time I came to a conclusion. The multiline functions are matched but read as a single line by the grep therefore it assigns the whole function a single line number.The next line that follows the function gets its line number short depending on the number of lines the 'function definition start regex 'occupies.
There are numerous multiline C functions in my file and the line number is way too deviated for each one of them.
Is there a way to correct this ?

Using the pcregrep for the same regex shows the correct line number !
pcregrep -Mni '^[^\S\n]*?\w+\s+\w+?\s*\([\w-0-9,/* \s]*\)\s*\{$'

Your solution doesn't work with my (linux, gnu)grep.
I need to add -z to make it work. With -z "lines" are separated by a null char; this way you have a single "line", so line numbers wont work :( but byte offset do. Sometimes that is enough...
So, using grep:
grep -ziPbo '^[^\S\n]*?\w+\s+\w+?\s*\([\w-0-9,/* \s]*\)\s*\{$'
obtaining
24:int
f2 (int b )
{

Related

Extract filename from multiple lines in unix

I'm trying to extract the name of the file name that has been generated by a Java program. This Java program spits out multiple lines and I know exactly what the format of the file name is going to be. The information text that the Java program is spitting out is as follows:
ABCASJASLEKJASDFALDSF
Generated file YANNANI-0008876_17.xml.
TDSFALSFJLSDJF;
I'm capturing the output in a variable and then applying a sed operator in the following format:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p'
The result set is:
YANNANI-0008876_17.xml.
However, my problem is that want the extraction of the filename to stop at .xml. The last dot should never be extracted.
Is there a way to do this using sed?
Let's look at what your capture group actually captures:
$ grep 'YANNANI.\([[:digit:]]\).\([xml]\)*' infile
Generated file YANNANI-0008876_17.xml.
That's probably not what you intended:
\([[:digit:]]\) captures just a single digit (and the capture group around it doesn't do anything)
\([xml]\)* is "any of x, m or l, 0 or more times", so it matches the empty string (as above – or the line wouldn't match at all!), x, xx, lll, mxxxxxmmmmlxlxmxlmxlm, xml, ...
There is no way the final period is removed because you don't match anything after the capture groups
What would make sense instead:
Match "digits or underscores, 0 or more": [[:digit:]_]*
Match .xml, literally (escape the period): \.xml
Make sure the rest of the line (just the period, in this case) is matched by adding .* after the capture group
So the regex for the string you'd like to extract becomes
$ grep 'YANNANI.[[:digit:]_]*\.xml' infile
Generated file YANNANI-0008876_17.xml.
and to remove everything else on the line using sed, we surround regex with .*\( ... \).*:
$ sed -n 's/.*\(YANNANI.[[:digit:]_]*\.xml\).*/\1/p' infile
YANNANI-0008876_17.xml
This assumes you really meant . after YANNANI (any character).
You can call sed twice: first in printing and then in replacement mode:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p' | sed 's/\.$//g'
the last sed will remove all the last . at the end of all the lines fetched by your first sed
or you can go for a awk solution as you prefer:
awk '/.*YANNANI.[0-9]+.[0-9]+.xml/{print substr($NF,1,length($NF)-1)}'
this will print the last field (and truncate the last char of it using substr) of all the lines that do match your regex.

How to use sed to isolate only the first part of a file

I'm running Windows and have the GnuWin32 toolkit, which includes sed. Specifically:
C:\TEMP>sed --version
GNU sed version 4.2.1
I have a text file with two sections: A fixed part I want to preserve, and a part that's appended after running a job.
In the file is a unique string that identifies the start of the part that's added, and I'd like to use Gnu sed to isolate only the part of the file that's before the unique string - i.e., so I can append different data to the fixed part each time the job is run.
I know I could keep the fixed portion in a separate file, but that adds complexity and it would be more elegant if I could just reuse the data at the start of the same file.
A long time ago I knew how to set up sed scripts, and I'm sure this can be done with sed, but I've slept since then. :)
Can you please describe how to use sed to display the lines of text in a file up to and not including a specific string?
Example:
line 1 of fixed portion
line 2 of fixed portion
unique string
line 1 of appended portion
line 2 of appended portion
line 3 of appended portion
What I'd like is to see as output:
line 1 of fixed portion
line 2 of fixed portion
I've gotten as far as:
sed -r -n -e "0,/unique string/p"
but that prints the unique string as well.
Thanks in advance.
-Noel
This should work for you:
sed -n '/unique string/q;p' file
It quits processing at unique string. Other lines get printed.
An alternative might be to use a range address like this:
sed -n '1,/unique string/{/unique string/!p}' file
Note that sed includes the range border. We need to exclude unique string from printing.
Furthermore I'm using the -n option which makes sed suppress the output of input lines by default.
One thing, if unique string can contain characters which are also syntax characters in the regex like ...
test*
... sed might not be the right tool for the job any more since it can only match regular expressions but not fixed strings.
In that case awk might be the tool of choice:
awk 'index("*unique string*"){exit}1' file
index("string") returns a non zero value (the position) if the string has been found. We cancel further processing of input lines in that case and don't print that line as well.
The trailing 1 always evaluates to true and makes awk print all the lines until the previous condition applies.

Sed to replace variable length string between 2 known patterns

I'd like to be able to replace a string between 2 known patterns. The catch is that I want to replace it by a string of the same length that is composed only of 'x'.
Let's say I have a file containing:
Hello.StringToBeReplaced.SecondString
Hello.ShortString.SecondString
I'd like the output to be like this:
Hello.xxxxxxxxxxxxxxxxxx.SecondString
Hello.xxxxxxxxxxx.SecondString
Using sed loops
You can use sed, though the thinking required is not wholly obvious:
sed ':a;s/^\(Hello\.x*\)[^x]\(.*\.SecondString\)/\1x\2/;t a'
This is for GNU sed; BSD (Mac OS X) sed and other versions may be fussier and require:
sed -e ':a' -e 's/^\(Hello\.x*\)[^x]\(.*\.SecondString\)/\1x\2/' -e 't a'
The logic is identical in both:
Create a label a
Substitute the lead string and a sequence of x's (capture 1), followed by a non-x, and arbitrary other data plus the second string (capture 2), and replace it with the contents of capture 1, an x and the content of capture 2.
If the s/// command made a change, go back to the label a.
It stops substituting when there are no non-x's between the two marker strings.
Two tweaks to the regex allow the code to recognize two copies of the pattern on a single line. Lose the ^ that anchors the match to the beginning of the line, and change .* to [^.]* (so that the regex is not quite so greedy):
$ echo Hello.StringToBeReplaced.SecondString Hello.StringToBeReplaced.SecondString |
> sed ':a;s/\(Hello\.x*\)[^x]\([^.]*\.SecondString\)/\1x\2/;t a'
Hello.xxxxxxxxxxxxxxxxxx.SecondString Hello.xxxxxxxxxxxxxxxxxx.SecondString
$
Using the hold space
hek2mgl suggests an alternative approach in sed using the hold space. This can be implemented using:
$ echo Hello.StringToBeReplaced.SecondString |
> sed 's/^\(Hello\.\)\([^.]\{1,\}\)\(\.SecondString\)/\1#\3##\2/
> h
> s/.*##//
> s/./x/g
> G
> s/\(x*\)\n\([^#]*\)#\([^#]*\)##.*/\2\1\3/
> '
Hello.xxxxxxxxxxxxxxxxxx.SecondString
$
This script is not as robust as the looping version but works OK as written when each line matches the lead-middle-tail pattern. It first splits the line into three sections: the first marker, the bit to be mangled, and the second marker. It reorganizes that so that the two markers are separated by #, followed by ## and the bit to be mangled. h copies the result to the hold space. Remove everything up to and including the ##; replace each character in the bit to be mangled by x, then copy the material in the hold space after the x's in the pattern space, with a newline separating them. Finally, recognize and capture the x's, the lead marker, and the tail marker, ignoring the newline, the # and ## plus trailing material, and reassemble as lead marker, x's, and tail marker.
To make it robust, you'd recognize the pattern and then group the commands shown inside { and } to group them so they're only executed when the pattern is recognized:
sed '/^\(Hello\.\)\([^.]\{1,\}\)\(\.SecondString\)/{
s/^\(Hello\.\)\([^.]\{1,\}\)\(\.SecondString\)/\1#\3##\2/
h
s/.*##//
s/./x/g
G
s/\(x*\)\n\([^#]*\)#\([^#]*\)##.*/\2\1\3/
}'
Adjust to suit your needs...
Adjusting to suit your needs
[I tried one of your solutions and it worked fine.]
However when I try to replace the 'hello' by my real string (which is
'1.2.840.') and my second string (which is simply a dot '.'), things stop
working. I guess all these dots confuse the sed command.
What I try to achieve is transform this '1.2.840.10008.' to
'1.2.840.xxxxx.'
And this pattern happens several times in my file with variable number
of characters to be replaced between the '1.2.840.' and the next dot '.'
There are times when it is important to get your question close enough to the real scenario — this may be one such. Dot is a metacharacter in
sed regular expressions (and in most other dialects of regular expression — shell globbing being the noticeable exception). If the 'bit to be mangled' is always digits, then we can tighten up the regular expressions, though actually (when I look at the code ahead) the tightening really isn't imposing much in the way of a restriction.
Pretty much any solution using regular expressions is a balancing act that has to pit convenience and abbreviation against reliability and precision.
Revised code plus data
cat <<EOF |
transform this '1.2.840.10008.' to '1.2.840.xxxxx.'
OK, and hence 1.2.840.21. and 1.2.840.20992. should lose the 21 and 20992.
EOF
sed ':a;s/\(1\.2\.840\.x*\)[^x.]\([^.]*\.\)/\1x\2/;t a'
Example output:
transform this '1.2.840.xxxxx.' to '1.2.840.xxxxx.'
OK, and hence 1.2.840.xx. and 1.2.840.xxxxx. should lose the 21 and 20992.
The changes in the script are:
sed ':a;s/\(1\.2\.840\.x*\)[^x.]\([^.]*\.\)/\1x\2/;t a'
Add 1\.2\.840\. as the start pattern.
Revise the 'character to replace' expression to 'not x or .'.
Use just \. as the tail pattern.
You could replace the [^x.] with [0-9] if you're sure you only want digits matched, in which case you won't have to worry about spaces as discussed below.
You may decide you don't want spaces to be matched so that a casual comment like:
The net prefix is 1.2.840. And there are other prefixes too.
does not end up as:
The net prefix is 1.2.840.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
In which case, you probably need to use:
sed ':a;s/\(1\.2\.840\.x*\)[^x. ]\([^ .]*\.\)/\1x\2/;t a'
And so the changes continue until you've got something precise enough to do what you want without doing anything you don't want on your current data set. Writing bullet-proof regular expressions requires a precise specification of what you want matched, and can be quite hard.
I'd choose perl:
perl -pe 's/(?<=Hello\.)(.*?)(?=\.SecondString)/ "x" x length($1) /e' file
This awk should do:
awk -F. '{for (i=1;i<=length($2);i++) a=a"x";$2=a;a=""}1' OFS="." file
Hello.xxxxxxxxxxxxxxxxxx.SecondString
Hello.xxxxxxxxxxx.SecondString
Bash Works Too
While the perl, sed and awk solutions are probably the better choice, a Bash solution is not that difficult (just longer). Bash has good character-by-character handling abilities as well:
#!/bin/bash
rep=0 # replace flag
skip=0 # delay reset flag
while read -r line; do # read each line
for ((i=0; i<${#line}; i++)); do # for each character in the line
# if '.' and replace on, turn off and set skip
[ ${line:i:1} == '.' -a $rep -eq 1 ] && { rep=0; skip=1; }
# print char or "x" depending on replace flag
[ $rep -eq 0 ] && printf "%c" ${line:i:1} || printf "x"
# if '.' and replace off
if [ ${line:i:1} == '.' -a $rep -eq 0 ]; then
# if skip, turn skip off, else set replace on
[ $skip -eq 1 ] && skip=0 || rep=1
fi
done
printf "\n"
done
exit 0
Input
$ cat dat/replacefile.txt
Hello.StringToBeReplaced.SecondString
Hello.ShortString.SecondString
Output
$ bash replacedot.sh < dat/replacefile.txt
Hello.xxxxxxxxxxxxxxxxxx.SecondString
Hello.xxxxxxxxxxx.SecondString
For the sake of your sanity, just use awk:
$ awk 'BEGIN{FS=OFS="."} {gsub(/./,"x",$2)} 1' file
Hello.xxxxxxxxxxxxxxxxxx.SecondString
Hello.xxxxxxxxxxx.SecondString

How to find patterns across multiple lines using perl

I want to grep some string spread along multiple lines withing some begin and end pattern
Example:
MediaHelper->fetchStrings( names => [ //Here new line may or many not be
**'ubp-firstrun_heading',
'firstrun_text',
'_firstrun-or-start_search',
'installed'** //may end here also );
]);
using perl or grap how I can get list 4 strings here begin pattern is MediaHelper->fetchStrings(names => [ and end pattern is );
Or any other suggesting using other commands like grep or sed or awk ?
Try this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ p' <yourfile>
Or, if you want to skip the delimiting lines, this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ {/MediaHelper->fetchStrings( names =>/b; /^);/b; p}' <yourfile>
If I understand your question, you need to match all strings in all lines (and not just the MediaHelper thing).
If this is the case, then sed is the right tool, because it is by default line-oriented.
In our case, if you want to match the string in every line:
sed "s/.*\('.*'\).*/\1/" <your_file>
Hope it helps
Edit: To be more descriptive, first we need to match the whole line (that's the first and the last .*) and then we enclose in parenthesis the part of the line we want to print, which in our case is everything inside single quotes. The number 1 before the last delimiter denotes that we want to print the first (in our case it is the last also) parenthesis.
Just process the file in slurp mode instead of line by line:
perl -0777 -ne 'print $1 while m{MediaHelper->fetchStrings(names\s*=>\s*\[(.*?)\]}g' file
Explanation:
Switches:
-0777: Slurp mode instead of line by line
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

sed scripting : how to search after n no. of lines?

How to capture the first occurrence of a pattern using grep after 'n' numbers of lines in a large size file ?
For instance,
I have 1000 lines of code in which 'wire' occur before and after 451st line.
I want to grab the first occurrence of wire after 451st line .
You can use sed's range expressions to perform this task easily. For example:
sed -n '452,$ { /wire/ {p;q} }' /tmp/foo
This will skip the first 451 lines, then scan each line until EOF for "wire." When found, it will print the pattern space and then quit.