Extract the part enclosed by a predefined multiline character sequence - perl

Hope the AWK gurus can provide a solution to my problem .
I have a file that goes like this :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat dog pat ate cat dog
I have to use AWK to extract the pattern between the first occuring c and a d .Starting from the first c a count should be kept on the number of c's and d's such that when the count matches , the part between the first c and the matched d shoud be ouput to a file including the number of the line in which the match for d occured .
In this particular example the match occurs on the seventh dog , therefore the output will have to be :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat d
The match can go beyond just two lines ! The output can or cannot be inclusive of the c and the d .There exists all kinds of characters inclusive of the special ones in the text !
In order for the print to occur the count has to be matched .
Thanks in advance for the replies. Suggestions are always welcome .
EDIT : The capture of the pattern between c and d can be compromised as long as the condition is met and the line number of the exit d is obtained :)

A few tips, without giving the full solution:
By default, awk considers each line as a record. The default record separator is RS="\n".
Depending on your version of awk, you may be able to set RS, the record separator, to a regex which matches either c or d. Then, for each record, you can check the RT variable, which will contain either c or d, depending on what has actually been matched. Starting from there, using a variable incremented on c, decremented on d you will be able to find the end of the match when it reaches 0.
You can then use a variable that contains your match so far, and keep concatenating RT and the new record to it, until you're done.
If you need to know the line number of the end of the match, you can set RS to a regex that either matches c, d, as previously, but also add the possibility to match \n. And by maintaining another counter variable incremented every time RT tells you that \n has been matched, you'll have your line number.

Here's a sed solution just for fun:
sed -rne ':r;$!{N;br};s/^[^c]*(.*d)[^d]*$/\1/;:a;h;s/[^cd]//g;' \
-e ':s;s/d(.*)c/c\1d/;ts;s/cd/c\nd/;T;y/c/d/;/^(d+)\n\1$/{g;i -------' \
-e 'p};g;s/d[^d]*d$/d/;ta'
This prints all satisfying sequences from longest to shortest.

Related

Delete string after '#' using sed

I have a text file that looks like:
#filelists.txt
a
# aaa
b
#bbb
c #ccc
I want to delete parts of lines starting with '#' and afterwards, if line starts with #, then to delete whole line.
So I use 'sed' command in my shell:
sed -e "s/#*//g" -e "/^$/d" filelists.txt
I wish its result is:
a
b
c
but actually result is:
filelists.txt
a
aaa
b
bbb
c ccc
What's wrong in my "sed" command?
I know '*' which means "any", so I think that '#*' means string after "#".
Isn't it?
You may use
sed 's/#.*//;/^$/d' file > outfile
The s/#.*// removes # and all the rest of the line and /^$/d drops empty lines.
See an online test:
s="#filelists.txt
a
# aaa
b
#bbb
c #ccc"
sed 's/#.*//;/^$/d' <<< "$s"
Output:
a
b
c
Another idea: match lines having #, then remove # and the rest of the line there and drop if the line is empty:
sed '/#/{s/#.*//;/^$/d}' file > outfile
See another online demo.
This way, you keep the original empty lines.
* does not mean "any" (at least not in regular expression context). * means "zero or more of the preceding pattern element". Which means you are deleting "zero or more #". Since you only have one #, you delete it, and the rest of the line is intact.
You need s/#.*//: "delete # followed by zero or more of any character".
EDIT: was suggesting grep -v, but didn't notice the third example (# in the middle of the line).

how to delete lines connected with "+" signs with sed

In this example, "+" sign means it connects the previous line and the current line. So I like to delete a specific group of lines that are connected by "+".
For example, I'd like to remove from 1st line to 4th line(.groupA ~ + G H I). Please help me on how to do it with sed.
To delete lines starting with .groupA and all consecutive +-prefixed lines, one easy to understand approach is:
sed '/\.groupA/,/^[^+]/ { /\.groupA/d; /\.groupA/!{/^\+/d} }' file
We first select everything between .groupA and the first non +-prefixed line (inclusive), then for that selection of lines, we delete the first line (containing .groupA), and of the remaining lines, we delete all with + prefix.
Note you need to escape regex metacharacters (like . and +) if you want to match them literally.
A little bit more advanced, but more elegant (only one use of starting block pattern) approach uses a loop to skip the first line of the matched block, and all the following lines that start with +:
sed -n '/\.groupA/ { :a; n; s/^\+//; ta }; p' file
IMHO this is more readily done with awk, but kindly just ignore if that is not an option for you.
So, every time I see a line starting with .groupA, I set a flag d to say I am deleting, and then skip to the next line. If I see a line starting with a + and I am currently deleting, I skip to the next line. If I see anything else, I change the flag to say I am no longer deleting and print the line:
awk '/^\.groupA/ {d=1; next}
/^+/ && d==1 {next}
{d=0; print}' file
Sample Output
** Example **
abcdef ghijkl
.groupB abc def
+ JKL
+ MNO
+ GHI
opqrst vwxyz
You can cast it as a one-liner like this:
awk '/^\.groupA/{d=1; next} d==1 && /^+/ {next} {d=0;print}' file

how to use sed to print line #6 from a file, but only if any other line in the file matches a pattern

and, in a more generic way, is it possible to use program sed to print any line matching PATTERN 1, but only if any other line in the file matches PATTERN 2? It can be done with a combination of grep commands, but I am trying to get it done with a single sed command.
This is NOT a job for sed:
awk 'NR==FNR{if (/PATTERN2/) f=1; next} f && (FNR==6)' file file
awk 'NR==FNR{if (/PATTERN2/) f=1; next} f && /PATTERN1/' file file
or if you don't want to specify the file name twice:
awk 'BEGIN{ARGV[ARGC]=ARGV[ARGC-1]; ARGC++} NR==FNR{if (/PATTERN2/) f=1; next} f && /PATTERN1/' file
I believe it's possible:
:l1 {
/foo/ { H }
/bar/ { x ; s/^\n//; p ; s/.*//; h ; b l2}
n
b l1
}
:l2 {
/foo/ { p }
n
b l2
}
Quick overview:
l1 is our initial loop. It will check for /foo/ (being pattern 1). If it's found on a line, that line will be APPENDED to the holding space.
The next line will check for /bar/, when found, it will exchange the holding space and pattern space (x), remove an initial newline from the data (this is because we use H in our first line, we print the data, we empty this data and store it back in the holding pattern (so it will be empty). Then, we branch to l2, in effect, leaving the loop l2.
If the line does not match pattern 1 foo or pattern 2 bar, it will go to the next line, and jump back to the start l1 again.
Once we are in l2, we check for pattern 1 /foo/. Since we KNOW that we have found pattern 2 earlier (otherwise we wouldn't be here), we can safely print this data. If not foo, we just skip that line, and loop back to the start of l2.
Pretty much tested this with the following data:
a
b
c foo
d
e foo
f bar
g foo
h
i foo
j
k
Depending on "bar" being there, it will either print all lines with foo, or nothing at all.
Granted, it will not win any beauty contests, but it's written in sed only.
Here's a sed script that prints lines matching pattern1, if there exists a line matching pattern2, regardless of the order of pattern1, pattern2:
#n
:loop
/foo/H
/bar/{
g
s/\n//
/foo/p
:loop2
n
/foo/p
b loop2
}
n
b loop
If you save this into a file like s.sed, you can do
sed -f s.sed file
The #n works the same way as -n, meaning suppress standard output. The loop appends any lines matching foo (pattern1) to the hold space. When it encounters bar (pattern2), it gets the contents of the hold space (wiping out the current pattern space) with the g command. It removes the first new line (as the H command adds a new line even when the hold space is empty). It prints out the pattern space if it contains foo (meaning its not empty). Then the n goes to the next line. Now that we have matched pattern2, we can safely print all matches of foo by starting loop2
This might work for you (GNU sed):
sed -n ':a;6H;/pattern/{z;H};n;$!ba;x;s/\n//;s///p' file
Turn off automatic printing of the pattern space by using option -n. Set up a loop that reads every line of the file and appends a single line (for line 6) and/or(not) an empty line (denoting a match on pattern has occurred) in the hold space. At the end of the file swap to the hold space, remove the ever present first newline (if a line or an empty line has been appended) and removes a second newline and prints the result if successful.
N.B. If pattern exists in the file the hold space will contain two newlines either the first two characters or the first and the last characters.
There is no way you can do this in the general case without parsing the file twice. This means invoking sed twice.
If the line(s) matching the "trigger pattern" always occur after all occurrences of the line(s) matching the "match pattern", then this might do it for you:
$ cat testdata
1 aaa
2 match
3 match
4 ddd
5 trigger (only print line 2 and 3 if this line is present)
$ sed -n -e '/match/H' -e '/trigger/{x;^Mp;^M;q^M}' testdata
2 match
3 match
(the ^M in there are verbatim newlines)
I'm not sure how to delete the initial empty line in the output. Pointers about this are welcomed.
UPDATE: I put a final q (quit) at the end of the command sequence for the "trigger pattern" just to make sure that further trigger patterns later in the file wouldn't screw up the output.

Sed: backreference substring from line matching regexp in 's' command

I'm experimenting with sed and recently I've noticed interesting behavior. However, I'm unable to find any documentation that describes it.
Imagine that we have file called 'sedtest':
$cat sedtest
hello 0 world
example
4 sed
Phone number: 123-456-789
Next, I'll run it through sed:
$cat sedtest | sed '/\([[:digit:]]\+\)/s,,(\1),'
hello (0) world
example
(4) sed
Phone number: (123)-456-789
That was fairly easy to understand sed script:
First, it matches string by regexp \([[:digit:]]\+\), which means "match string that contains 1 or more digits". Notice that I also use s-command-style \( and \) parentheses to mark substring here (is it allowed?).
In case of match it proceeds with s command s,,(\1), (with empty regexp field), that means "replace matched substring with (\1)".
Initially I thought that it should fail with error, because \1 and similar backreferences should work only for substrings from s command matcher field, which is empty in this case.
But the result is as if it was s,\([[:digit:]]\+\),(\1), script (\regexp\ matcher moved inside s command matcher field)!
So, the question is: is it normal (i.e. it is desired behavior) to backreference substrings of text matched by \regexp\ rule from s//replace/ command as if they were matched by s/regexp/replace/ command?
P.S.
My sed version is: GNU sed 4.2.1
And motivation behind question is that way you can do something like:
sed '/^Number: \([[:digit:]]\+\)$/{s,,#NUMBER: (\1),;p;d};q 1', i.e.
/^Number: \([[:digit:]]\+\)$/ - match every string of kind Number: 12345 and in case of match:
s,,#NUMBER: (\1), - replace it with #NUMBER: (12345)
p - print it
d - clear pattern space, and start new cycle (fetch new line and start parsing script expression from beginning)
q 1 - exit with code 1. This command is executed only if no match occured in step 1 (because of d command presence) - it checks for 'not matched' case, which in my situation means 'not allowed string' and must result in error.
Main trick here was executing p and d commands after substitution took place, which is not possible when using 'normal' s/match/replace/ command.
It's normal. The back reference hold space doesn't get cleared unless you do another match. Since your regex for s is null, \1 refers to the capture group previous to that. You can see the difference:
$ sed '/\([[:digit:]]\+\)/s,\(a\),(\1),' sedtest
hello 0 world
example
4 sed
Phone number: 123-456-789
Nothing matched (lines with digits which also have a, but the back reference holds were cleared
$sed '/\([[:digit:]]\+\)/s,\(e\),(\1),'
h(e)llo 0 world
example
4 s(e)d
Phon(e) number: 123-456-789
e matched and that becomes the back reference.
If you don't want this behaviour, you shouldn't create the back reference by putting \( \) around [[:digit:]] in the first place.

Get the n-th range by pattern

My input is like this:
start
content A
end
garbage
start
content B
end
I want to extract the second (or first, or third ...) start .. end block. With
sed -ne '/start/,/end/p'
I can filter out the garbage, but how do I get just "start content B end"?
But anyway, if you want sed - you get sed:)
/^start$/{
x
s/^/a/
/^aaa$/{
x
:loop
p
/^end$/q
n
bloop
}
x
}
The number of a's in the middle match equals to which segment you want to get. You could also have it in regexp repetion like Dennis noted. That approach allows for specifying direct number to the script.
Note: the script should be run with -n sed option.
Get all range
$ awk 'BEGIN{RS="end";FS="start"}{ print $NF}' file
content A
content B
Get 2nd range
$ awk 'BEGIN{RS="end";FS="start"}{c++; if (c==2) print $NF}' file
content B
Ruby(1.9+), get first range
$ ruby -0777 -ne 'puts $_.scan(/start(.*?)end/m)[0]' file
content A