Can I use sed if I need to extract a pattern enclosed by a specific pattern, if it exists in a line?
Suppose I have a file with the following lines :
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
In both the cases I have to scan the line for the first occurring pattern i.e ' [/ ' or '/* ' in their respective cases and store the following pattern till then exit pattern i.e ' /] 'or ' */ ' respectively .
In short , I need fear and answer .If possible , Can it be extended for multiple lines ;in the sense ,if the exit pattern occurs in a line different than the same .
Any kind of help in the form of suggestions or algorithms are welcome. Thanks in advance for the replies

use strict;
use warnings;
while (<DATA>) {
while (m#/(\*?)(.*?)\1/#g) {
print "$2\n";
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
As a one-liner:
perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt
The inner while loop will iterate between all matches with the /g modifier. The backreference \1 will make sure we only match identical open/close tags.
If you need to match blocks that extend over multiple lines, you need to slurp the input:
use strict;
use warnings;
$/ = undef;
while (<DATA>) {
while (m#/(\*?)(.*?)\1/#sg) {
print "$2\n";
There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
foo bar /
baaz / fooz
perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt
The -0777 switch and $/ = undef will cause file slurping, meaning all of the file is read into a scalar. I also added the /s modifier to allow the wildcard . to match newlines.
Explanation for the regex: m#/(\*?)(.*?)\1/#sg
m# # a simple m//, but with # as delimiter instead of slash
/(\*?) # slash followed by optional *
(.*?) # shortest possible string of wildcard characters
\1/ # backref to optional *, followed by slash
#sg # s modifier to make . match \n, and g modifier
The "magic" here is that the backreference requires a star * only when one is found before it.

Quick and dirty way in awk
awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' input_file
$ cat file
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn't.
$ awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' file

Single-Line Matches
If you really want to do this in sed, you can extract your delimited patterns relatively easily as long as they are on the same line.
# Using GNU sed. Escape a whole lot more if your sed doesn't handle
# the -r flag.
sed -rn 's![^*/]*(/\*?.*/).*!\1!p' /tmp/foo
Multi-Line Matches
If you want to perform multi-line matches with sed, things get a little uglier. However, it can certainly be done.
# Multi-line matching of delimiters with GNU sed.
sed -rn ':loop
/\/[^\/]/ {
T loop
}' /tmp/foo
The trick is to look for a starting delimiter, then keep appending lines in a loop until you find the ending delimiter.
This works really well as long as you really do have an ending delimiter. Otherwise, the contents of the file will keep being appended to the pattern space until sed finds one, or until it reaches the end of the file. This may cause problems with certain versions of sed or with really, really large files where the size of the pattern space gets out of hand.
See GNU sed's Limitations and Non-limitations for more information.


Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
print #buf;
#buf = ();
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
print #buf;
#buf = ();
else {
$first = 1;
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Creating CSV of information extracted from filenames in a given format

I have a little script that lists paths to all files in a directory and all subdirectories and parses each path on the list with regex in Perl.
find * -type f | while read j; do
echo $j | perl -n -e '/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/ && print "\"0\";\"$1$2$3\";\"$4\";\"$5\";$fl\""' >> bss.csv
echo | readlink -f -n "$j" >>bss.csv
echo \">>bss.csv
I am using the readlink from GNU coreutils: -n suppresses newline at the end, -f performs canonicalization by recursively following symlinks on the path.
Problem is, when input string did not pass regex I have only line with file path.
How can I add condition to check if regex passed - show path, else - no.
I broke my brain with various combinations, but didn't find any that work properly.
Description of solution
In Perl, use if (/…/) {…} else {…} instead of /…/ && …. Thus you can execute print if match is successful and some other code otherwise.
If this is not the problem and you only want to get rid of the readlink output and closing quote, you can call readlink from Perl using backticks.
Resulting code
I turned everything into a single Perl program, used File::Find instead of find command, assumed $fl at the end of print in Perl is a relict (ignored it) and used Cwd::realpath() to find canonical path of the file instead of readlink -f from GNU coreutils. If you still want to use readlink -f, feel free to change Cwd::realpath($_) to `readlink -f '$_'` (including the backticks!), but then it will not work for filenames containing a single-quote.
You should call this script as ./script-name starting-directory > bss.csv. If you put it in the directory you are examining, the output would contain it too, along with the bss.csv.
# Usage: ./$0 [<starting-directory>...]
use strict;
use warnings;
use File::Find;
use Cwd;
no warnings 'File::Find';
sub handleFile() {
return if not -f;
if ($File::Find::name =~ /\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/) {
local $, = ';', $\ = "\n";
print map "\"$_\"", 0, $1.$2.$3, $4, $5, Cwd::realpath($_);
} else {
print STDERR "File $File::Find::name did not match\n";
find(\&handleFile, #ARGV ? #ARGV : '.');
For reference I also enclose polished version of the original program. It is calling readlink from Perl as I suggested above and really utilizes the -n option of Perl, avoiding the while read loop.
find . -type f | perl -n -e 'm{/(\d{2})/(\d{2})/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?} && print qq{"0";"$1$2$3";"$4";"$5";"`readlink -f -n '\''$_'\''`"}' > bss.csv
Other remarks to the original code
The echo | before the readlink does nothing and should be removed. Readlink does not read its stdin.
Where does $fl at the end of print in Perl come from? I assume it is a relict.
Use of generic quotes like qq{} and thoughtful use of delimiters (e.g. in regex matching and other quote-like operators) can save you from quoting hell. I already used this tip above: /…/ → m{…} and "…" → qq{…}. Thx, Slade! See perlop manpage for more info.
If I understand you, you want to capture the following parts of the filename:
~~ ~~ ~ ~~~ ~~~~~~~ ~
1 2 3 4 5 6
But your perl regex doesn't do that. Let's break it apart for better understanding.
Sliced into pieces, this would be...
\/(\d{2}) - a slash then two digits (with the digits captured)
\/(\d{2}) - another slash and two digits
\/(\d) - one more slash and any number of digits
.*- - any run of characters until the final hyphen in the input string
([a-zA-Z]+) - one or more alpha characters
(?:_(\d{1}))? - nonsensical (I think) construct matching an optional single digit that won't be captured (because it's inside a (?:...))
If you step through your filename, you'll see that there is nothing here to handle the second last string of digits.
I'd do this using simpler tools. Sed, for example:
[ghoti#pc ~]$ s="/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
[ghoti#pc ~]$ echo "$s" | sed -rne 's/.*/"&"/;h;s:.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.*:"0";"\1\2\3";"\4";"\6":;G;s/\n/;/;p'
[ghoti#pc ~]$
I'll break up the sed script for easier reading:
s/.*/"&"/; - Put quotes around the filename.
h; - Store the filename in Sed's "hold" space, for future use...
s: - Start the big substitution...
.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.* - This is the pattern we want to match for substitution. Similar to what you did in Perl, obviously, but using ERE instead of PCRE.
:"0";"\1\2\3";"\4";"\6":; - The replacement pattern, with \n being replaced by the bracketed elements of the RE. Note that \5 is skipped in the replace string, as that subexpression is only being used for the match.
G; - Append the "hold" space to the pattern space
s/\n/;/; - and remove the newline between them.
p - Print the result.
Note that this solution, as is, assumes that all input lines match the pattern you're looking for. If that's not the case, then you may get unpredictable output, and should put some pattern matching into the script.

Sed: syntax error with unexpected "("

I've got file.txt which looks like this:
And I'm trying to do two things:
Select the lines that have $id_play as 2nd field.
Replace ; with - on those lines.
My attempt:
$result = `sed s#^\([^;]*\);$id_play;\([^;]*\);\([^;]*\);\([^;]*\);\([^;]*\);\([^;]*\)\$#\1-$id_play-\2-\3-\4-\5-\6#g $input`;
And I'm getting this error:
sh: 1: Syntax error: "(" unexpected
You have to escape the # characters, add 2 backslashes in some cases (thanks ysth!), add single quotes between sed and make it also filter the lines. So replace with this:
$result = `sed 's\#^\\([^;]*\\);$id_play;\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\)\$\#\\1-$id_play-\\2-\\3-\\4-\\5-\\6-\\7\#g;tx;d;:x' $input`;
PS. What you are trying to do can be achieved in a much more clean way without calling sed and using a split. For example:
use warnings;
use strict;
my $id_play=3;
my $input="file.txt";
open (my $IN,'<',$input);
while (<$IN>) {
my #row=split/;/;
print join('-',#row) if $row[1]==$id_play;
close $IN;
No need to ever call sed from perl as the perl regex engine already built in and much easier to use. The above answer is perfectly fine. With such a simple dataset, another simple way to do it a little more idiomatically (although maybe a little more obfuscated...then again that sed command was a little complex in itself!) would be:
use warnings;
use strict;
my $id_play = 3;
my #result = map { s/;/-/g; $_ } grep { /^\w+;$id_play;/ } <DATA>;
print #result;
Assuming the file isn't too terribly large, you can just use grep with a regex to grab the lines you are looking for, and then map with a substitution operator to covert those semicolons to hyphens and store the results in a list that you can then print out. I tested it with the DATA block below the code, but instead of reading in from that block, you would probably read in from your file as normal.
edit: Also forgot to mention that in sed, '(' and ')' are treated as literal regular characters and not regex groupings. If you're dead set on sed for such things, use the -r option of sed to have it use those characters in the regex sense.
$ cat file
$ id_play=2
$ awk -v id="$id_play" -F';' -v OFS='-' '$2==id{$1=$1}1' file

Simple search and replace without regex

I've got a file with various wildcards in it that I want to be able to substitute from a (Bash) shell script. I've got the following which works great until one of the variables contains characters that are special to regexes:
perl -i -pe "s/VERSION/${VERSION}/g" txtfile.txt # No problems here
perl -i -pe "s/APP_NAME/${APP_NAME}/g" txtfile.txt # Error!
So instead I want something that just performs a literal text replacement rather than a regex. Are there any simple one-line invocations with Perl or another tool that will do this?
The 'proper' way to do this is to escape the contents of the shell variables so that they aren't seen as special regex characters. You can do this in Perl with \Q, as in
but when called from a shell script the backslash must be doubled to avoid it being lost, like so
perl -i -pe "s/APP_NAME/\\Q${APP_NAME}/g" txtfile.txt
But I suggest that it would be far easier to write the entire script in Perl
Use the following:
perl -i -pe "s|APP_NAME|\\Q${APP_NAME}|g" txtfile.txt
Since a vertical bar is not a legal character as part of a path, you are good to go.
I don't particularly like this answer because there should be a better way to do a literal replace in Perl. \Q is cryptic. Using quotemeta adds extra lines of code.
But... You can use substr to replace a portion of a string.
my $name = "Jess.*";
my $sentence = "Hi, my name is Jess.*, dude.\n";
my $new_name = "Prince//";
my $name_idx = index $sentence, $name;
if ($name_idx >= 0) {
substr($sentence, $name_idx, length($name), $new_name);
print $sentence;
Hi, my name is Prince//, dude.
You don't have to use a regular expression for this (using substr(), index(), and length()):
perl -pe '
foreach $var ("VERSION", "APP_NAME") {
while (($i = index($_, $var)) != -1) {
substr($_, $i, length($var)) = $ENV{$var};
Make sure you export your variables.
You can use a regex but escape any special characters.
Something like this may work.
APP_NAME=`echo "$APP_NAME" | sed -e '{s:/:\/:}'`
perl -i -pe "s/APP_NAME/${APP_NAME}/g" txtfile.txt
perl -i -pe "\$r = qq/\Q${APP_NAME}\E/; s/APP_NAME/\$r/go"
Rationale: Escape sequences
I managed to get a working solution, partly based on bits and pieces from other peoples' answers:
perl -pe "\$r = q/${app_name//\//\\/}/; s/APP_NAME/\$r/g" <<<'APP_NAME'
This creates a Perl variable, $r, from the result of the shell parameter expansion:
${ # Open parameter expansion
app_name # Variable name
// # Start global substitution
\/ # Match / (backslash-escaped to avoid being interpreted as delimiter)
/ # Delimiter
\\/ # Replace with \/ (literal backslash needs to be escaped)
} # Close parameter expansion
All that work is needed to prevent forward slashes inside the variable from being treated as Perl syntax, which would otherwise close the q// quotes around the string.
In the replacement part, use the variable $r (the $ is escaped, to prevent it from being treated as a shell variable within double quotes).
Testing it out:
$ app_name='../../path/to/myapp'
$ perl -pe "\$r = q/${app_name//\//\\/}/; s/APP_NAME/\$r/g" <<<'APP_NAME'

Need to print the last occurrence of a string in Perl

I have a script in Perl that searches for an error that is in a config file, but it prints out any occurrence of the error. I need to match what is in the config file and print out only the last time the error occurred. Any ideas?
Wow...I was not expecting this much of a response. I should've been more clear in stating this is for log monitoring on a windows box that sends an alert to Nagios. This is actually my first Perl program and all this information has been very helpful. Does anyone know how I can apply this any of the tail answers on a wintel box?
Another way to do it:
perl -n -e '$e = $1 if /(REGEX_HERE)/; END{ print $e }' CONFIG_FILE_HERE
What exactly do you need to print? The line containing the error? More context than that?
File::ReadBackwards can be helpful.
In outline:
my $errinfo;
while (<>)
$errinfo = "whatever" if (m/the error pattern/);
print "error: $errinfo\n" if ($errinfo);
This catches all errors, but doesn't print until the end, when only the last one survives.
A brute-force approach involves setting up your own pipeline by pointing STDOUT to tail. This allows you to print all errors, and then it's up to tail to worry about only letting the last one out.
You didn't specify, so I assume a legal config line is of the form
Name = some value
Matching that is straightforward:
^ (starting at the beginning of line)
\w+ (one or more “word characters”)
\s+ (followed by mandatory whitespace)
= (followed by an equals sign)
\s+ (more mandatory whitespace)
.+ (some mandatory value)
$ (finishing at the end of the line)
Gluing it together, we get
#! /usr/bin/perl
use warnings;
use strict;
# for demo only
my $pid = open STDOUT, "|-", "tail", "-1" or die "$0: open: $!";
while (<>) {
print unless /^ \w+ \s+ = \s+ .+ $/x;
close STDOUT or warn "$0: close: $!";
This = assignment is ok
But := not this
And == definitely not this
$ ./lasterr
And == definitely not this
With regular expressions, when you want the last occurrence of a pattern, place ^.* at the front of your pattern. For example, to replace the last X in the input with Y, use
$ echo XABCXXXQQQXX | perl -pe 's/^(.*)X/$1Y/'
Note that the ^ is redundant because regular-expression quantifiers are greedy, but I like having it there for emphasis.
Applying this technique to your problem, you can search for the last line in your config file that contains an error as in the following program:
#! /usr/bin/perl
use warnings;
use strict;
local $_ = do { local $/; scalar <DATA> };
if (/\A.* ^(?! \w+ \s+ = \s+ [^\r\n]+ $) (.+?)$/smx) {
print $1, "\n";
This = assignment is ok
But := not this
And == definitely not this
The syntax of the regular expression is a bit different because $_ contains multiple lines, but the principle is the same. \A is similar to ^, but it matches only at the beginning of string to be searched. With the /m switch (“multi-line”), ^ matches at logical line boundaries.
Up to this point, we know the pattern
/\A.* ^ .../
matches the last line that looks like something. The negative look-ahead assertion (?!...) looks for a line that is not a legal config line. Ordinarily . matches any character except newline, but the /s switch (“single line”) lifts this restriction. Specifying [^\r\n]+, that is, one or more characters that are neither carriage return nor line feed, does not allow the match to spill into the next line.
Look-around assertions do not capture, so we grab the offending line with (.+?)$. The reason it's safe to use . in this context is because we know the current line is bad and the non-greedy quantifier +? stops matching as soon as it can, which in this case is the end of the current logical line.
All these regular expressions use the /x switch (“extended mode”) to allow extra whitespace: the aim is to improve readability.