sed insert multiple lines before multiple lines pattern - sed

I have the following very long file:
...
close unit 1
...
...
close unit 1
...
...
close unit 1
stop
I want to insert multiples lines before the last close unit 1 which is before stop. The file contains an undefined number of close unit 1.
I found a lot of other similar questions here and there, but the answers couldn't help me... For example I tried https://stackoverflow.com/a/8635732/1689664 but this didn't work...

Using sed and tac:
$ tac inputfile | sed '/close unit 1/ {s/\(.*\)/\1\nLine3\nLine2\nLine1/; :loop; n; b loop}' | tac
...
close unit 1
...
...
close unit 1
...
...
Line1
Line2
Line3
close unit 1
stop
Note that you'd need to specify the input lines in the reverse order in the sed expression.

Perl solution:
perl -ne ' push #arr, $_;
print shift #arr if #arr > 3;
if ("stop\n" eq $_ and "close unit 1\n" eq $arr[0]) {
print "\n\n"; # Inserted lines
}
}{ print #arr ' long-file > new-file
It keeps a sliding window of last 3 lines, if the last line in the window is stop and the first one is close unit 1, it prints the lines.
Another possibility is to use nl to number the lines, then grep lines containing close unit 1, getting the number of the last such a line and using it in a sed address:
nl -ba long-file \
| grep -F 'close unit 1' \
| tail -n1 \
| ( read line junk
sed -e $line's/^/\n\n/' long-file > new-file
)

This might work for you (GNU sed):
sed '/close unit 1/,$!b;/close unit 1/{x;/./p;x;h;d};H;$!d;x;i\line 1\nline 2\n...' file
Print every line before the first occurrence of close unit 1 as normal. Store collections of lines beginning with close unit 1 in the hold space and print the previous collection before storing the next. At the end of the file, the last collection will still be in the hold space, so insert the lines required and then print the last collection.

Related

Append to line that is preceded AND followed by empty line

I need to append an asterisk to a line, but only if said line is preceded and followed by empty lines (FYI, said empty lines will NOT have any white space in them).
Suppose I have the following file:
foo
foo
foo
foo
foo
I want the output to look like this:
foo
foo
foo
foo*
foo
I tried modifying the following awk command (found here):
awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
to suit my uses, but got all tied up in knots.
Sed or Perl solutions are, of course, welcome also!
UPDATE:
It turned out that the question I asked was not quite correct. What I really needed was code that would append text to non-empty lines that do not start with whitespace AND are followed, two lines down, by non-empty lines that also do not start with whitespace.
For this revised problem, suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
For that, this sed one-liner does the trick:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to Benjamin W.'s clear and informative answer below, I was able to cobble this one-liner together!
A sed solution:
$ sed '1N;N;s/^\(\n.*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
foo
foo*
foo
N;P;D is the idiomatic way to look at two lines at the same time by appending the next one to the pattern space, then printing and deleting the first line.
1N;N;P;D extends that to always having three lines in the pattern space, which is what we want here.
The substitution matches if the first and last line are empty (^\n and \n$) and appends one * to the line between the empty lines.
Notice that this matches and appends a * also for the second line of three empty lines, which might not be what you want. To make sure this doesn't happen, the first capture group has to have at least one non-whitespace character:
sed '1N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
Question from comment
Can we not append the * if the line two above begins with abc?
Example input file:
foo
foo
abc
foo
foo
foo
foo
There are three foo between empty lines, but the first one should not get the * appended because the line two above starts with abc. This can be done as follows:
$ sed '1{N;N};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
abc
foo
foo*
foo*
foo
This keeps four lines at a time in the pattern space and only makes the substitution if the pattern space does not start with abc:
1 { # On the first line
N # Append next line to pattern space
N # ... again, so there are three lines in pattern space
}
N # Append fourth line
/^abc/! # If the pattern space does not start with abc...
s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/ # Append '*' to 3rd line in pattern space
P # Print first line of pattern space
D # Delete first line of pattern space, start next cycle
Two remarks:
BSD sed requires an extra semicolon: 1{N;N;} instead of 1{N;N}.
If the first and third line of the file are empty, the second line does not get an asterisk appended because we only start checking once there are four lines in the pattern space. This could be solved by adding an extra substitution into the 1{} block:
1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/}
(remember the extra ; for BSD sed), but trying to cover all edge cases makes sed even less readable, especially in one-liners:
sed '1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
One way to think about these problems is as a state machine.
start: state = 0
0: /* looking for a blank line */
if (blank line) state = 1
1: /* leading blank line(s)
if (not blank line) {
nonblank = line
state = 2
}
2: /* saw non-blank line */
if (blank line) {
output noblank*
state = 0
} else {
state = 1
}
And we can translate this pretty directly to an awk program:
BEGIN {
state = 0; # start in state 0
}
state == 0 { # looking for a (leading) blank line
print;
if (length($0) == 0) { # found one
state = 1;
next;
}
}
state == 1 { # have a leading blank line
if (length($0) > 0) { # found a non-blank line
saved = $0; # save it
state = 2;
next;
} else {
print; # multiple leading blank lines (ok)
}
}
state == 2 { # saw the non-blank line
if (length($0) == 0) { # followed by a blank line
print saved "*"; # BINGO!
state = 1; # to the saw a blank-line state
} else { # nope, consecutive non-blank lines
print saved; # as-is
state = 0; # to the looking for a blank line state
}
print;
next;
}
END { # cleanup, might have something saved to show
if (state == 2) print saved;
}
This is not the shortest way, nor likely the fastest, but it's probably the most straightforward and easy to understand.
EDIT
Here is a comparison of Ed's way and mine (see the comments under his answer for context). I replicated the OP's input a million-fold and then timed the runnings:
# ls -l
total 22472
-rw-r--r--. 1 root root 111 Mar 13 18:16 ed.awk
-rw-r--r--. 1 root root 23000000 Mar 13 18:14 huge.in
-rw-r--r--. 1 root root 357 Mar 13 18:16 john.awk
# time awk -f john.awk < huge.in > /dev/null
2.934u 0.001s 0:02.95 99.3% 0+0k 112+0io 1pf+0w
# time awk -f ed.awk huge.in huge.in > /dev/null
14.217u 0.426s 0:14.65 99.8% 0+0k 272+0io 2pf+0w
His version took about 5 times as long, did twice as much I/O, and (not shown in this output) took 1400 times as much memory.
EDIT from Ed Morton:
For those of us unfamiliar with the output of whatever time command John used above, here's the 3rd-invocation results from the normal UNIX time program on cygwin/bash using GNU awk 4.1.3:
$ wc -l huge.in
1000000 huge.in
$ time awk -f john.awk huge.in > /dev/null
real 0m1.264s
user 0m1.232s
sys 0m0.030s
$ time awk -f ed.awk huge.in huge.in > /dev/null
real 0m1.638s
user 0m1.575s
sys 0m0.030s
so if you'd rather write 37 lines than 3 lines to save a third of a second on processing a million line file then John's answer is the right one for you.
EDIT#3
It's the standard "time" built-in from tcsh/csh. And even if you didn't recognize it, the output should be intuitively obvious. And yes, boys and girls, my solution can also be written as a short incomprehensible mess:
s == 0 { print; if (length($0) == 0) { s = 1; next; } }
s == 1 { if (length($0) > 0) { p = $0; s = 2; next; } else { print; } }
s == 2 { if (length($0) == 0) { print p "*"; s = 1; } else { print p; s = 0; } print; next; }
END { if (s == 2) print p; }
Here's a perl filter version, for the sake of illustration — hopefully it's clear to see how it works. It would be possible to write a version that has a lower input-output delay (2 lines instead of 3) but I don't think that's important.
my #lines;
while (<>) {
# Keep three lines in the buffer, print them as they fall out
push #lines, $_;
print shift #lines if #lines > 3;
# If a non-empty line occurs between two empty lines...
if (#lines == 3 && $lines[0] =~ /^$/ && $lines[2] =~ /^$/ && $lines[1] !~ /^$/) {
# place an asterisk at the end
$lines[1] =~ s/$/*/;
}
}
# Flush the buffer at EOF
print #lines;
A perl one-liner
perl -0777 -lne's/(?<=\n\n)(.*?)(\n\n)/$1\*$2/g; print' ol.txt
The -0777 "slurps" in the whole file, assigned to $_, on which the (global) substitution is run and which is then printed.
The lookbehind (?<=text) is needed for repeating patterns, [empty][line][empty][line][empty]. It is a "zero-width assertion" that only checks that the pattern is there without consuming it. That way the pattern stays available for next matches.
Such consecutive repeating patterns trip up the /(\n\n)(.*?)(\n\n)/$1$2\*$3/, posted initially, since the trailing \n\n are not considered for the start of the very next pattern, having been just matched.
Update: My solution also fails after two consecutive matches as described above and needs the same lookback: s/(?<=\n\n)(\w+)\n\n/\1\2*\n\n/mg;
The easiest way is to use multi-line match:
local $/; ## slurp mode
$file = <DATA>;
$file =~ s/\n\n(\w+)\n\n/\n\n\1*\n\n/mg;
printf $file;
__DATA__
foo
foo
foo
foo
foo
It's simplest and clearest to do this in 2 passes:
$ cat tst.awk
NR==FNR { nf[NR]=NF; nr=NR; next }
FNR>1 && FNR<nr && NF && !nf[FNR-1] && !nf[FNR+1] { $0 = $0 "*" }
{ print }
$ awk -f tst.awk file file
foo
foo
foo
foo*
foo
The above takes one pass to record the number of fields on each line (NF is zero for an empty line) and then the second pass just checks your requirements - the current line is not the first or last in the file, it is not empty and the lines before and after are empty.
alternative awk solution (single pass)
$ awk 'NR>2 && !pp && !NF {p=p"*"}
NR>1{print p}
{pp=length(p);p=$0}
END{print p}' foo
foo
foo
foo
foo*
foo
Explanation: defer printing to next line for decision making, so need to keep previous line in p and state of the second previous line in pp (length zero assumed to be empty). Do the bookkeeping assignments and at the end print the last line.

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

Any way to find if two adjacent new lines start with certain words?

Say I have a file like so:
+jaklfjdskalfjkdsaj
fkldsjafkljdkaljfsd
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
I only want to find and count the amount of times during this file a line that starts with - is immediately followed by a line that starts with +.
Rules:
No external scripts
Must be done from within a bash script
Must be inline
I could figure out how to do this in a Python script, for instance, but I've never had to do something this extensive in Bash.
Could anyone help me out? I figure it'll end up being grep, perl, or maybe a talented sed line -- but these are things I'm still learning.
Thank you all!
grep -A1 "^-" $file | grep "^+" | wc -l
The first grep finds all of the lines starting with -, and the -A1 causes it to also output the line after the match too.
We then grep that output for any lines starting with +. Logically:
We know the output of the first grep is only the -XXX lines and the following lines
We know that a +xxx line cannot also be a -xxx line
Therefore, any +xxx lines must be following lines, and should be counted, which we do with wc -l
Easy in Perl:
perl -lne '$c++ if $p and /^\+/; $p = /^-/ }{ print $c' FILE
awk one-liner:
awk -v FS='' '{x=x sprintf("%s", $1)}END{print gsub(/-\+/,"",x)}' file
e.g.
kent$ cat file
+jaklfjdskalfjkdsaj
fkldsjafkljdkaljfsd
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
-
-
-
+
-
+
foo
+
kent$ awk -v FS='' '{x=x sprintf("%s", $1)}END{print gsub(/-\+/,"",x)}' file
3
Another Perl example. Not as terse as choroba's, but more transparent in how it works:
perl -e'while (<>) { $last = $cur; $cur = $_; print $last, $cur if substr($last, 0, 1) eq "-" && substr($cur, 0, 1) eq "+" }' < infile
Output:
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
Pure bash:
unset c p
while read line ; do
[[ $line == +* && $p == 0 ]] && (( c++ ))
[[ $line == -* ]]
p=$?
done < FILE
echo $c

sed — joining a range of selected lines

I'm a beginner to sed. I know that it's possible to apply a command (or a set of commands) to a certain range of lines like so
sed '/[begin]/,/[end]/ [some command]'
where [begin] is a regular expression that designates the beginning line of the range and [end] is a regular expression that designates the ending line of the range (but is included in the range).
I'm trying to use this to specify a range of lines in a file and join them all into one line. Here's my best try, which didn't work:
sed '/[begin]/,/[end]/ {
N
s/\n//
}
'
I'm able to select the set of lines I want without any problem, but I just can't seem to merge them all into one line. If anyone could point me in the right direction, I would be really grateful.
One way using GNU sed:
sed -n '/begin/,/end/ { H;g; s/^\n//; /end/s/\n/ /gp }' file.txt
This is straight forward if you want to select some lines and join them. Use Steve's answer or my pipe-to-tr alternative:
sed -n '/begin/,/end/p' | tr -d '\n'
It becomes a bit trickier if you want to keep the other lines as well. Here is how I would do it (with GNU sed):
join.sed
/\[begin\]/ {
:a
/\[end\]/! { N; ba }
s/\n/ /g
}
So the logic here is:
When [begin] line is encountered start collecting lines into pattern space with a loop.
When [end] is found stop collecting and join the lines.
Example:
seq 9 | sed -e '3s/^/[begin]\n/' -e '6s/$/\n[end]/' | sed -f join.sed
Output:
1
2
[begin] 3 4 5 6 [end]
7
8
9
I like your question. I also like Sed. Regrettably, I do not know how to answer your question in Sed; so, like you, I am watching here for the answer.
Since no Sed answer has yet appeared here, here is how to do it in Perl:
perl -wne 'my $flag = 0; while (<>) { chomp; if (/[begin]/) {$flag = 1;} print if $flag; if (/[end]/) {print "\n" if $flag; $flag = 0;} } print "\n" if $flag;'

Add column to middle of tab-delimited file (sed/awk/whatever)

I'm trying to add a column (with the content '0') to the middle of a pre-existing tab-delimited text file. I imagine sed or awk will do what I want. I've seen various solutions online that do approximately this but they're not explained simply enough for me to modify!
I currently have this content:
Affx-11749850 1 555296 CC
I need this content
Affx-11749850 1 0 555296 CC
Using the command awk '{$3=0}1' filename messes up my formatting AND replaces column 3 with a 0, rather than adding a third column with a 0.
Any help (with explanation!) so I can solve this problem, and future similar problems, much appreciated.
Using the implicit { print } rule and appending the 0 to the second column:
awk '$2 = $2 FS "0"' file
Or with sed, assuming single space delimiters:
sed 's/ / 0 /2' file
Or perl:
perl -lane '$, = " "; $F[1] .= " 0"; print #F'
awk '{$2=$2" "0; print }' your_file
tested below:
> echo "Affx-11749850 1 555296 CC"|awk '{$2=$2" "0;print}'
Affx-11749850 1 0 555296 CC