multiline pattern delete with single line command - perl

I would like to delete all empty segments in my file.
The empty segment can be specified by a pair of consecutive lines starting with START and ending with END. Valid segments will have some contents between lines starting with START and ending with END
Sample Input
Header
START arguments
END
Any contents
START arguments
...
something
...
END
Footer
Desired Output
Header
Any contents
START arguments
...
something
...
END
Footer
Here I'm looking for possible one liners. Any help would be appreciated.
Trials
I tried following awk. It works to some extent but it deletes START lines even in valid segments.
awk '/^START/ && getline && /^END$/ {next} 1' file

perl -00 -pe 's/START .*?\nEND//g' file
this is a better one.
the solution I gave earlier will discard whole paragraph if they are not separated by blank lines.
Earlier response below:
how about this perl one liner ?
perl -00 -ne 'print if not /START .*\nEND/' file
read-in file in paragraph mode and discard lines matching START <string><newline>END

Meanwhile people are suggesting nice solutions, I came up with alternative solution using sed
sed '/^START/N;/^START.*END$/d' file
Or as suggested by #jthill
sed '/^START/N; /\nEND$/d' file

gawk only
awk -v RS='START[^\n]*\nEND\n' '{printf "%s", $0}' file.txt

Perhaps the following will be helpful:
perl -ne 'print /^START/?do{$x=<>;$_,$x if $x!~/^END/}:$_' inFile
Output on your dataset:
Header
Any contents
START arguments
...
something
...
END
Footer

$ awk '{rec = rec $0 RS} END{ gsub(/START[^\n]*\nEND\n/,"",rec); printf "%s", rec }' file
Header
Any contents
START arguments
...
something
...
END
Footer

/^START/ {
startline=$0
next
}
/^END$/ && startline {
startline=""
next
}
startline {
print startline
}
startline=""
1

Related

sed/awk conditionally delete lines from the start and end of a file

I have several thousand text files which might start with
"
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...
"
perl is also ok
my attempt would be something like this with fish shell. awk is probably more performant though
if head -1 | grep \"
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
that might actually work I'm going to try it
The simplest way to do this is a 2-pass approach where on the first pass you figure out the beginning and ending line numbers for the "good" lines and on the second you print the lines between those numbers:
awk '
NR==FNR { if (NF && !/^"$/) { if (!beg) beg=NR; end=NR } next }
(beg <= FNR) && (FNR <= end)
' file file
For example given this input:
$ cat file
"
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...
"
We can do the following using any awk in any shell on every UNIX box:
$ awk 'NR==FNR{if (NF && !/^"$/) {if (!beg) beg=NR; end=NR} next} (beg <= FNR) && (FNR <= end)' file file
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...
You can use ed to do it in a single pass, too:
Something like
printf '%s\n' '1g/^"$/.,/^./-1d' '$g/^"$/?^.?+1,$d' w | ed -s "$file"
Translated: If the first line is nothing but a quote, delete it and any following empty lines. If the last line is nothing but a quote, delete all preceding empty lines and it. Finally write the file back to disk.
This might work for you (GNU sed):
sed '1{/^"$/d};/\S/!d;:a;${/^"$/Md};/\S/{n;ba};$d;N;ba' file
Delete the first line if contains a single ".
Delete all empty lines from the start of the file.
Form a loop for the remainder of the file.
Delete the last line(s) if it/they contains a single ".
If the current line(s) is/are not empty, print it/them, fetch the next and repeat.
If the current line(s) is/are the last and empty, delete it/them.
The current line(s) is/are empty so append the next line and repeat.
N.B. This is a single pass solution and allows for empty lines within the body of the file.
Alternative, memory intensive:
sed -Ez 's/^"?\n+//;s/\n+("\n)?$/\n/' file
In addition to the two-pass processing, here's a one-pass:
awk '!/^"*$/{print b $0;f=1;b=""} f&&/^"*$/{b=b $0 ORS}' file
The program consists of two small parts:
Whenever there's content (lines that contain more than "), print possibly buffered lines and the current input line, set a flag that content has started, and clear the buffer.
If content had started (f), but the current line doesn't contain any content, we may have reached the end, so we buffer these empty lines. Later, (1) will print them or they will be discarded on EOF.

sed: replace pattern only if followed by empty line

I need to replace a pattern in a file, only if it is followed by an empty line. Suppose I have following file:
test
test
test
...
the following command would replace all occurrences of test with xxx
cat file | sed 's/test/xxx/g'
but I need to only replace test if next line is empty. I have tried matching a hex code, but that doesn ot work:
cat file | sed 's/test\x0a/xxx/g'
The desired output should look like this:
test
xxx
xxx
...
Suggested solutions for sed, perl and awk:
sed
sed -rn '1h;1!H;${g;s/test([^\n]*\n\n)/xxx\1/g;p;}' file
I got the idea from sed multiline search and replace. Basically slurp the entire file into sed's hold space and do global replacement on the whole chunk at once.
perl
$ perl -00 -pe 's/test(?=[^\n]*\n\n)$/xxx/m' file
-00 triggers paragraph mode which makes perl read chunks separated by one or several empty lines (just what OP is looking for). Positive look ahead (?=) to anchor substitution to the last line of the chunk.
Caveat: -00 will squash multiple empty lines into single empty lines.
awk
$ awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
Basically store previous line in l, substitute pattern in l if current line is empty. Print l. Finally print the very last line.
Output in all three cases
test
xxx
xxx
...
This might work for you (GNU sed):
sed -r '$!N;s/test(\n\s*)$/xxx\1/;P;D' file
Keep a window of 2 lines throughout the length of the file and if the second line is empty and the first line contains the pattern then make a substitution.
Using sed
sed -r ':a;$!{N;ba};s/test([^\n]*\n(\n|$))/xxx\1/g'
explanation
:a # set label a
$ !{ # if not end of file
N # Add a newline to the pattern space, then append the next line of input to the pattern space
b a # Unconditionally branch to label. The label may be omitted, in which case the next cycle is started.
}
# simply, above command :a;$!{N;ba} is used to read the whole file into pattern.
s/test([^\n]*\n(\n|$))/xxx\1/g # replace the key word if next line is empty (\n\n) or end of line ($)

put all separate paragraphs of a file into a separate line

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:
#example
ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK
SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH
and I want to end up with a file looking like:
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
each sequence is the same length (if that helps).
I would also be looking to do this over multiple files stored in different directiories.
I have just tried
sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt
however this just deleted the entire file :S
any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.
Thanks.
All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:
$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
awk '
/^[[:space:]]*$/ {if (line) print line; line=""; next}
{line=line $0}
END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'
For multiple files:
# adjust your glob pattern to suit,
# don't be shy to ask for assistance
for file in */*.txt; do
newfile="/some/directory/$(basename "$file")"
perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done
A Perl one-liner, if you prefer:
perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file
The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.
just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$:
after that you replace every linebreak with an empty string to get the whole file in one line.
next, replace your special sign with a simple line break :)

join 2 consecutive rows under condition

I have 5 lines like:
typeA;pointA1
typeA;pointA2
typeA;pointA3
typeB;pointB1
typeB;pointB2
result output would be:
typeA;pointA1;typeA;pointA2
typeA;pointA2;typeA;pointA3
typeB;pointB1;typeB;pointB2
Is it possible to use sed or awk for this purpose?
This is easy with awk:
awk -F';' '$1 == prevType { printf("%s;%s;%s\n", $1, prevPoint, $0) } { prevType = $1; prevPoint = $2 }'
I've assumed that the blank lines between the records are not part of the input; if they are, just run the input through grep -v '^$' before awk.
paste could be useful in this case. it could save a lot of codes:
sed '1d' file|paste -d";" file -|awk -F';' '$1==$3'
see the test below
kent$ cat a
typeA;pointA1
typeA;pointA2
typeA;pointA3
typeB;pointB1
typeB;pointB2
kent$ sed '1d' a|paste -d";" a -|awk -F';' '$1==$3'
typeA;pointA1;typeA;pointA2
typeA;pointA2;typeA;pointA3
typeB;pointB1;typeB;pointB2
This GNU sed solution might work for you:
sed -rn '1{h;b};H;x;/^([^;]*);.*\n\1/!{s/.*\n//;x;d};s/\n/;/p' source_file
Assumes no blank lines else pipe preformat the source file with sed '/^$/d' source_file
EDIT:
On reflection the above solution is far too elaborate and can be condensed to:
sed -ne '1{h;b};H;x;/^\([^;]*\);.*\1/s/\n/;/p' source_file
Explanation:
The -n prevents any lines being implicitly printed. The first line is copied to the hold space (HS an extra register) and then a break is made that ends the iteration. All subsequent lines are appended to the HS. The HS is then swapped with the pattern space (PS - a register holding the current line). The HS at this point contains the previous and current lines which are now checked to see if the first field in each line are identical. If so, the newline separating the two lines is replaced by a ; and providing the substitution occurred the PS is printed out. The next iteration now takes place, the current line refreshes the PS and HS now holds the previous line.

replace two newlines to one in shell command line

There are lot of questions about replacing multi-newlines to one newline but no one is working for me.
I have a file:
first line
second line MARKER
third line MARKER
other lines
many other lines
I need to replace two newlines (if they exist) after MARKER to one newline. A result file should be:
first line
second line MARKER
third line MARKER
other lines
many other lines
I tried sed ':a;N;$!ba;s/MARKER\n\n/MARKER\n/g' Fail.
sed is useful for single line replacements but has problems with newlines. It can't find \n\n
I tried perl -i -p -e 's/MARKER\n\n/MARKER\n/g' Fail.
This solution looks closer, but it seems that regexp didn't reacts to \n\n.
Is it possible to replace \n\n only after MARKER and not to replace other \n\n in the file?
I am interested in one-line-solution, not scripts.
I think you were on the right track. In a multi-line program, you would load the entire file into a single scalar and run this substitution on it:
s/MARKER\n\n/MARKER\n/g
The trick to getting a one-liner to load a file into a multi-line string is to set $/ in a BEGIN block. This code will get executed once, before the input is read.
perl -i -pe 'BEGIN{$/=undef} s/MARKER\n\n/MARKER\n/g' input
Your Perl solution doesn't work because you are search for lines that contain two newlines. There is no such thing. Here's one solution:
perl -ne'print if !$m || !/^$/; $m = /MARKER$/;' infile > outfile
Or in-place:
perl -i~ -ne'print if !$m || !/^$/; $m = /MARKER$/;' file
If you're ok with loading the entire file into memory, you can use
perl -0777pe's/MARKER\n\n/MARKER\n/g;' infile > outfile
or
perl -0777pe's/MARKER\n\K\n//g;' infile > outfile
As above, you can use -i~ do edit in-place. Remove the ~ if you don't want to make a backup.
awk:
kent$ cat a
first line
second line MARKER
third line MARKER
other lines
many other lines
kent$ awk 'BEGIN{RS="\x034"} {gsub(/MARKER\n\n/,"MARKER\n");printf $0}' a
first line
second line MARKER
third line MARKER
other lines
many other lines
See sed one liners.
awk '
marker { marker = 0; if (/^$/) next }
/MARKER/ { marker = 1 }
{ print }
'
This can be done in very simple sed.
sed '/MARKER$/{n;/./!d}'
This might work for you:
sed '/MARKER/,//{//!d}'
Explanation:
Deletes all lines between MARKER's preserving the MARKER lines.
Or:
sed '/MARKER/{n;N;//D}'
Explanation:
Read the next line after MARKER, then append the line after that. Delete the previous line if the current line is a MARKER line.