sed/awk conditionally delete lines from the start and end of a file - perl

I have several thousand text files which might start with
"
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...
"
perl is also ok
my attempt would be something like this with fish shell. awk is probably more performant though
if head -1 | grep \"
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
if head -1 | grep '^\r\n$'
sed -i 1d $file
that might actually work I'm going to try it

The simplest way to do this is a 2-pass approach where on the first pass you figure out the beginning and ending line numbers for the "good" lines and on the second you print the lines between those numbers:
awk '
NR==FNR { if (NF && !/^"$/) { if (!beg) beg=NR; end=NR } next }
(beg <= FNR) && (FNR <= end)
' file file
For example given this input:
$ cat file
"
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...
"
We can do the following using any awk in any shell on every UNIX box:
$ awk 'NR==FNR{if (NF && !/^"$/) {if (!beg) beg=NR; end=NR} next} (beg <= FNR) && (FNR <= end)' file file
Start of text
but not all of them have the same number of line breaks and not all of them have "
I would like to remove " (if it exists) and any line breaks, if any.
(and the ending too but I'll probably figure it out if you show me how to remove it from the start)
End of file...

You can use ed to do it in a single pass, too:
Something like
printf '%s\n' '1g/^"$/.,/^./-1d' '$g/^"$/?^.?+1,$d' w | ed -s "$file"
Translated: If the first line is nothing but a quote, delete it and any following empty lines. If the last line is nothing but a quote, delete all preceding empty lines and it. Finally write the file back to disk.

This might work for you (GNU sed):
sed '1{/^"$/d};/\S/!d;:a;${/^"$/Md};/\S/{n;ba};$d;N;ba' file
Delete the first line if contains a single ".
Delete all empty lines from the start of the file.
Form a loop for the remainder of the file.
Delete the last line(s) if it/they contains a single ".
If the current line(s) is/are not empty, print it/them, fetch the next and repeat.
If the current line(s) is/are the last and empty, delete it/them.
The current line(s) is/are empty so append the next line and repeat.
N.B. This is a single pass solution and allows for empty lines within the body of the file.
Alternative, memory intensive:
sed -Ez 's/^"?\n+//;s/\n+("\n)?$/\n/' file

In addition to the two-pass processing, here's a one-pass:
awk '!/^"*$/{print b $0;f=1;b=""} f&&/^"*$/{b=b $0 ORS}' file
The program consists of two small parts:
Whenever there's content (lines that contain more than "), print possibly buffered lines and the current input line, set a flag that content has started, and clear the buffer.
If content had started (f), but the current line doesn't contain any content, we may have reached the end, so we buffer these empty lines. Later, (1) will print them or they will be discarded on EOF.

Related

Bash one-liner to insert text marker into the fourth and all consecutive tabs of fields populated with text

This is a Bash/.bat terminal script for Mac.
I'm trying to add text ("!!XX!!") into a group of tab-delimited .txt files in a folder, but I only want to add it into the 4th and all following incidents of the tab in each .txt file, and then only if those cels have text in them. So, the end result would be something like (assuming the 7th cel/field/bit of info is blank). So turn this:
text01
text02
text03
text04
text05
text06
... into this:
text01 [TAB] text02 [TAB] text03 [TAB] text04!!XX!! [TAB] text05!!XX!! [TAB] text06!!XX!! [TAB]
The text marker "!!XX!!" is so that another script in a different system can run on the file and perform special system-compatible/custom line formatting at each incident of "!!XX!!", but I don't want to populate the first three fields/tab-delimited text (because it's not needed there) or in the empty fields (because it's not wanted there).
I'm already replacing each line return with a tab, so it is possible to do it there, though my preference is to do it later to the tab-delimited text b/c of weird issues with the line returns/formatting coming in from .rtf files. Below is what I am to replace each line return and replace it with a TAB (and, yes, that is an actual line return and tab in there, which seems to work best because... Macs?):
perl -pi -w -e 's/
/ /g' *.txt;
Thanks in advance :)
This post assumes an input file that has lines with tab-separated fields, where each field starting from (and including) the fourth need be edited if it has something.
One way
perl -F"\t" -wlane'
for (3..$#F) { $F[$_] .= "!XX!" if defined $F[$_] }; print join("\t", #F)
' file
(In tcsh shell need to escape those ! with a backslash.) Once you've tested enough add -i switch to change input file in place (-i.bak keeps a backup).
This uses Perl's -a switch to break input lines by what is given under -F switch (or by whitespace by default), and the resulting array is in #F. See switches in perlrun.
Then it iterates from the fourth field to the last. I use syntax $#ary for the index of the last element of array #ary.
I don't know what counts for cells that "have text in them" so above I test a field for defined-ness; thus this will append even for an empty string. Adjust as suitable.
Or use a regex, which allows more flexibility here. For example,
for (3..$#F) { $F[$_] =~ s/.+\K/!XX!/ }
This matches all characters and then adds !XX! (keeping what it matched, by \K assertion). Using regex allows and demands to specify more precisely what is accepted there; the shown pattern will match even for whitespace alone, but not for empty string. To not touch fields with whitespace only, and to strip trailing spaces if any
for (3..$#F) { $F[$_] =~ s/.+\S\K\s*/XX/ };
Again, adjust to your details.
I don't quite understand the discussion of newlines and what is wanted of them; the above one-liner goes line by line. If that's not what you need please clarify. I don't have Macs to test, so I can't comment on all that.
A self-contained example for ready testing and tweaking
echo "t1\tt2\tt3\tt4\t\tt6 \t " |
perl -F"\t" -wlanE'for (3..$#F) { $F[$_] =~ /.+\S\K\s*/XX/ } say for #F'
where I print each field on a separate line for easier inspection. The last tab in input is followed by trailing spaces only -- this results in an empty field, but with no text marker added (as asked for in a comment).
with GNU sed
$ echo text{01..07}$'\t' | sed -E 's/([^\t]+)(\t|$)/\1!!xx!!/4g'
text01 text02 text03 text04!!xx!! text05!!xx!! text06!!xx!! text07!!xx!!
or
$ echo text{01..07}$'\t' | sed -E 's/\t([^\t]+)/\1!!xx!!/3g'
Assuming each text file contains 7 lines, you can do
paste -s *.txt | awk '
BEGIN {FS=OFS="\t"}
{for (i=4; i<=NF; i++) if ($i != "") $i = $i "!!XX!!"; print}
'
Here is an awk:
echo text{01..10}$'\t' |
awk -v OFS=$'\t' '{for(i=1;i<=NF;i++) printf "%s%s", $i, i>=4 ? "XXX\t" : i<NF ? OFS : ORS }'
With perl, I would do this:
echo text{01..10}$'\t' |
perl -lpE '$cnt=0; s/\h+/++$cnt>=4 ? "XXX\t" : "\t"/ge;'
Both print:
text01 text02 text03 text04XXX text05XXX text06XXX text07XXX text08XXX text09XXX text10XXX

How to avoid the last newline in sed?

I want to remove the last part of a file, starting at a line following a certain pattern and including the preceding newline.
So, stopping at "STOP", the following file:
keep\n
STOP\n
whatever
Should output:
keep
With no trailing newline.
I tried this, and the logic seems to work, but it seems that sed adds a newline every time it prints its buffer. How can I avoid that? When sed doesn't manipulate the buffer, I don't have that problem (IE If I remove the STOP, sed outputs 'whatever' at the end of the file without a newline).
printf 'keep
STOP
Whatever' | sed 'N
/\nSTOP/ {
s/\n.*$//
P
Q
}
P
D'
I'm trying to write a git cleaning filter, and I cannot have a new newline appended every time I commit.
$ awk '/^STOP/{exit} {printf "%s%s", ors, $0; ors=RS}' file
keep$
The above prints every line without a trailing newline but preceded by a newline (\n or \r\n - whichever your environment dictates so it'll behave correctly on UNIX or Windows or whatever) for every 2nd and subsequent line. When it finds a STOP line it just exits before printing anything.
Note that the above doesn't keep anything in memory except the current line so it'll work no matter how large your input file is and no matter where the STOP appears in it - it'll even work if STOP is the first line of the file unlike the other answers you have so far.
It will also work using any awk in any shell on every UNIX box.
This might work for you (GNU sed):
sed -z 's/\nSTOP.*//' file
The -z option slurps the whole file into memory and the substitute command, removes the remainder of the file from the first newline followed by STOP.
Using awk you could:
$ awk '$0=="STOP"{exit} {b=b (b==""?"":ORS) $0} END{printf "%s",b}' file
Output:
keep$
Explained:
$ awk '
$0=="STOP" { exit } # exit at STOP, ie. go to END
{ b=b (b==""?"":ORS) $0 } # gather an output buffer, control \n
END { printf "%s",b } # in the END output output buffer
' file
... more (focusing a bit on the conditional operator):
b=b # appending to b, so b is b and ...
(b==""?"":ORS) # if b was empty, add nothing to it, if not add ORS ie. \n ...
$0 # and the current record

how to delete lines connected with "+" signs with sed

In this example, "+" sign means it connects the previous line and the current line. So I like to delete a specific group of lines that are connected by "+".
For example, I'd like to remove from 1st line to 4th line(.groupA ~ + G H I). Please help me on how to do it with sed.
To delete lines starting with .groupA and all consecutive +-prefixed lines, one easy to understand approach is:
sed '/\.groupA/,/^[^+]/ { /\.groupA/d; /\.groupA/!{/^\+/d} }' file
We first select everything between .groupA and the first non +-prefixed line (inclusive), then for that selection of lines, we delete the first line (containing .groupA), and of the remaining lines, we delete all with + prefix.
Note you need to escape regex metacharacters (like . and +) if you want to match them literally.
A little bit more advanced, but more elegant (only one use of starting block pattern) approach uses a loop to skip the first line of the matched block, and all the following lines that start with +:
sed -n '/\.groupA/ { :a; n; s/^\+//; ta }; p' file
IMHO this is more readily done with awk, but kindly just ignore if that is not an option for you.
So, every time I see a line starting with .groupA, I set a flag d to say I am deleting, and then skip to the next line. If I see a line starting with a + and I am currently deleting, I skip to the next line. If I see anything else, I change the flag to say I am no longer deleting and print the line:
awk '/^\.groupA/ {d=1; next}
/^+/ && d==1 {next}
{d=0; print}' file
Sample Output
** Example **
abcdef ghijkl
.groupB abc def
+ JKL
+ MNO
+ GHI
opqrst vwxyz
You can cast it as a one-liner like this:
awk '/^\.groupA/{d=1; next} d==1 && /^+/ {next} {d=0;print}' file

put all separate paragraphs of a file into a separate line

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:
#example
ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK
SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH
and I want to end up with a file looking like:
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
each sequence is the same length (if that helps).
I would also be looking to do this over multiple files stored in different directiories.
I have just tried
sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt
however this just deleted the entire file :S
any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.
Thanks.
All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:
$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
awk '
/^[[:space:]]*$/ {if (line) print line; line=""; next}
{line=line $0}
END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'
For multiple files:
# adjust your glob pattern to suit,
# don't be shy to ask for assistance
for file in */*.txt; do
newfile="/some/directory/$(basename "$file")"
perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done
A Perl one-liner, if you prefer:
perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file
The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.
just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$:
after that you replace every linebreak with an empty string to get the whole file in one line.
next, replace your special sign with a simple line break :)

sed: joining lines depending on the second one

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with '+' (possibly preceeded by spaces).
line 1
line 2
+ continue 2
line 3
...
I'd like join the split line back:
line 1
line 2 continue 2
line 3
...
using sed. I'm not clear how to join a line with the preceeding one.
Any suggestion?
This might work for you:
sed 'N;s/\n\s*+//;P;D' file
These are actually four commands:
N
Append line from the input file to the pattern space
s/\n\s*+//
Remove newline, following whitespace and the plus
P
print line from the pattern space until the first newline
D
delete line from the pattern space until the first newline, e.g. the part which was just printed
The relevant manual page parts are
Selecting lines by numbers
Addresses overview
Multiline techniques - using D,G,H,N,P to process multiple lines
Doing this in sed is certainly a good exercise, but it's pretty trivial in perl:
perl -0777 -pe 's/\n\s*\+//g' input
I'm not partial to sed so this was a nice challenge for me.
sed -n '1{h;n};/^ *+ */{s// /;H;n};{x;s/\n//g;p};${x;p}'
In awk this is approximately:
awk '
NR == 1 {hold = $0; next}
/^ *\+/ {$1 = ""; hold=hold $0; next}
{print hold; hold = $0}
END {if (hold) print hold}
'
If the last line is a "+" line, the sed version will print a trailing blank line. Couldn't figure out how to suppress it.
You can use Vim in Ex mode:
ex -sc g/+/-j -cx file
g global search
- select previous line
j join with next line
x save and close
Different use of hold space with POSIX sed... to load the entire file into the hold space before merging lines.
sed -n '1x;1!H;${g;s/\n\s*+//g;p}'
1x on the first line, swap the line into the empty hold space
1!H on non-first lines, append to the hold space
$ on the last line:
g get the hold space (the entire file)
s/\n\s*+//g replace newlines preceeding +
p print everything
Input:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
+ continued
becomes
line 1
line 2 continue 2 continue 2 even more
line 3 continued
This (or potong's answer) might be more interesting than a sed -z implementation if other commands were desired for other manipulations of the data you can simply stick them in before 1!H, while sed -z is immediately loading the entire file into the pattern space. That means you aren't manipulating single lines at any point. Same for perl -0777.
In other words, if you want to also eliminate comment lines starting with *, add in /^\s*\*/d to delete the line
sed -n '1x;/^\s*\*/d;1!H;${g;s/\n\s*+//g;p}'
versus:
sed -z 's/\n\s*+//g;s/\n\s*\*[^\n]*\n/\n/g'
The former's accumulation in the hold space line by line keeps you in classic sed line processing land, while the latter's sed -z dumps you into what could be some painful substring regexes.
But that's sort of an edge case, and you could always just pipe sed -z back into sed. So +1 for that.
Footnote for internet searches: This is SPICE netlist syntax.
A solution for versions of sed that can read NUL separated data, like here GNU Sed's -z:
sed -z 's/\n\s*+//g'
Compared to potong's solution this has the advantage of being able to join multiple lines that start with +. For example:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
becomes
line 1
line 2 continue 2 continue 2 even more
line 3