Why is a line break added when an element node is appended to a node that already has a text node? - dom

I'm probably overlooking the obvious, but would you please explain why this code produces a line break after the span tag?
If the text node is not first appended to $div then there is not a line break after the span.
Thank you.
set div [$doc createElement div]
$div appendChild [$doc createTextNode Hello]
chan put stdout "[$div asHTML] -- line break: [string first \n [$div asHTML]]"
set span [$doc createElement span]
$span setAttribute class orig
$span appendChild [$doc createTextNode { there!}]
chan put stdout "[$span asHTML] -- line break: [string first \n [$span asHTML]]"
$div appendChild $span
chan put stdout "[$div asHTML] -- line break: [string first \n [$div asHTML]]"
# Results:
# <div>Hello</div> -- line break: -1
# <span class="orig"> there!</span> -- line break: -1
# <div>Hello<span class="orig"> there!</span>
# </div> -- line break: 43
It doesn't appear to be something in the above code because this code generates a line break also.
set em [$::doc createElement em]
$em appendFromList [ list {span} {class added} { {{#text} Hello}\
{{span} {class inner} {{{#text} { there!}}}}}]
set html [$em asHTML]
chan puts stdout "$html -- [string first \n $html]"
# <em><span class="added">Hello<span class="inner"> there!</span>
# </span></em> -- 63

Related

sed/awk/perl remove the first two lines of a 3 line pattern

I have a huge text file. I need to replace all occurrences of this three line
pattern:
|pattern|some data|
|giberish|,,
|pattern|some other data|
by the last line of the pattern:
|pattern|some other data|
remove the first two lines of the pattern, keep only the last one.
The second line of the pattern ends with two commas and does not start with |pattern|
The first line of the pattern line starts with |pattern| and does not end with two commas.
The third line of the pattern line starts with |pattern| and does not end with two commas.
I tried this:
sed 'N;N;/^|pattern|.*\n.*,,\n|pattern|.*/I,+1 d' trial.txt
with no much luck
Edit: Here is a more substantial example
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
EOL
and it should become:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
#zdim:
the first three lines of the file:
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
satisfy the pattern. So they are replaced by
|pattern|sdk;sd|
so the top of the file now becomes:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
...
the first three lines of which are:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
which satisfy the pattern, so they are replaced by:
|pattern|aslkaa|
so the top of the file now is:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
....
#JosephQuinsey:
consider this file:
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|blabla|
|||4|||-0.97|0|1429037262.8271||20160229||1025||1000.0|0.01|,,
|pattern|blable|
|||5|||-1.27|0|1429037262.854||20160229||1025||1000.0|0.01|,,
|pattern|blasbla|
|||493|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,,
|||11|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,|T|347||1429043438.1962|-0.22|5|0||-0.22|1429043438.1962|,|Q|346||1429043437.713|-0.24|26|-0.22|5|||1429043437.713|
|pattern|jksds|
|||232|||-5.66|0|1429037262.817||20150415||1025||1000.0|0.01|,,
|pattern|bdjkds|
|||123q|||-7.15|0|1429037262.8271||20150415||1025||1000.0|0.01|,,
|pattern|blabla|
|||239ps|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,,
|||-92opa|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|1||1428969600.5019|-0.99|1|11||||,
|||kj2w|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|2||1428969600.5019|-1|1|11||||,
|||0293|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|3||1428969600.5019|-1.01|1|11||||,
|||2;;w32|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|4||1428969600.5019|-1.11|1|11||||,
EOL
Here is a simple take on it, using a buffer to collect and manage the pattern-lines
use warnings;
use strict;
use feature 'say';
my $file = shift or die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #buf;
while (<$fh>) {
chomp;
if (/^\|pattern\|/ and not /,,$/) {
#buf = $_; # start the buffer (first line) or overwrite (third)
}
elsif (/,,$/ and not /^\|pattern\|/) {
if (#buf) { push #buf, $_ } # add to buffer with first line in it
else { say } # not part of 3-line-pattern; print
}
else {
say for #buf; # time to print out buffer
#buf = (); # ... empty it ...
say # and print the current line
}
}
This prints the expected output.
Explanation.
Pattern-lines go in a buffer, and when we get the "third line" the first two need be removed. Then "assign" to the array whenever we see ^|pattern| -- either to start the buffer if it's the first line or to re-initialize the array (removing what's in it) if it's the third line
A line ending with ,, is added to the buffer, if there is a line there already. Nothing prohibits lines ending with ,, just so -- they may be outside of a pattern; in that case just print it
So each |pattern| line sets the buffer straight -- either starts it or resets it. Thus once we run into a line with neither ^|pattern| nor ,,$ we can print out our buffer, and that line
Please test more comprehensively, what i still didn't get to do.
In order to run this either in a pipeline or on a file use the "magical" <> filehandle. So it becomes
use warnings;
use strict;
use feature 'say';
my #buf;
while (<>) { # reads lines from files given on command line, or from STDIN
...
}
Now you can run it either as data | script.pl or as script.pl datafile. (Make the script executable for this, or use as perl script.pl.)
The script's output goes to STDOUT which can be piped into other programs or redirected to a file.
It may depend on how your file is huge but if it is smaller than the allowed memory size, how about:
perl -0777 -pe '
1 while s/^\|pattern\|.+?\|\n(?<!\|pattern\|).+?,,\n(\|pattern\|.+?\|)$/\1/m;
' trial.txt
Output:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
An awk solution:
awk -v pa=pattern '
$0 ~ pa {
do {
hold=$0;
getline;
hold=hold "\n" $0;
getline;
} while(match($0, pa));
print hold
}
1' trial.txt
The idea is to buffer the line that matched the pattern, then the line after. If the next line also matches the pattern, loop, this time buffer the most recent matching line and and the one following it. This has the effect of removing the lines that need to be replaced.
When the loop stops, the first line the buffer contains is either the line to replace the removed lines or simply a first pattern match that is not to be removed. Either way the contents of the buffer get printed.
The final 1 statement is needed to print the line that ended the while loop and all other lines that aren't the first or second after one matching the pattern.
Updated answer: The following sed solution should work:
sed '/\n/!N;/\n.*\n/!N;/^|pattern|.*\n.*,,\n|pattern|/!{P;D;};s/[^\n]*\n//;D;'
Explanation:
/\n/!N if the P-space has only one line, read the next
/\n.*\n/!N if the P-space has only two lines, read in a third
/^|pattern|.*\n.*,,\n|pattern|/ test if the first and third lines start with |pattern|, and the middle line ends with two commas
!{P;D;} if the match fails, then print the first line and start over
s/[^\n]*\n//;D; otherwise, when the match succeeds, delete the first two lines, and start over.
This might work for you (GNU sed):
sed ':a;N;s/[^\n]*/&/3;Ta;/^|pattern|.*\n.*,,\n|pattern|/{/,,\n.*\n\|,,$/!{s/.*\n//;ba}};P;D' file
Populate the pattern space with the next three lines of the file. If the first pattern matches the current three lines and neither the first or the third line ends with ,,, then delete the first two lines and repeat. Otherwise print and delete the first line of the three line window and repeat.

Find and print two blocks of lines from file with sed in one pass

I am trying to come up with an sed command to find and print two blocks of variable number of lines from a text file that look like this:
...
INFO first block to match
id: "value"
...
last line of the first block
INFO next irrelevant block
id: "different value"
...
INFO second block to match
id: "value"
...
last line of the second block
...
I only have prior knowledge of the id value and the fact that each block starts with a line that has "INFO". I want to match each block from that first line and not include the first line of the next block in the output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
Ideally I would prefer to do it in a single pass, not have the file scanned multiple times from top to bottom. Currently I have this (it only matches the first block, and I need both):
sed -n -e "/INFO/{"'$!'"{N;/INFO.*id: \"value\"/{:l;p;n;/^[^\\[]/bl;}}}" file.log
EDIT
Linebreak between blocks is certainly nice, but entirely optional.
EDIT 2
Please note that INFO and id: "value" do not have to be in the beginning of the line, and all other words in my example are arbitrary and not known in advance. There can be any number of blocks (including 0) between and around the ones I need to match.
sed is powerful, terse, and dumb. awk is smarter!
awk '/^INFO/{f = /match/? 1: 0} f'
edit: I see you want a linebreak between each "block"; will update if I find a tighter way:
awk '/^INFO/{f = /match/? 1: 0; if(i++) $0 = RS $0} f'
/^INFO/{action}: Execute {action} only on lines beginning with "INFO"
variable = if ? then : else: Conditional Expression (ternary operator)
if(i++): The first time this is evaluated, i will be zero, thus the expression will be false. This prevents an extra line break at the first block.
$0 = RS $0: Prepend a Record Separator (newline) to $0 (entire record)
f If f is greater than zero, {print $0} is implied.
This might work for you (GNU sed):
sed -nE ':a;/^INFO/{N;/^id: "value"/M!D;:b;H;$!{n;/^INFO/!bb};x;s/^/x/;/^x{2}/{s/^x*.//p;q};x;ba}' file
This solution stores the required blocks in the hold space, prefixed by a counter. Once the required number of blocks are stored the counters are removed, the blocks printed and the process quit.
The solution (based only on the input provided) supposes that an id (if it exists) always follows the the INFO line.
Here is an alternative solution using a combination of sed and awk. It allows you to parse the input blockwise or recordwise. This approach relies on setting awk record separator (RS) to the empty string which makes awk read a full block in at a time.
So there are 2 steps:
Make the input record-parsable.
Process each record.
For your example this could something like this:
sed '1!s/^INFO/\n&/' infile | awk '/id: "value"/' RS= ORS='\n\n'
Output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
awk is good for this, and if you could set RS to a multi-character expression it would be ideal. (gnu awk allows this, but why bother with gnu awk when there is perl?)
perl -wnle 'BEGIN{$/="INFO"; undef $\} print "$/$_" if m/id: \"value\"/' input
Basically, this sets the record separator ($/) to the string "INFO" (so now each of your "records" is a "line" to perl). If the record matches the pattern id: "value", it is printed with "INFO" prepended to the start. (without -n, perl would retain the record separator the end of each record, which is not quite what you want). By omitting the "undef $\", you can get an extra newline between records. Some code golf could probably cut the length of this in half, but my perl is a bit rusty. Waiting for the shorter version in comments.
This may or may not be what you want depending on what your real data looks like:
$ awk '/INFO/{info=$0; f=0} /id: "value"/{print info; f=1} f' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
or if you want to do more with each block than just print it as you go then some variation of this is better:
$ awk '
/INFO/ { prt() }
{ block = block $0 ORS }
END { prt() }
function prt() {
if (block ~ /id: "value"/) {
printf "%s", block
}
block=""
}
' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
The above will behave the same using any awk in any shell on any UNIX box.

Append to line that is preceded AND followed by empty line

I need to append an asterisk to a line, but only if said line is preceded and followed by empty lines (FYI, said empty lines will NOT have any white space in them).
Suppose I have the following file:
foo
foo
foo
foo
foo
I want the output to look like this:
foo
foo
foo
foo*
foo
I tried modifying the following awk command (found here):
awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
to suit my uses, but got all tied up in knots.
Sed or Perl solutions are, of course, welcome also!
UPDATE:
It turned out that the question I asked was not quite correct. What I really needed was code that would append text to non-empty lines that do not start with whitespace AND are followed, two lines down, by non-empty lines that also do not start with whitespace.
For this revised problem, suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
For that, this sed one-liner does the trick:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to Benjamin W.'s clear and informative answer below, I was able to cobble this one-liner together!
A sed solution:
$ sed '1N;N;s/^\(\n.*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
foo
foo*
foo
N;P;D is the idiomatic way to look at two lines at the same time by appending the next one to the pattern space, then printing and deleting the first line.
1N;N;P;D extends that to always having three lines in the pattern space, which is what we want here.
The substitution matches if the first and last line are empty (^\n and \n$) and appends one * to the line between the empty lines.
Notice that this matches and appends a * also for the second line of three empty lines, which might not be what you want. To make sure this doesn't happen, the first capture group has to have at least one non-whitespace character:
sed '1N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
Question from comment
Can we not append the * if the line two above begins with abc?
Example input file:
foo
foo
abc
foo
foo
foo
foo
There are three foo between empty lines, but the first one should not get the * appended because the line two above starts with abc. This can be done as follows:
$ sed '1{N;N};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
abc
foo
foo*
foo*
foo
This keeps four lines at a time in the pattern space and only makes the substitution if the pattern space does not start with abc:
1 { # On the first line
N # Append next line to pattern space
N # ... again, so there are three lines in pattern space
}
N # Append fourth line
/^abc/! # If the pattern space does not start with abc...
s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/ # Append '*' to 3rd line in pattern space
P # Print first line of pattern space
D # Delete first line of pattern space, start next cycle
Two remarks:
BSD sed requires an extra semicolon: 1{N;N;} instead of 1{N;N}.
If the first and third line of the file are empty, the second line does not get an asterisk appended because we only start checking once there are four lines in the pattern space. This could be solved by adding an extra substitution into the 1{} block:
1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/}
(remember the extra ; for BSD sed), but trying to cover all edge cases makes sed even less readable, especially in one-liners:
sed '1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
One way to think about these problems is as a state machine.
start: state = 0
0: /* looking for a blank line */
if (blank line) state = 1
1: /* leading blank line(s)
if (not blank line) {
nonblank = line
state = 2
}
2: /* saw non-blank line */
if (blank line) {
output noblank*
state = 0
} else {
state = 1
}
And we can translate this pretty directly to an awk program:
BEGIN {
state = 0; # start in state 0
}
state == 0 { # looking for a (leading) blank line
print;
if (length($0) == 0) { # found one
state = 1;
next;
}
}
state == 1 { # have a leading blank line
if (length($0) > 0) { # found a non-blank line
saved = $0; # save it
state = 2;
next;
} else {
print; # multiple leading blank lines (ok)
}
}
state == 2 { # saw the non-blank line
if (length($0) == 0) { # followed by a blank line
print saved "*"; # BINGO!
state = 1; # to the saw a blank-line state
} else { # nope, consecutive non-blank lines
print saved; # as-is
state = 0; # to the looking for a blank line state
}
print;
next;
}
END { # cleanup, might have something saved to show
if (state == 2) print saved;
}
This is not the shortest way, nor likely the fastest, but it's probably the most straightforward and easy to understand.
EDIT
Here is a comparison of Ed's way and mine (see the comments under his answer for context). I replicated the OP's input a million-fold and then timed the runnings:
# ls -l
total 22472
-rw-r--r--. 1 root root 111 Mar 13 18:16 ed.awk
-rw-r--r--. 1 root root 23000000 Mar 13 18:14 huge.in
-rw-r--r--. 1 root root 357 Mar 13 18:16 john.awk
# time awk -f john.awk < huge.in > /dev/null
2.934u 0.001s 0:02.95 99.3% 0+0k 112+0io 1pf+0w
# time awk -f ed.awk huge.in huge.in > /dev/null
14.217u 0.426s 0:14.65 99.8% 0+0k 272+0io 2pf+0w
His version took about 5 times as long, did twice as much I/O, and (not shown in this output) took 1400 times as much memory.
EDIT from Ed Morton:
For those of us unfamiliar with the output of whatever time command John used above, here's the 3rd-invocation results from the normal UNIX time program on cygwin/bash using GNU awk 4.1.3:
$ wc -l huge.in
1000000 huge.in
$ time awk -f john.awk huge.in > /dev/null
real 0m1.264s
user 0m1.232s
sys 0m0.030s
$ time awk -f ed.awk huge.in huge.in > /dev/null
real 0m1.638s
user 0m1.575s
sys 0m0.030s
so if you'd rather write 37 lines than 3 lines to save a third of a second on processing a million line file then John's answer is the right one for you.
EDIT#3
It's the standard "time" built-in from tcsh/csh. And even if you didn't recognize it, the output should be intuitively obvious. And yes, boys and girls, my solution can also be written as a short incomprehensible mess:
s == 0 { print; if (length($0) == 0) { s = 1; next; } }
s == 1 { if (length($0) > 0) { p = $0; s = 2; next; } else { print; } }
s == 2 { if (length($0) == 0) { print p "*"; s = 1; } else { print p; s = 0; } print; next; }
END { if (s == 2) print p; }
Here's a perl filter version, for the sake of illustration — hopefully it's clear to see how it works. It would be possible to write a version that has a lower input-output delay (2 lines instead of 3) but I don't think that's important.
my #lines;
while (<>) {
# Keep three lines in the buffer, print them as they fall out
push #lines, $_;
print shift #lines if #lines > 3;
# If a non-empty line occurs between two empty lines...
if (#lines == 3 && $lines[0] =~ /^$/ && $lines[2] =~ /^$/ && $lines[1] !~ /^$/) {
# place an asterisk at the end
$lines[1] =~ s/$/*/;
}
}
# Flush the buffer at EOF
print #lines;
A perl one-liner
perl -0777 -lne's/(?<=\n\n)(.*?)(\n\n)/$1\*$2/g; print' ol.txt
The -0777 "slurps" in the whole file, assigned to $_, on which the (global) substitution is run and which is then printed.
The lookbehind (?<=text) is needed for repeating patterns, [empty][line][empty][line][empty]. It is a "zero-width assertion" that only checks that the pattern is there without consuming it. That way the pattern stays available for next matches.
Such consecutive repeating patterns trip up the /(\n\n)(.*?)(\n\n)/$1$2\*$3/, posted initially, since the trailing \n\n are not considered for the start of the very next pattern, having been just matched.
Update: My solution also fails after two consecutive matches as described above and needs the same lookback: s/(?<=\n\n)(\w+)\n\n/\1\2*\n\n/mg;
The easiest way is to use multi-line match:
local $/; ## slurp mode
$file = <DATA>;
$file =~ s/\n\n(\w+)\n\n/\n\n\1*\n\n/mg;
printf $file;
__DATA__
foo
foo
foo
foo
foo
It's simplest and clearest to do this in 2 passes:
$ cat tst.awk
NR==FNR { nf[NR]=NF; nr=NR; next }
FNR>1 && FNR<nr && NF && !nf[FNR-1] && !nf[FNR+1] { $0 = $0 "*" }
{ print }
$ awk -f tst.awk file file
foo
foo
foo
foo*
foo
The above takes one pass to record the number of fields on each line (NF is zero for an empty line) and then the second pass just checks your requirements - the current line is not the first or last in the file, it is not empty and the lines before and after are empty.
alternative awk solution (single pass)
$ awk 'NR>2 && !pp && !NF {p=p"*"}
NR>1{print p}
{pp=length(p);p=$0}
END{print p}' foo
foo
foo
foo
foo*
foo
Explanation: defer printing to next line for decision making, so need to keep previous line in p and state of the second previous line in pp (length zero assumed to be empty). Do the bookkeeping assignments and at the end print the last line.

conditional searching using perl

Here is my input file:
Test1
{
abc
jkl
mno
search_text
pqr
}
Test2
{
stu
vwx
}
Test3
{
yza
search_text
bcd
}
Problem: print the test name (Test1, Test2, Test3 etc.) if there exists "search_text" inbetween the curly braces of that test.
Output expected from given in put file:
Test1
Test3
This one-liner does what you need:
perl -lne'
BEGIN { $/ = "}" }
if ( /search_text/ ) {
$match = $.==1 ? (split "\n")[0] : (split "\n")[1];
print $match;
}' file
Test1
Test3
Explanation:
-l: Chomps the newline while processing and places it back during print.
-n: Creates an implicit while(<>) { ... } loop to process each line.
-e: Tells the perl interpreter to execute the code that follows it.
$/: In the BEGIN block we set the input field separator to } instead of default newline.
Since our working line has effectively turned in to chunks of line separated by } we test if the line contains our search_text. If it does, we put the desired text by splitting the line on newlines and if it is our first line we capture the first split else we capture the second split and print the text.

sed, awk or perl: Pattern range match, print 45 lines then add record delimiter

I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ...
DELUXE EDITION
Pull each record by pattern range so I can sort records by department. Eg., with sed
sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt
Then, Print only the first 45 lines of each record in the file to conform to
the 45 line constraint.
Finally, make sure the result still has the record delimiter on line 45.
WALMART EDITION
Same as above, but instead of using a range, just use the record delimiter.
STATUS
My attempt at this might clarify what I'm trying to do.
sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt
This doesn't work, of course, because sed is operating on the entire file at each command.
I need it to operate on each range match not the whole file.
SAMPLE INPUT - 80 Lines ( truncated for space )
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
<way too many lines here>
#matchee
SAMPLE OUTPUT - now only 45 lines
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<Record now equals exactly 45 lines>
<yet record delimiter is maintained>
#matchee
CLARIFICATION UPDATE
I will never need more than the first 40 lines if this makes things easier. Maybe the process would be:
Match pattern(s)
Print first 40 lines.
Pad to appropriate length. Eg., 45 lines.
Tack delimiter back on. Eg., #matchee
I think this would be more flexible -- Ie., can handle record shorter than 45 lines.
Here's a riff based on #Borodin's Perl example below:
my $count = 0;
$/ = "#matchee";
while (<>) {
if (/^REDUNDANCY.*DEPT/) {
print;
$count = 0;
}
else {
print if $count++ < 40;
print "\r\n" x 5;
print "#matchee\r\n";
}
}
This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want.
Print 40 lines based on department -- pad -- tack delimiter back on.
I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2?
This Perl fragment does what I understand you need.
If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }.
Let us know if this is right so far, and what more you need from it.
use strict;
use warnings;
open my $fh, '<', 'mess-o-records.txt' or die $!;
my $count = 0;
while (<$fh>) {
if (/^#matchee/) {
print;
$count = 0;
}
else {
print if $count++ < 45;
}
}
I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done.
Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers.
dep.awk
BEGIN { RS = "#matchee\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for (i = 1; i <= 45; i++) {
print i, current[i];
}
print "#matchee\n"
}
In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own.
The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed.
The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable.
You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start:
dep.sh
#!/bin/bash
# prints the first n lines within every block of text delimited by splitter
splitter=$1
numlines=$2
awk 'BEGIN { RS="'$1'\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for(i=1;i<='$numlines';i++) {
print i, current[i]
}
print "'$splitter'", "\n"
}' $3
Make the script executable and run it.
./dep.sh '#matchee' 45 input.txt > output.txt
I added these files to a gist so you could also verify the output
This might work for you:
D="DEPARTMENT-A" M="#matchee"
sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file
Explanation:
Focus on range of lines /DEPARTMENT/,/#matchee/
At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d}
All subsequent lines in the range append to HS and delete H....;d
At end of range:/#matchee/
Swap to HS x
Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45
If previous substitution was successful branch to label b. tb
If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines.
Branch to label a and test for 45 lines etc . ta
Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines.
Print the range of records. p
All non-range records pass through untouched.
TXR Solution ( http://www.nongnu.org/txr )
For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again.
#(collect)
# (all)
#dept
# (and)
# (collect)
#line
# (until)
#matchee
# (end)
# (end)
#(end)
#(output)
# (repeat)
#{line[0..12] "\n"}
#matchee
# (end)
#(end)
Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing.
Run on the sample data:
$ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
#matchee
The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk.
Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found.
It will not match a record if #matchee is not found.