comparing lines with awk vs while read line

comparing lines with awk vs while read line - sed

I have two files one with 17k lines and another one with 4k lines. I wanted to compare position 115 to position 125 with each line in the second file and if there is a match, write the entire line from the first file into a new file. I had come up with a solution where i read the file using 'cat $filename | while read LINE'. but it's taking around 8 mins to complete. is there any other way like using 'awk' to reduce this process time.
my code
cat $filename | while read LINE
do
#read 115 to 125 and then remove trailing spaces and leading zeroes
vid=`echo "$LINE" | cut -c 115-125 | sed 's,^ *,,; s, *$,,' | sed 's/^[0]*//'`
exist=0
#match vid with entire line in id.txt
exist=`grep -x "$vid" $file_dir/id.txt | wc -l`
if [[ $exist -gt 0 ]]; then
echo "$LINE" >> $dest_dir/id.txt
fi
done

How is this:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
{ # Now in second file
for(i in lines) # For each line in the array
if (i~$0) { # If matches the current line in second file
print lines[i] # Print the matching line from file1
next # Get next line in second file
}
}
Save it to a script script.awk and run like:
$ awk -f script.awk "$filename" "${file_dir}/id.txt" > "${dest_dir}/id.txt"
This will still be slow because for each line in second file you need to look at ~50% of the unique lines in first (assuming most line do in fact match). This can be significantly improved if you can confirmed that the lines in the second file are full line matches against the substrings.
For full line matches this should be faster:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
($0 in lines) { # Now in second file
print lines[$0] # Print the matching line from file1
}

Related

Append to line that is preceded AND followed by empty line

I need to append an asterisk to a line, but only if said line is preceded and followed by empty lines (FYI, said empty lines will NOT have any white space in them).
Suppose I have the following file:
foo
foo
foo
foo
foo
I want the output to look like this:
foo
foo
foo
foo*
foo
I tried modifying the following awk command (found here):
awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
to suit my uses, but got all tied up in knots.
Sed or Perl solutions are, of course, welcome also!
UPDATE:
It turned out that the question I asked was not quite correct. What I really needed was code that would append text to non-empty lines that do not start with whitespace AND are followed, two lines down, by non-empty lines that also do not start with whitespace.
For this revised problem, suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
For that, this sed one-liner does the trick:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to Benjamin W.'s clear and informative answer below, I was able to cobble this one-liner together!

A sed solution:
$ sed '1N;N;s/^\(\n.*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
foo
foo*
foo
N;P;D is the idiomatic way to look at two lines at the same time by appending the next one to the pattern space, then printing and deleting the first line.
1N;N;P;D extends that to always having three lines in the pattern space, which is what we want here.
The substitution matches if the first and last line are empty (^\n and \n$) and appends one * to the line between the empty lines.
Notice that this matches and appends a * also for the second line of three empty lines, which might not be what you want. To make sure this doesn't happen, the first capture group has to have at least one non-whitespace character:
sed '1N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
Question from comment
Can we not append the * if the line two above begins with abc?
Example input file:
foo
foo
abc
foo
foo
foo
foo
There are three foo between empty lines, but the first one should not get the * appended because the line two above starts with abc. This can be done as follows:
$ sed '1{N;N};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
abc
foo
foo*
foo*
foo
This keeps four lines at a time in the pattern space and only makes the substitution if the pattern space does not start with abc:
1 { # On the first line
N # Append next line to pattern space
N # ... again, so there are three lines in pattern space
}
N # Append fourth line
/^abc/! # If the pattern space does not start with abc...
s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/ # Append '*' to 3rd line in pattern space
P # Print first line of pattern space
D # Delete first line of pattern space, start next cycle
Two remarks:
BSD sed requires an extra semicolon: 1{N;N;} instead of 1{N;N}.
If the first and third line of the file are empty, the second line does not get an asterisk appended because we only start checking once there are four lines in the pattern space. This could be solved by adding an extra substitution into the 1{} block:
1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/}
(remember the extra ; for BSD sed), but trying to cover all edge cases makes sed even less readable, especially in one-liners:
sed '1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile

One way to think about these problems is as a state machine.
start: state = 0
0: /* looking for a blank line */
if (blank line) state = 1
1: /* leading blank line(s)
if (not blank line) {
nonblank = line
state = 2
}
2: /* saw non-blank line */
if (blank line) {
output noblank*
state = 0
} else {
state = 1
}
And we can translate this pretty directly to an awk program:
BEGIN {
state = 0; # start in state 0
}
state == 0 { # looking for a (leading) blank line
print;
if (length($0) == 0) { # found one
state = 1;
next;
}
}
state == 1 { # have a leading blank line
if (length($0) > 0) { # found a non-blank line
saved = $0; # save it
state = 2;
next;
} else {
print; # multiple leading blank lines (ok)
}
}
state == 2 { # saw the non-blank line
if (length($0) == 0) { # followed by a blank line
print saved "*"; # BINGO!
state = 1; # to the saw a blank-line state
} else { # nope, consecutive non-blank lines
print saved; # as-is
state = 0; # to the looking for a blank line state
}
print;
next;
}
END { # cleanup, might have something saved to show
if (state == 2) print saved;
}
This is not the shortest way, nor likely the fastest, but it's probably the most straightforward and easy to understand.
EDIT
Here is a comparison of Ed's way and mine (see the comments under his answer for context). I replicated the OP's input a million-fold and then timed the runnings:
# ls -l
total 22472
-rw-r--r--. 1 root root 111 Mar 13 18:16 ed.awk
-rw-r--r--. 1 root root 23000000 Mar 13 18:14 huge.in
-rw-r--r--. 1 root root 357 Mar 13 18:16 john.awk
# time awk -f john.awk < huge.in > /dev/null
2.934u 0.001s 0:02.95 99.3% 0+0k 112+0io 1pf+0w
# time awk -f ed.awk huge.in huge.in > /dev/null
14.217u 0.426s 0:14.65 99.8% 0+0k 272+0io 2pf+0w
His version took about 5 times as long, did twice as much I/O, and (not shown in this output) took 1400 times as much memory.
EDIT from Ed Morton:
For those of us unfamiliar with the output of whatever time command John used above, here's the 3rd-invocation results from the normal UNIX time program on cygwin/bash using GNU awk 4.1.3:
$ wc -l huge.in
1000000 huge.in
$ time awk -f john.awk huge.in > /dev/null
real 0m1.264s
user 0m1.232s
sys 0m0.030s
$ time awk -f ed.awk huge.in huge.in > /dev/null
real 0m1.638s
user 0m1.575s
sys 0m0.030s
so if you'd rather write 37 lines than 3 lines to save a third of a second on processing a million line file then John's answer is the right one for you.
EDIT#3
It's the standard "time" built-in from tcsh/csh. And even if you didn't recognize it, the output should be intuitively obvious. And yes, boys and girls, my solution can also be written as a short incomprehensible mess:
s == 0 { print; if (length($0) == 0) { s = 1; next; } }
s == 1 { if (length($0) > 0) { p = $0; s = 2; next; } else { print; } }
s == 2 { if (length($0) == 0) { print p "*"; s = 1; } else { print p; s = 0; } print; next; }
END { if (s == 2) print p; }

Here's a perl filter version, for the sake of illustration — hopefully it's clear to see how it works. It would be possible to write a version that has a lower input-output delay (2 lines instead of 3) but I don't think that's important.
my #lines;
while (<>) {
# Keep three lines in the buffer, print them as they fall out
push #lines, $_;
print shift #lines if #lines > 3;
# If a non-empty line occurs between two empty lines...
if (#lines == 3 && $lines[0] =~ /^$/ && $lines[2] =~ /^$/ && $lines[1] !~ /^$/) {
# place an asterisk at the end
$lines[1] =~ s/$/*/;
}
}
# Flush the buffer at EOF
print #lines;

A perl one-liner
perl -0777 -lne's/(?<=\n\n)(.*?)(\n\n)/$1\*$2/g; print' ol.txt
The -0777 "slurps" in the whole file, assigned to $_, on which the (global) substitution is run and which is then printed.
The lookbehind (?<=text) is needed for repeating patterns, [empty][line][empty][line][empty]. It is a "zero-width assertion" that only checks that the pattern is there without consuming it. That way the pattern stays available for next matches.
Such consecutive repeating patterns trip up the /(\n\n)(.*?)(\n\n)/$1$2\*$3/, posted initially, since the trailing \n\n are not considered for the start of the very next pattern, having been just matched.

Update: My solution also fails after two consecutive matches as described above and needs the same lookback: s/(?<=\n\n)(\w+)\n\n/\1\2*\n\n/mg;
The easiest way is to use multi-line match:
local $/; ## slurp mode
$file = <DATA>;
$file =~ s/\n\n(\w+)\n\n/\n\n\1*\n\n/mg;
printf $file;
__DATA__
foo
foo
foo
foo
foo

It's simplest and clearest to do this in 2 passes:
$ cat tst.awk
NR==FNR { nf[NR]=NF; nr=NR; next }
FNR>1 && FNR<nr && NF && !nf[FNR-1] && !nf[FNR+1] { $0 = $0 "*" }
{ print }
$ awk -f tst.awk file file
foo
foo
foo
foo*
foo
The above takes one pass to record the number of fields on each line (NF is zero for an empty line) and then the second pass just checks your requirements - the current line is not the first or last in the file, it is not empty and the lines before and after are empty.

alternative awk solution (single pass)
$ awk 'NR>2 && !pp && !NF {p=p"*"}
NR>1{print p}
{pp=length(p);p=$0}
END{print p}' foo
foo
foo
foo
foo*
foo
Explanation: defer printing to next line for decision making, so need to keep previous line in p and state of the second previous line in pp (length zero assumed to be empty). Do the bookkeeping assignments and at the end print the last line.

copying every nth line to a new line

I have a txt file that I need to copy the 1st line of every four lines and print it onto the 3rd line of every four. And print this into a new txt file.
e.g
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+
;969999999-4;BB
change it to this:
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+CR5SM:00004:00029
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+CR5SM:00005:00026
;969999999-4;BB
I have tried using Awk but cant seem to find the correct commands to do this.
Does anyone have any solutions? Thanks

Using awk:
$ awk '/^#/{a=substr($0,2)}/^\+/{$0=$0 a}1' file
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+CR5SM:00004:00029
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+CR5SM:00005:00026
;969999999-4;BB
You can redirect the output to another file by saying:
awk '/^#/{a=substr($0,2)}/^\+/{$0=$0 a}1' file > newfile
We use the substr function to capture the lines that start with # from second character onwards until the end of the line.
We look for lines that start with + (notice we escape it since it is a meta-character). Once we find that line, we append our captured line to the existing line.
1 at the end allows us to print the lines.

Try:
awk '
(NR-1) % 4 == 0 { l=substr($0,2); print; next } # save every 4th line (print & continue)
(NR-1) % 4 == 2 { print $0 l; next } # append saved line to every 3rd line (print & continue)
{ print }' \ # all other lines: print as is
infile > outfile # specify input file and redirect output to output file

This might work for you (GNU sed):
sed 'h;n;n;G;s/\n.//;n' file
Copy the first line, print the first and second lines and append the first to the third removing the first character of the first, print it and the fourth line and repeat.

Search for a particular multiline pattern using awk and sed

I want to read from the file /etc/lvm/lvm.conf and check for the below pattern that could span across multiple lines.
tags {
hosttags = 1
}
There could be as many white spaces between tags and {, { and hosttags and so forth. Also { could follow tags on the next line instead of being on the same line with it.
I'm planning to use awk and sed to do this.
While reading the file lvm.conf, it should skip empty lines and comments.
That I'm doing using.
data=$(awk < cat `cat /etc/lvm/lvm.conf`
/^#/ { next }
/^[[:space:]]*#/ { next }
/^[[:space:]]*$/ { next }
.
.
How can I use sed to find the pattern I described above?

Are you looking for something like this
sed -n '/{/,/}/p' input
i.e. print lines between tokens (inclusive)?
To delete lines containing # and empty lines or lines containing only whitespace, use
sed -n '/{/,/}/p' input | sed '/#/d' | sed '/^[ ]*$/d'
space and a tab--^
update
If empty lines are just empty lines (no ws), the above can be shortened to
sed -e '/#/d' -e '/^$/d' input
update2
To check if the pattern tags {... is present in file, use
$ tr -d '\n' < input | grep -o 'tags\s*{[^}]*}'
tags { hosttags = 1# this is a comment}
The tr part above removes all newlines, i.e. makes everything into one single line (will work great if the file isn't to large) and then search for the tags pattern and outputs all matches.
The return code from grep will be 0 is pattern was found, 1 if not.
Return code is stored in variable $?. Or pipe the above to wc -l to get the number of matches found.
update3
regex for searcing for tags { hosttags=1 } with any number of ws anywhere
'tags\s*{\s*hosttags\s*=\s*1*[^}]*}'

try this line:
awk '/^\s*#|^\s*$/{next}1' /etc/lvm/lvm.conf

One could try preprocessing the file first, removing commments and empty lines and introducing empty lines behind the closing curly brace for easy processing with the second awk.
awk 'NF && $1!~/^#/{print; if(/}/) print x}' file | awk '/pattern/' RS=

Match a string from File1 in File2 and replace the string in File1 with corresponding matched string in File2

The title may be confusing, here's what I'm trying to do:
File1
12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5
File2
Tom 12 John 921 Mike 813
Output
Tom=John:5,Mike:5
The file2 has the values of the numbers in file1, and I want match and replace the numbers with string values. I tried this with my limited knowledge in awk, but couldn't do it.
Any help appreciated.

Here's one way using GNU awk. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
BEGIN {
FS="[ =:,]"
}
FNR==NR {
a[$1]=$0
next
}
$2 in a {
split(a[$2],b)
for (i=3;i<=NF-1;i+=2) {
for (j=2;j<=length(b)-1;j+=2) {
if ($(i+1) == b[j]) {
line = (line ? line "," : "") $i ":" b[j+1]
}
}
}
print $1 "=" line
line = ""
}
Results:
Tom=John:5,Mike:5
Alternatively, here's the one-liner:
awk -F "[ =:,]" 'FNR==NR { a[$1]=$0; next } $2 in a { split(a[$2],b); for (i=3;i<=NF-1;i+=2) for (j=2;j<=length(b)-1;j+=2) if ($(i+1) == b[j]) line = (line ? line "," : "") $i ":" b[j+1]; print $1 "=" line; line = "" }' file1 file2
Explanation:
Change awk's field separator to a either a space, equals, colon or comma.
'FNR==NR { ... }' is only true for the first file in the arguments list.
So when processing file1, awk will add column '1' to an array and we assign the whole line as a value to this array element.
'next' will simply skip processing the rest of the script, and read the next line of input.
When awk has finished reading the input in file1, it will continue reading file2. However, this also resets 'FNR' to '1', so awk will skip processing the 'FNR==NR' block for file2 because it is not longer true.
So for file2: if column '2' can be found in the array mentioned above:
Split the value of the array element into another array. This essentially splits up the whole line in file1.
Now create two loops.
The first will loop through all the names in file2
And the second will loop through all the values in the (second) array (this essentially loops over all the fields in file1).
Now when a value succeeding a name in file2 is equal to one of the key numbers in file1, create a line construct that looks like: 'name:number_following_key_number_from_file1'.
When more names and values are found during the loops, the quaternary construct '( ... ? ... : ...)' adds these elements onto the end of the line. It's like an if statement; if there's already a line, add a comma onto the end of it, else don't do anything.
When all the loops are complete, print out column '1' and the line. Then empty the line variable so that it can be used again.
HTH. Goodluck.

The following may work as a template:
skrynesaver#busybox ~/ perl -e '$values="12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5";
$data = "Tom 12 John 921 Mike 813";
($line,$values)=split/=/,$values;
#values=split/,/,$values;
$values{$line}="=";
map{$_=~/(\d+)(:\d+)/;$values{$1}="$2";}#values;
if ($data=~/\w+\s$line\s/){
$data=~s/(\w+)\s(\d+)\s?/$1$values{$2}/g;
}
print "$data\n";
'
Tom=John:5Mike:5
skrynesaver#busybox ~/

sed, awk or perl: Pattern range match, print 45 lines then add record delimiter

I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ...
DELUXE EDITION
Pull each record by pattern range so I can sort records by department. Eg., with sed
sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt
Then, Print only the first 45 lines of each record in the file to conform to
the 45 line constraint.
Finally, make sure the result still has the record delimiter on line 45.
WALMART EDITION
Same as above, but instead of using a range, just use the record delimiter.
STATUS
My attempt at this might clarify what I'm trying to do.
sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt
This doesn't work, of course, because sed is operating on the entire file at each command.
I need it to operate on each range match not the whole file.
SAMPLE INPUT - 80 Lines ( truncated for space )
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
<way too many lines here>
#matchee
SAMPLE OUTPUT - now only 45 lines
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<Record now equals exactly 45 lines>
<yet record delimiter is maintained>
#matchee
CLARIFICATION UPDATE
I will never need more than the first 40 lines if this makes things easier. Maybe the process would be:
Match pattern(s)
Print first 40 lines.
Pad to appropriate length. Eg., 45 lines.
Tack delimiter back on. Eg., #matchee
I think this would be more flexible -- Ie., can handle record shorter than 45 lines.
Here's a riff based on #Borodin's Perl example below:
my $count = 0;
$/ = "#matchee";
while (<>) {
if (/^REDUNDANCY.*DEPT/) {
print;
$count = 0;
}
else {
print if $count++ < 40;
print "\r\n" x 5;
print "#matchee\r\n";
}
}
This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want.
Print 40 lines based on department -- pad -- tack delimiter back on.

I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2?
This Perl fragment does what I understand you need.
If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }.
Let us know if this is right so far, and what more you need from it.
use strict;
use warnings;
open my $fh, '<', 'mess-o-records.txt' or die $!;
my $count = 0;
while (<$fh>) {
if (/^#matchee/) {
print;
$count = 0;
}
else {
print if $count++ < 45;
}
}

I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done.
Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers.
dep.awk
BEGIN { RS = "#matchee\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for (i = 1; i <= 45; i++) {
print i, current[i];
}
print "#matchee\n"
}
In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own.
The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed.
The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable.
You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start:
dep.sh
#!/bin/bash
# prints the first n lines within every block of text delimited by splitter
splitter=$1
numlines=$2
awk 'BEGIN { RS="'$1'\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for(i=1;i<='$numlines';i++) {
print i, current[i]
}
print "'$splitter'", "\n"
}' $3
Make the script executable and run it.
./dep.sh '#matchee' 45 input.txt > output.txt
I added these files to a gist so you could also verify the output

This might work for you:
D="DEPARTMENT-A" M="#matchee"
sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file
Explanation:
Focus on range of lines /DEPARTMENT/,/#matchee/
At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d}
All subsequent lines in the range append to HS and delete H....;d
At end of range:/#matchee/
Swap to HS x
Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45
If previous substitution was successful branch to label b. tb
If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines.
Branch to label a and test for 45 lines etc . ta
Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines.
Print the range of records. p
All non-range records pass through untouched.

TXR Solution ( http://www.nongnu.org/txr )
For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again.
#(collect)
# (all)
#dept
# (and)
# (collect)
#line
# (until)
#matchee
# (end)
# (end)
#(end)
#(output)
# (repeat)
#{line[0..12] "\n"}
#matchee
# (end)
#(end)
Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing.
Run on the sample data:
$ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
#matchee
The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk.
Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found.
It will not match a record if #matchee is not found.