I have a flat file of records, each 33 lines long. I need to format this file to specs in a template. The template is in DOS format while the source file is in NIX format. The template has specific indenting and spacing which must be adhered to. I've thought of a few options:
BASH with classic nix tools: sed, awk, grep etc...
BASH with template toolkit
Perl eith template toolkit
Perl
These are in order of my familiarity. Here's a sample source record ( NIX format ):
I've reduced the number of newlines to save space ( normally 33 lines ):
JACKSON HOLE SANITARIUM AND REPTILE ZOO
45 GREASY HOLLER LN
JACKSON HOLE, AK 99999
Change Service Requested
BUBBA HOTEP
3 DELIVERANCE RD
MINNEAPOLIS, MN 99998
BUBBA HOTEP 09090909090909
You have a hold available for pickup as of 2012-01-04:
Title: Banjo for Fun and Profit
Author: Williams, Billy Dee
Price: $10
Here's the template ( DOS format -- lines reduced - 66 lines normally):
<%BRANCH-NAME%>
<%BRANCH-ADDR%>
<%BRANCH-CTY%>
<%CUST-NAME%> <%BARCODE%>
You have a hold available for pickup as of <%DATE%>:
Title: <%TITLE%>
Author: <%AUTHOR%>
Price: <%PRICE%>
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY%>
end of file
It actually does say "end of file" at the end of each record.
Thoughts? I tend to over-complicate things.
UPDATE2
Figured it out.
My answer is below. Feel free to suggest improvements.
As a starter, here is a hint: Perl HERE-documents (showing just a few substitutions as a demo):
#!/usr/bin/perl
use strict;
use warnings;
my #lines = qw/branchname cust_name barcode bogus whatever/; # (<>);
my ($branchname, $cust_name, $barcode, undef, $whatever) = #lines;
print <<TEMPLATE;
$branchname
<%BRANCH-ADDR%>
<%BRANCH-CTY%>
$cust_name $barcode
You have a hold available for pickup as of <%DATE%>:
Title: <%TITLE%>
Author: <%AUTHOR%>
Price: <%PRICE%>
$cust_name
<%CUST-ADDR%>
<%CUST-CTY%>
end of file
TEMPLATE
Replace the dummy input array with the lines read from the stdin with (<>) if you will. (Use a loop reading n lines and push it to the array if that is more efficient). I just showed the gist, add more variables as required, and skip input lines by specifting undef for the 'capture' variable (as shown).
Now, simply interpolate these variables into your text.
If line-ends are giving you any grief, consider using chomp eg.:
my #lines = (<>); # just read em all...
my #cleaned = map { chomp } #lines;
This is what I am using for this project. Feel free to suggest improvements, or, submit better solutions.
cp $FILE $WORKING # we won't mess with original
NUM_RECORDS=$( grep "^Price:" "$FILE" | wc -l ) # need to know how many records we have
# counting occurences of end of record r
TMP=record.txt # holds single record, used as temp storage in loop below
# Sanity
# Make sure temp storage exists. If not create -- if so, clear it.
[ ! -f $TMP ] && touch $TMP || cat /dev/null >$TMP
# functions
function make_template () {
local _file="$1"
mapfile -t filecontent < "$_file"
_loc_name="${filecontent[0]}"
_loc_strt="${filecontent[1]}"
_loc_city="${filecontent[2]}"
_pat_name="${filecontent[14]}"
_pat_addr="${filecontent[15]}"
_pat_city="${filecontent[16]}"
_barcode=${filecontent[27]:(-14)} # pull barcode from end of string
_date=${filecontent[29]:(-11)} # pull date from end of string
# Test title length - truncate if necessary - 70 chars.
_title=$(grep -E "^Title:" $_file)
MAXLEN=70
[ "${#_title}" -gt "$MAXLEN" ] && _title="${filecontent[31]:0:70}" || :
_auth=$(grep -E "^Author:" $_file)
_price=$(grep -E "^Price:" $_file)
sed "
s#<%BRANCH-NAME%>#${_loc_name}#g
s#<%BRANCH-ADDR%>#${_loc_strt}#g
s#<%BRANCH-CTY%>#${_loc_city}#g
s#<%CUST-NAME%>#${_pat_name}#g
s#<%CUST-ADDR%>#${_pat_addr}#
s#<%CUST-CTY%>#${_pat_city}#
s#<%BARCODE%>#${_barcode}#g
s#<%DATE%>#${_date}#
s#<%TITLE%>#${_title}#
s#<%AUTHOR%>#${_auth}#
s#<%PRICE%>#${_price}#" "$TEMPLATE"
}
####################################
# MAIN
####################################
for((i=1;i<="$NUM_RECORDS";i++))
do
sed -n '1,/^Price:/{p;}' "$WORKING" >"$TMP" # copy first record with end of record
# and copy to temp storage.
sed -i '1,/^Price:/d' "$WORKING" # delete first record using EOR regex.
make_template "$TMP" # send temp file/record to template fu
done
# cleanup
exit 0
Related
I have many files in a folder with the format '{galaxyID}-cutout-HSC-I-{#}-pdr2_wide.fits', where {galaxyID} and {#} are different numbers for each file. Here are some examples:
2185-cutout-HSC-I-9330-pdr2_wide.fits
992-cutout-HSC-I-10106-pdr2_wide.fits
2186-cutout-HSC-I-9334-pdr2_wide.fits
I want to change the format of all files in this folder to match the following:
2185_HSC-I.fits
992_HSC-I.fits
2186_HSC-I.fits
namely, I want to take out "cutout", the second number, and "pdr2_wide" from each file name. I would prefer to do this in either Perl or Python. For my Perl script, so far I have the following:
rename [-n];
my #parts=split /-/;
my $this=$parts[0].$parts[1].$parts[2].$parts[3].$parts[4].$parts[5];
$_ = $parts[0]."_".$parts[2]."_".$parts[3];
*fits
which gives me the error message
Not enough arguments for rename at ./rename.sh line 3, near "];" Execution of ./rename.sh aborted due to compilation errors.
I included the [-n] because I want to make sure the changes are what I want before actually doing it; either way, this is in a duplicated directory just for safety.
It looks like you are using the rename you get on Ubuntu (it's not the one that's on my ArchLinux box), but there are other ones out there. But, you've presented it oddly. The brackets around -n shouldn't be there and the ; ends the command.
The syntax, if you are using what I think you are, is this:
% rename -n -e PERL_EXPR file1 file2 ...
The Perl expression is the argument to the -e switch, and can be a simple substitution. Note that this expression is a string that you give to -e, so that probably needs to be quoted:
% rename -n -e 's/-\d+-pdr2_wide//' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-cutout-HSC-I.fits)
And, instead of doing this in one step, I'd do it in two:
% rename -n -e 's/-cutout-/-/; s/-\d+-pdr2_wide//' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
There are other patterns that might make sense. Instead of taking away parts, you can keep parts:
% rename -n -e 's/\A(\d+).*(HSC-I).*/$1-$2.fits/' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
I'd be inclined to use named captures so the next poor slob knows what you are doing:
% rename -n -e 's/\A(?<galaxy>\d+).*(HSC-I).*/$+{galaxy}-$2.fits/' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
From your description {galaxyID}-cutout-HSC-I-{#}-pdr2_wide.fits, I assume that cutout-HSC-I is fixed.
Here's a script that will do the rename. It takes a list of files on stdin. But, you could adapt to take the output of readdir:
#!/usr/bin/perl
master(#ARGV);
exit(0);
sub master
{
my($oldname);
while ($oldname = <STDIN>) {
chomp($oldname);
# find the file extension/suffix
my($ix) = rindex($oldname,".");
next if ($ix < 0);
# get the suffix
my($suf) = substr($oldname,$ix);
# only take filenames of the expected format
next unless ($oldname =~ /^(\d+)-cutout-(HSC-I)/);
# get the new name
my($newname) = $1 . "_" . $2 . $suf;
printf("OLDNAME: %s NEWNAME: %s\n",$oldname,$newname);
# rename the file
# change to "if (1)" to actually do it
if (0) {
rename($oldname,$newname) or
die("unable to rename '$oldname' to '$newname' -- $!\n");
}
}
}
For your sample input file, here's the program output:
OLDNAME: 2185-cutout-HSC-I-9330-pdr2_wide.fits NEWNAME: 2185_HSC-I.fits
OLDNAME: 992-cutout-HSC-I-10106-pdr2_wide.fits NEWNAME: 992_HSC-I.fits
OLDNAME: 2186-cutout-HSC-I-9334-pdr2_wide.fits NEWNAME: 2186_HSC-I.fits
The above is how I usually do things but here's one with just a regex. It's fairly strict in what it accepts [for safety], but you can adapt as desired:
#!/usr/bin/perl
master(#ARGV);
exit(0);
sub master
{
my($oldname);
while ($oldname = <STDIN>) {
chomp($oldname);
# only take filenames of the expected format
next unless ($oldname =~ /^(\d+)-cutout-(HSC-I)-\d+-pdr2_wide([.].+)$/);
# get the new name
my($newname) = $1 . "_" . $2 . $3;
printf("OLDNAME: %s NEWNAME: %s\n",$oldname,$newname);
# rename the file
# change to "if (1)" to actually do it
if (0) {
rename($oldname,$newname) or
die("unable to rename '$oldname' to '$newname' -- $!\n");
}
}
}
I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz
I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.
Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt
I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ...
DELUXE EDITION
Pull each record by pattern range so I can sort records by department. Eg., with sed
sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt
Then, Print only the first 45 lines of each record in the file to conform to
the 45 line constraint.
Finally, make sure the result still has the record delimiter on line 45.
WALMART EDITION
Same as above, but instead of using a range, just use the record delimiter.
STATUS
My attempt at this might clarify what I'm trying to do.
sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt
This doesn't work, of course, because sed is operating on the entire file at each command.
I need it to operate on each range match not the whole file.
SAMPLE INPUT - 80 Lines ( truncated for space )
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
<way too many lines here>
#matchee
SAMPLE OUTPUT - now only 45 lines
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<Record now equals exactly 45 lines>
<yet record delimiter is maintained>
#matchee
CLARIFICATION UPDATE
I will never need more than the first 40 lines if this makes things easier. Maybe the process would be:
Match pattern(s)
Print first 40 lines.
Pad to appropriate length. Eg., 45 lines.
Tack delimiter back on. Eg., #matchee
I think this would be more flexible -- Ie., can handle record shorter than 45 lines.
Here's a riff based on #Borodin's Perl example below:
my $count = 0;
$/ = "#matchee";
while (<>) {
if (/^REDUNDANCY.*DEPT/) {
print;
$count = 0;
}
else {
print if $count++ < 40;
print "\r\n" x 5;
print "#matchee\r\n";
}
}
This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want.
Print 40 lines based on department -- pad -- tack delimiter back on.
I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2?
This Perl fragment does what I understand you need.
If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }.
Let us know if this is right so far, and what more you need from it.
use strict;
use warnings;
open my $fh, '<', 'mess-o-records.txt' or die $!;
my $count = 0;
while (<$fh>) {
if (/^#matchee/) {
print;
$count = 0;
}
else {
print if $count++ < 45;
}
}
I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done.
Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers.
dep.awk
BEGIN { RS = "#matchee\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for (i = 1; i <= 45; i++) {
print i, current[i];
}
print "#matchee\n"
}
In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own.
The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed.
The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable.
You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start:
dep.sh
#!/bin/bash
# prints the first n lines within every block of text delimited by splitter
splitter=$1
numlines=$2
awk 'BEGIN { RS="'$1'\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for(i=1;i<='$numlines';i++) {
print i, current[i]
}
print "'$splitter'", "\n"
}' $3
Make the script executable and run it.
./dep.sh '#matchee' 45 input.txt > output.txt
I added these files to a gist so you could also verify the output
This might work for you:
D="DEPARTMENT-A" M="#matchee"
sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file
Explanation:
Focus on range of lines /DEPARTMENT/,/#matchee/
At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d}
All subsequent lines in the range append to HS and delete H....;d
At end of range:/#matchee/
Swap to HS x
Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45
If previous substitution was successful branch to label b. tb
If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines.
Branch to label a and test for 45 lines etc . ta
Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines.
Print the range of records. p
All non-range records pass through untouched.
TXR Solution ( http://www.nongnu.org/txr )
For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again.
#(collect)
# (all)
#dept
# (and)
# (collect)
#line
# (until)
#matchee
# (end)
# (end)
#(end)
#(output)
# (repeat)
#{line[0..12] "\n"}
#matchee
# (end)
#(end)
Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing.
Run on the sample data:
$ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
#matchee
The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk.
Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found.
It will not match a record if #matchee is not found.
I have a directory full of files containing records like:
FAKE ORGANIZATION
799 S FAKE AVE
Northern Blempglorff, RI 99xxx
01/26/2011
These items are being held for you at the location shown below each one.
IF YOU ASKED THAT MATERIAL BE MAILED TO YOU, PLEASE DISREGARD THIS NOTICE.
The Waltons. The complete DAXXXX12118198
Pickup at:CHUPACABRA LOCATION 02/02/2011
GRIMLY, WILFORD
29 FAKE LANE
S. BLEMPGLORFF RI 99XXX
I need to remove all entries with the expression Pickup at:CHUPACABRA LOCATION.
The "record separator" issue:
I can't touch the input file's formatting -- it must be retained as is. Each record
is separated by roughly 40+ new lines.
Here's some awk ( this works ):
BEGIN {
RS="\n\n\n\n\n\n\n\n\n+"
FS="\n"
}
!/CHUPACABRA/{print $0}
My stab with perl:
perl -a -F\n -ne '$/ = "\n\n\n\n\n\n\n\n\n+";$\ = "\n";chomp;$regex="CHUPACABRA";print $_ if $_ !~ m/$regex/i;' data/lib51.000
Nothing is returned. I'm not sure how to specify 'field separator' in perl except at the commandline. Tried the a2p utility -- no dice. For the curious, here's what it produces:
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z
# process any FOO=bar switches
#$FS = ' '; # set field separator
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$/ = "\n\n\n\n\n\n\n\n\n+";
$FS = "\n";
while (<>) {
chomp; # strip record separator
if (!/CHUPACABRA/) {
print $_;
}
}
This has to run under someone's Windows box otherwise I'd stick with awk.
Thanks!
Bubnoff
EDIT ( SOLVED ) **
Thanks mob!
Here's a ( working ) perl script version ( adjusted a2p output ):
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z
# process any FOO=bar switches
#$FS = ' '; # set field separator
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$/ = "\n"x10;
$FS = "\n";
while (<>) {
chomp; # strip record separator
if (!/CHUPACABRA/) {
print $_;
}
}
Feel free to post improvements or CPAN goodies that make this more idiomatic and/or perl-ish. Thanks!
In Perl, the record separator is a literal string, not a regular expression. As the perlvar doc famously says:
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Still, it looks like you can get away with $/="\n" x 10 or something like that:
perl -a -F\n -ne '$/="\n"x10;$\="\n";chomp;$regex="CHUPACABRA";
print if /\S/ && !m/$regex/i;' data/lib51.000
Note the extra /\S/ &&, which will skip empty paragraphs from input that has more than 20 consecutive newlines.
Also, have you considered just installing Cygwin and having awk available on your Windows machine?
There is no need for (much)conversion if you can download gawk for windows
Did you know that Perl comes with a program called a2p that does exactly what you described you want to do in your title?
And, if you have Perl on your machine, the documentation for this program is already there:
C> perldoc a2p
My own suggestion is to get the Llama book and learn Perl anyway. Despite what the Python people say, Perl is a great and flexible language. If you know shell, awk and grep, you'll understand many of the Perl constructs without any problems.