Dedup multi line records with perl - perl

I have multi-line records in a text file I'd like to dedupe using perl:
Records are delimited by "#end-of-record" string and look like this:
CAPTAIN GIBLET'S NEWT CORRAL
555 RANDOM ST
TARDIS, CT 99999
We regret to inform you that we must repossess your pants in part due to your being 6 months late on payments. But mostly it's maliciousness. :)
TOTAL DUE: $30.00
#end-of-record
Here is my initial attempt:
#!/usr/bin/perl -w
use strict;
{
local $/ = "#end-of-record";
my %seen;
while ( my $record = <> ) {
if (not exists $seen{$record}) {
print $record;
$seen{$record} = 1;
}
}
}
This is printing out every record ...and duplicate records. Where did I go wrong?
UPDATE
Above code seems to work.

gawk 'BEGIN {ORS = RS = "#end-of-record\n"} !$seen[$0]++
END { print $ORS }' yourfile

Related

Extracting info from file rows into columns using whatever it works (PERL, SED, AWK)

Maybe I´m too old for perl/awk/sed, too young to stop programming.
Here is the problem I need to solve:
I have info like this in a TXT file:
Name:
Name 1
Phone:
1111111
Email:
some#email1
DoentMatterInfo1:
whatever1
=
Name:
Name 2
Phone:
22222222
DoentMatterInfo2:
whatever2
Email:
some#email2
=
Name:
Name 3
DoentMatterInfo3:
whatever2
Email:
some#email3
=
Please note that the desired info is in the next line, there is a record separator (=) and very important, some records doesn't have all the info, but could have info that we dont want.
So, the challenge is to extract the desired info, if exist, in an output like:
Name 1 ; 111111 ; some#email1
Name 2 ; 222222 ; some#email2
Name 3 ; ; some#email3
What I have tried that worked a little bit but stills is not what I´m looking for.
1. Using PERL
Using Perl I got the fields that matter:
while (<>) {
if ($_ =~ /Name/) {
print "=\n". scalar <>;
}
if ($_ =~ /Email/) {
print "; ". scalar <>;
}
if ($_ =~ /Phone/) {
print "; ". scalar <>;
}
}
The I got a file like:
Name 1
; 1111111
; some#email1
=
Name 2
; 22222222
; some#email2
=
Name:
Name 3
; some#email3
=
Now with sed I put each record in a single line:
SED
With SED, this command replaces the Line Feed, got the info in a single line:
sed ':a;N;$!ba;s/\n//g' input.txt > out1.txt
And out back the line feed:
sed 's/|=|/\n/g' out1.txt > out2.txt
So I got a file with the info in each line:
Name 1 ; 1111111 ; some#email1
Name 2 ; 22222222 ; some#email2
Name 3 ; some#email3
Still not what I would like to get from coding. I want something better, like being able to fill the missing phone with space, so the second column could be always the phone column. Do you get it?
AS you can see, the poitn is to find a solution, no matter if is using Perl, AWk or SED. I´m trying perl hashes...
Thanks in advance!!
Here is a Perl solution, asked for and attempted
use warnings;
use strict;
use feature 'say';
my #fields = qw(Name Phone Email); # fields to process
my $re_fields = join '|', map { quotemeta } #fields;
my %record;
while (<>) {
if (/^\s*($re_fields):/) {
chomp($record{$1} = <>);
}
elsif (/^\s*=/) {
say join ';', map { $record{$_} // '' } #fields;
%record = ();
}
}
The input is prepared in the array #fields; this is the only place where those names are spelled out, so if more fields need be added to processing just add them here. A regex pattern for matching any one of these fields is also prepared, in $re_fields.
Then we read line by line all files submitted on the command line, using the <> operator.
The if condition captures an expected keyword if there. In the body we read the next line for its value and store it with the key being the captured keyword (need not know which one).
On a line starting with = the record is printed (correctly with the given sample file). I put nothing for missing fields (no spaces) and no extra spaces around ;. Adjust the output format as desired.
In order to collect records throughout and process further (or just print) later, add them to a suitable data structure instead of printing. What storage to choose depends on what kind of processing is envisioned. The simplest way to go is to add strings for each output record to an array
my (#records, %record);
while (<>) {
...
elsif (/^\s*=/) {
push #records, join ';', map { $record{$_} // '' } #fields;
%record = ();
}
}
Now #records has ready strings for all records, which can be printed simply as
say for #records;
But if more involved processing may be needed then better store in an array copies of %record as hash references, so that individual components can later be manipulated more easily
my (#records, %record);
while (<>) {
...
elsif (/^\s*=/) {
# Add a key to the hash for any fields that are missing
$record{$_} //= '' for #fields;
push #records, { %record };
%record = ();
}
}
I add a key for possibly missing fields, so that the hashrefs have all expected keys, and I assign an empty string to it. Another option is to assign undef.
Now you can access individual fields in each record as
foreach my $rec (#records) {
foreach my $fld (sort keys %$rec) {
say "$fld -> $rec->{$fld}"
}
}
or of course just print the whole thing using Data::Dumper or such.
This will work using any awk in any shell on every UNIX box:
$ cat tst.awk
BEGIN { OFS=" ; " }
$0 == "=" {
print f["Name:"], f["Phone:"], f["Email:"]
delete f
lineNr = 0
next
}
++lineNr % 2 { tag = $0; next }
{ f[tag] = $0 }
.
$ awk -f tst.awk file
Name 1 ; 1111111 ; some#email1
Name 2 ; 22222222 ; some#email2
Name 3 ; ; some#email3
I would do it like this:
$ cat prog.awk
#!/bin/awk -f
BEGIN { OFS = ";" }
/^(Name|Phone|Email):$/ { getline arr[$0] ; next }
/^=$/ { print arr["Name:"], arr["Phone:"], arr["Email:"] ; delete arr }
Explanation:
In the BEGIN block, define the output field separator (semicolon).
For each line in the input file, if the line (in its entirety) equals Name: or Phone: or Email: then assign that string to the key and the value of the following line to the value of an element of the associative array arr. (That is how getline can be used to assign a value to a variable.) Then skip the next rule.
If the line is =, print the three values from the arr associative array, and then clear out the array (reset all the values to the empty string).
* * * *
Make it executable:
chmod +x prog.awk
Use it:
$ ./prog.awk file.txt
Name 1;1111111;some#email1
Name 2;22222222;some#email2
Name 3;;some#email3
Note - a missing value is indicated by two consecutive semicolons (not by a space). Using space as placeholder for NULL is a common bad practice (especially in relational databases, but in flat files too). You can change this to use NULL as placeholder, I am not terribly interested in that bit of the problem.
Input file format is easy to parse: split on =\n into records, split each record on \n into a hash and push the hash into #result array.
Then just output each element of #result array with specifying fields of interest.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my #result;
my $data = do { local $/; <DATA> };
my #records = split('=\n?',$data);
push #result, {split "\n", $_} for #records;
say Dumper(\#result);
my #fields = qw/Name: Phone: Email:/;
for my $record (#result) {
$record->{$_} = $record->{$_} || '' for #fields;
say join('; ', #$record{#fields});
}
__DATA__
Name:
Name 1
Phone:
1111111
Email:
some#email1
DoentMatterInfo1:
whatever1
=
Name:
Name 2
Phone:
22222222
DoentMatterInfo2:
whatever2
Email:
some#email2
=
Name:
Name 3
DoentMatterInfo3:
whatever2
Email:
some#email3
=
Output
$VAR1 = [
{
'DoentMatterInfo1:' => 'whatever1',
'Name:' => 'Name 1',
'Email:' => 'some#email1',
'Phone:' => '1111111'
},
{
'Phone:' => '22222222',
'Email:' => 'some#email2',
'Name:' => 'Name 2',
'DoentMatterInfo2:' => 'whatever2'
},
{
'DoentMatterInfo3:' => 'whatever2',
'Name:' => 'Name 3',
'Email:' => 'some#email3'
}
];
Name 1; 1111111; some#email1
Name 2; 22222222; some#email2
Name 3; ; some#email3

Want to extract the first letter of each word

I basically have a variable COUNTRY along with variables SUBJID and TREAT and I want to concatenate it like this ABC002-123 /NZ/ABC.
Suppose if the COUNTRY variable had the value 'New Zealand'. I want to extract the first letter of each word, But I want extract only the first two letters of the value when there is only one word in the COUNTRY variable. I wanted a to know how to simply the below code. If possible in perl programming.
If COUNTW(COUNTRY) GT 1 THEN
CAT_VAR=
UPCASE(SUBJID||"/"||CAT(SUBSTR(SCAN(COUNTRY,1,' '),1,1),
SUBSTR(SCAN(COUNTRY,2,' '),1,1))||"/"||TREAT);
my #COUNTRY = ("New Zealand", "Germany");
# 'NZ', 'GE'
my #two_letters = map {
my #r = /\s/ ? /\b(\w)/g : /(..)/;
uc(join "", #r);
} #COUNTRY;
The SAS Perl Regular Expression solution is to use CALL PRXNEXT along with PRXPOXN or CALL PRXPOSN (or a similar function, if you prefer):
data have;
infile datalines truncover;
input #1 country $20.;
datalines;
New Zealand
Australia
Papua New Guinea
;;;;
run;
data want;
set have;
length country_letter $5.;
prx_1 = prxparse('~(?:\b([a-z])[a-z]*\b)+~io');
length=0;
start=1;
stop = length(country);
position=0;
call prxnext(prx_1,start,stop,country,position,length);
do while (position gt 0);
matchletter = prxposn(prx_1,1,country);
country_letter = cats(country_letter,matchletter);
call prxnext(prx_1,start,stop,country,position,length);
put i= position= start= stop=;
end;
run;
I realize the OP might not be interested in another answer, but for other users browsing this thread and not wanting to use Perl expressions I suggest the following simple solution (for the original COUNTRY variable):
FIRST_LETTERS = compress(propcase(COUNTRY),'','l');
The propcase functions capitalizes the first letters of each word and puts the other ones in lower case. The compress function with 'l' modifier deletes all lower case letters.
COUNTRY may have any number of words.
How about this:
#!/usr/bin/perl
use warnings;
use strict;
my #country = ('New Zealand', 'Germany', 'Tanzania', 'Mozambique', 'Irish Repuublic');
my ($one_word_letters, $two_word_letters, #initials);
foreach (#country){
if ($_ =~ /\s+/){ # Captures CAPs if 'country' contains a space
my ($first_letter, $second_letter) = ($_ =~ /([A-Z])/g);
my ($two_word_letters) = ($first_letter.$second_letter);
push #initials, $two_word_letters; # Add to array for later
}
else { ($one_word_letters) = ($_ =~ /([A-Z][a-z])/); # If 'country' is only one word long, then capture first two letters (CAP+noncap)
push #initials, $one_word_letters; # Add this to the same array
}
}
foreach (#initials){ # Print contents of the capture array:
print "$_\n";
}
Outputs:
NZ
Ge
Ta
Mo
IR
This should do the job provided there really are no 3 word countries. Easily fixed if there are though...
This should do.
#!/usr/bin/perl
$init = &getInitials($ARGV[0]);
if($init)
{
print $init . "\n";
exit 0;
}
else
{
print "invalid name\n";
exit 1;
}
1;
sub getInitials {
$name = shift;
$name =~ m/(^(\S)\S*?\s+(\S)\S*?$)|(^(\S\S)\S*?$)/ig;
if( defined($1) and $1 ne '' ) {
return uc($2.$3);
} elsif( defined($4) and $4 ne '' ) {
return uc($5);
} else {
return 0;
}
}

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

Parsing files that use synonyms

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?
Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;
Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd
Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.
Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs