Calculate difference of two values in hash - perl

Input from file
ID S1 S2 S3 S4
a R1 R2 R3 R4 R5 R6 R7 R8
. . . . . .
. . . . .
a is the ID may be string
R1,R2,R3,R4.....are numeric values
I have to use perl script.I have to read the above data from a file and read in hashes.I have to calculate the difference of R1 and R2 if difference is greater than 0 then print R1 and R2 as it is,other put 0 in place of R1 and R2. next find the difference of R3 and R4. How can we do the above problem using hashes
Output of the perl program:
ID S1 S2 S3 S4
a R1 R2 0 0 0 0 R7 R8
. . . . . .
. . . . .
. . . . .

Why would you need a hash to find the difference of pairs of numbers???
print(scalar(<>));
while (<>) {
my #parts = split ' ';
for (my $i=1; $i<#parts; $i+=2) {
if ($parts[$i+0] == $parts[$i+1]) {
$parts[$i+0] = '0';
$parts[$i+1] = '0';
}
}
print(join(' ', #parts), "\n");
}
Or if you want to preserve the formatting:
print(scalar(<>));
while (<>) {
chomp;
my #parts = split /(\s+)/;
for (my $i=2; $i<#parts; $i+=4) {
if ($parts[$i+0] == $parts[$i+2]) {
$parts[$i+0] = sprintf('%*s', length($parts[$i+0]), '0');
$parts[$i+2] = sprintf('%*s', -length($parts[$i+2]), '0');
}
}
print(join('', #parts), "\n");
}

Related

perl array from csv file creating newline in unexpected place

Hi I have a few scripts that convert an xlsx file to a tab seperated file, which then remove any commas, duplicates and then splits it by commas. (i do this to make sure users have not put any commas in a colomn)
I then do some stuff. and then convert it back to an xlsx file. This has always worked fine. But instead of opening and closing files all the time i thought i would push the file to an array and then convert it to an xlsx at the end. Unfortunatly when i try and convert back to an xlsx file it is creating a newline in the space between the name. If i OUTPUT to a csv file then Open it and convert to an xlsx file it works fine.
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::BasicRead;
use Excel::Writer::XLSX;
local $" = "'\n'";
open( STDERR, ">&STDOUT" );
#covert to csv
my $xlsx_WSD = ( "C:\\Temp\\testing_file.xlsx"),, 1;
my #csvtemp;
if ( -e $xlsx_WSD ) {
my $ss = new Spreadsheet::BasicRead($xlsx_WSD) or die;
my $col = '';
my $row = 0;
while ( my $data = $ss->getNextRow() ) {
$row++;
$col= join( "\t", #$data );
push #csvtemp, $col . "\n" if ( $col ne "" );
}
}
else {
print " C:\\Temp\\testing_file.xlsx file EXISTS ...!!\n";
print " please investigate and use the restore option if required !..\n";
exit;
}
;
my #arraynew;
my %seen;
our $Header_row = shift (#csvtemp);
foreach (#csvtemp){
chomp;
$_ =~ s/,//g;
$_ =~ s/\t/,/g;
# print $_ . "\n" if !$seen{$_}++ ;
push #arraynew, $_ . "\n" if !$seen{$_}++ ; #remove any dupes
}
#covert back to xlsx
my $workbook = Excel::Writer::XLSX->new("C:\\Temp\\testing_filet.xlsx");
my $worksheet = $workbook->add_worksheet();
my ( $x, $y ) = ( 0, 0 );
while (<#arraynew>) {
my #list = split /,/;
foreach my $c (#list) {
$worksheet->write( $x, $y++, $c );
}
$x++;
$y = 0;
}
__DATA__
Animal keeper M/F Years START DATE FRH FSM
GIRAFFE JAMES LE M 5 10/12/2007 Y
HIPPO JACKIE LEAN F 6 11/12/2007 Y
ZEBRA JAMES LEHERN M 7 12/12/2007 Y
GIRAFFE AMIE CAHORT M 5 13/12/2012 Y
GIRAFFE MICKY JAMES M 5 14/06/2007 Y
MEERKAT JOHN JONES M 9 15/12/2007 v v
LEOPPARD JIM LEE M 8 16/12/2002
unexpected result
GIRAFFE JAMES
LE M 5 10/12/2007 Y
"
HIPPO" JACKIE
LEAN F 6 11/12/2007 Y
"
ZEBRA" JAMES
LEHERN M 7 12/12/2007 Y
"
GIRAFFE" AMIE
CAHORT M 5 13/12/2012 Y
"
GIRAFFE" MICKY
JAMES M 5 14/06/2007 Y
"
MEERKAT" JOHN
JONES M 9 15/12/2007 v v
"
LEOPPARD" JIM
LEE M 8 16/12/2002
Since you are running this on Windows, have you considered using Win32::OLE instead?
use strict;
use Win32::OLE;
my $app = Win32::OLE->GetActiveObject('Excel.Application')
|| Win32::OLE->new('Excel.Application', 'Quit');
my $wb = $app->Workbooks->Open("C:/Temp/testing_file.xlsx");
my $ws = $wb->ActiveSheet;
my $max_row = $ws->UsedRange->Rows->Count;
my $max_col = $ws->UsedRange->Columns->Count;
my ($row, %already) = (1);
while ($row <= $max_row) {
my ($col, #output) = (1);
while ($col <= $max_col) {
my $val = $ws->Cells($row, $col)->{Text};
if ($val =~ /[,\t]/) {
$val =~ tr/,//d;
$val =~ tr/\t/,/;
$ws->Cells($row, $col)->{Value} = $val;
}
#output[$col - 1] = $val;
$col++;
}
if ($already{join "|", #output}++) {
$ws->Rows($row)->EntireRow->Delete;
$max_row--;
} else {
$row++;
}
}
$wb->SaveAs("C:\\temp\\testing_filet.xlsx");
This is an issue with end of line characters.
There are three conventions for marking the end of a line: with \n on Unix, \r\n on Windows and \r on Mac. It looks as though your script is assuming the Mac convention while input and output use the Windows convention.
So after reading the input, a leading \n appears on all lines except the first. As long as this is also the case with the output lines prior to composing them with \r, you end up with an output file with perfectly \r\n-delimited lines. Clearly it's better to make your script wary of what line ending convention the input is using and ensure it uses the same for splitting the lines and composing the output.

Making 2nd Degree Polys Given a Triplet

The purpose of this code is to make a subroutine that takes three coefficients k2, k1 , and k0 and returns an anonymous function/subroutine that takes one argument x and returns a string representation of the second degree polynomial with the given coefficients and its value at x, i.e., k2*x^2 + k1*x + k0.
As of right now, it returns the message "Can't use string ("6") as an ARRAY ref while "strict refs" is in use"
Does anyone see what I'm messing up?
use strict;
use warnings;
use 5.10.0;
sub make_2nd_degree_poly {
my($k2, $k1, $k0) = #_;
my $n = $_[0];
return sub {
return ($k2 . 'x^2 . ' . $k1 . 'x + ' . $k0 . ' at x = ' . $n,
$k2 * $n ** 2 + $k1 *$n + $k0
);
}
}
my #triplet0 = (1, 2, 3);
my #triplet1 = (4, 5, 6);
my $p1 = make_2nd_degree_poly(#triplet0);
my $p2 = make_2nd_degree_poly(#triplet1);
print "#{$p1->(1)}\n";
print "#{$p2->(2)}\n";
Next part of the issue: I need to write a function/sub gen_2nd_deg_polys that takes a list of 3-tuples and returns a list of anonymous 2nd
degree polynomials.
polys = gen_2nd_deg_polys([(1, 2, 3), (4, 5, 6)])
THIS SHOULD BE MY OUTPUT:
('1x^2 + 2x + 3 at x = 1 is ', 6)
('4x^2 + 5x + 6 at x = 2 is ', 32)
How can I extend this program to accomplish this task?
sub gen_2nd_deg_polys {
return map(make_2nd_degree_poly(#{$_}), #_);
}
my (#a) = #_;
#a = ([1..3],[4..6]);
my #p3 = gen_2nd_deg_polys(#a);
print #p3->(1);
You can't dereference something that's not a reference. Your created sub returns a list, make it return an anonymous array instead:
return sub {
[ "${k2}x^2 . $k1 x + $k0 at x = $n",
$k2 * $n ** 2 + $k1 *$n + $k0 ]
}
Or, return just one string:
return sub {
"${k2}x^2 . $k1 x + $k0 at x = $n "
. ($k2 * $n ** 2 + $k1 *$n + $k0)
}
Then, you need no dereference:
print $p1->(1), "\n";
print $p2->(2), "\n";
Update
Or, return a list, no dereference needed in such a case, but you might like to add a space in between the formula and the result:
return sub {
my $n = $_[0];
return "${k2}x^2 . $k1 x + $k0 at x = $n",
($k2 * $n ** 2 + $k1 *$n + $k0)
}
# ...
print join ' ', $p1->(1), "\n";
print join ' ', $p2->(2), "\n";
sub gen_2nd_deg_polys {
return map make_2nd_degree_poly(#$_), #_;
}
my #arr = ([1, 2, 3], [4, 5, 6]);
my #p3 = gen_2nd_deg_polys(#arr);
print join ' ', $_->(1), "\n" for #p3;
You can use references to force Perl to interpolate the return value of a function call inside double-quoted strings.So, here your #{} should contain [] to perl interpolate that you are returning an array reference.
You should do #{[]} in print to tell it you are returning an array reference which you are dreferencing using #{}.
Change your print statements to this. And then it should work:
print "#{[$p1->(1)]}\n";
print "#{[$p2->(2)]}\n";

How to finding intervals based on matching elements in perl.?

#t = qw(a b c d e + g h + j k m n l + h +);
#q = qw(a b c d e f g h i j k l m l j h h);
#s = qw(a b c d e f g h k j k l m l j h h);
foreach (0..$#q){
if($t[$_] eq ($q[$_] && $s[$_])){
print "$t[$_]";
}
print "$t[$_]-$t[$_]\n";
elsif($t[$_] eq '+' && $q[$_] eq $s[$_]){
print"$t[$_]";
}
else{
print "\n";
}
}
Expected Output:
abcde+gh [1-8]
jk [10-11]
l+h+ [14-17]
Here #t based on matching both #q and #s, and print the intervals also based on #t.
I am not able to get an intervals as mismatching. please give me a good solution
Your code has an syntax error you introduced with your 4th edit. You can't put any code outside an if's block and its elseif. If I understood it right you wanted to know when the arrays #q, #s and #t line up, where #t is allowed to have '+' as a wildcard.
Here is one solution. It uses a $start variable to check if we are inside an interval and stores the beginning. If we are at the end of an interval or the arrays. We print the interval lengths. There are probably nicer ways to format this. The best would be to introduce more complex ad-hoc objects. The code would be much easier if you were't interested in the indices of the beginning and end of the intervals.
For the test: I restructured it a bit. Furthermore if you already know that $q[$_] eq $s[$_] you won't have to check both $t[$_] eq $s[$_] and $t[$_] eq $q[$_]. You don't have to make that check at all if $t[$_] eq "+"
#!/usr/bin/env perl
use strict; # These aren't optional!
use warnings; # Always use them!
use 5.01; # for the // operator and say
my #t = qw(a b c d e + g h + j k m n l + h +);
my #q = qw(a b c d e f g h i j k l m l j h h);
my #s = qw(a b c d e f g h k j k l m l j h h);
my ($start);
sub print_interval{
my $end = shift;
printf((' 'x(8+$start-$end)). # inserting the whitespaces
"[%2d-%-2d]\n", $start, $end);
}
foreach (0..$#q){
my ($te, $qe, $se) = ($t[$_], $q[$_], $s[$_]); # just shorthands
if($qe eq $se && ($te eq "+" || $te eq $qe)){
$start //= $_; # if not set, set it to the current index
print $te;
}elsif (defined $start){
print_interval($_-1);
undef $start;
}
}
if (defined $start){
# if we are still in an interval at the end,
# we'll have to print that too.
print_interval($#q)
}
If you're uncomfortable with the definedness checks, you also can set $start to -1 and check 0 <= $start.
Here is a solution that uses intermediate objects and saves the results in an array, this makes for nicer formatting and the code is structured better:
# … strict, warnings, array declarations
my ($res,#results);
foreach (0..$#q){
my ($te, $qe, $se) = ($t[$_], $q[$_], $s[$_]);
if($qe eq $se && ($te eq "+" || $te eq $qe)){
$res = {start => $_, string => ''} unless defined $res;
$res->{string} .= $te;
}elsif (defined $res){
$res->{end} = $_-1;
push #results, $res;
undef $res;
}
}
if (defined $res){ # still in interval
$res->{end} = $#q;
push #results, $res;
}
printf "%-9s[%2d-%-2d]\n", #{$_}{qw|string start end|} for #results;
#!/usr/bin/perl
use strict;
use warnings;
my #t = qw(a b c d e + g h + j k m n l + h +);
my #q = qw(a b c d e f g h i j k l m l j h h);
my #s = qw(a b c d e f g h k j k l m l j h h);
my #current_interval = (); #will store the interval we are currently working on
my #intervals = (); #keeps track of all those intervals
for(0 .. $#t){
if($q[$_] eq $s[$_] and ($q[$_] eq $t[$_] or $t[$_] eq '+')){
push(#current_interval, $_);
}
else{
if(#current_interval){
push(#intervals, [$current_interval[0], $current_interval[$#current_interval]]);
#current_interval = ();
}
}
}
#when exiting the loop we dont want to lose our current interval!
if(#current_interval){
push(#intervals, [$current_interval[0], $current_interval[$#current_interval]]);}
#print intervals
for (#intervals){
my #c = #{$_};
print $c[0],"\t",$c[1],"\n";
}
I got the intervals for you.
Please note that I added "use strict; use warnings" - before adding this solution to your project.
Greetings Tim

Perl cannot open file; from Ploteig script

I am a total novice when it comes to computer programming and perl, so please forgive me if this question is simple!
I am trying to run a perl script (called ploteig, a component of a free genetics program download, Eigenstrat:Eigenstrat software) that works fine until I get to line 96-
open (YY, ">$dfile") || die "can't open $dfile\n" ;
I am given the error that the file is unable to be opened and the script dies.
Below, I have provided the entire code for you (since honestly, I have no idea what part of the code could be influencing the inability to open the file). The code uses input from a file created previously with Eigenstrat, example of 4 rows, 12 columns:
#eigvals: 20.388 7.503 4.033 2.929 2.822 2.726 2.700 2.590 2.451 2.365
GREY_BI_011_COMSTOCK_11 0.0164 0.0164 0.0382 -0.1283 -0.0658 0.0406 0.0322 0.0105 -0.0851 -0.0625 Case
GREY_BI_014_COMSTOCK_14 0.0191 0.0094 0.0567 -0.0250 0.0804 -0.0531 -0.0165 0.0321 0.1130 -0.0025 Control
GREY_BI_015_COMSTOCK_15 0.0221 -0.0042 -0.0031 0.0091 0.1448 0.0351 0.0430 0.0359 0.0049 0.0791 Control
(rows represent individual sample pca scores, columns specific pcas. First column sample names, last column case or control status)
Additionally, I call the code as follows:
perl ploteig –i combogreyout.pca.evec –p Case:Control –s Out –c 1:2 –x –o utploteig.xtxt –k
I am really unsure where to go from here. I tried changing the file permissions and ensuring it was in the working directory, but it wouldn’t allow me to change permissions and everything pointed to being in the correct directory. However, I am unsure if either of these are the real problem.
I would very much appreciate any help anyone can give me!
Thank you SO much!
> #!/usr/bin/perl -w
### ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k] [-y]
[-z sep] [-f fixgreen]
use Getopt::Std ;
use File::Basename ;
## pops : separated -x = make postscript and pdf -z use another separator
## -k keep intermediate files
## NEW if pops is a file names are read one per line
getopts('i:o:p:c:s:d:z:t:xkyf',\%opts) ;
$postscmode = $opts{"x"} ;
$oldkeystyle = $opts{"y"} ;
$kflag = $opts{"k"} ;
$keepflag = 1 if ($kflag) ;
$keepflag = 1 unless ($postscmode) ;
$dofixgreen = ( exists $opts{"f"} ? $opts{"f"} : 0 );
$zsep = ":" ;
if (defined $opts{"z"}) {
$zsep = $opts{"z"} ;
$zsep = "\+" if ($zsep eq "+") ;
}
$title = "" ;
if (defined $opts{"t"}) {
$title = $opts{"t"} ;
}
if (defined $opts{"i"}) {
$infile = $opts{"i"} ;
}
else {
usage() ;
exit 0 ;
}
open (FF, $infile) || die "can't open $infile\n" ;
#L = (<FF>) ;
chomp #L ;
$nf = 0 ;
foreach $line (#L) {
next if ($line =~ /\#/) ;
#Z = split " ", $line ;
$x = #Z ;
$nf = $x if ($nf < $x) ;
}
printf "## number of fields: %d\n", $nf ;
$popcol = $nf-1 ;
if (defined $opts{"p"}) {
$pops = $opts{"p"} ;
}
else {
die "p parameter compulsory\n" ;
}
$popsname = setpops ($pops) ;
print "$popsname\n" ;
$c1 = 1; $c2 =2 ;
if (defined $opts{"c"}) {
$cols = $opts{"c"} ;
($c1, $c2) = split ":", $cols ;
die "bad c param: $cols\n" unless (defined $cols) ;
}
$stem = "$infile.$c1:$c2" ;
if (defined $opts{"s"}) {
$stem = $opts{"s"} ;
}
$gnfile = "$stem.$popsname.xtxt" ;
if (defined $opts{"o"}) {
$gnfile = $opts{"o"} ;
}
#T = () ; ## trash
open (GG, ">$gnfile") || die "can't open $gnfile\n" ;
print GG "## " unless ($postscmode) ;
print GG "set terminal postscript color\n" ;
print GG "set title \"$title\" \n" ;
print GG "set key outside\n" unless ($oldkeystyle) ;
print GG "set xlabel \"eigenvector $c1\" \n" ;
print GG "set ylabel \"eigenvector $c2\" \n" ;
print GG "plot " ;
$np = #P ;
$lastpop = $P[$np-1] ;
$d1 = $c1+1 ;
$d2 = $c2+1 ;
foreach $pop (#P) {
$dfile = "$stem:$pop" ;
push #T, $dfile ;
print GG " \"$dfile\" using $d1:$d2 title \"$pop\" " ;
print GG ", \\\n" unless ($pop eq $lastpop) ;
chomp $dfile;
open (YY, ">$dfile") || die "can't open $dfile\n" ;
foreach $line (#L) {
next if ($line =~ /\#/) ;
#Z = split " ", $line ;
next unless (defined $Z[$popcol]) ;
next unless ($Z[$popcol] eq $pop) ;
print YY "$line\n" ;
}
close YY ;
}
print GG "\n" ;
print GG "## " if ($postscmode) ;
print GG "pause 9999\n" ;
close GG ;
if ($postscmode) {
$psfile = "$stem.ps" ;
if ($gnfile =~ /xtxt/) {
$psfile = $gnfile ;
$psfile =~ s/xtxt/ps/ ;
}
system "gnuplot < $gnfile > $psfile" ;
if ( $dofixgreen ) {
system "fixgreen $psfile" ;
}
system "ps2pdf $psfile " ;
}
unlink (#T) unless $keepflag ;
sub usage {
print "ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k]\n" ;
print "-i eigfile input file first col indiv-id last col population\n" ;
print "## as output by smartpca in outputvecs \n" ;
print "-c a:b a, b columns to plot. 1:2 would be common and leading 2 eigenvectors\n" ;
print "-p pops Populations to plot. : delimited. eg -p Bantu:San:French\n" ;
print "## pops can also be a filename. List populations 1 per line\n" ;
print "[-s stem] stem will start various output files\n" ;
print "[-o ofile] ofile will be gnuplot control file. Should have xtxt suffix\n";
print "[-x] make ps and pdf files\n" ;
print "[-k] keep various intermediate files although -x set\n" ;
print "## necessary if .xtxt file is to be hand edited\n" ;
print "[-y] put key at top right inside box (old mode)\n" ;
print "[-t] title (legend)\n" ;
print "[-f] fix green and yellow colors\n";
print "The xtxt file is a gnuplot file and can be easily hand edited. Intermediate files
needed if you want to make your own plot\n" ;
}
sub setpops {
my ($pops) = #_ ;
local (#a, $d, $b, $e) ;
if (-e $pops) {
open (FF1, $pops) || die "can't open $pops\n" ;
#P = () ;
foreach $line (<FF1>) {
($a) = split " ", $line ;
next unless (defined $a) ;
next if ($a =~ /\#/) ;
push #P, $a ;
}
$out = join ":", #P ;
print "## pops: $out\n" ;
($b, $d , $e) = fileparse($pops) ;
return $b ;
}
#P = split $zsep, $pops ;
return $pops ;
}

Parsing File Delimited Vertically in Perl

I have a file that looks like this:
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
Each record chunk has different number of lines, (e.g. CX entry does not always present).
But if CX exists, in only appear as 1 entry only.
We want to get a Hash that takes "MH" as keys and "CX" as values.
Hence parsing the above data we hope to get this structure:
$VAR = { "Urinary Bladder" => ["CYST-" , "VESIC-"]};
What's the right way to parse it?
I'm stuck with this, that doesn't give me result as I want.
use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {
chomp;
if (/^MH = (\w+/)) {
$key = $1;
push #{$bighash{$key}}, " ";
}
elsif ( /^CX = (\w+/)) {
$cx = $1;
}
else {
push #{$bighash{$key}}, $cx;
}
}
This becomes simpler if you use $/ to read the data a paragraph at a time. I'm surprised that no-one else has suggested that.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %bighash;
$/ = '';
while (<DATA>) {
if (my ($k) = /^MH = (.*?)$/m and my ($v) = /^CX = (.*?)$/m) {
$bighash{$k} = [ $v =~ /([A-Z]+-)/g ];
}
}
say Dumper \%bighash;
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
The output looks like this:
$VAR1 = {
'Urinary Bladder' => [
'CYST-',
'VESIC-'
]
};
Try the following. And it's probably a good idea to examine the changes (or listen to Aki):
use strict;
use warnings;
use Data::Dumper;
my %bighash;
my $current_key;
while ( <DATA> ) {
chomp;
if ( m/^MH = (.+)/ ) {
$current_key = $1;
} elsif ( /^CX = (.+)/ ) {
my $text = $1;
$bighash{ $current_key } = [ $text =~ /([A-Z]+-)/g ];
}
}
print Dumper ( \%bighash );
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
Update: Used Regex-Captures instead of split and grep
Haven't practiced my perl kung fu lately but the last else statement looks fishy.
Try dropping the last else statement and add the 'push' statement straight after the second elsif. Basically do the push operation straight after matching the CX.
Also, you know that MH must always appear before a CX otherwise the logic breaks.
Fix the regular expressions
/^MH = (\w+/) should be /^MH (\w+)/. You may want to use \s+ or \s* instead of space
Remove push from the if block
Remove else block
In the elsif block Push $cx into hash using the key $key
List item
Add use strict; and use warnings; to your code
Try these and if you have difficulty i will help you with the code
It might be simpler to use Config::Tiny or Config::YAML to do an initial pass over the file and then loop through each record individually. Although if your file is like a gigabyte or more this might suck up all your memory.
Here is something I quickly did, I hope it gives you an idea to start from:
use Data::Dumper;
# Set your record separator
{
local $/="*NEWRECORD\n";
while(<DATA>) {
# Get rid of your separator
chomp($_);
print "Parsing record # $.\n";
push #records, $_ if ( $_ );
}
}
foreach (#records) {
# Get your sub records
#lines = split(/\n/,$_);
my %h = ();
my %result = ();
# Create a hash from your sub records
foreach (#lines) {
($k, $v) = split(/\s*=\s*/, $_);
$h{$k} = $v;
}
# Parse the CX and strip the lower case comments
$h{ 'CX' } =~ s/[a-z]//g;
$h{ 'CX' } =~ s/^\s+//g;
# Have the upper case values as an array ref in the result hash
$result{ $h{ 'MH' } } = [ split( /\s+/, $h{ 'CX' } ) ] if ( $h{ 'CX' } );
print Dumper( \%h );
print "Result:\n";
print Dumper( \%result );
}
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359