Perl file manipulation using Tie::File - perl

I am parsing an HTML file that contains data that is associated in a grid like manner and am close to being done. I had previously thought that removing all blank lines would be needed but I failed to notice that some fields in the grid are blank. I am now trying to use the Tie::File module to store the file in an array, iterate over it, and if there are three continuous blank lines, I want to insert a dummy value I can manipulate later so that the blank line stripping does not alter the structure of my data.
What I have tried so far (the file is ~2 MB):
my #lines;
my $num = 0;
tie #lines, 'Tie::File', 'results.txt';
(tied #lines)->defer;
foreach (#lines)
{
chomp $lines[$num];
$num++;
if ($lines[$num-1] =~ /^$/ && $lines[$num+1] =~ /^$/)
{
$lines[$num] = "null";
}
}
(tied #lines)->flush;
untie #lines;
Edit: How do I go about iterating over the array and insert the value so there is only one space between each line so I can later get rid of all the blank lines?

If I understand your problem correctly (replace three consecutive empty lines with the word "null" and an empty line on either side), perhaps the following regex operating on your file's contents will help:
use Modern::Perl;
my $htmlFile = do { local $/; <DATA> };
$htmlFile =~ s/(?<!\S)\n{3}/\nnull\n\n/g;
say $htmlFile;
__DATA__
A
B
C
D
E
F
Output:
null
A
B
null
null
C
D
null
E
F

Related

Replace comma with space in just one field - from a .CSV file

I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

i want to merge multiple csv files by specific condition using perl

i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)

Perl: How to add a line to sorted text file

I want to add a line to the text file in perl which has data in a sorted form. I have seen examples which show how to append data at the end of the file, but since I want the data in a sorted format.
Please guide me how can it be done.
Basically from what I have tried so far :
(I open a file, grep its content to see if the line which I want to add to the file already exists. If it does than exit else add it to the file (such that the data remains in a sorted format)
open(my $FH, $file) or die "Failed to open file $file \n";
#file_data = <$FH>;
close($FH);
my $line = grep (/$string1/, #file_data);
if($line) {
print "Found\n";
exit(1);
}
else
{
#add the line to the file
print "Not found!\n";
}
Here's an approach using Tie::File so that you can easily treat the file as an array, and List::BinarySearch's bsearch_str_pos function to quickly find the insert point. Once you've found the insert point, you check to see if the element at that point is equal to your insert string. If it's not, splice it into the array. If it is equal, don't splice it in. And finish up with untie so that the file gets closed cleanly.
use strict;
use warnings;
use Tie::File;
use List::BinarySearch qw(bsearch_str_pos);
my $insert_string = 'Whatever!';
my $file = 'something.txt';
my #array;
tie #array, 'Tie::File', $file or die $!;
my $idx = bsearch_str_pos $insert_string, #array;
splice #array, $idx, 0, $insert_string
if $array[$idx] ne $insert_string;
untie #array;
The bsearch_str_pos function from List::BinarySearch is an adaptation of a binary search implementation from Mastering Algorithms with Perl. Its convenient characteristic is that if the search string isn't found, it returns the index point where it could be inserted while maintaining the sort order.
Since you have to read the contents of the text file anyway, how about a different approach?
Read the lines in the file one-by-one, comparing against your target string. If you read a line equal to the target string, then you don't have to do anything.
Otherwise, you eventually read a line 'greater' than your current line according to your sort criteria, or you hit the end of the file. In the former case, you just insert the string at that position, and then copy the rest of the lines. In the latter case, you append the string to the end.
If you don't want to do it that way, you can do a binary search in #file_data to find the spot to add the line without having to examine all of the entries, then insert it into the array before outputting the array to the file.
Here's a simple version that reads from stdin (or filename(s) specified on command line) and appends 'string to append' to the output if it's not found in the input. Outuput is printed on stdout.
#! /usr/bin/perl
$found = 0;
$append='string to append';
while(<>) {
$found = 1 if (m/$append/o);
print
}
print "$append\n" unless ($found);;
Modifying it to edit a file in-place (with perl -i) and taking the append string from the command line would be quite simple.
A 'simple' one-liner to insert a line without using any module could be:
perl -ni -le '$insert="lemon"; $eq=($insert cmp $_); if ($eq == 0){$found++}elsif($eq==-1 && !$found){print$insert} print'
giver a list.txt whose context is:
ananas
apple
banana
pear
the output is:
ananas
apple
banana
lemon
pear
{
local ($^I, #ARGV) = ("", $file); # Enable in-place editing of $file
while (<>) {
# If we found the line exactly, bail out without printing it twice
last if $_ eq $insert;
# If we found the place where the line should be, insert it
if ($_ gt $insert) {
print $insert;
print;
last;
}
print;
}
# We've passed the insertion point, now output the rest of the file
print while <>;
}
Essentially the same answer as pavel's, except with a lot of readability added. Note that $insert should already contain a trailing newline.

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\