How to parse multiple line, fixed-width file in perl?

How to parse multiple line, fixed-width file in perl? - perl

I have a file that I need to parse in the following format. (All delimiters are spaces):
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.
I am familiar with how to parse a single line fixed-width file, but am stumped with how to handle multiple lines.

#!/usr/bin/env perl
use strict; use warnings;
my (%fields, $current_field);
while (my $line = <DATA>) {
next unless $line =~ /\S/;
if ($line =~ /^ \s+ ( \S .+ )/x) {
if (defined $current_field) {
$fields{ $current_field} .= $1;
}
}
elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
$current_field = $1;
$fields{ $current_field } = $2;
}
}
use Data::Dumper;
print Dumper \%fields;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.

Fixed-width says unpack to me. It is possible to parse with regexes and split, but unpack should be a safer choice, as it is the Right Tool for fixed width data.
I put the width of the first field to 12 and the empty space between to 13, which works for this data. You may need to change that. The template "A12A13A*" means "find 12 then 13 ascii characters, followed by any length of ascii characters". unpack will return a list of these matches. Also, unpack will use $_ if a string is not supplied, which is what we do here.
Note that if the first field is not fixed width up to the colon, as it appears to be in your sample data, you'll need to merge the fields in the template, e.g. "A25A*", and then strip the colon.
I chose array as the storage device, as I do not know if your field names are unique. A hash would overwrite fields with the same name. Another benefit of an array is that it preserves the order of the data as it appears in the file. If these things are irrelevant and quick lookup is more of a priority, use a hash instead.
Code:
use strict;
use warnings;
use Data::Dumper;
my $last_text;
my #array;
while (<DATA>) {
# unpack the fields and strip spaces
my ($field, undef, $text) = unpack "A12A13A*";
if ($field) { # If $field is empty, that means we have a multi-line value
$field =~ s/:$//; # strip the colon
$last_text = [ $field, $text ]; # store data in anonymous array
push #array, $last_text; # and store that array in #array
} else { # multi-line values get added to the previous lines data
$last_text->[1] .= " $text";
}
}
print Dumper \#array;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value
with a third line
Output:
$VAR1 = [
[
'field name 1:',
'Multiple word value.'
],
[
'field name 2:',
'Multiple word value along with multiple lines.'
],
[
'field name 3:',
'Another multiple word and multiple line value with a third line'
]
];

You could do this:
#!/usr/bin/perl
use strict;
use warnings;
my #fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
for (<$fh>) {
if (/^\s/) {
$fields[$#fields] .= $_;
} else {
push #fields, $_;
}
}
close $fh;
If the line starts with white space, append it to the last element in #fields, otherwise push it onto the end of the array.
Alternatively, slurp the entire file and split with look-around:
#!/usr/bin/perl
use strict;
use warnings;
$/=undef;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
my #fields = split/(?<=\n)(?!\s)/, <$fh>;
close $fh;
It's not a recommended approach though.

You can change delimiter:
$/ = "\nfield name";
while (my $line = <FILE>) {
if ($line =~ /(\d+)\s+(.+)/) {
print "Record $1 is $2";
}
}

Related

Data value of array not printing properly

I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.

As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}

As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.

The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .

how to remove last single line available in file using perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}

Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}

You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.

I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Reading File and Inserting into Variables using Perl

I'm new to Perl, so please bare with my on my ignorance. What I'm trying to do is read a file (already using File::Slurp module) and create variables from the data in the file. Currently I have this setup:
use File::Slurp;
my #targets = read_file("targetfile.txt");
print #targets;
Within that target file, I have the following bits of data:
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
The first line is an ID, the second line is a name, and all successive lines will be IP addresses (maximum length of a few hundred).
So my goal is to read that file and create variables that look something like this:
$var1="123456789";
$var2="anytownusa";
$var3="1.2.3.4/32,5.6.7.8/32,etc,etc,etc,etc,etc";
** Taking note that all the IP addresses end up grouped together into a single variable and seperated by a (,) comma.

File::Slurp will read the complete file data in one go. This might cause an issue if the file size is very big. Let me show you a simple approach to this problem.
Read file line by line using while loop
Check line number using $. and assign line data to respective variable
Store ips in an array and at the end print them using join
Note: If you have to alter the line data then use search and replace in the respective conditional block before assigning the line data to the variable.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my ($id, $name, #ips);
while(<DATA>){
chomp;
if ($. == 1){
$id = $_;
}
elsif ($. == 2){
$name = $_;
}
else{
push #ips, $_;
}
}
print "$id\n";
print "$name\n";
print join ",", #ips;
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Demo

As it has been noted, there is no reason to "slurp" the whole file into a variable. If nothing else, it only makes the processing harder.
Also, why not store named labels in a hash, in this example
my %identity = (id => 123456789, name => 'anytownusa');
The code below picks up the key names from the file, they aren't hard-coded.
Then
use warnings;
use strict;
use feature 'say';
my (#ips, %identity);
my $file = 'targetfile.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
next if not /\S/;
chomp;
my ($m1, $m2) = split /:/; #/(stop bad syntax highlight)
if ($m1 and $m2) { $identity{$m1} = $m2; }
else { push #ips, $m1; }
}
say "$_: $identity{$_}" for keys %identity;
say join '/', #ips;
If the line doesn't have : the split will return it whole, which will be the ip and which is stored in an array for processing later. Otherwise it returns the named pair, for 'id' and 'name'.
We first skipped blank lines with next if not /\S/;, so the line must have some non-space elements and else suffices, as there is always something in $m1. We also need to remove the newline, with chomp.

Read the file into variables directly:
use Modern::Perl;
my ($id, $name, #ips) = (<DATA>,<DATA>,<DATA>);
chomp ($id, $name, #ips);
say $id;
say $name;
$" = ',';
say "#ips";
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Output:
id: 123456789
name: anytownusa
1.2.3.4/32,5.6.7.8/32

Perl script that parses CSV file excluding the contents enclosed in []

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.

Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";

Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

Perl parsing multiple separator character data

I have a mixed character separated file with a header row I am trying to read using Text::CSV, which I have used successfully on comma separate files to pull into an array of hashes in other scripts. I have read Text::CSV does not support multiple separators (spaces, tabs, commas), so I was trying to clean up the row using regex before using Text::CSV. Not to mention the data file also has comment lines in the middle of the file. Unfortunately, I do not have admin rights to install libraries which may accommodate multiple sep_chars, so I was hoping I could use Text::CSV or some other standard methods to clean up the header and row before adding to the AoH. Or should I abandon Text::CSV?
I'm obviously still learning. Thanks in advance.
Example file:
#
#
#
# name scale address type
test.data.one 32768 0x1234fde0 float
test.data.two 32768 0x1234fde4 float
test.data.the 32768 0x1234fde8 float
# comment lines in middle of data
test.data.for 32768 0x1234fdec float
test.data.fiv 32768 0x1234fdf0 float
Code excerpt:
my $fh;
my $input;
my $header;
my $pkey;
my $row;
my %arrayofhashes;
my $csv=Text::CSV({sep_char = ","})
or die "Text::CSV error: " Text::CSV=error_diag;
open($fh, '<:encoding(UTF-8)', $input)
or die "Can't open $input: $!";
while (<$fh>) {
$line = $_;
# skip to header row
next if($line !~ /^# name/);
# strip off leading chars on first column name
$header =~ s/# //g;
# replace multiple spaces and tabs with comma
$header =~ s/ +/,/g;
$header =~ s/t+/,/g;
# results in $header = "name,scale,address,type"
last;
}
my #header = split(",", $header);
$csv->parse($header);
$csv->column_names([$csv->fields]);
# above seems to work!
$pkey = 0;
while (<$fh>) {
$line = $_;
# skip comment lines
next if ($line =~ /^#/);
# replace spaces and tabs with commas
$line =~ s/( +|\t+)/,/g;
# replace multiple commas from previous regex with single comma
$line =~ s/,+/,/g;
# results in $line = "test.data.one,32768,0x1234fdec,float"
# need help trying to create a what I think needs to be a hash from the header and row.
$row = ?????;
# the following line works in my other perl scripts for CSV files when using:
# while ($row = $csv->getline_hr($fh)) instead of the above.
$arrayofhashes{$pkey} = $row;
$pkey++;
}

If your columns are separated by multiple spaces, Text::CSV is useless. Your code contains a lot of repeated code, trying to work around of Text::CSV limitations.
Also, your code has bad style, contains multiple syntax errors and typos, and confused variable names.
So You Want To Parse A Header.
We need a definition of the header line for our code. Let's take “the first comment line that contains non-space characters”. It may not be preceded by non-comment lines.
use strict; use warnings; use autodie;
open my $fh, '<:encoding(UTF-8)', "filename.tsv"; # error handling by autodie
my #headers;
while (<$fh>) {
# no need to copy to a $line variable, the $_ is just fine.
chomp; # remove line ending
s/\A#\s*// or die "No header line found"; # remove comment char, or die
/\S/ or next; # skip if there is nothing here
#headers = split; # split the header names.
# The `split` defaults to `split /\s+/, $_`
last; # break out of the loop: the header was found
}
The \s character class matches space characters (spaces, tabs, newlines, etc.). The \S is the inverse and matches all non-space characters.
The Rest
Now we have our header names, and can proceed to normal parsing:
my #records;
while (<$fh>) {
chomp;
next if /\A#/; # skip comments
my #fields = split;
my %hash;
#hash{#headers} = #fields; # use hash slice to assign fields to headers
push #records, \%hash; # add this hashref to our records
}
Voilà.
The Result
This code produces the following data structure from your example data:
#records = (
{
address => "0x1234fde0",
name => "test.data.one",
scale => 32768,
type => "float",
},
{
address => "0x1234fde4",
name => "test.data.two",
scale => 32768,
type => "float",
},
{
address => "0x1234fde8",
name => "test.data.the",
scale => 32768,
type => "float",
},
{
address => "0x1234fdec",
name => "test.data.for",
scale => 32768,
type => "float",
},
{
address => "0x1234fdf0",
name => "test.data.fiv",
scale => 32768,
type => "float",
},
);
This data structure could be used like
for my $record (#records) {
say $record->{name};
}
or
for my $i (0 .. $#records) {
say "$i: $records[$i]{name}";
}
Criticism Of Your Code
You declare all your variables at the top of your script, effectively making them global variables. Don't. Create your variables in the smallest scope possible. My code uses just three variables in the outer scope: $fh, #headers and #records.
This line my $csv=Text::CSV({sep_char = ","}) doesn't work as expected.
Text::CSV is not a function; it is the name of a module. You meant Text::CSV->new(...).
The options should be a hashref, but sep_char = "," tries to assign something to sep_char sadly, this could be valid syntax. But you actually meant to specify a key-value relationship. Use the => operator instead (called fat comma or hash rocket).
Neither does this work: or die "Text::CSV error: " Text::CSV=error_diag.
To concatenate strings, use the . concatenation operator. What you wrote is a syntax error: A literal string is always followed by an operator.
You really like assignments? The Text::CSV=error_diag does not work. You intended to call the error_diag method on the Text::CSV class. Therefore, use the correct operator ->: Text::CSV->error_diag.
The substitution s/t+/,/g replaces all sequences of ts by commas. To replace tabs, use the \t charclass.
%arrayofhashes is not an array of hashes: It is a hash (as evidenced by the % sigil), but you use integer numbers as keys. Arrays have the # sigil.
To add something to the end of an array, I'd rather not keep the index of the last item in an extra variable. Rather, use the push function to add an item to the end. This reduces the amount of bookkeeping code.
if you find yourself writing a loop like my $i = 0; while (condition) { do stuff; $i++}, then you usually want to have a C-style for loop:
for (my $i = 0; condition; $i++) {
do stuff;
}
This also helps with proper scoping of variables.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to parse multiple line, fixed-width file in perl? - perl

You can change delimiter: $/ = "\nfield name"; while (my $line = <FILE>) { if ($line =~ /(\d+)\s+(.+)/) { print "Record $1 is $2"; } }

Related

Data value of array not printing properly

how to remove last single line available in file using perl

Reading File and Inserting into Variables using Perl

Perl script that parses CSV file excluding the contents enclosed in []

Perl parsing multiple separator character data

Categories

Resources