Parse fixed-width files - perl

I have a lot of text files with fixed-width fields:
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
The rest of the files are in a similar format, where the <c> will mark the beginning of a column, but they have various (unknown) column & space widths. What's the best way to parse these files?
I tried using Text::CSV, but since there's no delimiter it's hard to get a consistent result (unless I'm using the module wrong):
my $csv = Text::CSV->new();
$csv->sep_char (' ');
while (<FILE>){
if ($csv->parse($_)) {
my #columns=$csv->fields();
print $columns[1] . "\n";
}
}

As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:
my #template = map {'A'.length} # convert each to 'A##'
<DATA> =~ /(\S+\s*)/g; # split first line into segments
$template[-1] = 'A*'; # set the last segment to be slurpy
my $template = "#template";
print "template: $template\n";
my #data;
while (<DATA>) {
push #data, [unpack $template, $_]
}
use Data::Dumper;
print Dumper \#data;
__DATA__
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
which prints:
template: A8 A10 A*
$VAR1 = [
[
'Dave',
'Thomas',
'123 Main'
],
[
'Dan',
'Anderson',
'456 Center'
],
[
'Wilma',
'Rainbow',
'789 Street'
]
];

CPAN to the rescue!
DataExtract::FixedWidth not only parses fixed-width files, but (based on POD) appears to be smart enough to figure out column widths from header line by itself!

Just use Perl's unpack function. Something like this:
while (<FILE>) {
my ($first,$last,$street) = unpack("A9A25A50",$_);
<Do something ....>
}
Inside the unpack template, the "A###", you can put the width of the field for each A.
There are a variety of other formats that you can use to mix and match with, that is, integer fields, etc...
If the file is fixed width, like mainframe files, then this should be the easiest.

Related

Split a string based on ASCII value

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];
When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;
You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

Replace single space with multiple spaces in perl

I have a requirement of replacing a single space with multiple spaces so that the second field always starts at a particular position (here 36 is the position of second field always).
I have a perl script written for this:
while(<INP>)
{
my $md=35-index($_," ");
my $str;
$str.=" " for(1..$md);
$_=~s/ +/$str/;
print "$_" ;
}
Is there any better approach with just using the regex in =~s/// so that I can use it on CLI directly instead of script.
Assuming that the fields in your data are demarcated by spaces
while (<$fh>) {
my ($first, #rest) = split;
printf "%-35s #rest\n", $first;
}
The first field is now going to be 36 wide, aligned left due to - in the format of printf. See sprintf for the many details. The rest is printed with single spaces between the original space-separated fields, but can instead be done as desired (tab separated, fixed width...).
Or you can leave the "rest" after the first field untouched by splitting the line into two parts
while (<$fh>) {
my ($first, $rest) = /(\S+)\s+(.*)/;
printf "%-35s $rest\n", $first;
}
(or use split ' ', $_, 2 instead of regex)
Please give more detail if there are other requirements.
One approach is to use plain ol' Perl formats:
#!/usr/bin/perl
use warnings;
use strict;
my($first, $second, $remainder);
format STDOUT =
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #<<<<<< #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$first, $second,$remainder
.
while (<DATA>) {
($first, $second, $remainder) = split(/\s+/, $_, 3);
write;
}
exit 0;
__DATA__
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Test output. I probably miscounted the columns, but you should get the idea:
$ perl dummy.pl
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Other option would be Text::Table

How can I extract specific columns in perl?

chr1 1 10 el1
chr1 13 20 el2
chr1 50 55 el3
I have this tab delimited file and I want to extract the second and third column using perl. How can I do that?
I tried reading the file using file handler and storing it in a string, then converting the string to an array but it didn't get me anywhere.
My attempt is:
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
Simply autosplit on tab
# ↓ index starts on 0
$ perl -F'\t' -lane'print join ",", #F[1,2]' inputfile
Output:
1,10
13,20
50,55
See perlrun.
use strict;
my $input=shift or die "must provide <input_file> as an argument\n";
open(my $in,"<",$input) or die "Cannot open $input for reading: $!";
while(<$in>)
{
my #tf1=split(/\t/,$_);
print "$tf1[1]|$tf1[2]\n"; # $tf1[1] is the second column and $tf1[2] is the third column
}
close($in)
What problem are you having? Your code already does all the hard parts.
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
You have all three columns in your #tf1 array (by the way - your variable naming needs serious work!) All you need to do now is to print the second and third elements from the array (but remember that Perl array elements are numbered from zero).
print "$tf1[1] / $tf1[2]\n";
It's possible to simplify your code quite a lot by taking advantage of Perl's default behaviours.
while (<FILE_HANDLE>) { # Store record in $_
my #tf1 = split(/\t/); # Declare and initialise on one line
# split() works on $_ by default
print "$tf1[1] / $tf1[2]\n";
}
Even more pithily than #daxim as a one-liner:
perl -aE 'say "#F[1,2]" ' file
See also: How to sort an array or table by column in perl?

Perl: Replace consecutive spaces in this given scenario?

an excerpt of a big binary file ($data) looks like this:
\n1ax943021C xxx\t2447\t5
\n1ax951605B yyy\t10400\t6
\n1ax919275 G2L zzz\t6845\t6
The first 25 characters contain an article number, filled with spaces. How can I convert all spaces between the article numbers and the next column into a \x09 ? Note the one or more spaces between different parts of the article number.
I tried a workaround, but that overwrites the article number with ".{25}xxx»"
$data =~ s/\n.{25}/\n.{25}xxx/g
Anyone able to help?
Thanks so much!
Gary
You can use unpack for fixed width data:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper $_ for map join("\t", unpack("A25A*")), <DATA>;
__DATA__
1ax943021C xxx 2447 5
1ax951605B yyy 10400 6
1ax919275 G2L zzz 6845 6
Output:
$VAR1 = "1ax943021C\txxx\t2447\t5";
$VAR1 = "1ax951605B\tyyy\t10400\t6";
$VAR1 = "1ax919275 G2L\tzzz\t6845\t6";
Note that Data::Dumper's Useqq option prints whitecharacters in their escaped form.
Basically what I do here is take each line, unpack it, using 2 strings of space padded text (which removes all excess space), join those strings back together with tab and print them. Note also that this preserves the space inside the last string.
I interpret the question as there being a 25 character wide field that should have its trailing spaces stripped and then delimited by a tab character before the next field. Spaces within the article number should otherwise be preserved (like "1ax919275 G2L").
The following construct should do the trick:
$data =~ s/^(.{25})/{$t=$1;$t=~s! *$!\t!;$t}/emg;
That matches 25 characters from the beginning of each line in the data, then evaluates an expression for each article number by stripping its trailing spaces and appending a tab character.
Have a try with:
$data =~ s/ +/\t/g;
Not sure exactly what you what - this will match the two columns and print them out - with all the original spaces. Let me know the desired output and I will fix it for you...
#!/usr/bin/perl -w
use strict;
my #file = ('\n1ax943021C xxx\t2447\t5', '\n1ax951605B yyy\t10400\t6',
'\n1ax919275 G2L zzz\t6845\t6');
foreach (#file) {
my ($match1, $match2) = ($_ =~ /(\\n.{25})(.*)/);
print "$match1'[insertsomethinghere]'$match2\n";
}
Output:
\n1ax943021C '[insertsomethinghere]'xxx\t2447\t5
\n1ax951605B '[insertsomethinghere]'yyy\t10400\t6
\n1ax919275 G2L '[insertsomethinghere]'zzz\t6845\t6

How can I correctly process this file containing tab separated values in Perl?

I am fairly new to Perl and know next to nothing about Perl's 'proper' syntax.
I have a text file that I use everyday with a listing of names, and other info for our users. This file changes daily and sometimes has two rows in it(tab delimited), and other times has 100+ rows in it.
The file also varies between 6-9 columns of data in a row. I have put together a Perl script that uses the split function on tabs, but the issue I am running into is that if I take row a, which has 5 columns in it and then add a second row b that has 6 columns in it that are all populated with data.
I cannot figure out how to get Perl to see that row a only has 5 columns of data and to continue parsing the text file from that point forward. It continues, but the output wraps lines strangely. How can I get around this issue? I hope that made sense.
You will have to post some code and possibly some sample data, but here's a code that is parsing rows of different lengths without issue.
Script:
#!/usr/bin/perl
use strict;
while (<STDIN>)
{
chomp;
my #info = split("\t");
print join(";", #info), "\n";
}
exit;
Test File:
jsmith 101 777-222-5555 Office 1 Building 1 Manager
aposse 104 777-222-5556 Office 2 Building 2 Stock Clerk
jbraza 105 777-222-5557 Office 3
mcuzui 102 777-222-5557 Office 3 Building 3 Cashier
ghines 107 777-222-5557 Office 3
Output:
%> test.pl < file.txt
jsmith;101;777-222-5555;Office 1;Building 1;Manager
aposse;104;777-222-5556;Office 2;Building 2;Stock Clerk
jbraza;105;777-222-5557;Office 3
mcuzui;102;777-222-5557;Office 3;Building 3;Cashier
ghines;107;777-222-5557;Office 3
You should post some sample data and code and explain desired behavior in terms of what the code currently does and what you want it to do. split will give you as many fields as there are in the input.
#!/usr/bin/perl
use strict; use warnings;
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
print "< #cells >\n";
}
__DATA__
1 2 3 4 5
a b c d e f
Text::CSV module can be used for parsing tab-separated-values as well. In reality, Text::CSV could parse values delimited by any character.
Relevant excerpt from its POD:
The module accepts either strings or
files as input and can utilize any
user-specified characters as
delimiters, separators, and escapes so
it is perhaps better called ASV
(anything separated values) rather
than just CSV.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { 'sep_char' => "\t" } );
open my $fh, '<', 'data.tsv' or die "Unable to open: $!";
my #rows;
while ( my $row_ref = $csv->getline($fh) ) {
push #rows, $row_ref;
}
$csv->sep_char('|');
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}