Using perl to split file in flat text

Using perl to split file in flat text - perl

I have a flat file that are created with offsets e.g. row 1: char 1 - 3 = ID, 4-19 = user name, 20 - 40 = last name, etc...
What's the best way to go about creating a perl script to read this? and is there any way to make it flexible based on different offset groups? Thank you!

If the positions/lengths are in terms of Unicode Code Points:
# Use :encoding(UTF-8) on the file handle.
my #fields = unpack('A3 A16 A21', $decoded_line);
If the positions/lengths are in terms of bytes:
use Encode qw( decode );
sub trim_end(_) { $_[0] =~ s/\s+\z//r }
# Use :raw on the file handle.
my #fields =
map trim_end(decode("UTF-8", $_)),
unpack('a3 a16 a21', $encoded_line);
In both cases, trailing whitespace is trimmed.

Related

Split a string based on ASCII value

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];

When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;

You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

Save a row to csv format

I have a set of rows from a DB that I would like to save to a csv file.
Taking into account that the data are ascii chars without any weird chars would the following suffice?
my $csv_row = join( ', ', #$row );
# save csv_row to file
My concern is if that would create rows that would be acceptable as CSV by any tool and e.g not be concern with quoting etc.
Update:
Is there any difference with this?
my $csv = Text::CSV->new ( { binary => 1, eol => "\n"} );
my $header = join (',', qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 ) );
$csv->print( $fh, [$header] );
foreach my $row ( #data ) {
$csv->print($fh, $row );
}
This gives me as a first line:
" COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4"
Please notice the double quotes and the rest of the rows are without any quotes.
What is the difference than my plain join? Also do I need the binary set?

The safest way should be to write clean records with a comma separator. The simpler the better, specially with the format that has so much variation in real life. If needed, double quote each field.
The true strength in using the module is for reading of "real-life" data. But it makes perfect sense to use it for writing as well, for a uniform approach to CSV. Also, options can then be set in a clear way, and the module can iron out some glitches in data.
The Text::CSV documentation tells us about binary option
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set binary => 1 in the call to new. To cover the widest range of parsing options, you will always want to set binary.
I'd say use it. Since you write a file this may be it for options, along with eol (or use say method). But do scan the many useful options and review their defaults.
As for your header, the print method expects an array reference where each field is an element, not a single string with comma-separated fields. So it is wrong to say
my $header = join (',', qw(COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4)); # WRONG
$csv->print( $fh, [$header] );
since the $header is a single string which is then made the sole element of the (anonymous) array reference created by [ ... ]. So it prints this string as the first field in the row, and since it detects in it the separator , itself it also double-quotes. Instead, you should have
$csv->print($fh, [COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4]);
or better assign column names to #header and then do $csv->print($fh, \#header).
This is also an example of why it is good to use the module for writing – if a comma slips into an element of the array, supposed to be a single field, it is handled correctly by double-quoting.
A complete example
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
my $file = 'output.csv';
open my $fh_out , '>', 'output.csv' or die "Can't open $file for writing: $!";
my #headers = qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 );
my #data = 1..4;
$csv->print($fh_out, \#headers);
$csv->print($fh_out, \#data);
close $fh_out;
what produces the file output.csv
COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4
1,2,3,4

How can I extract specific columns in perl?

chr1 1 10 el1
chr1 13 20 el2
chr1 50 55 el3
I have this tab delimited file and I want to extract the second and third column using perl. How can I do that?
I tried reading the file using file handler and storing it in a string, then converting the string to an array but it didn't get me anywhere.
My attempt is:
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}

Simply autosplit on tab
# ↓ index starts on 0
$ perl -F'\t' -lane'print join ",", #F[1,2]' inputfile
Output:
1,10
13,20
50,55
See perlrun.

use strict;
my $input=shift or die "must provide <input_file> as an argument\n";
open(my $in,"<",$input) or die "Cannot open $input for reading: $!";
while(<$in>)
{
my #tf1=split(/\t/,$_);
print "$tf1[1]|$tf1[2]\n"; # $tf1[1] is the second column and $tf1[2] is the third column
}
close($in)

What problem are you having? Your code already does all the hard parts.
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
You have all three columns in your #tf1 array (by the way - your variable naming needs serious work!) All you need to do now is to print the second and third elements from the array (but remember that Perl array elements are numbered from zero).
print "$tf1[1] / $tf1[2]\n";
It's possible to simplify your code quite a lot by taking advantage of Perl's default behaviours.
while (<FILE_HANDLE>) { # Store record in $_
my #tf1 = split(/\t/); # Declare and initialise on one line
# split() works on $_ by default
print "$tf1[1] / $tf1[2]\n";
}

Even more pithily than #daxim as a one-liner:
perl -aE 'say "#F[1,2]" ' file
See also: How to sort an array or table by column in perl?

convert ASCII code to number

I have one file, which contain 12 columns (bam file), the eleventh column contain ASCII code. In one file I have more than one ASCII code. I would like to convert it to number.
I think it this code:
perl -e '$a=ord("ALL_ASCII_CODES_FROM-FILE"); print "$a\t"'
And I would like to create for cycle to read all ASCII codes, which are in eleventh column, convert it to number and count the results to one number.

You need to split the string into individual characters, loop over every character, and call ord in the loop.
my #codes = map ord, split //, $str;
say join '.', map { sprintf("%02X", $_) } #codes;
Conveniently, unpack 'C*' does all of that.
my #codes = unpack 'C*', $str;
say join '.', map { sprintf("%02X", $_) } #codes;
If you do intend to print it out in hex, you can use the v modifier in a printf.
say sprintf("%v02X", $str);

The natural tool to convert a string of characters into a list of the corresponding ASCII codes would be unpack:
my #codes = unpack "C*", $string;
In particular, assuming that you're parsing the QUAL column of an SAM file (or, more generally, any FASTQ-style quality string, I believe the correct conversion would be:
my #qual = map {$_ - 33} unpack "C*", $string;
Ps. From your mention of "columns", I'm assuming you're actually parsing a SAM file, not a BAM file. If I'm reading the spec correctly, the BAM format doesn't seem to use the +33 offset for quality values, so if you are parsing BAM files, you'd simply use the first example above for that.

How can I correctly process this file containing tab separated values in Perl?

I am fairly new to Perl and know next to nothing about Perl's 'proper' syntax.
I have a text file that I use everyday with a listing of names, and other info for our users. This file changes daily and sometimes has two rows in it(tab delimited), and other times has 100+ rows in it.
The file also varies between 6-9 columns of data in a row. I have put together a Perl script that uses the split function on tabs, but the issue I am running into is that if I take row a, which has 5 columns in it and then add a second row b that has 6 columns in it that are all populated with data.
I cannot figure out how to get Perl to see that row a only has 5 columns of data and to continue parsing the text file from that point forward. It continues, but the output wraps lines strangely. How can I get around this issue? I hope that made sense.

You will have to post some code and possibly some sample data, but here's a code that is parsing rows of different lengths without issue.
Script:
#!/usr/bin/perl
use strict;
while (<STDIN>)
{
chomp;
my #info = split("\t");
print join(";", #info), "\n";
}
exit;
Test File:
jsmith 101 777-222-5555 Office 1 Building 1 Manager
aposse 104 777-222-5556 Office 2 Building 2 Stock Clerk
jbraza 105 777-222-5557 Office 3
mcuzui 102 777-222-5557 Office 3 Building 3 Cashier
ghines 107 777-222-5557 Office 3
Output:
%> test.pl < file.txt
jsmith;101;777-222-5555;Office 1;Building 1;Manager
aposse;104;777-222-5556;Office 2;Building 2;Stock Clerk
jbraza;105;777-222-5557;Office 3
mcuzui;102;777-222-5557;Office 3;Building 3;Cashier
ghines;107;777-222-5557;Office 3

You should post some sample data and code and explain desired behavior in terms of what the code currently does and what you want it to do. split will give you as many fields as there are in the input.
#!/usr/bin/perl
use strict; use warnings;
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
print "< #cells >\n";
}
__DATA__
1 2 3 4 5
a b c d e f

Text::CSV module can be used for parsing tab-separated-values as well. In reality, Text::CSV could parse values delimited by any character.
Relevant excerpt from its POD:
The module accepts either strings or
files as input and can utilize any
user-specified characters as
delimiters, separators, and escapes so
it is perhaps better called ASV
(anything separated values) rather
than just CSV.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { 'sep_char' => "\t" } );
open my $fh, '<', 'data.tsv' or die "Unable to open: $!";
my #rows;
while ( my $row_ref = $csv->getline($fh) ) {
push #rows, $row_ref;
}
$csv->sep_char('|');
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using perl to split file in flat text - perl

I have a flat file that are created with offsets e.g. row 1: char 1 - 3 = ID, 4-19 = user name, 20 - 40 = last name, etc... What's the best way to go about creating a perl script to read this? and is there any way to make it flexible based on different offset groups? Thank you!

Related

Split a string based on ASCII value

Save a row to csv format

How can I extract specific columns in perl?

convert ASCII code to number

How can I correctly process this file containing tab separated values in Perl?

Categories

Resources