Perl parsing multiple separator character data - perl

I have a mixed character separated file with a header row I am trying to read using Text::CSV, which I have used successfully on comma separate files to pull into an array of hashes in other scripts. I have read Text::CSV does not support multiple separators (spaces, tabs, commas), so I was trying to clean up the row using regex before using Text::CSV. Not to mention the data file also has comment lines in the middle of the file. Unfortunately, I do not have admin rights to install libraries which may accommodate multiple sep_chars, so I was hoping I could use Text::CSV or some other standard methods to clean up the header and row before adding to the AoH. Or should I abandon Text::CSV?
I'm obviously still learning. Thanks in advance.
Example file:
#
#
#
# name scale address type
test.data.one 32768 0x1234fde0 float
test.data.two 32768 0x1234fde4 float
test.data.the 32768 0x1234fde8 float
# comment lines in middle of data
test.data.for 32768 0x1234fdec float
test.data.fiv 32768 0x1234fdf0 float
Code excerpt:
my $fh;
my $input;
my $header;
my $pkey;
my $row;
my %arrayofhashes;
my $csv=Text::CSV({sep_char = ","})
or die "Text::CSV error: " Text::CSV=error_diag;
open($fh, '<:encoding(UTF-8)', $input)
or die "Can't open $input: $!";
while (<$fh>) {
$line = $_;
# skip to header row
next if($line !~ /^# name/);
# strip off leading chars on first column name
$header =~ s/# //g;
# replace multiple spaces and tabs with comma
$header =~ s/ +/,/g;
$header =~ s/t+/,/g;
# results in $header = "name,scale,address,type"
last;
}
my #header = split(",", $header);
$csv->parse($header);
$csv->column_names([$csv->fields]);
# above seems to work!
$pkey = 0;
while (<$fh>) {
$line = $_;
# skip comment lines
next if ($line =~ /^#/);
# replace spaces and tabs with commas
$line =~ s/( +|\t+)/,/g;
# replace multiple commas from previous regex with single comma
$line =~ s/,+/,/g;
# results in $line = "test.data.one,32768,0x1234fdec,float"
# need help trying to create a what I think needs to be a hash from the header and row.
$row = ?????;
# the following line works in my other perl scripts for CSV files when using:
# while ($row = $csv->getline_hr($fh)) instead of the above.
$arrayofhashes{$pkey} = $row;
$pkey++;
}

If your columns are separated by multiple spaces, Text::CSV is useless. Your code contains a lot of repeated code, trying to work around of Text::CSV limitations.
Also, your code has bad style, contains multiple syntax errors and typos, and confused variable names.
So You Want To Parse A Header.
We need a definition of the header line for our code. Let's take “the first comment line that contains non-space characters”. It may not be preceded by non-comment lines.
use strict; use warnings; use autodie;
open my $fh, '<:encoding(UTF-8)', "filename.tsv"; # error handling by autodie
my #headers;
while (<$fh>) {
# no need to copy to a $line variable, the $_ is just fine.
chomp; # remove line ending
s/\A#\s*// or die "No header line found"; # remove comment char, or die
/\S/ or next; # skip if there is nothing here
#headers = split; # split the header names.
# The `split` defaults to `split /\s+/, $_`
last; # break out of the loop: the header was found
}
The \s character class matches space characters (spaces, tabs, newlines, etc.). The \S is the inverse and matches all non-space characters.
The Rest
Now we have our header names, and can proceed to normal parsing:
my #records;
while (<$fh>) {
chomp;
next if /\A#/; # skip comments
my #fields = split;
my %hash;
#hash{#headers} = #fields; # use hash slice to assign fields to headers
push #records, \%hash; # add this hashref to our records
}
Voilà.
The Result
This code produces the following data structure from your example data:
#records = (
{
address => "0x1234fde0",
name => "test.data.one",
scale => 32768,
type => "float",
},
{
address => "0x1234fde4",
name => "test.data.two",
scale => 32768,
type => "float",
},
{
address => "0x1234fde8",
name => "test.data.the",
scale => 32768,
type => "float",
},
{
address => "0x1234fdec",
name => "test.data.for",
scale => 32768,
type => "float",
},
{
address => "0x1234fdf0",
name => "test.data.fiv",
scale => 32768,
type => "float",
},
);
This data structure could be used like
for my $record (#records) {
say $record->{name};
}
or
for my $i (0 .. $#records) {
say "$i: $records[$i]{name}";
}
Criticism Of Your Code
You declare all your variables at the top of your script, effectively making them global variables. Don't. Create your variables in the smallest scope possible. My code uses just three variables in the outer scope: $fh, #headers and #records.
This line my $csv=Text::CSV({sep_char = ","}) doesn't work as expected.
Text::CSV is not a function; it is the name of a module. You meant Text::CSV->new(...).
The options should be a hashref, but sep_char = "," tries to assign something to sep_char sadly, this could be valid syntax. But you actually meant to specify a key-value relationship. Use the => operator instead (called fat comma or hash rocket).
Neither does this work: or die "Text::CSV error: " Text::CSV=error_diag.
To concatenate strings, use the . concatenation operator. What you wrote is a syntax error: A literal string is always followed by an operator.
You really like assignments? The Text::CSV=error_diag does not work. You intended to call the error_diag method on the Text::CSV class. Therefore, use the correct operator ->: Text::CSV->error_diag.
The substitution s/t+/,/g replaces all sequences of ts by commas. To replace tabs, use the \t charclass.
%arrayofhashes is not an array of hashes: It is a hash (as evidenced by the % sigil), but you use integer numbers as keys. Arrays have the # sigil.
To add something to the end of an array, I'd rather not keep the index of the last item in an extra variable. Rather, use the push function to add an item to the end. This reduces the amount of bookkeeping code.
if you find yourself writing a loop like my $i = 0; while (condition) { do stuff; $i++}, then you usually want to have a C-style for loop:
for (my $i = 0; condition; $i++) {
do stuff;
}
This also helps with proper scoping of variables.

Related

find duplicate filenames and append them to hash of arrays

Perl question: I have a colon separated file containing paths that I'm using. I just split using a regex, like this:
my %unique_import_hash;
while (my $line = <$log_fh>) {
my ($log_type, $log_import_filename, $log_object_filename)
= split /:/, line;
$log_type =~ s/^\s+|\s+$//g; # trim whitespace
$log_import_filename =~ s/^\s+|\s+$//g; # trim whitespace
$log_object_filename =~ s/^\s+|\s+$//g; # trim whitespace
}
The exact file format is:
type : source-filename : import-filename
What I want is an index file that contains the last pushed $log_object_filename for each unique key $log_import_filename, so, what I'm going to do in English/Perl pseudo-code is push the $log_object_filename onto an array indexed by the hash %unique_import_hash. Then, I want to iterate over the keys and pop the array referred by %unique_import_hash and store it in an array of scalars.
My specific question is: what is the syntax for appending to an array that is the value of a hash?
You can use push, but you have to dereference the array referenced by the hash value:
push #{ $hash{$key} }, $filename;
See perlref for details.
If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value:
while (my $line = <$log_fh>) {
# ...
$unique_import_hash{$log_import_filename} = $log_object_filename;
}
use strict;
use warnings;
my %unique_import_hash;
my $log_filename = "file.log";
open(my $log_fh, "<" . $log_filename);
while (my $line = <$log_fh>) {
$line =~ s/ *: */:/g;
(my $log_type, my $log_import_filename, my $log_object_filename) = split /:/, $line;
push (#{$unique_import_hash{$log_import_filename}}, $log_object_filename);
}
Seek the wisdom of the Perl monks.

how to read a line and save multiple parameters into variables separated by ;?

So lets say I have a file.txt, this documents Syntax is like this:
"1;22;333;'4444';55555",
I now want my code to do the following:
open the file = already done
read line and save each Parameter separated by ; into a variable like ( $one = 1, $two = 22, $three = 333, $four = '4444', $five = 55555; )
this step would be writing the variables into a DB but thats done already
Loop until all lines of the file are done
So I actually Need help with Step 2, i think I am able to do the Loop and DB code. Do you guys have any ideas or tips how I could do this? beginnerfriendly would be nice so I can learn out of it.
foreach $file (#file){
$currentfile = "$currentdir\\$file";
open(my $reader, "<", $currentfile) or die "Failed to open file: $!\n";
?????
close $reader;
}
If you're just doing 'numbered fields' then you should be thinking 'array':
use Data::Dumper;
while ( <$reader> ) {
chomp;
my #row = split /;/;
print Dumper \#row;
}
This will give you an array that you can access - e.g. $row[0] for the first element.
$VAR1 = [
'1',
'22',
'333',
'\'4444\'',
'55555'
];
If you know what the headers are 'named' and prefer to work on names you can do something similar with a hash:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #cols = qw ( id value fish name sprout );
while ( <DATA> ) {
my %row;
chomp;
#row{#cols} = split /;/;
print Dumper \%row;
}
__DATA__
1;22;333;'4444';55555
This gives instead:
$VAR1 = {
'fish' => '333',
'name' => '\'4444\'',
'id' => '1',
'value' => '22',
'sprout' => '55555'
};
Note - hashes are unordered, but their whole point is that you don't need to care about the 'order' - just print $row{name},"\n";
You need to read from the filehandle $reader, line by line. See the tutorial perlopentut and the full reference open.
Then you split each line by the separator ;, what returns a list which you assign to an array.
open my $reader, "<", $currentfile or die "Failed to open file: $!\n";
while (my $line = <$reader>) {
chomp($line);
my #params = split ';', $line;
# do something with #params, it will be overwritten on next iteration
}
close $reader;
The diamond operator <> reads from a filehandle, <$fh>, returning a line at a time. See about it in perlop. When there are no more lines it returns undef and looping stops. You may assign the string that it returns to a variable which you declare (my $line), which then exists only within the body of the while loop. If you don't, but do while (<$fh>) instead, the line is assigned to the special variable $_, which is default for many things in Perl.
The chomp removes the linefeed (new line) from the end of the line.
Note that '4444' from your example is not a number and cannot be used as such.
Alternatively, you can take a reference to the array with parameters on each line, and put it in another array which thus will in the end contain all lines.
my #all_params;
while (my $line = <$reader>) {
my #params = split ';', $line;
push #all_params, \#params;
}
Now #all_params has elements which are references, each to an array with parameters for one line. For how to work with references see the tutorial perlreftut and the Cookbook on complex data structures, perldsc.
The following is more complex but let me mention it since it's a bit of an idiom. You can do the above in one statement
my #all_params = map { [ split ';', $_ ] } <$reader>;
This uses map, which applies the code in { ... } to each element of the list that is submitted to it, returning a list. So it takes a list and returns the processed list. The [...] inside makes an anonymous array, equivalent to the reference we took of an array previously. The filehandle <$reader>returns all lines of the file in one list when invoked in the list context, which is in this case imposed by map (since it must receive a list).
An important one: always start your programs with
use warnings 'all';
use strict;
The order of these doesn't really matter. Mostly you'll see use strict; first.
Then your loop over filenames need be foreach my $file (#file) { ... } and you must declare all variables, so my $currentfile = ....

How to parse multiple line, fixed-width file in perl?

I have a file that I need to parse in the following format. (All delimiters are spaces):
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.
I am familiar with how to parse a single line fixed-width file, but am stumped with how to handle multiple lines.
#!/usr/bin/env perl
use strict; use warnings;
my (%fields, $current_field);
while (my $line = <DATA>) {
next unless $line =~ /\S/;
if ($line =~ /^ \s+ ( \S .+ )/x) {
if (defined $current_field) {
$fields{ $current_field} .= $1;
}
}
elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
$current_field = $1;
$fields{ $current_field } = $2;
}
}
use Data::Dumper;
print Dumper \%fields;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.
Fixed-width says unpack to me. It is possible to parse with regexes and split, but unpack should be a safer choice, as it is the Right Tool for fixed width data.
I put the width of the first field to 12 and the empty space between to 13, which works for this data. You may need to change that. The template "A12A13A*" means "find 12 then 13 ascii characters, followed by any length of ascii characters". unpack will return a list of these matches. Also, unpack will use $_ if a string is not supplied, which is what we do here.
Note that if the first field is not fixed width up to the colon, as it appears to be in your sample data, you'll need to merge the fields in the template, e.g. "A25A*", and then strip the colon.
I chose array as the storage device, as I do not know if your field names are unique. A hash would overwrite fields with the same name. Another benefit of an array is that it preserves the order of the data as it appears in the file. If these things are irrelevant and quick lookup is more of a priority, use a hash instead.
Code:
use strict;
use warnings;
use Data::Dumper;
my $last_text;
my #array;
while (<DATA>) {
# unpack the fields and strip spaces
my ($field, undef, $text) = unpack "A12A13A*";
if ($field) { # If $field is empty, that means we have a multi-line value
$field =~ s/:$//; # strip the colon
$last_text = [ $field, $text ]; # store data in anonymous array
push #array, $last_text; # and store that array in #array
} else { # multi-line values get added to the previous lines data
$last_text->[1] .= " $text";
}
}
print Dumper \#array;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value
with a third line
Output:
$VAR1 = [
[
'field name 1:',
'Multiple word value.'
],
[
'field name 2:',
'Multiple word value along with multiple lines.'
],
[
'field name 3:',
'Another multiple word and multiple line value with a third line'
]
];
You could do this:
#!/usr/bin/perl
use strict;
use warnings;
my #fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
for (<$fh>) {
if (/^\s/) {
$fields[$#fields] .= $_;
} else {
push #fields, $_;
}
}
close $fh;
If the line starts with white space, append it to the last element in #fields, otherwise push it onto the end of the array.
Alternatively, slurp the entire file and split with look-around:
#!/usr/bin/perl
use strict;
use warnings;
$/=undef;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
my #fields = split/(?<=\n)(?!\s)/, <$fh>;
close $fh;
It's not a recommended approach though.
You can change delimiter:
$/ = "\nfield name";
while (my $line = <FILE>) {
if ($line =~ /(\d+)\s+(.+)/) {
print "Record $1 is $2";
}
}

Parsing an XML file line by line

So here is the issue. I am trying to parse a XML file of information from GenBank. This file contains information on multiple DNA sequences. I have this done already for two other xml formats from genbacnk (TINY xml and INSD xml), but pure xml gives me a headache. Here's how my program should work. Download an xml formated file that contains information on X number of sequences from GenBank. Run my perl script that searches through that xml file line by line and prints the information I want to a new file, in fasta format. Which is this: >Sequence_name_and_information\n sequences\n >sequence_name.... and on and on until you have all the sequences from the xml file. My issue though is that in pure xml the sequence itself comes before the identifier for the gene or locus of the sequences. The gene or locus of the sequences should go in the same line as the ">". Here is the code I have from the point of opening the file and parsing through it:
open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";
while ( my $lines = <INSD> ) {
foreach ($lines) {
if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
$lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
$lines =~ s/ //g;
chomp($lines);
print New_File ">$lines\_";
} elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
$lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
$lines =~ s/(\.|\?|\-| )/_/g;
$lines =~ s/_{2,}/_/g;
$lines =~ s/_{1,}$//;
$lines =~ s/^>*_{1,}//;
$lines =~ s/\s{2}//g;
chomp($lines);
print New_File "$lines\n";
} elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
$lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
$lines =~ s/ //g;
chomp($lines);
print New_File "$lines\n";
}
}
}
close INSD;
close New_File;
}
There are two places to find Gene/locus information. That info is found between either on of these two tags: LOCUS_NAME or GENE_NAME. There will be one, or the other. If one has info the other will be empty. In either case both need to be added to the end of the >....... line.
Thanks,
AlphaA
PS--I tried to print that info to a "file" by doing open "$NA", ">" the sequence to that, then moving on with the program, finding the gene info, printing it to the > line and then read the $NA file and printing it to the line right after the > line. I hope this is clear.
In my opinion you should use XSLT with XPath to navigate to the data you need.
As #Brian suggests, it is easier to use established XML parsing techniques and libraries.
There is even a Perl library for XSLT
Use an XML parser. I'm not a biologist, and I'm not sure of the final format you want, but it should be simple with this as a starting point. $_[1] in the anonymous sub contains a hash reference with, from what I can tell above, everything that I think you want saved from parsing the parent tag of the tags you want. It should be easy to print out the elements of $_[1] in the format that you want it to be in:
use strict;
use warnings;
use XML::Rules;
use Data::Dumper;
my #rules = (
_default => '',
'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
INSDSeq => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);
my $p = XML::Rules->new(rules => \#rules);
$p->parsefile('sequence.gbc.xml');
And that is just so that printing just the tags you want is easy. Or, if you want some other tags, What I really might do is this (you don't really need the #tags variable at all if you're just printing element by element):
my #tags = qw(
INSDSeq_locus
INSDSeq_organism
INSDSeq_sequence
);
my #rules = (
_default => 'content',
# Elements are, e.g. $_[1]{INSDSeq_locus}
INSDSeq => sub { print "$_: $_[1]{$_}\n" for #tags; return; },
);
with:
my $p = XML::Rules->new(rules => \#rules, stripspaces => 4);

I've unpacked it...how can I count the columns?

I am using unpack to parse some text files w/ some columns. Each text file is different and has a different number of columns. How can I count the columns so I don't get errors? Right now I am using 0..5 but if the text file has 3 columns then I get an error: "Use of uninitialized value in substitution...". Thx!
open (PARSE,"<$temp") or die $!;
my #template = map {'A'.length} <PARSE> =~ /(\S+\s*)/g;
next unless #template;
$template[-1] = 'A*';# set the last segment to be slurpy
my $template = "#template";
my #data;
while (<PARSE>) {
push #data, [unpack $template, $_]
}
for my $dat (#data){ # for each row
for(0..5){ # for each column in that row
$dat->[$_]=~s/^\s+//g;
$dat->[$_]=~s/\s+$//g;
print $dat->[$_].',';
}
print "\n";
}
With languages like Perl, Python, Ruby, etc., you rarely need to stoop to the level of subscripts when iterating over an array:
for my $cell (#$dat){
# Work with $cell rather than $dat->[$_].
...
}
Probably easier and cleaner to use Tie::File so that you don't have to read everything into memory, but here's one way that uses the #data list you set up:
my $dataFirstLine = $data[0];
chomp($dataFirstLine);
my #dataColumns = split("\t", $dataFirstLine); # assumes delimiter is tab, replace with escaped delimiter of choice
my $dataColumnCount = scalar #dataColumns;
print "number of columns: $dataColumnCount\n";