perl - fetch column names from file - perl

I have the following command in my perl script:
my #files = `find $basedir/ -type f -iname '$sampleid*.summary.csv'`; #there are multiple summary.csv files in my basedir. I store them in an array
my $summary = `tail -n 1 $files[0]`; #Each summary.csv contains a header line and a line with data. I fetch here the last line.
chomp($summary);
my #sp = split(/,/,$summary); # I split based on ','
my $gender = $sp[11]; # the values from column 11 are stored in $gender
my $qc = $sp[2]; # the values from column 2 are stored in $gender
Now, I'm experiencing the situation where my *summary.csv files don't have the same number of columns. They do all have 2 lines, where the first line represents the header.
What I want now is not storing the values from column 11 in gender, but I want to store the values from the column 'Gender' in $gender.
How can I achieve this?
First try at solution:
my %hash = ();
my $header = `head -n 1 $files[0]`; #reading the header
chomp ($header);
my #colnames = split (/,/,$header);
my $keyfield = $colnames[#here should be the column with the name 'Gender']
push #{ $hash{$keyfield} };
my $gender = $sp[$keyfield]

You will have to read the header line as well as the data to know what column holds which information. This is done easiest by writing actual Perl code instead of shelling out to various command line utilities. See further below for that solution.
Fixing your solution also requires a hash. You need to read the header line first, store the header fields in an array (as you've already done), and then read the data line. The data needs to be a hash, not an array. A hash is a map of keys and values.
# read the header and create a list of header fields
my $header = `head -n 1 $files[0]`;
chomp ($header);
my #colnames = split (/,/,$header);
# read the data line
my $summary = `tail -n 1 $files[0]`;
chomp($summary);
my %sp; # use a hash for the data, not an array
# use a hash slice to fill in the columns
#sp{#colnames} = split(/,/,$summary);
my $gender = $sp{Gender};
The tricky part here is this line.
#sp{#colnames} = split(/,/,$summary);
We have declared %sp as a hash, but we now access it with a # sigil. That's because we are taking a hash slice, as indicated by the curly braces {}. The slice we take is all elements with the names of the values in #colnames. There is more than one value, so the return value is not a scalar (with a $) any more. There is a list of return values, so the sigil turns to #. Now we use that list on the left hand side (that's called an LVALUE), and assign the result of the split to that list.
Doing it with modern Perl
The following program will use File::Find::Rule to replace your find command, and Text::CSV to read the CSV file. It grabs all the files, then opens one at a time. The header line will be read first, and fed into the Text::CSV object, so that it can then give back a hash reference, which you can use to access every field by name.
I've written it in a way that it will only read one line for each file, as you said there are only two lines per file. You can easily extend that to be a loop.
use strict;
use warnings;
use File::Find::Rule;
use Text::CSV;
my $sampleid;
my $basedir;
my $csv = Text::CSV->new(
{
binary => 1,
sep => ',',
}
) or die "Cannot use CSV: " . Text::CSV->error_diag;
my #files = File::Find::Rule->file()->name("$sampleid*.summary.csv")->in($basedir);
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open $file: $!";
# get the headers
my #cols = #{ $csv->getline($fh) };
$csv->column_names(#cols);
# read the first line
my $row = $csv->getline_hr($fh);
# do whatever you you want with the row
print "$file: ", $row->{gender};
}
Please note that I have not tested this program.

Related

Program argument is 100 but returns the value as 0100

Right now I am trying to do an assignment where I have to
- Extract information from an HTML file
- Save it to a scalar
- Run a regular expression to find the number of seats available in the designated course (the program argument is the course number for example 100 for ICS 100)
- If the course has multiple sessions, I have to find the sum of the seats available and print
- The output is just the number of seats available
The problem here is that when I was debugging and checking to make sure that my variable I have the program arg saved to was storing the correct value, it was storing the values with an extra 0 behind it.
ex.) perl filename.pl 100
ARGV[0] returns as 0100
I've tried storing the True regular expression values to an array, saving using multiple scalar variables, and changing my regular expression but none worked.
die "Usage: perl NameHere_seats.pl course_number" if (#ARGV < 1);
# This variable will store the .html file contents
my $fileContents;
# This variable will store the sum of seats available in the array #seatAvailable
my $sum = 0;
# This variable will store the program argument
my $courseNum = $ARGV[0];
# Open the file to read contents all at once
open (my $fh, "<", "fa19_ics_class_availability.html") or die ("Couldn't open 'fa19_ics_class_availability.html'\n");
# use naked brakets to limit the $/
{
#use local $/ to get <$fh> to read the whole file, and not one line
local $/;
$fileContents = <$fh>;
}
# Close the file handle
close $fh;
# Uncomment the line below to check if you've successfully extracted the text
# print $fileContents;
# Check if the course exists
die "No courses matched...\n" if ($ARGV[0] !~ m/\b(1[0-9]{2}[A-Z]?|2[0-8][0-9][A-Z]?|29[0-3])[A-Z]?\b/);
while ($fileContents =~ m/$courseNum(.+?)align="center">(\d)</) {
my $num = $2;
$sum = $sum + $num;
}
print $sum;
# Use this line as error checking to make sure #ARGV[0] is storing proper number
print $courseNum;
The current output I am receiving when program argument is 100 is just 0, and I assume it's because the regular expression is not catching any values as true therefore the sum remains at a value of 0. The output should be 15...
This is a link to the .html page > https://laulima.hawaii.edu/access/content/user/emeyer/ics/215/FA19/01/perl/fa19_ics_class_availability.html
You're getting "0100" because you have two print() statements.
print $sum;
...
print $courseNum;
And because there are no newlines or other output between them, you get the two values printed out next to each other. $sum is '0' and $courseNum is '100'.
So why is $sum zero? Well, that's because your regex isn't picking up the data you want it to match. Your regex looks like this:
m/$courseNum(.+?)align="center">(\d)</
You're looking for $courseNum followed by a number of other characters, followed by 'align="center">' and then your digit. This doesn't work for a number of reasons.
The string "100" appears many times in your text. Many times it doesn't even mean a course number (e.g. "100%"). Perhaps you should look for something more precise (ICS $coursenum).
The .+? doesn't do what you think it does. The dot doesn't match newline characters unless you use the /s option on the match operator.
But even if you fix those first two problems, it still won't work as there are a number of numeric table cells for each course and you're doing nothing to ensure that you're grabbing the last one. Your current code will get the "Curr. Enrolled" column, not the "Seats Avail" one.
This is a non-trivial HTML parsing problem. It shouldn't be addressed using regexes (HTML should never be parsed using regexes). You should look at one of the HTML parsing modules from CPAN - I think I'd use Web::Query.
Update: An example solution using Web::Query:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Basename;
use Web::Query;
my $course_num = shift
or die 'Usage: perl ' . basename $0 . " course_number\n";
my $source = 'fa19_ics_class_availability.html';
open my $fh, '<', $source
or die "Cannot open '$source': $!\n";
my $html = do { local $/; <$fh> };
my $count_free;
wq($html)
# Get each table row in the table
->find('table.listOfClasses tr')
->each(sub {
my ($i, $elem) = #_;
my #tds;
# Get each <td> in the <tr>
$elem->find('td')->each(sub { push #tds, $_[1] });
# Ignore rows that don't have 13 columns
return if #tds != 13;
# Ignore rows that aren't about the right course
return if $tds[2]->text ne "ICS $course_num";
# Add the number of available places
$count_free += $tds[8]->text;
});
say $count_free;

Save a row to csv format

I have a set of rows from a DB that I would like to save to a csv file.
Taking into account that the data are ascii chars without any weird chars would the following suffice?
my $csv_row = join( ', ', #$row );
# save csv_row to file
My concern is if that would create rows that would be acceptable as CSV by any tool and e.g not be concern with quoting etc.
Update:
Is there any difference with this?
my $csv = Text::CSV->new ( { binary => 1, eol => "\n"} );
my $header = join (',', qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 ) );
$csv->print( $fh, [$header] );
foreach my $row ( #data ) {
$csv->print($fh, $row );
}
This gives me as a first line:
" COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4"
Please notice the double quotes and the rest of the rows are without any quotes.
What is the difference than my plain join? Also do I need the binary set?
The safest way should be to write clean records with a comma separator. The simpler the better, specially with the format that has so much variation in real life. If needed, double quote each field.
The true strength in using the module is for reading of "real-life" data. But it makes perfect sense to use it for writing as well, for a uniform approach to CSV. Also, options can then be set in a clear way, and the module can iron out some glitches in data.
The Text::CSV documentation tells us about binary option
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set binary => 1 in the call to new. To cover the widest range of parsing options, you will always want to set binary.
I'd say use it. Since you write a file this may be it for options, along with eol (or use say method). But do scan the many useful options and review their defaults.
As for your header, the print method expects an array reference where each field is an element, not a single string with comma-separated fields. So it is wrong to say
my $header = join (',', qw(COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4)); # WRONG
$csv->print( $fh, [$header] );
since the $header is a single string which is then made the sole element of the (anonymous) array reference created by [ ... ]. So it prints this string as the first field in the row, and since it detects in it the separator , itself it also double-quotes. Instead, you should have
$csv->print($fh, [COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4]);
or better assign column names to #header and then do $csv->print($fh, \#header).
This is also an example of why it is good to use the module for writing – if a comma slips into an element of the array, supposed to be a single field, it is handled correctly by double-quoting.
A complete example
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
my $file = 'output.csv';
open my $fh_out , '>', 'output.csv' or die "Can't open $file for writing: $!";
my #headers = qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 );
my #data = 1..4;
$csv->print($fh_out, \#headers);
$csv->print($fh_out, \#data);
close $fh_out;
what produces the file output.csv
COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4
1,2,3,4

Storing a file in a hash - Only stores first line?

I am trying to read a file and store it into a hash. When i print out the contents of the hash only the first line from the file stores.
#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dump;
local $/ = "";
my %parameters;
open(my $PARAMS, 'SimParams.conf')
or die "Unable to open file, $!";
while(<$PARAMS>) {
my #temp = split(/:\s*|\n/);
$parameters{$temp[0]} = $temp[1];
}
dd(\%parameters);
exit 0
The dd(\%parameters) shows only the first line of the file as key and value. How can I get all 3 lines to be Key and Value pairings in this hash?
EDIT: SimParams file as requested:
RamSize: 1000
PageSize: 200, 200
SysClock: 1
The datadump gives the output:
{ RamSize => "1000\r" }
The line
local $/ = "";
is reading your 3 line file as 1 chunk, the entire file. If you eliminate that code, your hash should be created.
You should probably chomp your input to remove the newline . Place it in your code before splitting to #temp.
chomp;
Borodin best explains what local $/ = ""; does.
Setting $/ to the null string enables paragraph mode. Each time you read from $PARAMS (which should be $params because it is a local variable) you will be given the next block of data until a blank line is encountered
It looks like there are no blank lines in your data, so the read will return the entire contents of the file
You don't say why you modified the value of $/, but it looks like just removing that assignment will get your code working properly

Perl - Need to append duplicates in a file and write unique value only

I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.

i want to merge multiple csv files by specific condition using perl

i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)