Read a CSV file with uneven commas but fixed number of columns - perl

I want to able to read this CSV file into an array of arrays or hashes for manipulation. How can I go about it?
For example my file contains the following (the first line is the header):
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
First column is name, second is Age, third is Items, But items can contain more than one thing separated by commas, and last column is Person availability.
How can I accurately read this?

Well-formed CSV quotes fields that contain a comma as part of the value. If your CSV is well-formed use the Text::CSV module:
use Text::CSV;
my $csv = Text::CSV->new();
while (my $row = $csv->getline(\*DATA)) {
my $name = $row->[0];
my $age = $row->[1];
my #items = split /,/, $row->[2];
my $available = $row->[3];
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad",touchpen,Yes
Output:
Name/Age/Items/Available
John/29/laptop mouse/Yes
Jane/28/desktop keyboard mouse/yes
Doe/56/tablet keyboard trackpad touchpen/Yes
If your CSV is not well-formed you'll need to implement a custom parse based on knowledge of your data. Assuming that the Items column is the only multi-valued field you can split on a comma and then remove the fields with a known position. Whatever is left is the items.
while (my $line = <DATA>) {
chomp $line;
my #record = split /,/, $line;
my $name = shift #record;
my $age = shift #record;
my $available = pop #record;
my #items = #record;
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
Alternately, you could use array slicing to get the same result:
my ($name, $age, $available, #items) = #record[0, 1, -1, 2 .. #record - 2];

Since your data is, in reality, a properly-formatted CSV file, you can use the standard tools to read and store it
Here's the data I'm now assuming that you have
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touch pen",Yes
Solution
Like my original answer, this code uses Text::CSV to parse each line of input. But instead of having to reformat it, each row may be pushed directly onto array #data
Also as before, it conforms to the standard of reading from STDIN. But this time I have used Data::Dump to reveal the in-memory data structure that has been built. If you run it on the command line you should use
$ perl unpack_csv.pl text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
my #data;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
push #data, \#row;
}
use Data::Dump;
dd \#data;

Update
I now realise that the OP's file may well contain properly-formatted CSV data, which makes this answer superfluous
However the question has not been changed to show the real data, so I am leaving this answer here in case the question's subject line and content entices people with a problem that this will solve
I recommend that you use an intermediate program to format your CSV file properly. Once you have a standard-format file, the resulting output can then be processed using Perl with Text::CSV, Excel, or anything similar
This program uses Text::CSV to read your input data and write the Items column enclosed in quotes if necessary
It works by using Text::CSV->parse to split each line into fields, and then reserving the first two and final fields for new fields 1, 2 and 4. Whatever is left is joined with a comma , and used for field 3. The four resulting values are passed back to Text::CSV->combine and printed
It conforms to the standard of reading from STDIN and writing to STDOUT, so if you run it on the command line you should use
$ perl reformat_csv.pl text.csv > new_text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
my $f1 = shift #row;
my $f2 = shift #row;
my $f4 = pop #row;
my $f3 = join ',', #row;
$csv->combine($f1, $f2, $f3, $f4);
print $csv->string, "\n";
}
output
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touchpen",Yes

Related

Data value of array not printing properly

I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.
As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}
As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.
The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .

how to remove last single line available in file using perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}
Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}
You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.
I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Perl - Using an Array on a Config txt file

So I have a text file with four sets of data on a line, such as aa bb username password. So far I have been able to parse through the first line of the file using substrings and indices, and assigning each of the four to variables.
My goal is to use an array and chomp through each line and assign them to the four variables, and than to match an user inputted argument to the first variable, and use the four variables in that correct line.
For example, this would be the text file:
"aa bb cc dd"
"ee ff gg hh"
And depending on whether the user inputs "aa" or "ee" as the argument, it would use that line's set of arguments in the code.
I am trying to get up a basic array and chomp through it based on a condition for the first variable, essentially.
Here is my code for the four variables for the first line, but like I said, this only works for the first line in the text file:
local $/;
open(FILE, $configfile) or die "Can't read config file 'filename' [$!]\n";
my $document = <FILE>;
close (FILE);
my $string = $document;
my $substring = " ";
my $Index = index($string, $substring);
my $aa = substr($string, 0, $Index);
my $newstring = substr($string, $Index+1);
my $Index2 = index($newstring, $substring);
my $bb = substr($newstring, 0, $Index2);
my $newstring2 = substr($newstring, $Index2+1);
my $Index3 = index($newstring2, $substring);
my $cc = substr($newstring2, 0, $Index3);
my $newstring3 = substr($newstring2, $Index3+1);
my $Index4 = index($newstring3, $substring);
my $dd = substr($newstring3, 0, $Index4);
First of all, you can parse your whole line using split instead of running index and substring on them:
my ( $aa, $bb, $cc, $dd ) = split /\s+/, $line;
Even better, use an array:
my #array = split /\s+/, $line;
I think you're saying that you need to store each array of command parts into another array of lines. Is that correct? Take a look at this tutorial on references available in the Perl Documentation.
Perl has three different types of variables. The problem is that each of the types of variables of these stores only a single piece of data. Arrays and hashes may store lots of data, but only one piece of data can be stored in each element of a hash or array.
References allow you to get around this limitation. A reference is simply a pointer to another piece of data. For example, if $line = aa bb cc dd, doing this:
my #command_list = split /\s+/ $line;
Will give you the following:
$command_list[0] = "aa";
$command_list[1] = "bb";
$command_list[2] = "cc";
$command_list[3] = "dd";
You want to store #command_list into another structure. What you need is a reference to #command_list. To get a reference to it, you merely put a backslash in front of it:
my $reference = \#command_list;
This could be put into an array:
my #array;
$array[0] = $reference;
Now, I'm storing an entire array into a single element of an array.
To get back to the original structure from the reference, you put the correct sigil. Since this is an array, you put # in front of it:
my #new_array = #{ $reference };
If you want the first item in your reference without using having to transport it into another array, you could simply treat #{ $reference } as an array itself:
${ $reference }[0] = "aa";
Or, use the magic -> which makes the syntax a bit cleaner:
$reference->[0] = "aa";
Go through the tutorial. This will help you understand the full power of references, and how they can be used. Your program would look something like this:
use strict;
use warnings;
use feature qw(say); #Better print that print
use autodie; #Kills your program if the file can't be open
my $file = [...] #Somehow get the file you're reading in...
open my $file_fh, "<", $file;
my #command_list;
while ( my $line = <$file_fh> ) {
chomp $line;
my #line_list = split /\s+/, $line;
push #command_list, \#line_list;
}
Note that push #command_list, \#line_list; is pushing a reference to one array into another. How do you get it back out? Simple:
for my $cmd_line_ref ( #command_list ) {
my $command = $cmd_line_ref->[0]; #This is the first element in your command
next unless $command eq $user_desires; # However you figure out what the user wants
my $line = join " ", #{ $cmd_line_ref } #Rejoins your command line once again
??? #Profit
}
Read the tutorial on references, and learn about join and split.
You are reading the whole file in the my $document = <FILE> line.
Try something like:
my #lines;
open my $file, '<', $configfile or die 'xxx';
while( <$file> ) {
chomp;
push #lines, [ split ]
}
And now #lines has an array of arrays with the information you need.
(EDIT) don't forget to lose the local $/; -- it's what is making you read the whole file at once.
my $document = <FILE> is reading in only the first line. Try using a while loop.
If you want to read all lines of the file at once - assuming it's a small file - you may want to use File::Slurp module:
use File::Slurp;
my #lines = File::Slurp::read_file($configfile);
foreach my $line (#lines) {
# do whatever
Also, you can use CPAN modules to split the strings into fields.
If they are single-space separated, simply read the whole file using a standard CSV parser (you can configure Text::CSV_XS to use any characater as separator). Example here: How can I parse downloaded CSV data with Perl?
If they are separated by random amount of whitespace, use #massa's advice below and use split function.

Using Perl to parse a CSV file from a particular row to the end of the file

am very new to Perl and need your help
I have a CSV file xyz.csv with contents:
here level1 and er values are strings names...not numbers...
level1,er
level2,er2
level3,er3
level4,er4
I parse this CSV file using the script below and pass the fields to an array in the first run
open(my $d, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$d>) {
chomp $line;
my #data = split "," , $line;
#XYX = ( [ "$data[0]", "$data[1]" ], );
}
For the second run I take an input from a command prompt and store in variable $val. My program should parse the CSV file from the value stored in variable until it reaches the end of the file
For example
I input level2 so I need a script to parse from the second line to the end of the CSV file, ignoring the values before level2 in the file, and pass these values (level2 to level4) to the #XYX = (["$data[1]","$data[1]"],);}
level2,er2
level3,er3
level4,er4
I input level3 so I need a script to parse from the third line to the end of the CSV file, ignoring the values before level3 in the file, and pass these values (level3 and level4) to the #XYX = (["$data[0]","$data[1]"],);}
level3,er3
level4,er4
How do I achieve that? Please do give your valuable suggestions. I appreciate your help
As long as you are certain that there are never any commas in the data you should be OK using split. But even so it would be wise to limit the split to two fields, so that you get everything up to the first comma and everything after it
There are a few issues with your code. First of all I hope you are putting use strict and use warnings at the top of all your Perl programs. That simple measure will catch many trivial problems that you could otherwise overlook, and so it is especially important before you ask for help with your code
It isn't commonly known, but putting a newline "\n" at the end of your die string prevent Perl from giving file and line number details in the output of where the error occurred. While this may be what you want, it is usually more helpful to be given the extra information
Your variable names are verly unhelpful, and by convention Perl variables consist of lower-case alphanumerics and underscores. Names like #XYX and $W don't help me understand your code at all!
Rather than splitting to an array, it looks like you would be better off putting the two fields into two scalar variables to avoid all that indexing. And I am not sure what you intend by #XYX = (["$data[1]","$data[1]"],). First of all do you really mean to use $data[1] twice? Secondly, your should never put scalar variables inside double quotes, as it does something very specific, and unless you know what that is you should avoid it. Finally, did you mean to push an anonymous array onto #XYX each time around the loop? Otherwise the contents of the array will be overwritten each time a line is read from the file, and the earlier data will be lost
This program uses a regular expression to extract $level_num from the first field. All it does it find the first sequence of digits in the string, which can then be compared to the minimum required level $min_level to decide whether a line from the log is relevant
use strict;
use warnings;
my $file = 'xyz.csv';
my $min_level = 3;
my #list;
open my $fh, '<', $file or die "Could not open '$file' $!";
while (my $line = <$fh>) {
chomp $line;
my ($level, $error) = split ',', $line, 2;
my ($level_num) = $level =~ /(\d+)/;
next unless $level_num >= $min_level;
push #list, [ $level, $error ];
}
For deciding which records to process you can use the "flip-flop" operator (..) along these lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $level = shift || 'level1';
while (<DATA>) {
if (/^\Q$level,/ .. 0) {
print;
}
}
__DATA__
level1,er
level2,er2
level3,er3
level4,er4
The flip-flop operator returns false until its first operand is true. At that point it returns false until its second operand is true; at which point it returns false again.
I'm assuming that your file is ordered so that once you start to process it, you never want to stop. That means that the first operand to the flip-flop can be /^\Q$level,/ (match the string $level at the start of the line) and the second operand can just be zero (as we never want it to stop processing).
I'd also strongly recommend not parsing CSV records using split /,/. That may work on your current data but, in general, the fields in a CSV file are allowed to contain embedded commas which will break this approach. Instead, have a look at Text::CSV or Text::ParseWords (which is included with the standard Perl distribution).
Update: I seem to have got a couple of downvotes on this. It would be great if people would take the time to explain why.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my #XYZ;
my $file = 'xyz.csv';
open my $fh, '<', $file or die "$file: $!\n";
my $level = shift; # get level from commandline
my $getall = not defined $level; # true if level not given on commandline
my $parser = Text::CSV->new({ binary => 1 }); # object for parsing lines of CSV
while (my $row = $parser->getline($fh)) # $row is an array reference containing cells from a line of CSV
{
if ($getall # if level was not given on commandline, then put all rows into #XYZ
or # if level *was* given on commandline, then...
$row->[0] eq $level .. 0 # ...wait until the first cell in a row equals $level, then put that row and all subsequent rows into #XYZ
)
{
push #XYZ, $row;
}
}
close $fh;
#!/usr/bin/perl
use strict;
use warnings;
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
my $level = shift ||"level1";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if($fields[0] eq $level .. 0){
print "\n$fields[0]\n";
print "$fields[1]\n";
}}
This worked....thanks ALL for your help...

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra
#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}
This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}
Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>
You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.