perl read file from specified line thru the end - perl

I'm new with perl. I'm trying to read a large comma separate file, split and grab only some columns. I could create it with some internet help, but I'm struggling to change to code to start reading from a specific line thru the end of the file.
my need is open file start reading on line 12, split ',' grab column 0,2,10,11 and concatenate those needed columns with '\t'.
here is my code
#!/usr/bin/perl
my $filename = 'file_to_read.csv';
open(FILER, $filename) or die "Could not read $filename.";
open(FILEW, ">$filename.txt") || die "couldn't create the file\n";
while(<FILER>) {
chomp;
my #fields = split(',', $_);
print FILEW "$fields[0]\t$fields[3]\t$fields[10]\t$fields[11]\n";
}
close FILER;
close FILEW;
here is the file example:
[Header]
GSGT Version: X
Processing Date:12/01/2010 7:20 PM
Content:
Num SNPs:
Total SNPs:
Num Samples:
Total Samples:
Sample:
[Data]
SNP Name,Chromosome,Pos,GC Score,Theta,R,X,Y,X Raw,Y Raw,B Allele Freq,Log R Ratio,Allele1 - TOP,Allele2 - TOP
1:10001102-G-T,1,10001102,0.4159,0.007,0.477,0.472,0.005,6281,126,0.0000,-0.2581,A,A
1:100011159-T-G,1,100011159,0.4259,0.972,0.859,0.036,0.822,807,3648,0.9942,-0.0304,C,C
1:10002775-GA,1,10002775,0.4234,0.977,1.271,0.043,1.228,809,5140,0.9892,0.0111,G,G

Rather than skipping until a specific line number, which may vary from file to file, it is best to keep track of the current section of the file marked by [Header], [Data] etc.
This solution keeps a state variable $section which is updated to the current section name every time a [Section] label is encountered in the file. Everything from the Data section is summarised and printed
A similar thing could be done with the column headers, using names instead of numbers to select the fields to be output, but I have chosen to keep the complexity down instead
use strict;
use warnings 'all';
use feature 'say';
my $filename = 'file_to_read.csv';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my $section = "";
while ( <$fh> ) {
next unless /\S/; # Skip empty lines
if ( $section eq 'Data' ) { # Skip unless we're in the [Data] section
chomp;
my #fields = split /,/;
say join ',', #fields[0,3,10,11];
}
elsif ( /\[(\w+)\]/ ) {
$section = $1;
}
}
output
SNP Name,GC Score,B Allele Freq,Log R Ratio
1:10001102-G-T,0.4159,0.0000,-0.2581
1:100011159-T-G,0.4259,0.9942,-0.0304
1:10002775-GA,0.4234,0.9892,0.0111

please assign a variable to count the lines processed like my $line_count = 0;
and inside the beginning of while loop increment the varialbe $line_count++;
and skip if the line count is below 12 ie , next if $line_count > 12;

Related

How loading of file into memory in perl

i have tried with some script for sorting a input text file in descending order and print only top usage customer.
input text file containts:
NAME,USAGE,IP
example :
Abc,556,10.2.3.5
bbc,126,14.2.5.6
and so on, this is very large file and i am trying to avoid loading file into memory.
I have tried with following script.
use warnings ;
use strict;
my %hash = ();
my $file = $ARGV[0] ;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line) ;
my( $name,$key,$ip) = split /,/, $line;
$hash{$key} = [ $name, $ip ];
}
my $count= 0 ;
foreach ( sort { $b <=> $a } keys %hash ){
my $value = $hash{$_};
print "$_ #{$value} \n" ;
last if (++$count == 5);
}
Output should be sorted based on usage and it will show the name and IP for respective usage. " `
I think you want to print the five lines of the file that have the highest value in the second column
That can be done by a sort of insertion sort that checks each line of the file to see if it comes higher than the lowest of the five lines most recently found, but it's easier to just accumulate a sensible subset of the data, sort it, and discard all but the top five
Here, I have array #top containing lines from the file. When there are 100 lines in the array, it is sorted and reduced to the five maximal entries. Then the while loop continues to add lines to the file until it reaches the limit again or the end of the file has been reached, when the process is repeated. That way, no more than 100 lines from the file are ever help in memory
I have generated a 1,000-line data file to test this with random values between 100 and 2,000 in column 2. The output below is the result
use strict;
use warnings 'all';
open my $fh, '<', 'usage.txt' or die $!;
my #top;
while ( <$fh> ) {
push #top, $_;
if ( #top >= 100 or eof ) {
#top = sort {
my ($aa, $bb) = map { (split /,/)[1] } ($a, $b);
$bb <=> $aa;
} #top;
#top = #top[0..4];
}
}
print #top;
output
qcmmt,2000,10.2.3.5
ciumt,1999,10.2.3.5
eweae,1998,10.2.3.5
gvhwv,1998,10.2.3.5
wonmd,1993,10.2.3.5
The standard way to do this is to create a priority queue that contains k items, where k is the number of items you want to return. So if you want the five lines that have the highest value, you'd do the following:
pq = new priority_queue
add the first five items in the file to the priority queue
for each remaining line in the file
if value > lowest value on pq
remove lowest value on the pq
add new value to pq
When you're done going through the file, pq will contain the five items with the highest value.
To do this in Perl, use the Heap::Priority module.
This will be faster and use less memory than the other suggestions.
Algorithm remembering the last 5 biggest rows.
For each row, check with the lowest memorized element. If more - are stored in the array before next biggest item with unshift lowest.
use warnings;
use strict;
my $file = $ARGV[0] ;
my #keys=(0,0,0,0,0);
my #res;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while(<$fh>)
{
my($name,$key,$ip) = split /,/;
next if($key<$keys[0]);
for(0..4) {
if($_==4 || $key<$keys[$_+1]) {
#keys[0..$_-1]=#keys[1..$_] if($_>0);
$keys[$_]=$key;
$res[$_]=[ $name, $ip ];
last;
}
}
}
for(0..4) {
print "$keys[4-$_] #{$res[4-$_]}";
}
Test on file from 1M random rows (20 Mbytes):
Last items (This algorithm):
Start 1472567980.91183
End 1472567981.94729 (duration 1.03546 seconds)
full sort in memory (Algorithm of #Rishi):
Start 1472568441.00438
End 1472568443.43829 (duration 2.43391 seconds)
sort by parts of 100 rows (Algorithm of #Borodin):
Start 1472568185.21896
End 1472568195.59322 (duration 10.37426 seconds)

Modifying CSV file and Preserving Order

The question that follows is a made up simplified example of a more complex problem that I'm trying to solve. I would like to preserve the structure of the code, especially the use of the %hash to store the outcomes for each patient but I do not need to read the data file into memory (but I cannot find a way of reading my csv data file line by line from the end.)
My sample data is made up of events that occur to patients. A patient can be added to the study (Event=B) or he can die (Event=D) or exit the study(Event=F.) Death and Exit are the only two possible outcomes for each patient.
For each event I have the date of occurrence (in hours from given point in time), the unique ID number of each patient, the event and the Outcome (a field set to 0 for every patient.)
I'm trying to write a code that will change the input file by putting next to each addition of a new patient, what is his eventual outcome (death or exit.)
In order to do so, I read the file from the end, and whenever I encounter a death or exit of a patient, I populate a hash that matches patient ID with outcome. When I encounter an event telling me that a new patient has been added to the study, I then match his ID with those in the hash and change the value of "Outcome" from 0 to either D or F.
I have been able to write a code that reads the file from bottom and then creates a new modified file with the updated value for Outcome. The problem is that since I read the input file from bottom to top and print each line after reading it, the output file is in reversed order and I do not know how to change this. Also, ideally I don't want to create a new file bu I would like to simply modify the input one. However, I have failed with every attempt to do so.
Sample data:
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,0
25201028,562962838335408,B,0
25201100,562962838335407,D,0
25201128,562962838335408,F,0
My code:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $fh_input, "<", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";
my #lines = <$fh_input>;
close $fh_input;
open (my $fh_output, ">>", "Revised_mini_test2.csv")
or die "cannot open > Revised_mini_test2.csv: $!";
my $length = scalar(#lines);
my %outcome;
my #input_variables;
for (my $i = 1; $i < #lines; $i++){
chomp($lines[$length-$i]);
#input_variables=split(/,/, $lines[$length - $i]);
if ($input_variables[2] eq "D" || $input_variables[2] eq "F"){
$outcome{$input_variables[1]} = $input_variables[2];
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
elsif($input_variables[2] eq "B") {
$input_variables[3]=$outcome{$input_variables[1]};
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
else{
# necessary since the actual data has many more possible "Events"
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
}
close $fh_output;
EDIT: desired output should be
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
Also, an additional complication is that the unique patient ID after the exit of a patient gets re-used. This means that I cannot do a 1st pass and store the outcome for each patient and a 2nd one to update the values of Outcome.
EDIT 2: let me clarify that when I say that each patient has a "unique ID" I mean that there cannot be in the study, at the same time, two patients with the same ID. However, if a patient exits the study, his ID gets re-used.
Update
I have just read your additional information that patient numbers are re-used once they exit the study. Why you would design a system like that I don't know, but there it is
It becomes far harder to write something straightforward without reading the file into an array, so that's what I have done here
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'mini_test2.csv';
my #data;
while ( <$fh> ) {
chomp;
push #data, [ split /,/ ];
}
my %outcome;
for ( my $i = $#data; $i > 0; --$i ) {
my ($patient_number, $event) = #{$data[$i]}[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
elsif ( $event =~ /[B]/ ) {
$data[$i][3] = delete $outcome{$patient_number} // 0;
}
}
print join(',', #$_), "\n" for #data;
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
There are a few ways to approach this. I have chosen to take two passes through the file, first accumulating the outcome for each patient in a hash, and then replacing all the outcome fields in the B records
use strict;
use warnings;
use 5.010;
use autodie;
use Fcntl ':seek';
my %outcome;
open my $fh, '<', 'mini_test2.csv';
<$fh>; # Drop header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
}
seek $fh, 0, SEEK_SET; # Rewind
print scalar <$fh>; # Copy header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event !~ /[DF]/ ) {
$fields[3] = $outcome{$patient_number} // 0;
}
print join(',', #fields), "\n";
}
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
What we can do is instead of printing out the line at each stage, we'll write it back to the array of lines. Then we can just print them out at the end.
for (my $i=$#lines; i>=0; i--)
{
chomp $lines[$i];
#input_variables = split /,/, $lines[$i];
if ($input_variables[2] eq "D" || $input_variables[2] eq "F")
{
$outcome{$input_variables[1]}=$input_variables[2];
}else
{
$input_variables[3]=$outcome{$input_variables[1]};
}
$line[$i] = join ",", #input_variables;
}
$, = "\n"; #Make list seperator for printing a newline.
print $fh_output #lines;
As for the second question of modifying the original file. It is possible to open a file for both reading and writing using modes "+<", "+>", or "+>>". Don't do this! It is error prone as you must replace data character by character.
The standard way to "modify" an existing file is to rename it, read from the renamed file, write to a new file with the original name, and delete the temp file.
my $file_name = "mini_test2.csv";
my $tmp_file_name = $file_name . ".tmp";
rename $file_name, $tmp_file_name;
open (my $fh_input, "<", $tmp_file_name)
or die "cannot open > $tmp_file_name: $!";
open (my $fh_output, ">>", $file_name)
or die "cannot open > $file_name: $!";
#Your code to process the data.
close $fh_input;
close $fh_output;
#delete the temp file
unlink $tmp_file_name;
But, in your case, you slurp all of the data into memory right away. Just open for writing that clobbers existing files
open (my $fh_output, ">", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";

In perl is it possible to use the same elements created in an array in file.csv and do a foreach on file1csv as well

In file.csv i have created an array from all the unique values found in column B. now i want to do a foreach on the same values in column C of file1.csv is this possible? I can't hard code the values of the array as they change to frequently and every time the user runs the script there would be errors. so that is why i created the array values like this.
#!/usr/bin/perl
use strict;
use warnings;
use Tk;
use Tk::BrowseEntry;
use POSIX 'mktime';
use POSIX 'strftime';
open(STDERR, ">&STDOUT");
######## entry widget to get $yyyy $mmm $dd #######################################
print "\n Select Year = $yyyy\n";
print "\n Select Month = $mmm\n";
print "\n Number of Backup Days = $dd\n";
######## create input and output files #######################################
my $filerror = "\n\n! Cannot open File below, please check it exists or is not open already?\n";
my $OUTFILE = "C:\\Temp\\$yyyy\$mmmAudit.txt";
my $INFILE1 = "c:\\file1.csv";
my $INFILE = "c:\\file.csv";
#Open input file for reading and Output file for writting
open (INPUT,"$INFILE") or die "\n$filerror\$INFILE",,1;
#open (OUTPUT,">$OUTFILE") or die "\n$filerror\n$OUTFILE",,1;
my $total_names = 0;
$total_names++ while (<INPUT>);
my $Month_total = $total_names * $dd;
######### get total number of rows in files ##################################
print "\n Total number of names is $total_names\n";
print "\n Total number of names is $Month_total\n";
close INPUT;
open (INPUT,"$INFILE") or die "\n$filerror\$INFILE",,1;
######### keep only unique names to do a foreach in file1.csv#########
my %seen;
while (<INPUT>)
{
chomp;
my $line = $_;
my #elements = split (",", $line);
my $col_name = $elements[1];
print " $col_name \n" if ! $seen{$col_name}++;
}
## now in file1.csv i want to do a for each on all $col_mames's
close INPUT;
... you don't need to loop over the whole file twice - it looks like you're doing it the first time just to count line numbers - could you not do that inside your second loop? Saves the open/closing too... which for the same file is odd.
Your second file you can loopover just after grabbing each column name unless I'm missing something. Not efficient code, but it'll do it.

Parsing CSV files and matching

I have 10 folders and in each folder, i have two files (CSV,comma delimeted) in the following formats.
File 1:
Ensembl Gene ID,Ensembl Transcript ID,Exon Chr Start (bp),Exon Chr End (bp),Exon Rank in Transcript, Transcript count,Gene End (bp) ,Gene Start (bp),Strand
ENSG00000271782,ENST00000607815,50902700,50902978,1,1,50902978,50902700,-1
ENSG00000232753,ENST00000424955,103817769,103817825,1,1,103828355,103817769,1
ENSG00000232753,ENST00000424955,103827995,103828355,2,1,103828355,103817769,1
ENSG00000225767,ENST00000424664,50927141 50927168,1,1,50936822,50927141,1
File 2:
number,Start pos,End Pos
1,41035,41048
3,36738,36751
3,38169,38182
3,40264,40277
I am trying to match the second file to firstfile
The number in colum1 of second file is the key record number in first file.
Extract the last 3 colums from first file
where the output needed is :
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
I have started reading from second using TexT::CSV, but need help.
use strict;
use warnings;
use lib 'C:/Perl/lib';
use Text::CSV;
my $file1 = "infile1";
open my $fh, "<", $file1 or die "$file1: $!";
my $file2 = "infile2"
open my $fh2, "<", $file2 or die "$file2: $!";
my $csv = Text::CSV->new ({
binary => 1,
auto_diag => 1,
});
while (my $row = $csv->getline ($fh2)) {
print "#$row\n"; # I am stuck in extraction ? do I need to put another while loop for fh1
}
close $fh1;
close $fh2;
Since there are no commas enclosed within double-quotes, you can just split on the commas instead of using Text::CSV (which is an excellent module). Given this, the following produces the output you want:
use strict;
use warnings;
use autodie;
my ( $num, %hash ) = 0;
my ( $file1, $file2 ) = qw/inFile1 inFile2/;
open my $fh1, '<', $file1;
while (<$fh1>) {
next if $. == 1;
chomp;
my #fields = split /,/;
$num++ if !$hash{ $fields[0] }++;
push #{ $hash{$num} }, [ #fields[ 0, 6 .. 8 ] ];
}
close $fh1;
open my $fh2, '<', $file2;
while (<$fh2>) {
next if $. == 1;
chomp;
my #fields = split /,/;
if ( my #arr = #{ $hash{ $fields[0] }->[0] } ) {
splice #arr, 1, 0, #fields[ 1, 2 ];
print join( ',', $fields[0], #arr ), "\n";
}
}
close $fh2;
This uses a hash to: 1) keep track of seen Gene IDs, and 2) build a hash of array of arrays (HoAoA). The count--your "key record"--is incremented on unique Gene IDs, so #1 keeps track of these IDs to insure that $num is incremented only if the Gene ID hasn't yet appeared. Number 2 (HoAoA) is used because there are multiple instances of the same Gene ID, but only the values from the first instance are used in the printing. (I did note, however, that the second skips #2, which is the multiple-instance Gene ID.) Perhaps you only needed a hash of arrays (HoA), but it works well the way it is--or you can just modify it, as needed. That is, if you aren't going to use the multiple Gene ID info, the code could be simplified.
Output on your datasets:
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
Hope this helps!
The interesting part of this problem is that you need logic that read file 1 until it's ahead of file 2 and logic that reads file 2 until it's ahead of file 1 and logic to know how to act when one is behind the other, and when they are in balance.
You'll need to track unique gene ensemble ids, and their ordinal position in the list. So that when you read the second line of file2, you'll know how to skip the second and third line of file1, but also know not to skip and more in file1 when you've read the third and fourth line in file 1.
Or you can read file1 into memory, and create an array of arrays of lines, so that e.g.
file1arr[1] = [ $line1 ]
file1arr[2] = [ $line2, $line3 ]
file1arr[3] = [ $line4 ]
so when you loop over file 2, all the lines from file1 are in a neat little arrayref at the array index corresponding to the number column of file2.
Then it's just an exercise in iterating of the array of file1 lines, splitting them and building your output lines.

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra
#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}
This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}
Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>
You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.