identify and insert the missing rows - perl

An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.
Here is an example of data from file:
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).
Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.
#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;
opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my #run_folder = readdir $dir;
closedir $dir;
my $run_folder = pop #run_folder; print "The folder is".$run_folder."\n";
my $home="/data/";
my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";
my #CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (#CNVarray, $_);
}
close(TXT2);
foreach (#CNVarray){
chop($_);
}
my #array1 = map { $_->[1] } #CNVarray;
my #array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);
my %array1_hash;
my %array2_hash;
# Create a hash entry for each element in #array1
for my $element ( #array1 ) {
$array1_hash{$element} = #array1;
}
# Same for #array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } #array2;
for my $entry ( #array2 ) {
if ( not $array1_hash{$entry} ) {
return 1; #Entry in #array2 but not #array1: Differ
}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}

Note The best version is reached at the end. It is a few lines of code.
If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, with rows in that order. One way to find skipped rows is to iterate through both lists.
That approach can be picky and error prone but here it can be made easier by removing the front element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic, with the better version further below.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
foreach my $line (#CNVarray)
{
if ( (split /\t/, $line)[1] eq $ref_list[0] ) { # good row
shift #ref_list;
print $line, "\n";
}
else {
shift #ref_list;
print "blank row found\n";
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
# multiple missing rows? keep going through the reference list
shift #ref_list;
print "blank row found\n";
}
}
# We are done with the array, but are there more reference items?
print "blank row found\n" for #ref_list;
The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.
The filehandle read <...> in the list context returns a list with all lines from the resource.
The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).
Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces not tabs)
1 AA first
2 BB more
5 EE last
To test with this, change /\t/ to /\s/ (what will then work for tabs as well). It prints
1 AA first
2 BB more
blank row found
blank row found
5 EE last
With further elements added to the #ref_list (FF etc) further blank ... lines are printed.
The code above can be simplified. Lines are also collected in an array, then printed to a new file.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
my #new_lines;
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
shift #ref_list;
push #new_lines, 'blank row found';
print "blank row found\n";
}
shift #ref_list;
push #new_lines, $line;
print $line, "\n";
}
# There may be more items remaining on the reference list
for (#ref_list) {
push #new_lines, 'blank row found';
print "blank row found\n"
}
my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for #new_lines;
close $out_fh;
This behaves the same way with the test input above. It can be simplified further yet
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne shift #ref_list ) {
print "blank row found\n";
}
print $line, "\n";
}
The shift returns the removed element, which is what need be tested against.
A note on split syntax, following the code update ("\t" changed to /\t/).
When invoked as split /$patt/, $str, the $patt is used as a regular expression, with a few very minor differences. So with /\s/ the string is split on white space as understood in regex, thus including the tab, for example.
With double quotes "..." used instead of /.../, what is inside is interpolated first which may result in surprises, in particular with escapes. (Unless it is used as m"..." in which case it is merely a regex with " being the delimiter.)
In the above code for the tab one can use /\t/, or "\t", or '\t' (or /\s/ which includes yet other types of space). The "\t" was changed to /\t/, which is better in my opinion, being clearer (it is a regex, no questions asked). Thanks to Borodin for the early edit and for the comment.

I would write this
The input file is read into a hash, keyed by the value of the second column. Then the hash is read back and printed in the specified sequence of keys
Most of the code is finding the input file and setting up the sequence of keys. The core of the program is only three lines of code
use strict;
use warnings 'all';
use File::Spec::Functions 'catfile';
my $home = '/data';
my #run_folder = grep -f, glob catfile($home, 'test_all_runs', '*', 'CNV.txt');
die "No CNV file found" unless #run_folder;
my $cnv_file = $run_folder[-1];
print "The file is $cnv_file\n\n";
my #sequence = qw/
MTOR JAK1 NRAS DDR2 MYCN ALK
IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1
FGFR3 PDGFRA KIT APC FGFR4 ROS1
ESR1 EGFR CDK6 MET SMO BRAF
FGFR1 MYC JAK2 GNAQ RET FGFR2
HRAS CCND1 BIRC2 KRAS ERBB3 CDK4
AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1
GNA11 MAP2K2 JAK3 AR MED12
/;
open my $fh, '<', $cnv_file or die qq{Unable to open "$cnv_file" for input: $!};
my %data;
$data{ (split)[1] } = $_ while <$fh>;
print $data{$_} // "no data for $_\n" for #sequence;
output
The file is /data/test_all_runs/XXX/CNV.txt
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
no data for NRAS
no data for DDR2
no data for MYCN
no data for ALK
no data for IDH1
no data for ERBB4
no data for RAF1
no data for CTNNB1
no data for PIK3CA
no data for DCUN1D1
no data for FGFR3
no data for PDGFRA
no data for KIT
no data for APC
no data for FGFR4
no data for ROS1
no data for ESR1
no data for EGFR
no data for CDK6
no data for MET
no data for SMO
no data for BRAF
no data for FGFR1
no data for MYC
no data for JAK2
no data for GNAQ
no data for RET
no data for FGFR2
no data for HRAS
no data for CCND1
no data for BIRC2
no data for KRAS
no data for ERBB3
no data for CDK4
no data for AKT1
no data for MAP2K1
no data for IDH2
no data for NF1
no data for ERBB2
no data for BRCA1
no data for GNA11
no data for MAP2K2
no data for JAK3
no data for AR
no data for MED12

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my #array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = #_;
my #Columns = split("\t", $OutputLine);
my ($queryID, $target) = #Columns;
push(#array, $target, "\n") unless grep{$queryID eq $_} #array;
my $delineator = "\n";
my $count = true { /$delineator/g } #array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(#array){
print OUT #array;
}
}
I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push #{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar #{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for #{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id} is created once we push, if not there already
If there are issues with tabs (converted to spaces?), use ' '. See the comment.
A filehandle is closed and re-opened on every open, so no need to close every time
The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).
Note that in this case, since ' ' is the default along with $_, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push #{$seqs{$id}}, $seq;
}

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra
#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}
This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}
Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>
You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.

I've unpacked it...how can I count the columns?

I am using unpack to parse some text files w/ some columns. Each text file is different and has a different number of columns. How can I count the columns so I don't get errors? Right now I am using 0..5 but if the text file has 3 columns then I get an error: "Use of uninitialized value in substitution...". Thx!
open (PARSE,"<$temp") or die $!;
my #template = map {'A'.length} <PARSE> =~ /(\S+\s*)/g;
next unless #template;
$template[-1] = 'A*';# set the last segment to be slurpy
my $template = "#template";
my #data;
while (<PARSE>) {
push #data, [unpack $template, $_]
}
for my $dat (#data){ # for each row
for(0..5){ # for each column in that row
$dat->[$_]=~s/^\s+//g;
$dat->[$_]=~s/\s+$//g;
print $dat->[$_].',';
}
print "\n";
}
With languages like Perl, Python, Ruby, etc., you rarely need to stoop to the level of subscripts when iterating over an array:
for my $cell (#$dat){
# Work with $cell rather than $dat->[$_].
...
}
Probably easier and cleaner to use Tie::File so that you don't have to read everything into memory, but here's one way that uses the #data list you set up:
my $dataFirstLine = $data[0];
chomp($dataFirstLine);
my #dataColumns = split("\t", $dataFirstLine); # assumes delimiter is tab, replace with escaped delimiter of choice
my $dataColumnCount = scalar #dataColumns;
print "number of columns: $dataColumnCount\n";