I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.
The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.
I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my #array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = #_;
my #Columns = split("\t", $OutputLine);
my ($queryID, $target) = #Columns;
push(#array, $target, "\n") unless grep{$queryID eq $_} #array;
my $delineator = "\n";
my $count = true { /$delineator/g } #array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(#array){
print OUT #array;
}
}
I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push #{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar #{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for #{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id} is created once we push, if not there already
If there are issues with tabs (converted to spaces?), use ' '. See the comment.
A filehandle is closed and re-opened on every open, so no need to close every time
The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).
Note that in this case, since ' ' is the default along with $_, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push #{$seqs{$id}}, $seq;
}
I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra
#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}
This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}
Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>
You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.