fast way to compare rows in a dataset - perl

I asked this question in R and got a lot of answers, but all of them crash my 4Gb Ram computer after a few hours running or they take a very long time to finish.
faster way to compare rows in a data frame
Some people said that it's not a job to be done in R. As I don't know C and I'm a little bit fluent in Perl, I'll ask here.
I'd like to know if there is a fast way to compare each row of a large dataset with the other rows, identifying the rows with a specific degree of homology. Let's say for the simple example below that I want homology >= 3.
data:
sample_1,10,11,10,13
sample_2,10,11,10,14
sample_3,10,10,8,12
sample_4,10,11,10,13
sample_5,13,13,10,13
The output should be something like:
output
sample duplicate matches
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3

Matches are calculated when both lines have same numbers on same positions,
perl -F',' -lane'
$k = shift #F;
for my $kk (#o) {
$m = grep { $h{$kk}[$_] == $F[$_] } 0 .. $#F;
$m >=3 or next;
print ++$i, " $kk $k $m";
}
push #o, $k;
$h{$k} = [ #F ];
' file
output,
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3

This solution provides an alternative to direct comparison, which will be slow for large data amounts.
Basic idea is to build an inverted index while reading the data.
This makes comparison faster if there are a lot of different values per column.
For each row, you look up the index and count the matches - this way you only consider the samples where this value actually occurs.
You might still have a memory problem because the index gets as large as your data.
To overcome that, you can shorten the sample name and use a persistent index (using DB_File, for example).
use strict;
use warnings;
use 5.010;
my #h;
my $LIMIT_HOMOLOGY = 3;
while(my $line = <>) {
my #arr = split /,/, $line;
my $sample_no = shift #arr;
my %sim;
foreach my $i (0..$#arr) {
my $value = $arr[$i];
our $l;
*l = \$h[$i]->{$value};
foreach my $s (#$l) {
$sim{$s}++;
}
push #$l, $sample_no;
}
foreach my $s (keys %sim) {
if ($sim{$s}>=$LIMIT_HOMOLOGY) {
say "$sample_no: $s. Matches: $sim{$s}";
}
}
}
For 25000 rows with 26 columns with random integer values between 1 and 100, the program took 69 seconds on my mac book air to finish.

Related

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?
From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.
Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.
Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}
The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

Binary search—Can't use string "1" as a symbol ref while strict refs is in use

I've been browsing over the already answered questions regarding this error message.
I am trying to solve a problem from the Rosalind web site that looks for some indexes using a binary search.
When my subroutine finds the number it seems to ignore it, and if I try to print the $found variable, it gives me the error
Can't use string "1" as a symbol ref while strict refs is in use
The code is this
sub binarysearch
{
my $numbertolook = shift;
my #intarray=#_;
my $lengthint = scalar #intarray;
my #sorted = sort {$a <=> $b} #intarray;
#print $numbertolook, " " , #sorted, "\n";
my $low=0;
my $high=$lengthint-1;
my $found =undef;
my $midpoint;
while ($low<$high)
{
$midpoint=int(($low+$high)/2);
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ($numbertolook<$sorted[$midpoint])
{
$high=$midpoint;
}
elsif ($numbertolook>$sorted[$midpoint])
{
$low=$midpoint;
}
elsif ($numbertolook==$sorted[$midpoint])
{
$found=1;
print $found "\n";
last;
}
if ($low==$high-1 and $low==$midpoint)
{
if ($numbertolook==$sorted[$high])
{
$found=1;
print $found "\n";
last;
}
$low=$high;
}
}
return $found;
}
You want
print $found, "\n";
Or
print $found . "\n";
With no operator between $found and the newline, it thinks $found is the filehandle to print a newline to, and is getting an error because it isn't a filehandle.
I'll try to help
First of all, as simple as it may seem, a binary search is quite difficult to code correctly. The main reason is that it's a hotbed of off-by-one errors, which are so prevalent that they have their own Wikipedia page
The issue is that an array containing, say, the values A to Z will have 26 elements with indices 0 to 25. I think FORTRAN bucks the trend, and Lua, but pretty much every other language has the first element of an array at index zero
A zero base works pretty well for everything until you start using divide and conquer algorithms. Merge Sort as well as Binary Search are such algorithms. Binary search goes
Is it in the first half?
If so then search the first half further
Else search the second half further
The hard part is when you have to decide when you've found the object, or when you need to give up looking. Splitting data in two nearly-halves is easy. Knowing when to stop is hard
It's highly efficient for sorted data, but the problem comes when implementing it that, if we do it properly, we have to deal with all sorts of weird index bases beyond zero or one.
Suppose I have an array
my #alpha = 'A' .. 'Q'
If I print scalar #alpha I will see 17, meaning the array has seventeen elements, indexed from 0 to 16
Now I'm looking for E in that array, so I do a binary search, so I want the "first half" and the "second half" of #alpha. If I add 0 to 16 and divide by 2 I get a neat "8", so the middle element is at index 8, which is H
But wait. There are 17 elements, which is an odd number, so if we say the first eight (A .. H) are left of the middle and the last eight (I .. Q) are right of the middle then surely the "middle" is I?
In truth this is all a deception, because a binary search will work however we partition the data. In this case binary means two parts, and although the search would be more efficient if those parts could be equal in size it's not necessary for the algorithm to work. So it can be the first third and the last two-thirds, or just the first element and the rest
That's why using int(($low+high)/2) is fine. It rounds down to the nearest integer so that with our 17-element array $mid is a usable 8 instead of 8.5
But your code still has to account for some unexpected things. In the case of our 17-element array we have calculated the middle index to be 8. So indexes 0 .. 7 are the "first half" while 8 .. 16 are the "second half", and the middle index is where the second half starts
But didn't we round the division down? So in the case of an odd number of elements, shouldn't our mid point be at the end of the first half, and not the start of the second? This is an arcane off-by-one error, but let's see if it still works with a simple even number of elements
#alpha = `A` .. `D`
The start and and indices are 0 and 3; the middle index is int((0+3)/2) == 1. So the first half is 0..1 and the second half is 2 .. 3. That works fine
But there's still a lot more. Say I have to search an array with two elements X and Y. That has two clear halves, and I'm looking for A, which is before the middle. So I now search the one-element list X for A. The minimum and maximum elements of the target array are both zero. The mid-point is int((0+0)/2) == 0. So what happens next?
It is similar but rather worse when we're searching for Z in the same list. The code has to be exactly right, otherwise we will be either searching off the end of the array or checking the last element again and again
Saving the worst for last, suppose
my #alpha = ( 'A', 'B, 'Y, 'Z' )
and I'm looking for M. That lest loose all sorts of optimisations that involve checks that may may the ordinary case much slower
Because of all of this it's by far the best solution to use a library or a language's built-in function to do all of this. In particular, Perl's hashes are usually all you need to check for specific strings and any associated data. The algorithm used is vastly better than a binary search for any non-trivial data sets
Wikipedia shows this algorithm for an iterative binary search
The binary search algorithm can also be expressed iteratively with two index limits that progressively narrow the search range.
int binary_search(int A[], int key, int imin, int imax)
{
// continue searching while [imin,imax] is not empty
while (imin <= imax)
{
// calculate the midpoint for roughly equal partition
int imid = midpoint(imin, imax);
if (A[imid] == key)
// key found at index imid
return imid;
// determine which subarray to search
else if (A[imid] < key)
// change min index to search upper subarray
imin = imid + 1;
else
// change max index to search lower subarray
imax = imid - 1;
}
// key was not found
return KEY_NOT_FOUND;
}
And here is a version of your code that is far from bug-free but does what you intended. You weren't so far off
use strict;
use warnings 'all';
print binarysearch( 76, 10 .. 99 ), "\n";
sub binarysearch {
my $numbertolook = shift;
my #intarray = #_;
my $lengthint = scalar #intarray;
my #sorted = sort { $a <=> $b } #intarray;
my $low = 0;
my $high = $lengthint - 1;
my $found = undef;
my $midpoint;
while ( $low < $high ) {
$midpoint = int( ( $low + $high ) / 2 );
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ( $numbertolook < $sorted[$midpoint] ) {
$high = $midpoint;
}
elsif ( $numbertolook > $sorted[$midpoint] ) {
$low = $midpoint;
}
elsif ( $numbertolook == $sorted[$midpoint] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
if ( $low == $high - 1 and $low == $midpoint ) {
if ( $numbertolook == $sorted[$high] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
return;
}
}
return $midpoint;
}
output
FOUND
66
If you call print with several parameters separated with a space print expects the first one to be a filehandle. This is interprented as print FILEHANDLE LIST from the documentation.
print $found "\n";
What you want to do is either to separate with ,, to call it as print LIST.
print $found, "\n";
or to concat as strings, which will also call it as print LIST, but with only one element in LIST.
print $found . "\n";

Merge lines and do operations if a condition is statisfied

I'm new in perl and I would like to read a table and make a sum of some values from specific lines. This is a simplified example of my input file:
INPUT :
Gene Size Feature
GeneA 1200 Intron 1
GeneB 100 Intron 1
GeneB 200 Intron 1
GeneB 150 Intron 2
GeneC 300 Intron 5
OUTPUT :
GeneA 1200 Intron 1
GeneB 300 Intron 1 <-- the size values are summed
GeneB 150 Intron 2
GeneC 300 Intron 5
Because Gene B is present for intron 1 with two different sizes, I would like to sum these two values and print only one line per intron number.
This is an example of code that I want to do. But I would like to make it more complicated if I can understand How to handle this kind of data.
#!/usr/bin/perl
use strict;
use warnings;
my $sum;
my #GAP_list;
my $prevline = 'na';
open INFILE,"Table.csv";
while (my $ligne = <INFILE>)
{
chomp ($ligne);
my #list = split /\t/, $ligne;
my $gene= $list[0];
my $GAP_size= $list[2];
my $intron= $list[3];
my $intron_number=$list[4];
if($prevline eq 'na'){
push #GAP_list, $GAP_size;
}
elsif($prevline ne 'na') {
my #list_p = split /\t/,$prevline;
my $gene_p= $list_p[0];
my $GAP_size_p= $list_p[2];
my $intron_p= $list_p[3];
my $intron_number_p=$list_p[4];
if (($gene eq $gene_p) && ($intron eq $intron_p) && ($intron_number eq $intron_number_p)){
push #GAP_list, $GAP_size;
}
}
else{
$sum = doSum(#GAP_list);
print "$gene\tGAP\t$GAP_size\t$intron\t$intron_number\t$sum\n";
$prevline=$ligne;
}
}
# Subroutine
sub doSum {
my $sum = 0;
foreach my $x (#_) {
$sum += $x;
}
return $sum;
}
Assuming the fields are seperated by tabs, then the following strategy would work. It buffers the last line, either adding up if the other fields are equal, or printing the old data and then replacing the buffer with the current line.
After the whole input was processed, we must not forget to print out the contents that are still in the buffer.
my $first_line = do { my $l = <>; chomp $l; $l };
my ($last_gene, $last_tow, $last_intron) = split /\t/, $first_line;
while(<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
if ($gene eq $last_gene and $intron eq $last_intron) {
$last_tow += $tow;
} else {
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
($last_gene, $last_tow, $last_intron) = ($gene, $tow, $intron);
}
}
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
This works fine as long as genes that may be folded together are always consecutive. If the joinable records are spread all over the file, we have to keep a data structure of all records. After the whole file is parsed, we can emit nicely sorted sums.
We will use a multilevel hash that uses the gene as first level key, and the intron as 2nd level key. The value is the count/tow/whatever:
my %records;
# parse the file
while (<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
$records{$gene}{$intron} += $tow;
}
# emit the data:
for my $gene (sort keys %records) {
for my $intron (sort keys %{ $records{$gene} }) {
print join("\t", $gene, records{$gene}{$intron}, $intron), \n";
}
}
This seems more like something that can be done easily using a simple SQL Query. Especially as you get your files in a database table format. I couldn't comment on your question, to ask you more about it as I don't have enough reputation to do so.
So I'm assuming that you get your data from a table. Not that you can't solve this problem in Perl. But I strongly recommend using the database to do such calculation when fetching the data file, as that seems much easier. And I am not sure why you chose to do it in Perl, especially when you have lots of such fields in a file and you wanted to do such operations on all of them. And you could still use Perl to interact with your database when solving your problem via an SQL Query.
So my proposed solution in SQL, if the data is collected from a database is:
Write an SQL statement involving a GROUP BY on the GENE and feature field and aggregate the size column.
If your table looked exactly like what you described, let us call it GeneInformation table and you loaded your data file to the SQL database (SQLLite maybe) then your select query would be:
SELECT gene, feature, SUM(size) FROM GeneInformation
GROUP
BY gene, feature;
That should give you a list of genes, features and their corresponding total sizes .
If SQL solution is completely impossible for you then I will talk about the Perl solution.
I noticed that the Perl solutions are based on the assumption that a particular gene's values would appear consecutively in the file. If that is the case then I would like to up vote amon's answer (which I can't do at the moment).

Perl: increment 2d array cell?

I have a set of numerical data for which is important to me to know what pairs of numbers occurred together, and how many times. Each set of data contain 7 numbers betwen 1 and 20. There are several hundred sets of data.
Essentially, by parsing each set of my data, I want to create a 20 x 20 array that I can use to keep a count of when pairs of numbers occurred together.
I have done a lot of searching, but maybe I've used the wrong key words. I've seen loads of examples how to create a "2D array" - I know perl doesn't actually do that, and that it's really an array of references - and to print the values contained therein, but nothing really on how to work with one particular cell by number and alter it.
Below is my conceptual code. The commented lines don't work, but illustrate what I want to achieve. I'm reasonably new to coding perl, and this just seems to advanced for me to understand the examples I've seen and translate it into something I can actually use.
my #datapairs;
while (<DATAFILE>)
{
chomp;
my #data = split(",",$_);
for ($prcount=0; $prcount <=5; $prcount++)
{
for ($othcount=($prcount+1); $othcount<=6; $othcount++)
{
#data[$prcount]=#data[$prcount]+1;
#data[$othcount]=#data[$othcount]+1;
#data[$prcount]=#data[$prcount]-1;
#data[$othcount]=#data[$othcount]-1;
print #data[$prcount]." ".#data[$othcount]."; ";
##datapairs[#data[$prcount]][#data[$othcount]]++;
##datapairs[#data[$othcount]][#data[$prcount]]++;
}
}
}
Any input or suggestions would be much appreciated.
To access a "cell" in a "2-d array" in Perl (as you alredy figured out, it's an array of arrayrefs), is simple:
my #datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs[$i]->[$j]++;
print that value
print "$datapairs[$i]->[$j]\n";
It's not clear what you mean by "occur together" - if you mean "in the same length-7 array", it's easy:
my #datapairs;
while (<DATAFILE>) {
chomp;
my #data = split(",", $_);
for (my $prcount = 0; $prcount <= 5; $prcount++) {
for (my $othcount = $prcount + 1; $othcount <=6 ; $othcount++) {
$datapairs[ $data[$prcount] ]->[ $data[$othcount] ]++;
}
}
}
# Print
for (my $i = 0; $i < 20; $i++) {
for (my $j = 0; $j < 20; $j++) {
print "$datapairs[$i]->[$j], ";
}
print "\n";
}
As a side note, personally, just for stylistic reasons, I strongly prefer to reference EVERYTHING, e.g. use arrayref of arrayrefs instead of array of arrays. E.g.
my $datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs->[$i]->[$j]++;
print that value
print "$datapairs->[$i]->[$j]\n";
The second (and third...) arrow dereference operator is optional in Perl but I personally find it significantly more readable to enforce its usage - it spaces out the index expressions.

How do I replace row identifiers with sequential numbers?

I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;