Perl: Removing duplicates from a large set of data

Perl: Removing duplicates from a large set of data - perl

I'm using Perl to generate a list of unique exons (which are the units of genes).
I've generated a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.
Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).
I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.
Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar #exons; $i++) {
my #temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar #exons_dup; $j++) {
my #inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array
unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop
($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines
if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same
($temp_line[2] == $inner_temp_line[2])) { # between the comparisons
splice(#exons, $i, 1); # delete the first one
}
}
}
}

my #exons = (
'chr1 1000 2000 gene1',
'chr1 3000 4000 gene2',
'chr1 5000 6000 gene3',
'chr1 1000 2000 gene4'
);
my %unique_exons = map {
my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
"$chro $scoor $ecoor" => $gene
} #exons;
print "$_ $unique_exons{$_} \n" for keys %unique_exons;
This will give you uniqueness, and the last gene name will be included. This results in:
chr1 1000 2000 gene4
chr1 5000 6000 gene3
chr1 3000 4000 gene2

You can use a hash to dedup en passant, but you need a way to join the parts you want to use to detect duplicates into a single string.
sub extract_dup_check_string {
my $exon = shift;
my #parts = line_splitter($exon);
# modify to suit:
my $dup_check_string = join( ';', #parts[0..2] );
return $dup_check_string;
}
my %seen;
#deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, #exons;

You can use a hash to keep track of duplicates you've already seen and then skip them. This example assumes the fields in your input file are space-delimited:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
while (my $line = <>) {
my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line;
my $key = join ':', $chromosome, $exon_start, $exon_end;
if ($seen{$key}) {
next;
}
else {
$seen{$key}++;
print $line;
}
}

As simple as it comes. I tried to use as little magic as possible.
my %exoms = ();
my $input;
open( $input, '<', "lines.in" ) or die $!;
while( <$input> )
{
if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format
{
my #splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces
my $key = $splits[1] . '_' . $splits[2];
if( !exists( $exoms{$key} ) )
{
#could output or write to a new file here, probably output to a file
#for large sets.
$exoms{$key} = \#splits;
}
}
}
#demo to show what was parsed from demo input
while( my ($key, $value) = each(%exoms) )
{
my #splits = #{$value};
foreach my $position (#splits)
{
print( "$position " );
}
print( "\n" );
}

Related

Perl: print the line with the largest number from standard input

i'm very new to Perl and I've been trying to implement a function that print out the line with the largest number from the standard input. For example, If the input is:
Hello, i'm 18
1 this year is 2019 1
1 2 3 - 4
The output should be: 1 this year is 2019 1
And secondly, I would like to know what does $line =~ /-?(?:\d+.?\d*|.\d+)/g mean?
The following is what I've tried, it is not working and I hope someone could fix it for me. I'm struggling with filtering out random characters but leaving out the digit.
Is it necessary to push the largest number onto an array? Is there any way that once we could do this in one step?
#!/usr/bin/perl -w
while ($lines = <STDIN>){
#input = $lines =~ /\d+/g;
if (!#input){
} else {
$largest_number = sort {$a <=> $b} #input[0];
push(#number, $largest_number);
}
}
if (!#number){
}else{
print $largest_number;
}

#input[0] returns just the first value from the array. You probably want to use #input instead - but this way you'd get the numbers from one line sorted. Also, you need to store the whole line somewhere in order to be able to display it later.
Here's how I would do it:
#!/usr/bin/perl
use warnings;
use strict;
my #max;
while (my $line = <>) {
my #numbers = $line =~ /\d+/g;
for my $n (#numbers) {
if (! #max || $n > $max[0]) {
#max = ($n, $line);
}
}
}
print $max[1] if #max;
The #max array stores the largest number in $max[0], while $max[1] keeps the whole line. You just compare each number to the largest one, there's no need to search for the maximum for each line.
To store all the lines containing the largest number, change the loop body to
if (! #max || $n > $max[0]) {
#max = ($n, $line);
} elsif ($n == $max[0]) {
push #max, $line;
}
and the last line to
print #max[ 1 .. $#max ] if #max;

Curiously behaving IF block in Perl run on Windows

Background: I have a Perl script that I wrote to go through two files. The basic point of the script is to identify overlaps between one list of coordinates, defining the beginnings and ends of randomly selected chromosomal segments, and a second list of coordinates, defining the beginnings and endings of actual gene transcripts.
The first input file contains three columns. The first is for the chromosome number, and the second and third are the proximal and distal coordinates, in base pairs, of the randomly selected regions. For eg,
chr1 1100349 2035647
chr1 47837656 736474584
. . .
. . .
. . .
The second input file contains four columns: chromosome number, proximal coordinate, distal coordinate, and the name of the gene. For eg,
chr1 1588354 2283765 geneA
chr1 55943837 787653743 geneB
Here is a set of test files I used to start off with. First set.
chr1 1 10
chr1 5 10
chr1 5 15
chr1 14 15
chr1 100 101
chr1 11 17
Second set.
chr1 1 5 geneA
chr1 7 10 geneB
chr1 12 16 geneC
chr1 18 21 geneD
chr10 126602211 126609396 B4galnt1
The script reads off the first line from the first list, then reads through all the lines of the second list, and prints for me whether and how the first coordinate pair overlaps with the second coordinate pair (Is the first coordinate pair outside the second pair? Is the first pair inside or overlapping with the second?) Then, the script goes back and reads off the second line from the first list, and repeats the process. The first file has 200,000 lines. The second several thousand. It is running now overnight.
The problem: When the script determines the relationship between the first and second coordinate pairs, it prints out a line to an output file. Not all these print statements need to be sent to output, so I tried to comment them out. However, when I did this, none of the print statements sending information to the output file got printed. Statements are printed to the screen, though, just not to the output file. The script is running, but all the print to output statements are being used, so the output file is getting huge. If the script would just print to output for only those coordinates that overlap, the output file would be very, very much smaller. At present, the output file is now 2,131,294 KB! And that's only up to chromosome 11. There are eight more to go through, albeit smaller ones, but the file size is still going to expand greatly.
Updated information: This is edited in after my original posting. To be more precise, it is only when I comment out the first print $output "..."; statement that is inside the loop (the very first statement is to print a header, and this is before the loop) that the script fails to print anything, even when all the others are left alone (not commented).
In case it matters: I wrote the script on my Mac, using Fraise, but I am running it on a PC, the script contained in a Notepad text file.
Here's the script: Note: there are many print statements in the file, many commented out. The print statements of interest are those printing to the output file. Those are the ones that, when one or more are commented out, wind up never sending information to the output file. Those statements look like:
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
The actual script:
#!/bin/usr/perl
use strict; use warnings;
#############
## findGenes_after_ASboot_v5.pl
#############
#############
# After making a big list of randomly placed intervals,
# this script uses RefGene.txt file and identifies the
# the gene symbols encompassed or overlapped by each random interval
#############
unless(scalar #ARGV == 2) {
# $0 name of the program being executed;
print "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n";
exit;
}
#for ( my $i = 0; $i < 25; $i++ ){
# print "#########################################\n";
#}
open( my $positions, "<", $ARGV[0] ) or die;
open( my $RefGene, "<", $ARGV[1] ) or die;
open( my $output, ">>", "output.txt") or die;
# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";
my $pos_count = 1;
my $ref_count = 1;
for my $position_line (<$positions>) {
#print "$position_line";
my #posline = split('\t', $position_line);
#print "$posline[0]\t$posline[1]\t$posline[2]";
open( my $RefGene, "<", $ARGV[1] ) or die;
for my $ref (<$RefGene>){
#print "\t$ref";
my #refline = split('\t', $ref);
# print "\t$refline[0]\t$refline[1]\t$refline[2]\t$refline[3]";
chomp $posline[2];
chomp $refline[3];
if ( $posline[0] eq $refline[0] ){
#print "\tchr match\n";
# am i entirely prox to a gene?
if ( $posline[2] < $refline[1] ){
#print "too proximal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
$ref_count++;
next;
}
# am i entirely distal to a gene?
elsif ( $posline[1] > $refline[2] ){
#print "too distal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
$ref_count++;
next;
}
# am i completely inside a gene?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] <= $refline[2] ){
#print "inside\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
$ref_count++;
next;
}
# am i proximally overlapping?
elsif ( $posline[1] < $refline[1] &&
$posline[2] <= $refline[2] ){
#print "proximal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
$ref_count++;
next;
}
# am i distally overlapping?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] > $refline[2] ){
#print "distal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
$ref_count++;
next;
}
else {
#print "encompassing\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
$ref_count++;
next;
}
} # if a match with chr
else {
next;
}
} # for each reference
$pos_count++;
} # for each position
Data Files:
http://www.filedropper.com/proxdistalpositionsofrandompositions
http://www.filedropper.com/modifiedrefgene
Some output: http://www.filedropper.com/output_17

I see two potential flaws in your code:
Always use while when processing a file instead of for.
Whenever you use the latter, you're actually loading the entire file into memory versus just doing line by line processing. If you're actually able to support doing that though, you should go ahead and load your smaller file entirely and just iterate on the lines.
Split on "\t" not on '\t'.
The latter is almost certainly a bug, unless you really do use a 2 character delimiter for your data.
Anyway, I've cleaned up your code considerably. Removing duplicated lines etc. It's likely that a lot of these changes may either not work (as it's untested) or not be what you want. However, if you go through the code, perhaps it will give you ideas at the very least:
#!/bin/usr/perl
use strict;
use warnings;
use autodie;
#############
## findGenes_after_ASboot_v5.pl
#############
#############
# After making a big list of randomly placed intervals,
# this script uses RefGene.txt file and identifies the
# the gene symbols encompassed or overlapped by each random interval
#############
die "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"
if #ARGV != 2;
open my $positions, "<", $ARGV[0];
# Cache file by key
my %refgenes;
open my $RefGene, "<", $ARGV[1];
while (<$RefGene>) {
chomp;
my #cols = split "\t";
push #{$refgenes{$cols[0]}}, \#cols;
}
open my $output, ">>", "output.txt";
# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";
my $pos_count = 1;
my $ref_count = 1;
while (my $position_line = <$positions>) {
chomp $position_line;
my #posline = split "\t", $position_line;
# Only iterate on matching refs
for my $ref (#{ $refgenes{$posline[0]} }) {
my #refline = #$ref;
my $desc = join "\t", ($posline[0], $pos_count, #posline[1,2], $refline[0], $ref_count, #refline[1,2,3]);
my $message = '';
# am i entirely prox to a gene?
if ( $posline[2] < $refline[1] ){
$message = 'too proximal';
# am i entirely distal to a gene?
} elsif ( $posline[1] > $refline[2] ) {
$message = 'too distal';
# am i completely inside a gene?
} elsif ( $posline[1] >= $refline[1] && $posline[2] <= $refline[2] ) {
$message = 'inside';
# am i proximally overlapping?
} elsif ( $posline[1] < $refline[1] && $posline[2] <= $refline[2] ) {
$message = 'proximal overlap';
# am i distally overlapping?
} elsif ( $posline[1] >= $refline[1] && $posline[2] > $refline[2] ) {
$message = 'distal overlap';
} else {
$message = 'encompassing';
}
print "$desc\t$message\n";
print $output "$desc\t$message\n";
$ref_count++;
} # for each reference
$pos_count++;
} # for each position

How can I use the until function with appropriate way

I have a file that I want to filter which is like that:
##matrix=axtChain 16 91,-114,-31,-123,-114,100,-125,-31,-31,-125,100,-114,-123,-31,-114,91
##gapPenalties=axtChain O=400 E=30
chain 21455232 chr20 14302601 + 37457 14119338 chr22 14786829 + 3573 14759345 1
189 159 123
24 30 22
165 21 20
231 105 0
171 17 19
261 0 2231
222 2 0
253 56 48
chain 164224 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
221 352 334
24 100 112
34 56 56
26 50 47
…………………….
chain 143824 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
So, briefly,there are blocks separated by a blank line.
Each block begins with the line " chain xxxxx " and continues with lines with numbers.
I want to filter out the file and keep just the blocks with chain and the number that follows be greater than 3000.
I wrote the following script to do that:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX;
my $chain = $ARGV[0];
#It filters the chains with chains >= 3000.
open my $chain_file, $chain or die "Could not open $chain: $!";
my #array;
while( my $cline = <$chain_file>) {
#next if /^\s*#/;
chomp $cline;
#my #lines = split (/ /, $cline);
if ($cline =~/^chain/) {
my #lines = split (/\s/, $cline);
if ($lines[1] >= 3000) {
#print $lines[1];
#my #lines = split (/ /, $cline);
#print "$cline\n";
push (#array, $cline);
}
}
until ($cline ne ' ') {
push (#array, $cline);
}
foreach (#array) {
print "$_\n";
}
undef(#array);
}
The problem is that I can print just the headers (chain XXXXX…..) and not the numbers that follows at the next lines of each block.
I'm using the until function till will find the blank line, but it doesn't work.
If someone could help me with that….
Thank you very much in advance,
Vasilis.

The first problem here is that ' ' is a single space, not a blank line ("" or '' should be fine since you've already chomp-ed the line.
The second problem is that
until ( $cline ne "" )
is the same as
while ( $cline eq "" )
which is the opposite of what you need to push lines to #array.
That said, the flip-flop operator is probably a more suitable construct for what you're after:
my #array;
while ( <$chain_file> ) { # Using $_ instead of $cline
chomp;
if ( do { /^chain\s+(\d+)/ && $1 >= 3000 } .. /^$/ ) {
# Accumulate lines in #array
push #array, $_; # False until LHS evaluates to true ...
} # ... then true until RHS evaluates to true
else {
for ( #array ) {
print $_, "\n"; # Print matches
}
#array = (); # Reset/clear out #array
}
}

It's usually best not to use unless instead of while. It negates the boolean expression many times leaving you with a double negative to solve. Here's an example
while ( $foo ne $bar ) {
Even though this is a negative expression, I can pretty easily figure out when to exit my loop. However:
until ( $foo eq $bar ) {
Just takes time to figure out.
Also, ' ' does not make a blank line: Use the regular expression $cline =~ /^\s*$/. However, even beyond that the loop:
until ($cline ne ' ') {
push (#array, $cline);
}
will go on forever if $cline does equal blank. You're never changing the value of $cline.
You can use what I use to call state variables (until Perl actually created a variable type called state and now I have no idea what to call them.) This variable tracks where you are in your file. Are you inside a chain section of the file? Do you want these lines or not? This way, you only have a single loop. You set your state variables and then process your loop.
In this example, I have a state variable called $keep_line which is asking whether or not I want to keep the lines I want to read in. If the line starts with chain and the second field is greater than 3000, I want to keep the entire block (if I understand what you're attempting to do). (By the way, I'm keeping blank lines. Is that okay?)
my $keep_lines = 0; # Aren't in lines I want to keep
my #good_lines; # Array where you want to keep the good lines
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
if ( $line =~ /^chain/ ) { # This is a chain line. Do I want to keep this group?
my #fields = ( /\s+/, $line );
if ( $field[1] > 3000 ) { # Yes, if the second field is greater than 3000
$keep_lines = 1; # Keep this and all other lines until the next chain line
}
else {
$keep_lines = 0; # Skip until the next chain line
}
}
if ( $keep_lines ) {
push #good_lines, $line;
}
}
I also smell a function here: Instead of the tangle of if clauses, I would probably make this a function that returns the value I set $keep_lines to.
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
$keep_lines = keep_this_section( $line );
if ( $keep_lines ) {
push #good_lines, $line;
}
}
Simplifies the code quite a bit.
I would also declare some constants to remove those Mysterious Moes. Those are things like 3000 and /^chain/ that have some sort of mysterious, but important meaning in your program. You can use constant as a pragma to define Perl constants. It's part of standard Perl:
use constant {
KEEP_LIMIT => 3000,
SECTION_HEADER => qr/^chain/,
};
Now, I can do things like this:
if ( $line =~ SECTION_HEADER ) {
instead of:
if ( $line =~ /^chain/ ) {
and
if ( $field[1] > KEEP_LIMIT ) {
instead of
if ( $field[1] > 3000 ) {
There are problems with the constant pragma. The biggest is that it just doesn't interpolate in places where Perl will normally interpolate variables. This include double quoted strings and hash keys. If I have $foo{KEEP_LIMIT}, Perl will interpret the key as a string KEEP_LIMIT and not as a constant of KEEP_LIMIT.
Many developers use Readonly which is just so much better in so many ways. Unfortunately, Readonly isn't a standard Perl module, so you have to install it via CPAN, and that's sometimes not possible or desirable to do. So, I tend to use constant.

Perl: Compare keys of 2 hashes & print the difference between the closest keys

UPDATE(16/1/13)
Borodin pointed out another possibility which I completely overlooked.
In the actual files (I manually sat and started looking through 46 files, each about 10MB large), there are cases where for a particular value in File1, no smaller value exists in File2 (but a greater value does).
Likewise there exist cases where for a particular value in File1, no greater value exists in File2 (but a smaller value does)
I am updating the sample files and the desired output here to reflect this update.
UPDATE (15/1/13)
I have updated the desired output to account for a case where a value in File1 matches a value in File2. Thanks to Borodin for pointing out such a scenario.
I have 2 files which look like the following :
File1
chr1 10227
chr1 447989
chr1 535362
chr1 856788
chr1 249240496
File2
chr1 11017
chr1 11068
chr1 23525
chr1 439583
chr1 454089
chr1 460017
chr1 544711
chr1 546239
chr1 856788
chr1 249213429
chr1 249214499
chr1 249239072
What I need to do is that foreach value in file1, eg. 10227, find from file2 , two values which are closest. One of these values would be bigger, and the other smaller.
So taking 10227 in file1, the values which are closest in file2 are 9250 and 11017. Now the difference needs to be computed viz 9250 - 10227 = -977 and 11017 - 10227 = 790 to give an output like the following (tab delimited) :
Desired Output
chr1 10227 No 790 No Match
chr1 447989 No 6100 -8406
chr1 535362 No 9349 -75345
chr1 856788 Yes
chr1 249240496 No No Match -25997
I figured the fastest way to do this would be to use a hash to read in the 2 files, taking the numbers as keys and assigning 1 as value.
The code I have written so far is giving the difference of 10227 wrt all the values in file2. Similarly with 447989 and 535682.
How do I stop this and find the difference of only the closest numbers, one which is >10227 and one which is <10227
Code
use 5.014;
use warnings;
#code to enter lsdpeak and pg4 data into hash with KEYS as the numerical values, VALUE as 1
#Assign filename
my $file1 = 'lsdpeakmid.txt';
my $file2 = 'pg4mid.txt';
#Open file
open my $fh1, '<', $file1 or die $!;
open my $fh2, '<', $file2 or die $!;
#Read in file linewise
my %hash1;
while(<$fh1>){
my $key1 = (split)[1];
$hash1{$key1} = 1;
}
my %hash2;
while(<$fh2>){
my $key2 = (split)[1];
}
foreach my $key1 (sort keys %hash1){
foreach my $key2 (sort keys %hash2){
say $key2-$key1;
}
}
#Exit
exit;
Thank you for taking the time to go through the problem. I would be grateful for any and every comment/answer.

A hash is not a good choice here, as the only way to find the correct boundaries from file2 is to search through the list of values, and a hash doesn't facilitate that.
This program works by putting all the boundaries from file2 into the array #boundaries, and then searching through this array for each value read from file1 to find the first boundary value that is greater. Then this and the previous boundaries are the ones required, and the arithmetic is done in the print statement.
Note that this code will have problems if file2 contains a matching boundary, or if there is no boundary greater than or none less than a given value.
use strict;
use warnings;
use Data::Dump;
my $file1 = 'lsdpeakmid.txt';
my $file2 = 'pg4mid.txt';
my #boundaries = do {
open my $fh, '<', $file2 or die $!;
map { (split)[1] } <$fh>;
};
open my $fh, '<', $file1 or die $!;
while (my $line = <$fh>) {
chomp $line;
my #vals = split ' ', $line;
my $val = $vals[-1];
for my $i (1 .. $#boundaries) {
if ($boundaries[$i] > $val) {
print join(' ', #vals, $boundaries[$i] - $val, $boundaries[$i-1] - $val), "\n";
last;
}
}
}
output
chr1 10227 790 -977
chr1 447989 6100 -8406
chr1 535362 9349 -75345

One way:
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(first);
open my $fh1,'<','file1' or die $!;
open my $fh2,'<','file2' or die $!;
my %h1;
while(<$fh2>){
chomp;
my ($k,$v)=split(/\s+/);
push #{$h1{$k}}, $v;
}
close $fh2;
while (<$fh1>){
chomp;
my ($k, $v)=split(/\s+/);
my $bef=first{$_ >= $v}#{$h1{$k}};
$bef=defined $bef?$bef-$v:"No match";
my $aft=first{$_ <= $v}reverse #{$h1{$k}};
$aft=defined $aft?$aft-$v:"No match";
my $str=sprintf("%-8s %-10d %-5s %-8s %-8s",$k, $v,$bef?"No":"Yes",$bef?$bef:"",$aft?$aft:"");
print $str, "\n";
}
close $fh1;
The first while loop reads the second file and creates a hash where the key is chr1 and the value is an array reference containing all the values of chr1.
The foreach block sorts all the keys in numerical order.
The 2nd while loop processes the records of file1 and uses the first function of List::Util to get the results.
The first function is used twice: Once, to get the first biggest value than the current value, second: to get the last smallest value than the current value which is obtained by using the first on a reverse sorted array.
First function:
First function returns the first number in the array which satisfies the condition.
first{$_ > $v}#{$h1{$k}} => This gets the first number in the array which is bigger than the current number. Say for 10227, first will return 11017.
The next thing needed is the last smallest number before 10227. To get this, the first function is applied on the reverse array.
first{$_ < $v}reverse #{$h1{$k}} => This will return the first number which is lesser than 10227, and since the array is reversed, what we get is actually the last smallest number before 10227 which is 9250.
On running this:
chr1 10227 No 790 No match
chr1 447989 No 6100 -8406
chr1 535362 No 9349 -75345
chr1 856788 Yes
chr1 249240496 No No match -1424

First, we read in the second file and put the values into an array. I further assume that this chr1 is constant and can be discarded safely:
#!/usr/bin/perl
use strict; use warnings;
my #file2;
open my $fh2, "<", "file2" or die $!;
while (<$fh2>) {
my (undef, $num) = split;
die "the number contains illegal characters" if $num =~ /\D/;
push #file2, $num;
}
#file2 = sort {$a <=> $b} #file2; # sort ascending
# remove previous line if sorting is already guaranteed.
Then, we define a sub to find the two values in our array. It is just a variation of a basic algorithm to find a certain value in a sorted list (in O(log n)), and should perform better than iterating over each value, at least on large sets. Also, it doesn't require reversing the whole list for each value.
sub find {
my ($num, $arrayref) = #_;
# exit if array is too small
return unless #$arrayref >= 2;
# exit if $num is outside the values of this array (-1 is last element)
return if $num <= $arrayref->[0] or $arrayref->[-1] < $num;
my ($lo, $hi) = (1, $#$arrayref);
my $i = int(($lo+$hi)/2); # start in the middle
# iterate until
# a) the previous index contains a number that is smaller than $num and
# b) the current index contains a number that is greater or equal to $num.
until($arrayref->[$i-1] < $num and $num <= $arrayref->[$i]) {
# make $i the next lower or upper bound.
# instead of going into an infinite loop (which would happen if we
# assign $i to a variable that already holds the same value), we discard
# the value and move on towards the middle.
# $i is too small
if ($num > $arrayref->[$i] ) { $lo = ($lo == $i ? $i+1 : $i) }
# $i is too large
elsif ($num <= $arrayref->[$i-1]) { $hi = ($hi == $i ? $i-1 : $i) }
# in case I made an error:
else { die "illegal state" }
# calculate the next index
$i = int(($lo+$hi)/2);
}
return #{$arrayref}[$i-1, $i];
}
The rest is trivial:
open my $fh1, "<", "file1" or die $!;
while (<$fh1>) {
my ($chr, $num) = split;
die "the number contains illegal characters" if $num =~ /\D/;
if (my ($lo, $hi) = find($num, \#file2)) {
if ($hi == $num) {
print join("\t", $chr, $num, "Yes"), "\n";
} else {
print join("\t", $chr, $num, "No", $hi-$num, $lo-$num), "\n";
}
} else {
# no matching numbers were found in file 2
print join("\t", $chr, $num, "No-match"), "\n";
}
}
Output:
chr1 10227 No 790 -977
chr1 447989 No 6100 -8406
chr1 535362 No 9349 -75345
chr1 856788 Yes

Find the word with most letters in common with other words

I want Perl (5.8.8) to find out what word has the most letters in common with the other words in an array - but only letters that are in the same place. (And preferably without using libs.)
Take this list of words as an example:
BAKER
SALER
BALER
CARER
RUFFR
Her BALER is the word that has the most letters in common with the others. It matches BAxER in BAKER, xALER in SALER, xAxER in CARER, and xxxxR in RUFFR.
I want Perl to find this word for me in an arbitrary list of words with the same length and case. Seems I've hit the wall here, so help is much appreciated!
What I've tried until now
Don't really have much of a script at the moment:
use strict;
use warnings;
my #wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (#wordlist) {
my #letters = split(//, $word);
# now trip trough each iteration and work magic...
}
Where the comment is, I've tried several kinds of code, heavy with for-loops and ++ varables. Thus far, none of my attempts have done what I need it to do.
So, to better explain: What I need is to test word for word against the list, for each letterposition, to find the word that has the most letters in common with the others in the list, at that letter's position.
One possible way could be to first check which word(s) has the most in common at letter-position 0, then test letter-position 1, and so on, until you find the word that in sum has the most letters in common with the other words in the list. Then I'd like to print the list like a matrix with scores for each letterposition plus a total score for each word, not unlike what DavidO suggest.
What you'd in effect end up with is a matrix for each words, with the score for each letter position, and the sum total score fore each word in the matrix.
Purpose of the Program
Hehe, I might as well say it: The program is for hacking terminals in the game Fallout 3. :D My thinking is that it's a great way to learn Perl while also having fun gaming.
Here's one of the Fallout 3 terminal hacking tutorials I've used for research: FALLOUT 3: Hacking FAQ v1.2, and I've already made a program to shorten the list of words, like this:
#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings;
my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my #checkletters = split(//, $checkword); #/
my #wordlist = qw(
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
);
print "$checkword has $match letters in common with:\n";
foreach my $word (#wordlist) {
next if $word eq $checkword;
my #letters = split(//, $word);
my $length = #letters; # determine length of array (how many letters to check)
my $eq_letters = 0; # reset to 0 for every new word to be tested
for (my $i = 0; $i < $length; $i++) {
if ($letters[$i] eq $checkletters[$i]) {
$eq_letters++;
}
}
if ($eq_letters == $match) {
print "$word\n";
}
}
# Now to make a script on to find the best word to check in the first place...
This script will yield CONSTRUCTION and TRANSMISSION as its result, just as in the game FAQ. The trick to the original question, though (and the thing I didn't manage to find out on my own), is how to find the best word to try in the first place, i.e. APPRECIATION.
OK, I've now supplied my own solution based on your help, and consider this thread closed. Many, many thanks to all the contributers. You've helped tremendously, and on the way I've also learned a lot. :D

Here's one way. Having re-read your spec a couple of times I think it's what you're looking for.
It's worth mentioning that it's possible there will be more than one word with an equal top score. From your list there's only one winner, but it's possible that in longer lists, there will be several equally winning words. This solution deals with that. Also, as I understand it, you count letter matches only if they occur in the same column per word. If that's the case, here's a working solution:
use 5.012;
use strict;
use warnings;
use List::Util 'max';
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
my #scores;
foreach my $word ( #words ) {
my $score;
foreach my $comp_word ( #words ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
say "Words with most matches:";
say for #words[#max_ixs];
This solution counts how many times per letter column each word's letters match other words. So for example:
Words: Scores: Because:
ABC 1, 2, 1 = 4 A matched once, B matched twice, C matched once.
ABD 1, 2, 1 = 4 A matched once, B matched twice, D matched once.
CBD 0, 2, 1 = 3 C never matched, B matched twice, D matched once.
BAC 0, 0, 1 = 1 B never matched, A never matched, C matched once.
That gives you the winners of ABC and ABD, each with a score of four positional matches. Ie, the cumulative times that column one, row one matched column one row two, three, and four, and so on for the subsequent columns.
It may be able to be optimized further, and re-worded to be shorter, but I tried to keep the logic fairly easy to read. Enjoy!
UPDATE / EDIT
I thought about it and realized that though my existing method does exactly what your original question requested, it did it in O(n^2) time, which is comparatively slow. But if we use hash keys for each column's letters (one letter per key), and do a count of how many times each letter appears in the column (as the value of the hash element), we could do our summations in O(1) time, and our traversal of the list in O(n*c) time (where c is the number of columns, and n is the number of words). There's some setup time too (creation of the hash). But we still have a big improvement. Here is a new version of each technique, as well as a benchmark comparison of each.
use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;
my #words = qw/
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
/;
# Just a test run for each solution.
my( $top, $indexes_ref );
($top, $indexes_ref ) = find_top_matches_force( \#words );
print "Testing force method: $top matches.\n";
print "#words[#$indexes_ref]\n";
( $top, $indexes_ref ) = find_top_matches_hash( \#words );
print "Testing hash method: $top matches.\n";
print "#words[#$indexes_ref]\n";
my $count = 20000;
cmpthese( $count, {
'Hash' => sub{ find_top_matches_hash( \#words ); },
'Force' => sub{ find_top_matches_force( \#words ); },
} );
sub find_top_matches_hash {
my $words = shift;
my #scores;
my $columns;
my $max_col = max( map { length $_ } #{$words} ) - 1;
foreach my $col_idx ( 0 .. $max_col ) {
$columns->[$col_idx]{ substr $_, $col_idx, 1 }++
for #{$words};
}
foreach my $word ( #{$words} ) {
my $score = sum(
map{
$columns->[$_]{ substr $word, $_, 1 } - 1
} 0 .. $max_col
);
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
sub find_top_matches_force {
my $words = shift;
my #scores;
foreach my $word ( #{$words} ) {
my $score;
foreach my $comp_word ( #{$words} ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if
substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
The output is:
Testing force method: 39 matches.
APPRECIATION
Testing hash method: 39 matches.
APPRECIATION
Rate Force Hash
Force 2358/s -- -74%
Hash 9132/s 287% --
I realize your original spec changed after you saw some of the other options provided, and that's sort of the nature of innovation to a degree, but the puzzle was still alive in my mind. As you can see, my hash method is 287% faster than the original method. More fun in less time!

As a starting point, you can efficiently check how many letters they have in common with:
$count = ($word1 ^ $word2) =~ y/\0//;
But that's only useful if you loop through all possible pairs of words, something that isn't necessary in this case:
use strict;
use warnings;
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
# you want a hash to indicate which letters are present how many times in each position:
my %count;
for my $word (#words) {
my #letters = split //, $word;
$count{$_}{ $letters[$_] }++ for 0..$#letters;
}
# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:
my %max_common_letters_count;
my %max_common_letters_words;
for my $word (#words) {
my #letters = split //, $word;
my $total;
for my $position (0..$#letters, 'total') {
my $count;
if ( $position eq 'total' ) {
$count = $total;
}
else {
$count = $count{$position}{ $letters[$position] } - 1;
$total += $count;
}
if ( ! $max_common_letters_count{$position} || $count >= $max_common_letters_count{$position} ) {
if ( $max_common_letters_count{$position} && $count == $max_common_letters_count{$position} ) {
push #{ $max_common_letters_words{$position} }, $word;
}
else {
$max_common_letters_count{$position} = $count;
$max_common_letters_words{$position} = [ $word ];
}
}
}
}
# then show the maximum words for each position and in total:
for my $position ( sort { $a <=> $b } grep $_ ne 'total', keys %max_common_letters_count ) {
printf( "Position %s had a maximum of common letters of %s in words: %s\n",
$position,
$max_common_letters_count{$position},
join(', ', #{ $max_common_letters_words{$position} })
);
}
printf( "The maximum total common letters was %s in words(s): %s\n",
$max_common_letters_count{'total'},
join(', ', #{ $max_common_letters_words{'total'} })
);

Here's a complete script. It uses the same idea that ysth mentioned (although I had it independently). Use bitwise xor to combine the strings, and then count the number of NULs in the result. As long as your strings are ASCII, that will tell you how many matching letters there were. (That comparison is case sensitive, and I'm not sure what would happen if the strings were UTF-8. Probably nothing good.)
use strict;
use warnings;
use 5.010;
use List::Util qw(max);
sub findMatches
{
my ($words) = #_;
# Compare each word to every other word:
my #matches = (0) x #$words;
for my $i (0 .. $#$words-1) {
for my $j ($i+1 .. $#$words) {
my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;
$matches[$i] += $m;
$matches[$j] += $m;
}
}
# Find how many matches in the best word:
my $max = max(#matches);
# Find the words with that many matches:
my #wanted = grep { $matches[$_] == $max } 0 .. $#matches;
wantarray ? #$words[#wanted] : $words->[$wanted[0]];
} # end findMatches
my #words = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
say for findMatches(\#words);

Haven't touched perl in a while, so pseudo-code it is. This isn't the fastest algorithm, but it will work fine for a small amount of words.
totals = new map #e.g. an object to map :key => :value
for each word a
for each word b
next if a equals b
totals[a] = 0
for i from 1 to a.length
if a[i] == b[i]
totals[a] += 1
end
end
end
end
return totals.sort_by_key.last
Sorry about the lack of perl, but if you code this into perl, it should work like a charm.
A quick note on run-time: this will run in time number_of_words^2 * length_of_words, so on a list of 100 words, each of length 10 characters, this will run in 100,000 cycles, which is adequate for most applications.

Here's a version that relies on transposing the words in order to count the identical characters. I used the words from your original comparison, not the code.
This should work with any length words, and any length list. Output is:
Word score
---- -----
BALER 12
SALER 11
BAKER 11
CARER 10
RUFFR 4
The code:
use warnings;
use strict;
my #w = qw(BAKER SALER BALER CARER RUFFR);
my #tword = t_word(#w);
my #score;
push #score, str_count($_) for #tword;
#score = t_score(#score);
my %total;
for (0 .. $#w) {
$total{$w[$_]} = $score[$_];
}
print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total{$_}\n" for (sort { $total{$b} <=> $total{$a} } keys %total);
# transpose the words
sub t_word {
my #w = #_;
my #tword;
for my $word (#w) {
my $i = 0;
while ($word =~ s/(.)//) {
$tword[$i++] .= $1;
}
}
return #tword;
}
# turn each character into a count
sub str_count {
my $str = uc(shift);
while ( $str =~ /([A-Z])/ ) {
my $chr = $1;
my $num = () = $str =~ /$chr/g;
$num--;
$str =~ s/$chr/$num /g;
}
return $str;
}
# sum up the character counts
# while reversing the transpose
sub t_score {
my #count = #_;
my #score;
for my $num (#count) {
my $i = 0;
while( $num =~ s/(\d+) //) {
$score[$i++] += $1;
}
}
return #score;
}

Here is my attempt at an answer. This will also allow you to see each individual match if you need it. (ie. BALER matches 4 characters in BAKER). EDIT: It now catches all matches if there is a tie between words (I added "CAKER" to the list to test).
#! usr/bin/perl
use strict;
use warnings;
my #wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);
my %wordcomparison;
#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (#wordlist) {
my #letters = split(//, $word);
foreach my $otherword (#wordlist) {
my $count;
next if $otherword eq $word;
my #otherwordletters = split (//, $otherword);
foreach my $i (0..$#letters) {
$count++ if ( $letters[$i] eq $otherwordletters[$i] );
}
$wordcomparison{"$word"}{"$otherword"} = $count;
}
}
# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) {
foreach ( sort keys %{ $wordcomparison{$word} }) {
$wordcomparison{$word}{total} += $wordcomparison{$word}{$_};
}
}
#Want $word with highest total
my #max_match = (sort { $wordcomparison{$b}{total} <=> $wordcomparison{$a}{total} } keys %wordcomparison );
#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (#max_match) {
print "$_\n" if ($wordcomparison{$_}{total} >= $wordcomparison{$maximum}{total} )
}
The output is simply: CAKER BALER and BAKER.
The hash %wordcomparison looks like:
'SALER'
{
'RUFFR' => 1,
'BALER' => 4,
'BAKER' => 3,
'total' => 11,
'CARER' => 3
};

You can do this, using a dirty regex trick to execute code if a letter matches in its place, but not otherwise, thankfully it's quite easy to build the regexes as you go:
An example regular expression is:
(?:(C(?{ $c++ }))|.)(?:(A(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)(?:(E(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)
This may or may not be fast.
use 5.12.0;
use warnings;
use re 'eval';
my #words = qw(BAKER SALER BALER CARER RUFFR);
my ($best, $count) = ('', 0);
foreach my $word (#words) {
our $c = 0;
foreach my $candidate (#words) {
next if $word eq $candidate;
my $regex_str = join('', map {"(?:($_(?{ \$c++ }))|.)"} split '', $word);
my $regex = qr/^$regex_str$/;
$candidate =~ $regex or die "did not match!";
}
say "$word $c";
if ($c > $count) {
$best = $word;
$count = $c;
}
}
say "Matching: first best: $best";
Using xor trick will be fast but assumes a lot about the range of characters you might encounter. There are many ways in which utf-8 will break with that case.

Many thanks to all the contributers! You've certainly shown me that I still have a lot to learn, but you have also helped me tremendously in working out my own answer. I'm just putting it here for reference and possible feedback, since there are probably better ways of doing it. To me this was the simplest and most straight forward approach I could find on my own. Enjøy! :)
#!/usr/bin/perl
use strict;
use warnings;
# a list of words for testing
my #list = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
# populate two dimensional array with the list,
# so we can compare each letter with the other letters on the same row more easily
my $list_length = #list;
my #words;
for (my $i = 0; $i < $list_length; $i++) {
my #letters = split(//, $list[$i]);
my $letters_length = #letters;
for (my $j = 0; $j < $letters_length; $j++) {
$words[$i][$j] = $letters[$j];
}
}
# this gives a two-dimensionla array:
#
# #words = ( ["B", "A", "K", "E", "R"],
# ["S", "A", "L", "E", "R"],
# ["B", "A", "L", "E", "R"],
# ["C", "A", "R", "E", "R"],
# ["R", "U", "F", "F", "R"],
# );
# now, on to find the word with most letters in common with the other on the same row
# add up the score for each letter in each word
my $word_length = #words;
my #letter_score;
for my $i (0 .. $#words) {
for my $j (0 .. $#{$words[$i]}) {
for (my $k = 0; $k < $word_length; $k++) {
if ($words[$i][$j] eq $words[$k][$j]) {
$letter_score[$i][$j] += 1;
}
}
# we only want to add in matches outside the one we're testing, therefore
$letter_score[$i][$j] -= 1;
}
}
# sum each score up
my #scores;
for my $i (0 .. $#letter_score ) {
for my $j (0 .. $#{$letter_score[$i]}) {
$scores[$i] += $letter_score[$i][$j];
}
}
# find the highest score
my $max = $scores[0];
foreach my $i (#scores[1 .. $#scores]) {
if ($i > $max) {
$max = $i;
}
}
# and print it all out :D
for my $i (0 .. $#letter_score ) {
print "$list[$i]: $scores[$i]";
if ($scores[$i] == $max) {
print " <- best";
}
print "\n";
}
When run, the script yields the following:
BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl: Removing duplicates from a large set of data - perl

Related

Perl: print the line with the largest number from standard input

Curiously behaving IF block in Perl run on Windows

How can I use the until function with appropriate way

Perl: Compare keys of 2 hashes & print the difference between the closest keys

Find the word with most letters in common with other words

Categories

Resources