I have some perl script to process a file (contains lots of numbers) line by line.
File content (sample data, first 3 numbers are seperated by space, then the seperate is tab between 3rd and 4th numbers):
1 2 3 15
2 9 8 30
100 106 321 92
9 8 2 59
300 302 69 88
....
Script content:
# snippet of script
open(INF, "$infile") || die "Unable to open file $infile: $!\n";
#content = <INF>;
close(INF);
foreach $line (#content) {
# blah blah, script to handle math here
# Now the numbers are stored in separate variables
# $n1 stores the 1st number, i.e.: 1
# $n2 stores the 2nd number, i.e.: 2
# $n3 stores the 3rd number, i.e.: 3
# $n4 stores the 4th number, i.e.: 15
# Solution code to be inserted here
}
I would like to:
Sort the variables $n1, $n2, $n3 and output them in ascending order.
At the end of foreach, get rid of duplicates
My approach:
# Insert below code to foreach
$numbers{$n1} = 1;
$numbers{$n2} = 1;
$numbers{$n3} = 1;
#keys = sort { $numbers{$b} <=> $numbers{$a} } keys %numbers;
#push #numbers, "$keys[0] $keys[1] $keys[2]";
$numbers2{"$keys[0] $keys[1] $keys[2]"} = 1;
This defines two hashes: 1st hash is for sorting, 2nd hash is for removing duplicates after sorting.
Is there any better approach?
Thanks,
Updated with another solution -- it is the lines that may be duplicate, not numbers on a line.
In order to remove duplicate lines it is easiest if we have all sorted lines of three numbers in an array. Then we post-process that by running them through uniq. There are (at least) two ways to do this.
Store lines in an array, each being a reference to a sorted array with the three numbers. Then for comparison construct a string of each on the fly. This is better if there is yet other processing of numbers somewhere along, as they are in arrays.
Build a string out of each sorted line and store those in an array. Then it's easier to compare.
Below I use the first approach, assuming that there is other processing of numbers.
use warnings;
use strict;
use feature wq(say);
use List::MoreUtils qw(uniq);
my $file = 'sort_nums.txt';
my #content = do {
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #linerefs_all;
foreach my $line (#content) {
# Calculations ... numbers stored in ($n1, $n2, $n3, $n4)
my ($n1, $n2, $n3) = split '\s+' $line; # FOR TESTING
# Add to #rlines a reference to the sorted array with first three
push #linerefs, [ sort { $a <=> $b } ($n1, $n2, $n3) ];
}
# Remove dupes by comparing line-arrays as strings, then remake arrayrefs
my #linerefs = map { [ split ] } uniq map { join ' ', #$_ } #linerefs_all;
say "#$_" for #linerefs;
Using the posted lines in the file sort_nums.txt, the code above prints
1 2 3
2 8 9
100 106 321
69 300 302
Explanation of the post-processing line, read from the right.
The map on the right processes a list of arrayrefs. It dereferences each and joins its elements with a space, forming a string for the line. It returns a list of such strings, one per line.
That list is pruned of duplicates by uniq, which itself returns a list, fed into the map on the left.
In that map's block each string is split by (the default) white space into a list (of numbers on the line), and then a reference of that is taken by [ ]. This map thus returns a list of references to arrays, one for each line, what is assigned to #linerefs.
This is then printed.
If this is too much to stomach for one statement break the process into steps, generating intermediate arrays. Or switch to the second method above.
Initial post, assuming that numbers on each line may be duplicates
I take the objective to be: sort three variables and keep only unique ones, for each line.
use List::MoreUtils qw(uniq);
foreach my $line (#content) {
# Calculations, numbers stored in ($n1, $n2, $n3, $n4)
my #nums = uniq sort { $a <=> $b } ($n1, $n2, $n3);
say "#nums";
}
Remember that after this you don't know which one(s) of $n1, $n2, $n3 may have been dropped.
If, for some reason, a non-core module is not suitable, see this in perlfaq4 For example,
my %seen = ();
my #nums = sort { $a <=> $b } grep { ! $seen{$_}++ } ($n1, $n2, $n3);
or, if you need it without an extra hash around
my #nums = do { my %seen; sort { $a <=> $b } grep { !$seen{$_}++ } ($n1, $n2, $n3) };
Related
open INP,"<C:\\Users\\hanadi\\Documents\\cs.txt";
while ($line=<INP>)
{
chomp($line);
#list=split/\s+/,$line;
foreach $w (#list)
{
$wordfreq{$w}++;
}
}
foreach $w2(keys>wordfreq)
{
print "$w2==>$wordfreq{$w}";
}
I want to print each word and its frequency.now i want code in Perl to jump and Print the above information for the next
ranks (>100), but do not print
every line, print only one line for every 1000 words (otherwise there will
be too many lines to print) in decreasing order of frequency and
decreasing alphabetical order among words with the same frequency.
The first issue of this problem is to define the word "word." Am assuming, by one of your comments, that punctuation is not part of a "word," since you were asking how to "...delete punctuations from the text..."
One solution to this is to use a regex to capture only "word" characters, i.e., alphanumeric and underscore, by matching the text against \w in the regex.
Building a hash, where the keys are the words and the associated values are the counts, is the way to go. However, when doing this, you need to insure that the keys are all the same case, i.e., either all UPPER or all lower.
After you've built this hash, you can sort the output in descending order by the has values (frequency) and use a counter to print just the top 100 words. There will be words with the same frequency count--especially having only one occurrence. How do you want these printed, as it can make a difference whether they appear in the list of top 100. I'd suggest ordering these cases alphabetically.
Give the above, consider the following solution, which uses the text above the code below as the corpus:
use strict;
use warnings;
my %hash;
open my $fh, '<', 'words.txt' or die $!;
while (<$fh>) {
$hash{ lc $1 }++ while /(\w+)/g;
}
close $fh;
my $i = 1;
for my $word ( sort { $hash{$b} <=> $hash{$a} || $a cmp $b } keys %hash ) {
print "$i. $word: ($hash{$word})\n" if $i++ < 100 or !( $i % 1000 );
}
Partial output:
1. the: (22)
2. to: (8)
3. a: (5)
4. you: (5)
5. is: (4)
6. of: (4)
7. this: (4)
8. word: (4)
9. all: (3)
10. and: (3)
...
96. punctuation: (1)
97. punctuations: (1)
98. since: (1)
99. sort: (1)
100. suggest: (1)
Limitations:
One issue that results from capturing word characters can be seen in cases of some possessive forms of words, e.g., word's. In this case, both word and s would be captured as words. If you want to retain such punctuation, and split on whitespace, you can just use the following instead of the regex line:
$hash{ lc $_ }++ for split ' ';
Hope this helps!
please pay attention to all of amons reply, and always rtq. (good spot amon).
(i have determined) your problem is that once you have constructed your wordcount hash, you now need to invert the hash so you can sort the values into some kind of order. The problem with this is that more than one word may have the same count and would overwrite earlier stored words.
To do this you need to store an array in a hash value, and this is done by using a reference to an array. Hash values may only be scalars, an array is not a scalar, but a reference to an array is.
In this re-write of your problem, I have updated the open and close calls to use scalar filehandles, with appropriate error handling (or die), and converted your foreach statements into 'maps'. These can take a bit of time to grasp so do not copy and paste them. Rather, focus on the inverting of the hash and how the array is accessed. This is possibly quite complex for you so I have left this parts in foreach style.
The 'each' keyword takes a key/value pair from the hash, and is often used this way to process hashes in while statements.
You will still need to work on converting the counts into frequencies as per amons suggestion and retrieving the top 100. There is a clue to the counting in the '$c' variable.
#!/usr/bin/perl
# word count #wct.pl
use warnings;
use strict;
my (%wordfreq);
open my $input, '<', 'wc.txt'
or die "cannot open wc txt file $!";
map { $wordfreq{ $_ }++; } (split /\s+/, $_) while <$input> ;
close $input
or die "cannot close wc txt file $!";
# print (
# map {"$_ has ". $wordfreq{$_} . "\n" } (keys %wordfreq)
# );
foreach (keys %wordfreq){
# print "$_ has ". $wordfreq{$_} . "\n"
}
my %invertedhash;
while (my ($key,$value) = each %wordfreq){
push #{$invertedhash{$value}}, $key;
}
my $c;
foreach (reverse sort keys %invertedhash){
last if $c++ == 2;
print "words with a count of $_ are #{$invertedhash{$_}} \n";
}
exit 0;
sample
one two two
three three three four
four four four five
five five five
produces
words with a count of 4 are four five
words with a count of 3 are three
Hope this helps.
I have a table like this:
+ chr13 25017807 6
+ chr10 128074490 1
- chr7 140968671 1
+ chr10 79171976 3
- chr7 140968671 1
+ chr12 4054997 6
+ chr13 25017807 6
+ chr15 99504255 6
- chr8 91568709 5
It has been already read into Perl as a string variable (the returned value of an external shell script). I need to remove the duplicated lines and sort the table by the last column, and then print it out. How should I do it in Perl? Thanks!
Assuming the data is contained in the string $string, this solution would work:
my %seen; # just needed to remove duplicates
my $deduped_string =
join "\n", # 6. join the lines to a single string
map { join(" ", #$_) } # 5. join the fields of each line to a string
sort { $a->[-1] <=> $b->[-1] } # 4. sort arrayrefs by last field, numerically
map { [split] } # 3. split line into fields, store in anon arrayref
grep { not $seen{$_}++ } # 2. dedupe the lines
split /\n/, $string; # 1. split string into lines
This gargantuan expression executes from bottom towards top (or right to left). It consists of multiple composable transformers and filters:
map {BLOCK} LIST applies the code in the block to each value of the list. It transforms the list element-wise.
grep {BLOCK} LIST selects those elements from the list where the block returns true. It therefore filters the list and only outputs elements that satisfy a certain condition.
sort {BLOCK} LIST resorts the list. The block must return -1 if $a is less than $b, 1 if it is greater, or zero when equal. The <=> operator compares scalars numerically in this fashion. If the sort function is omitted, string comparision is used.
join STRING, LIST concatenates the elements of the list with the string in between.
split REGEX, STRING splits the string into pieces. The regex matches the delimiter (not usually returned). split and join can be considered inverse operations. If the string is omitted, $_ is used. When the regex is omitted, it works similarly to split /\s+/, $_, i.e. splits at every whitespace character.
This solution uses at its heart the Schwartzian Transform, a technique/idiom that enables cheap sorting by expensive-to-calculate keys. In it's general form, it is
my #sorted_data =
map { $_->[0] } # 3. map back to the orginal value
sort { $a->[1] <=> $b->[1] } # 2. sort by the special key
map { [$_, create_the_key($_)] } # 1. annotate each value with a key
#data;
In my specific case, the special key is the last column of each record; To obtain the original data (or an equivalent form) from the annotated data I join the fields together. As mpapec points out, I could also have carried the original line through the transform; this would preserve the original alignment of the lines.
For a beginner, I'd do it like this:
use strict; use warnings;
my $file = "table.txt";
open(my $fh, "<", $file) || die "Can't open $file: $!\n";
my #lines;
# read the file and save a transformed version to #lines
while (my $line = <$fh>) {
chomp($line); # remove final newline
$line =~ s/ +/:/gi; # make ":" the new separator
my #fields = split(/:/,$line); # split at the separator
my $newline = "$fields[4]:$fields[1]:$fields[2]:$fields[3]"; # reorder fields
push(#lines, $newline); # save the new line
}
#lines = sort(#lines); # sort lines alphabetically:
# duplicate lines are now consecutive
my $uniqline=""; # the last unique line
foreach my $line (#lines) {
# do this if the current line isn't string-equal to the last line
# (i.e. skip all lines that are equal to the previous line)
if ($uniqline ne $line) {
$uniqline = $line; # remember the last line
# print fields in original order
my #fields = split(/:/,$line);
printf(" %s %7s %11s %s\n",$fields[1],$fields[2],$fields[3],$fields[0]);
}
}
I got slightly different result...
+ chr10 128074490 1
- chr7 140968671 1
+ chr10 79171976 3
- chr8 91568709 5
+ chr12 4054997 6
+ chr13 25017807 6
+ chr15 99504255 6
Filters out duplicate lines, and sorts by the last column at the end,
perl -ane 'next if $s{$_}++; push #r,[$_,#F]}{ print $$_[0] for sort { $$a[-1] <=> $$b[-1] } #r' file
Almost same as,
use strict;
use warnings;
open my $fh, "file" or die $!;
my (%seen_line, #result_unique_lines);
while (<$fh>) {
# $_ => content of current line
# skip current if it's duplicate
next if $seen_line{$_}++;
my #line_values = split;
push #result_unique_lines, [$_, #line_values];
}
close $fh;
# sort lines
#result_unique_lines = sort { $a->[-1] <=> $b->[-1] } #result_unique_lines;
for my $aref (#result_unique_lines) {
my $line = $aref->[0];
print $line;
}
I have stored values in 5 text files. The values in each text file should be considered as an array. I am trying to write a perl program, to read and print the common elements in these 5 arrays.
For Instance
#a1=(1,7,4,5);
#a2=(1,9,4,5);
#a3=qw(1,6,4,5 );
#a4=qw(1 2 4 5 );
#a5=qw(1 2 4 5 );
I expect to print
1 4 5
The perlfaq has lots of answer to questions that are frequently asked. Of course it's all a bit of a waste of time and effort if no-one bothers to check there before asking the question again :-)
How do I compute the difference of two arrays? How do I compute the
intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
You need the intersection of two arrays. And then do it three more times.
You don't say what format your input files have, but this program will find all digit strings in each file and list the values common to all of them.
The list of input files is expected as command-line arguments.
use strict;
use warnings;
use File::Slurp 'read_file';
my %counts;
for (#ARGV) {
$counts{$_}++ for map /\d+/g, read_file $_;
}
my #common = grep $counts{$_} == #ARGV, keys %counts;
printf "(%s)\n", join ', ', #common;
output
(4, 1, 5)
I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.
Here is an example of input file format:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
The important part is the sequence read id, which is the first column (ie HWI-....55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.
Here is what I have so far:
unless (#ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my #hit;
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my #col;
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.
I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn't make something clear enough.
Initialize your %hash1 with zeros instead of ones:
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
# Start them as "0" for "no duplicates".
$hash1{$hit[0]} = 0;
}
Then, in your second loop, you can increment $hash1{$col[0]}:
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
# Increment the counter if %hash1 has what we're looking for.
++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}
There's no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you're only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you're just grabbing the first field of each line you don't need the chomp calls but they do no harm so you can keep them if you want.
This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:
my #dups = grep { $hash1{$_} > 0 } keys %hash1;
And then display them with their counts:
for my $k (sort #dups) {
print "$k\t$hash1{$k}\n";
}
You could also check the counts while displaying the matches:
for my $k (sort keys %hash1) {
print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}
I'm looking for help sorting an array where each element is made up of "a number, then a string, then a number". I would like to sort on the first number part of the array elements, descending (so that I list the higher numbers first), while also listing the text etc.
am still a beginner so alternatives to the below are also welcome
use strict;
use warnings;
my #arr = map {int( rand(49) + 1) } ( 1..100 ); # build an array of 100 random numbers between 1 and 49
my #count2;
foreach my $i (1..49) {
my #count = join(',', #arr) =~ m/$i,/g; # maybe try to make a string only once then search trough it... ???
my $count1 = scalar(#count); # I want this $count1 to be the number of times each of the numbers($i) was found within the string/array.
push(#count2, $count1 ." times for ". $i); # pushing a "number then text and a number / scalar, string, scalar" to an array.
}
#for (#count2) {print "$_\n";}
# try to add up all numbers in the first coloum to make sure they == 100
#sort #count2 and print the top 7
#count2 = sort {$b <=> $a} #count2; # try to stop printout of this, or sort on =~ m/^anumber/ ??? or just on the first one or two \d
foreach my $i (0..6) {
print $count2[$i] ."\n"; # seems to be sorted right anyway
}
First, store your data in an array, not in a string:
# inside the first loop, replace your line with the push() with this one:
push(#count2, [$count1, $i];
Then you can easily sort by the first element of each subarray:
my #sorted = sort { $b->[0] <=> $a->[0] } #count2;
And when you print it, construct the string:
printf "%d times for %d\n", $sorted[$i][0], $sorted[$i][1];
See also: http://perldoc.perl.org/perlreftut.html, perlfaq4
Taking your requirements as is. You're probably better off not embedding count information in a string. However, I'll take it as a learning exercise.
Note, I am trading memory for brevity and likely speed by using a hash to do the counting.
However, the sort could be optimized by using a Schwartzian Transform.
EDIT: Create results array using only numbers that were drawn
#!/usr/bin/perl
use strict; use warnings;
my #arr = map {int( rand(49) + 1) } ( 1..100 );
my %counts;
++$counts{$_} for #arr;
my #result = map sprintf('%d times for %d', $counts{$_}, $_),
sort {$counts{$a} <=> $counts{$b}} keys %counts;
print "$_\n" for #result;
However, I'd probably have done something like this:
#!/usr/bin/perl
use strict; use warnings;
use YAML;
my #arr;
$#arr = 99; #initialize #arr capacity to 100 elements
my %counts;
for my $i (0 .. 99) {
my $n = int(rand(49) + 1); # pick a number
$arr[ $i ] = $n; # store it
++$counts{ $n }; # update count
}
# sort keys according to counts, keys of %counts has only the numbers drawn
# for each number drawn, create an anonymous array ref where the first element
# is the number drawn, and the second element is the number of times it was drawn
# and put it in the #result array
my #result = map [$_, $counts{$_}],
sort {$counts{$a} <=> $counts{$b} }
keys %counts;
print Dump \#result;