I would like to sort files numerically using Perl script.
My files looks like below:
1:file1:filed2
3:filed1:field2
10:filed1:field2
4:field1:field2
7:field1:field2
I would like to display it as:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
The way sort works in perl, is that it works through your list, setting each element to $a and $b - then testing those. By default, it uses cmp which is an alphanumeric sort.
You've also got <=> which is a numeric sort, and the kind you're looking for. (Alpha sorts 10 ahead of 2 ).
So - all we need do is extract the numeric value of your key. There's a number of ways you could do this - the obvious being to take a subroutine that temporarily copies the variables:
#!/usr/bin/env perl
use strict;
use warnings;
sub compare_first_num {
my ( $a1 ) = split ( /:/, $a );
my ( $b1 ) = split ( /:/, $b );
return $a1 <=> $b1;
}
print sort compare_first_num <>;
This uses <> - the magic filehandle - to read STDIN or files specified on command line.
Or alternatively, in newer perls (5.16+):
print sort { $a =~ s/:.*//r <=> $b =~ s/:.*//r } <>;
We use the 'substitute-and-return' operation to compare just the substrings we're interested in. (Numerically).
Split on : and store in a hash of arrays. Then you can sort and print out the hash keys:
my %data;
while(<DATA>){
my #field = split(/:/);
$data{$field[0]} = [#field[1..2]];
}
print join (':', $_, #{$data{$_}}) for sort { $a <=> $b } keys %data;
print "\n";
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
For simple and fast solution, use Sort::Key::Natural (fast natural sorting) module:
use warnings;
use strict;
use Sort::Key::Natural qw( natsort );
open my $fh, "<", "file.txt" or die $!;
my #files = natsort <$fh>;
close $fh;
print #files;
Output:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
Related
Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)
I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}
I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
#!/usr/local/bin/perl
open (FILE, "ssbn1898.txt");
$x=<FILE>;
split/[,]/$x;
#array1=$y;
if #array1[2]="M";
print #array2;
else;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You
First off - you split on comma, so I'm going to assume your data looks something like this:
one,M
two,F
three,M
four,M
five,F
six,M
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
#!/usr/local/bin/perl
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
chomp;
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.
There's a part function in List::MoreUtils that does exactly what you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
__END__
one,M,foo
two,F,bar
three,M,baz
four,M,foo
five,F,bar
six,M,baz
The output is:
M: one,M,foo
three,M,baz
four,M,foo
six,M,baz
F: two,F,bar
five,F,bar
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
}
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
#
In fact the same thing...
#!/usr/bin/perl
use strict;
my (#m,#f);
while(<>){
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
}
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}
I have some 1000 files in a directory. Naming convention of the file is like below.
TC_01_abcd_16_07_2014_14_06.txt
TC_02_abcd_16_07_2014_14_06.txt
TC_03_abcd_16_07_2014_14_07.txt
.
.
.
.
TC_100_abcd_16_07_2014_15_16.txt
.
.
.
TC_999_abcd_16_07_2014_17_06.txt
I have written some code like this
my #dir="/var/tmp";
foreach my $inputfile (glob("$dir/*abcd*.txt")) {
print $inputfile."\n";
}
While running this it is not printing in sequence.
it it printing till 09 file then it is printing 1000th file name then
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
TC_11_abcd_16_07_2014_11_56.txt
Please guide me how to print in sequence
The files are sorted according to the rules of shell glob expansion, which is a simple alpha sort. You will need to sort them according to a numeric sort of the first numeric field.
Here is one way to do that:
# Declare a sort comparison sub, which extracts the part of the filename
# which we want to sort on and compares them numerically.
# This sub will be called by the sort function with the variables $a and $b
# set to the list items to be compared
sub compareFilenames {
my ($na) = ($a =~ /TC_(\d+)/);
my ($nb) = ($b =~ /TC_(\d+)/);
return $na <=> $nb;
}
# Now use glob to get the list of filenames, but sort them
# using this comparison
foreach my $file (sort compareFilenames glob("$dir/*abcd*.txt")) {
print "$file\n";
}
See: perldoc for sort
That's printing the files in order -- ASCII order that is.
In ASCII, the underscore (_) is after the digits when sorting. If you want to sort your files in the correct order, you'll have to sort them yourself. Without sort, there's no guarantee that they'll print in any order. Even worse for you, you don't really want to print the files in either numeric sorted order (because the file names aren't numeric) or ASCII order (because you want TC_10 to print before TC_100.
Therefore, you need to write your own sorting routine. Perl gives you the sort command. By default, it will sort in ASCII order. However, you can define your own subroutine to sort in the order you want. sort will pass two values to your in your sort routine $a and $b. What you can do is parse these two values to get the sort keys you want, then use the <=> or cmp operators to return the values in the correct sort order:
#! /usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw(say);
opendir my $dir, 'temp'; # Opens a directory for reading
my #dir_list = readdir $dir;
closedir $dir;
#dir_list = sort { # My sort routine embedded inside the sort command
my $a_val;
my $b_val;
if ( $a =~ /^TC_(\d+)_/ ) {
$a_val = $1;
}
else {
$a_val = 0;
}
if ( $b =~ /^TC_(\d+)_/ ) {
$b_val = $1;
}
else {
$b_val = 0;
}
return $a_val <=> $b_val;
} #dir_list;
for my $file (#dir_list) {
next if $file =~ /^\./;
say "$file";
}
In my sort subroutine am going to take $a and $b and pull out the number you want to sort them by and put that value into $a_val and $b_val. I also have to watch what happens if the files don't have the name I think they may have. Here I simply decide to set the sort value to 0 and hope for the best.
I am using opendir and readdir instead of globbing. This will end up including . and .. in my list, and it will include any file that starts with .. No problem, I'll remove these when I print out the list.
In my test, this prints out:
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_11_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
Files are sorted numerically by the first set of digits after TC_.
Here you go:
#!/usr/bin/perl
use warnings;
use strict;
sub by_substring{
$a=~ /(\d+)/;
my $x=$1;
$b=~ /(\d+)/;
my $y=$1;
return $x <=> $y;
}
my #files=<*.txt>;
#files = sort by_substring #files;
for my $inputfile (#files){
print $inputfile."\n";
}
It will not matter if your filenames start with "TC" or "BD" or "President Carter", this will just use the first set of adjacent digits for the sorting.
the sort in the directory will be alphanumeric, hence your effect. i do not know how to sort glob by creation date, here is a workaround:
my #dir="/var/tmp";
my #files = glob("$dir/*abcd*.txt");
my #sorted_files;
for my $filename (#files) {
my ($number) = $filename =~ m/TC_(\d+)_abcd/;
$sorted_files[$number] = $filename;
}
print join "\n", #sorted_filenames;
How can I find the top 100 most used strings (words) in a .txt file using Perl? So far I have the following:
use 5.012;
use warnings;
open(my $file, "<", "file.txt");
my %word_count;
while (my $line = <$file>) {
foreach my $word (split ' ', $line) {
$word_count{$word}++;
}
}
for my $word (sort keys %word_count) {
print "'$word': $word_count{$word}\n";
}
But this only counts each word, and organizes it in alphabetical order. I want the top 100 most frequently used words in the file, sorted by number of occurrences. Any ideas?
Related: Count number of times string repeated in files perl
From reading the fine perlfaq4(1) manpage, one learns how to sort hashes by value. So try this. It’s rather more idiomatically “perlian” than your approach.
#!/usr/bin/env perl
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);
my %seen;
while (<>) {
$seen{$_}++ for split /\W+/; # or just split;
}
my $count = 0;
for (sort {
$seen{$b} <=> $seen{$a}
||
lc($a) cmp lc($b) # XXX: should be v5.16's fc() instead
||
$a cmp $b
} keys %seen)
{
next unless /\w/;
printf "%-20s %5d\n", $_, $seen{$_};
last if ++$count > 100;
}
When run against itself, the first 10 lines of output are:
seen 6
use 5
_ 3
a 3
b 3
cmp 2
count 2
for 2
lc 2
my 2