k/Column number in Perl - perl

So I'm sure this is somewhere on the site, but as always, I have looked high and low before asking a question.
In Bash, you can use certain flags on some commands (such as k[number] on sort) to grab a certain column from a text file. What is the method for doing this in Perl? For an example from my input file:
Jess 6 8 25000
Say that I want to run a statement
if (k2 =< 6)
{
print "foo";
}
Of course, k2 doesn't work in Perl. May someone show me (or link me to) how this is done?

You can try with the command line Perl as well
with the inputs
$ cat reubens.txt
0 5 0
0 10 0
0 15 0
0 20 0
0 1 0
0 10 0
$ perl -lane ' print "The second column is ", $F[1] < 10 ? "less than 10": $F[1]==10 ? "equal to 10" : "more than 10" ' reubens.txt
The second column is less than 10
The second column is equal to 10
The second column is more than 10
The second column is more than 10
The second column is less than 10
The second column is equal to 10
$

In case you want to do it for exactly one column and want to avoid maintaining the result of "split" in array shape. (Otherwise use split as mentioned in the comments.)
perl -ne"/(?:\w+\s+){1}(\w+\b)/;print $1.\"\n\""
Will print the column of word-like characters between space-like characters, identified by the number inside "{}", in this case "1"; counting columns starting with 0.
E.g. it prints "6" for the input example, using "1".
How:
Make a regular expression for a column followed by space,
(?:\w+\s+)
require it a number of times,
{1}
then grab a regular expression for a colum followed by anything not word-like (including end of line)
(\w+\b)
The desired column is found in the grabbed string
$1
I did this in a command line one liner which expects standard input, to be able to test it.
Please just adapt it into your script.

This will check the second column:
(split)[1]
Sample script:
use strict;
use warnings;
my $filename = 'input';
open(FILE, $filename) or die "Can not open $filename.";
print "\n";
while(<FILE>)
{
#The test
if ((split)[1] < 10)
{
print "The second column is less than ten\n";
}
elsif ((split)[1] > 10)
{
print "The second column is more than ten\n";
}
else
{
print "The second column is equal to ten\n";
}
}
Input:
#Input file
0 5 0
0 10 0
0 15 0
0 20 0
0 1 0
0 10 0
Output:
The second column is less than ten
The second column is equal to ten
The second column is more than ten
The second column is more than ten
The second column is less than ten
The second column is equal to ten

Related

Randomly pick a region and process it, a number of times

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more
as an example, I randomly pick
LVPSLTRYLT 0
then
ITNLRSFIHK 1
then again randomly go and pick up 10 letters consecutive
AHSRIRKERP 0
This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen
So I do the following
# first I remove the header
grep -v ">" data.txt > out.txt
then get randomly one region with 10 letters I tried to use shuf with no success,
shuf -n1000 data.txt
then I tried to use awk and was not successful either
awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'
then calculate the number of F and save it in the file
grep -i -e [F] |wc -l
Note, we should not pick up the same region twice
I've got to assume some things here, and leave some restrictions
Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code
First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::MoreUtils qw(uniq);
my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);
GetOptions(
'num-regions|n=i' => \$num_regions,
'region-len|l=i' => \$region_len,
'char|c=s' => \$count_freq_for,
) or usage();
my $file = shift || usage();
# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my #rand = uniq sort { $a <=> $b }
map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: #rand";
my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0);
my $region;
while (my $line = <$fh>) {
chomp $line;
# Total number of characters so far, up to this line and with this line
$nchars_prev = $nchars;
$nchars += length $line;
next if $line =~ /^\s*>sp/;
# Complete the region if there wasn't enough chars on the previous line
if ($chars_left > 0) {
$region .= substr $line, 0, $chars_left;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
$chars_left = -1;
};
# Random positions that happen to be on this line
my #pos = grep { $_ > $nchars_prev and $_ < $nchars } #rand;
# say "\tPositions on ($nchars_prev -- $nchars) line: #pos" if #pos;
for (#pos) {
my $pos_in_line = $_ - $nchars_prev;
$region = substr $line, $pos_in_line, $region_len;
# Don't print if there aren't enough chars left on this line
last if ( $chars_left =
($region_len - (length($line) - $pos_in_line)) ) > 0;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
}
}
sub fraction_skipped {
my ($fh) = #_;
my ($skip_len, $data_len);
my $curr_pos = tell $fh;
seek $fh, 0, 0 if $curr_pos != 0;
while (<$fh>) {
chomp;
if (/^\s*>sp/) { $skip_len += length }
else { $data_len += length }
}
seek $fh, $curr_pos, 0; # leave it as we found it
return $skip_len / ($skip_len+$data_len);
}
sub usage {
say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
exit;
}
Uncomment the srand line so to have the same run always, for testing.
Notes follow.
Some corner cases
If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little
Handling of issues regarding randomness
"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.
Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.
If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.
The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.
More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.
In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.
Note on a particular test case
With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.
Here is a simple driver to run the above a given number of times, for statistics.
Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.†
Adjust and add code to process as needed for statistics; output is in files.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use IPC::Run qw(run);
my $outdir = 'rr_output'; # pick a directory name
mkdir $outdir if not -d $outdir;
my $prog = 'random_regions.pl'; # your name for the program
my $input = 'data_file.txt'; # your name for input file
my $ch = 'F';
my ($runs, $regions, $len) = (10, 10, 10);
GetOptions(
'runs|n=i' => \$runs,
'regions=i' => \$regions,
'length=i' => \$len,
'char=s' => \$ch,
'input=s' => \$input
) or usage();
my #cmd = ( $prog, $input,
'--num-regions', $regions,
'--region-len', $len,
'--char', $ch
);
say "Run: #cmd, $runs times.";
for my $n (1..$runs) {
my $outfile = "$outdir/regions_r$n.txt";
say "Run #$n, output in: $outdir/$outfile";
run \#cmd, '>', $outfile or die "Error with #cmd: $!";
}
sub usage {
say STDERR "Usage: $0 [options]", "\n\toptions: ...";
exit;
}
Please expand on the error checking. See for instance this post and links on details.
Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.
The called program (random_regions.pl above) must be executable.
†   Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.
awk to the rescue!
you didn't specify but there are two random actions going on. I treated them independently, may not be so. First picking a line and second picking a random 10 letter substring from that line.
This assumes the file (or actually half of it) can fit in memory. Otherwise, split the file into equal chunks and run this on chunks. Doing so will reduce some of the clustering but not sure how important in this case. (If you have one big file, it's possible that all samples may be drawn from the first half, with splitting you eliminate this probability). For certain cases this is a desired property. Don't know your case.
$ awk 'BEGIN {srand()}
!/^>/ {a[++n]=$0}
END {while(i++<1000)
{line=a[int(rand()*n)+1];
s=int(rand()*(length(line)-9));
print ss=substr(line,s,10), gsub(/F/,"",ss)}}' file
GERISPKDRC 0
QDEPRCLHEW 0
LLYQLFRNLF 2
GTHGAGAMES 0
TKALQDVQIR 0
FCVHTKALQD 1
SNKAQVKPGQ 0
CMECQGHGER 0
TRRFVGHTKD 1
...
Here is one solution using Perl
It slurps the entire file to memory. Then the lines starting with > are removed.
Here I'm looping for 10 times $i<10, you can increase the count here.
Then rand function is called by passing length of the file and using the rand value, substr of 10 is computed. $s!~/\n/ guard is to make sure we don't choose the substring crossing newlines.
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f\n" ;$i++} else { $i-- } } '
random10.txt
ENTQLLETKN 0
LSEGALSPDG 0
LRKARAEAED 0
RLWDLTTGTT 0
KWSGRCGLGY 0
TRRFVGHTKD 1
PVKRPIPHPA 0
GMVQQIQSVC 0
LTHPVLSFGI 1
KVNFPENGFL 2
$
To know the random number generated
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f $x\n" ;$i++} else { $i-- } }
' random10.txt
QLDGSLTMSS 0 1378.61409368207
DLIAKVDELT 0 1703.46689004765
SGGGANGTSF 1 900.269562152326
PEELTLSPKL 0 1368.55540468164
TCLSEGALSP 0 1016.50744004085
NRTWNSSAVP 0 23.7868578293154
VNFPENGFLS 2 363.527933104776
NSGLTWSGND 0 48.656607650744
MILSASRDKT 0 422.67705815168
RRGEDLFMCM 1 290.828530365
AGDGLLTPDA 0 1481.78080339531
$
Since your input file is huge I'd do it in these steps:
select random 10-char strings from each line of your input file
shuffle those to get the number of samples you want in random order
count the Fs
e.g.
$ cat tst.sh
#!/bin/env bash
infile="$1"
sampleSize=10
numSamples=15
awk -v sampleSize="$sampleSize" '
BEGIN { srand() }
!/^>/ {
begPos = int((rand() * sampleSize) + 1)
endPos = length($0) - sampleSize
for (i=begPos; i<=endPos; i+=sampleSize) {
print substr($0,i,sampleSize)
}
}
' "$infile" |
shuf -n "$numSamples"
.
$ ./tst.sh file
HGDIKCVLNE
QDEPRCLHEW
SEVQAIIEST
THDLRVSLEE
SEWVSCVRFS
LTRYLTLNAS
KDGQKITFHG
SNSPEPQKAV
QGGSKATTPA
QLLETKNALN
LLFCDNHKKQ
DETNYGIPQR
IRFQPQLNPD
LQTIRFSPDI
SLKRCGGFLI
$ ./tst.sh file | awk '{print $0, gsub(/F/,"")}'
SPKLQLDGSL 0
IKLFCVHTKA 1
VVSRCRLRHT 0
SPEPQKAVEQ 0
AYNPKNFSND 1
FGESRPELGS 1
AGDGLLTPDA 0
VGHTKDVLSV 0
VTHDLRVSLE 0
PISLGIFPLP 1
ASQITNLRSF 1
LTRPPEELTL 0
FDRYGEEGLK 1
IYIEGQDEPR 0
WNTLGVCKYT 0
Just change the numSamples from 15 to 1000 or whatever you like when run against your real data.
The above relies on shuf -n being able to handle however much input we throw at it, presumably much like sort does by using paging. If it fails in that regard then obviously you'd have to choose/implement a different tool for that part. FWIW I tried seq 100000000 | shuf -n 10000 (i.e. 10 times as many input lines as the OPs posted max file length of 10000000 to account for the awk part generating N lines of output per 1 line of input and 10 times as many output lines required than the OPs posted 1000) and it worked fine and only took a few secs to complete.

Perl: perl regex for extracting values from complex lines

Input log file:
Nservdrx_cycle 4 servdrx4_cycle
HCS_cellinfo_st[10] (type = (LTE { 2}),cell_param_id = (28)
freq_info = (10560),band_ind = (rsrp_rsrq{ -1}),Qoffset1 = (0)
Pcompensation = (0),Qrxlevmin = (-20),cell_id = (7),
agcreserved{3} = ({ 0, 0, 0 }))
channelisation_code1 16/5 { 4} channelisation_code1
sync_ul_info_st_ (availiable_sync_ul_code = (15),uppch_desired_power =
(20),power_ramping_step = (3),max_sync_ul_trans = (8),uppch_position_info =
(0))
trch_type PCH { 7} trch_type8
last_report 0 zeroth bit
I was trying to extract only integer for my above inputs but I am facing some
issue with if the string contain integer at the beginning and at the end
For ( e.g agcreserved{3},HCS_cellinfo_st[10],Qoffset1)
here I don't want to ignore {3},[10] and 1 but in my code it does.
since I was extracting only integer.
Here I have written simple regex for extracting only integer.
MY SIMPLE CODE:
use strict;
use warnings;
my $Ipfile = 'data.txt';
open my $FILE, "<", $Ipfile or die "Couldn't open input file: $!";
my #array;
while(<$FILE>)
{
while ($_ =~ m/( [+-]?\d+ )/xg)
{
push #array, ($1);
}
}
print "#array \n";
output what I am getting for above inputs:
4 4 10 2 28 10560 -1 1 0 0 -20 7 3 0 0 0 1 16 5 4 1 15 20 3 8 0 7 8 0
expected output:
4 2 28 10560 -1 0 0 -20 7 0 0 0 4 15 20 3 8 0 7 0
If some body can help me with explanation ?
You are catching every integer because your regex has no restrictions on which characters can (or can not) come before/after the integer. Remember that the /x modifier only serves to allow whitespace/comments inside your pattern for readability.
Without knowing a bit more about the possible structure of your output data, this modification achieves the desired output:
while ( $_ =~ m! [^[{/\w] ( [+-]?\d+ ) [^/\w]!xg ) {
push #array, ($1);
}
I have added rules before and after the integer to exclude certain characters. So now, we will only capture if:
There is no [, {, /, or word character immediately before the number
There is no / or word character immediately after the number
If your data could have 2-digit numbers in the { N} blocks (e.g. PCH {12}) then this will not capture those and the pattern will need to become much more complex. This solution is therefore quite brittle, without knowing more of the rules about your target data.

Using unix tools(without resorting to R), how to filter the rows based on some statistics of the grouping?

I'd like to filter rows of the following table such that it goes from this :
A 1 3 SOME_OTHER_INFO
A 1 4 SOME_OTHER_INFO2
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 3 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
to that:
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Filtering criteria is this:
1) based on the first 2 columns, group rows.
2) Then for each group, select the row where the third column is minimum within group.
3) Return.
Now it's easy to do something like this in R using the package such as plyr using commands like this:
ddply(data, .(first_col, second_col), function(x) {
min_idx = which.min(x$third_col);
return(x[min_idx])
})
But I want to know if there is a efficient & elegant way to do this using unix tools on a command line.
Lastly, I almost found the beautiful solution to this using datamash, which is a recently tool released in GNU, but with some glitches.
$ datamash -g 1,2 min 3 -f < file.txt | cut -f1-4
A 1 3 SOME_OTHER_INFO1
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 3 SOME_OTHER_INFO4 # <-- not the correct row I want to grab
The problem was when using "-f" flag, it grabs the first item from each group, not the row that min corresponds to. So if you look at the output above "B 2 3 SOME_OTHER_INFO4" was selected rather than "B 2 0 SOME_OTHER_INFO4".
Here are couple of more options using perl:
perl -MList::Util=min -lane'
$h{"#F[0,1]"}{$F[2]} = $_
}{
print $h{$_}{ min keys %{$h{$_}} } for sort keys %h
' file
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Create a hash of hash having inner key as the first two columns and outer key as the third column.
Using the core module min method grab the smallest outer key and print the value which is the entire line.
or without the core module:
perl -lane'
push #{ $h{"#F[0,1]"} }, [$F[2], $_]
}{
print $_->[1] for sort map {
(sort { $a->[0] <=> $b->[0] } #$_)[0]
} values %h
' file
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Create a hash of arrays using key as first two columns and the value as array of third column and the entire line.
Pull the hash entries by values and sort based on the first element of the array.
Using slice just grab the first smallest entry and print the second element which is the entire line.
Dunno what you call efficient or elegant, but this seems to be what you want:
sort -k1 -k2,3n file.txt | rev | uniq -f 2 | rev
If the double rev is considered inelegant (or the actual number of columns varies, in which case it won't work),
sort -k1 -k2,3n file.txt | perl -wane'print if $.==1 || $F[0] ne $last[0] || $F[1] != $last[1]; #last=#F'
Provided you can get the lines sorted in the right order, a simple Awk filter which prints only the first line in a group should work.
sort -k1 -k2n -k3n file.txt |
awk '!a[$1 $2]++'
The Awk script populates an array a with keys from the first two fields, and prints only when it sees a new key.
How about an elegant, efficient regex?
perl -pi'.old' -0777 -e 's/^([a-z]\t[0-9]\t)([0-9]\t\w+\s*)^(\g1[0-9]\t\w+\s*){1,}/$1$2/smgi' file.txt
Slurps in file.txt and replaces consecutive lines where the first two columns are identical with the first occurrence of the line. This version modifies file.txt in place but backs up the original file as file.txt.old.

Insert the highest value among the number of times it occurs

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I have already done is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file. Example: The first search term is NF-E2. I checked if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i have counted the number of times it matches the pattern. Now my output is of the format:
Keyword Keyword in file? Number of times keyword occurs in file
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Now, in addition to this, what I would like to add is the highest Ncor value for each string in each file. Say for example: while I search for NF-E2 in NF-E2.txt, the Ncor values present are: 3.02, 2.87 and 4.59. Then I want the value 4.59 to be printed in the next column. So now my output should look like:
Keyword Keyword in file? Number of times keyword occurs in file Ncor
NF-E2 1 3 4.59
NRSF 0 0
NFATC1 0 0
NFKB 1 7 1.66
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on.
Here is my code:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
PLease help!!!
This should work:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my $keyword = (split /\s+/)[1];
my $file = "/home/$keyword/list/${keyword}.txt";
open my $reference, '<', "$file" or die "Cannot open $file: $!";
my $key_cnt = 0;
my $max_ncor = 0;
while (my $line = <$reference>) {
my ($string, undef, $ncor) = split /\s+|=/, $line;
if ($string =~ $keyword) {
$key_cnt++;
$max_ncor = $ncor if ($max_ncor < $ncor);
}
}
print join("\t", $keyword, $key_cnt ? 1 : 0, $key_cnt, $key_cnt ? $max_ncor : ''), "\n";
}
Run it like this:
perl t.pl search_string.txt

Sorting lines according to the numerical value of the first element on each line?

I have just started learning Perl, hence my question might seem very silly. I apologize in advance.
I have a list say #data which contains a list of lines read from the input. The lines contain numbers that are separated by (unknown number of) spaces.
Now, I would like to sort them and print them out, but not in the lexicographical order but according to the numerical value of the first number appearing on the line.
I know this must be something very simple but I cannot figure out how to do it?
Thanks in advance,
You can use a Schwartzian transform, capturing the first number in the row with a regex
use strict;
use warnings;
my #sorted = map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /^(-?[\d.]+)/ ] } <DATA>;
print #sorted;
__DATA__
21 13 14
0 1 2
32 0 4
11 2 3
1 3 3
Output:
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
Reading the transform from behind, the <DATA> is the file handle we use, it will return a list of the lines in the file. The first map statement returns an array reference [ ... ], that contains the original line, plus the first number that is captured in the line. Alternatively, you can use the regex /^(\S+)/ here, to just capture whatever non-whitespace that comes first. The sort uses this captured number inside the array ref when comparing lines. And finally, the last map converts the array ref back to the original value, stored in $_->[0].
Be aware that this relies on the lines having a number at the start of the line. If that can be missing, or blank, this will have some unforeseen consequences.
Note that only using a simple numerical sort will also "work", because Perl will convert one of your lines to the correct number, assuming each line begins with a number followed by space. You will have some warnings about that, such as Argument "21 13 14\n" isn't numeric in sort. For example, if I replace my code above with
my #foo = sort { $a <=> $b } <DATA>;
I will get the output:
Argument "21 13 14\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "0 1 2\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "32 0 4\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "11 2 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "1 3 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
But as you can see, it has sorted correctly. I would not advice this solution, but it is a nice demonstration in this context, I think.
You can use the sort function :
#sorted_data = sort(#data);