Insert the highest value among the number of times it occurs - perl

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I have already done is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file. Example: The first search term is NF-E2. I checked if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i have counted the number of times it matches the pattern. Now my output is of the format:
Keyword Keyword in file? Number of times keyword occurs in file
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Now, in addition to this, what I would like to add is the highest Ncor value for each string in each file. Say for example: while I search for NF-E2 in NF-E2.txt, the Ncor values present are: 3.02, 2.87 and 4.59. Then I want the value 4.59 to be printed in the next column. So now my output should look like:
Keyword Keyword in file? Number of times keyword occurs in file Ncor
NF-E2 1 3 4.59
NRSF 0 0
NFATC1 0 0
NFKB 1 7 1.66
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on.
Here is my code:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
PLease help!!!

This should work:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my $keyword = (split /\s+/)[1];
my $file = "/home/$keyword/list/${keyword}.txt";
open my $reference, '<', "$file" or die "Cannot open $file: $!";
my $key_cnt = 0;
my $max_ncor = 0;
while (my $line = <$reference>) {
my ($string, undef, $ncor) = split /\s+|=/, $line;
if ($string =~ $keyword) {
$key_cnt++;
$max_ncor = $ncor if ($max_ncor < $ncor);
}
}
print join("\t", $keyword, $key_cnt ? 1 : 0, $key_cnt, $key_cnt ? $max_ncor : ''), "\n";
}
Run it like this:
perl t.pl search_string.txt

Related

k/Column number in Perl

So I'm sure this is somewhere on the site, but as always, I have looked high and low before asking a question.
In Bash, you can use certain flags on some commands (such as k[number] on sort) to grab a certain column from a text file. What is the method for doing this in Perl? For an example from my input file:
Jess 6 8 25000
Say that I want to run a statement
if (k2 =< 6)
{
print "foo";
}
Of course, k2 doesn't work in Perl. May someone show me (or link me to) how this is done?
You can try with the command line Perl as well
with the inputs
$ cat reubens.txt
0 5 0
0 10 0
0 15 0
0 20 0
0 1 0
0 10 0
$ perl -lane ' print "The second column is ", $F[1] < 10 ? "less than 10": $F[1]==10 ? "equal to 10" : "more than 10" ' reubens.txt
The second column is less than 10
The second column is equal to 10
The second column is more than 10
The second column is more than 10
The second column is less than 10
The second column is equal to 10
$
In case you want to do it for exactly one column and want to avoid maintaining the result of "split" in array shape. (Otherwise use split as mentioned in the comments.)
perl -ne"/(?:\w+\s+){1}(\w+\b)/;print $1.\"\n\""
Will print the column of word-like characters between space-like characters, identified by the number inside "{}", in this case "1"; counting columns starting with 0.
E.g. it prints "6" for the input example, using "1".
How:
Make a regular expression for a column followed by space,
(?:\w+\s+)
require it a number of times,
{1}
then grab a regular expression for a colum followed by anything not word-like (including end of line)
(\w+\b)
The desired column is found in the grabbed string
$1
I did this in a command line one liner which expects standard input, to be able to test it.
Please just adapt it into your script.
This will check the second column:
(split)[1]
Sample script:
use strict;
use warnings;
my $filename = 'input';
open(FILE, $filename) or die "Can not open $filename.";
print "\n";
while(<FILE>)
{
#The test
if ((split)[1] < 10)
{
print "The second column is less than ten\n";
}
elsif ((split)[1] > 10)
{
print "The second column is more than ten\n";
}
else
{
print "The second column is equal to ten\n";
}
}
Input:
#Input file
0 5 0
0 10 0
0 15 0
0 20 0
0 1 0
0 10 0
Output:
The second column is less than ten
The second column is equal to ten
The second column is more than ten
The second column is more than ten
The second column is less than ten
The second column is equal to ten

Perl: perl regex for extracting values from complex lines

Input log file:
Nservdrx_cycle 4 servdrx4_cycle
HCS_cellinfo_st[10] (type = (LTE { 2}),cell_param_id = (28)
freq_info = (10560),band_ind = (rsrp_rsrq{ -1}),Qoffset1 = (0)
Pcompensation = (0),Qrxlevmin = (-20),cell_id = (7),
agcreserved{3} = ({ 0, 0, 0 }))
channelisation_code1 16/5 { 4} channelisation_code1
sync_ul_info_st_ (availiable_sync_ul_code = (15),uppch_desired_power =
(20),power_ramping_step = (3),max_sync_ul_trans = (8),uppch_position_info =
(0))
trch_type PCH { 7} trch_type8
last_report 0 zeroth bit
I was trying to extract only integer for my above inputs but I am facing some
issue with if the string contain integer at the beginning and at the end
For ( e.g agcreserved{3},HCS_cellinfo_st[10],Qoffset1)
here I don't want to ignore {3},[10] and 1 but in my code it does.
since I was extracting only integer.
Here I have written simple regex for extracting only integer.
MY SIMPLE CODE:
use strict;
use warnings;
my $Ipfile = 'data.txt';
open my $FILE, "<", $Ipfile or die "Couldn't open input file: $!";
my #array;
while(<$FILE>)
{
while ($_ =~ m/( [+-]?\d+ )/xg)
{
push #array, ($1);
}
}
print "#array \n";
output what I am getting for above inputs:
4 4 10 2 28 10560 -1 1 0 0 -20 7 3 0 0 0 1 16 5 4 1 15 20 3 8 0 7 8 0
expected output:
4 2 28 10560 -1 0 0 -20 7 0 0 0 4 15 20 3 8 0 7 0
If some body can help me with explanation ?
You are catching every integer because your regex has no restrictions on which characters can (or can not) come before/after the integer. Remember that the /x modifier only serves to allow whitespace/comments inside your pattern for readability.
Without knowing a bit more about the possible structure of your output data, this modification achieves the desired output:
while ( $_ =~ m! [^[{/\w] ( [+-]?\d+ ) [^/\w]!xg ) {
push #array, ($1);
}
I have added rules before and after the integer to exclude certain characters. So now, we will only capture if:
There is no [, {, /, or word character immediately before the number
There is no / or word character immediately after the number
If your data could have 2-digit numbers in the { N} blocks (e.g. PCH {12}) then this will not capture those and the pattern will need to become much more complex. This solution is therefore quite brittle, without knowing more of the rules about your target data.

How to find number of numerical data for each and every line in a file

Please help me to count the numerical data in each line of a file,
and also to find the line length. The code has to written in Perl.
For example if I have a line such as:
INPUT:I was born on 24th october,1994.
Output:2
You could do something like this:
perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
-n: causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk:
LINE:
while (<>) {
... # your program goes here
}
-e: may be used to enter one line of program;
() will make /[0-9]+/g be evaluated in list context (i.e. () = /[0-9]+/g will return an array containing the sequences of one or more digits found in the default input), while $x += will make the result be evaluated again in scalar context (i.e. $x += () = /[0-9]+/g will add the number of sequences of one or more digits found in the default input to $x); END{print($x . "\n") will print $x after the whole file has been processed.
% cat file
string 123 string 1 string string string
456 string
% perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
3
%
I'd do something like this
#!/usr/bin/perl
use warnings;
use strict;
my $file = 'num.txt';
open my $fh, '<', $file or die "Failed to open $file: $!\n";
while (my $line = <$fh>){
chomp $line;
my #num = $line =~ /([0-9.]+)/g;
print "On this line --- " .scalar(#num) . "\n";
}
close ($fh);
The input file I tested --
This should say 1
Line 2 should say 2
I want this line to say 5 so I have added 4 other numbers like 0.02 -1 and 5.23
The output as tested ----
On this line --- 1
On this line --- 2
On this line --- 5
Using the regex match ([0-9.]+) will match ANY number and include any decimals (I guess really you could use just ([0-9]+) since you are only counting them and not using the actually number represented.)
Hope it helps.

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.
My dummy examples looks like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I am doing this:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.
Preferable output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found
Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.
You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put it all in one line to make it a one liner.
If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace the 20 with your max word length and the other numbers as well if you need to.
Here is a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless #ARGV;
my $wordsfile = shift #ARGV;
my #wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (#wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, #{ $words{$key} };
}
Example output:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: every file is only read once.
This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:
perl -E 'undef $|; open $w, "<", "words"; #w=<$w>; chomp #w; $r{$_}=[0,{}] for #w; my $re = join "|", #w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for #w' < text
This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
an awk(gawk) oneliner could save you from grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
test with your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output, you could just pipe the awk output to column -t
so it looks like:
yellow 0 0
red 0 0
green 1 1
blue 3 2
awk '
NR==FNR { words[$0]; next }
{
for (word in words) {
count = gsub(word,word)
if (count) {
counts[word] += count
lines[word]++
}
}
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

How do I replace row identifiers with sequential numbers?

I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;