How to search/replace with random array value in perl from command line - perl

I'm trying to search/replace where the replacement has a random value from an array -- all from the command line in Perl. I can't figure out what's wrong here. I've tried a lot of variations of this, and read many examples online (which are all for non-command-line).
echo "Test z|Test z|Test z|" | tr '|' '\n' | \
perl -pe '#numbers=[12.3, 45.6, 78.9]; $number = $numbers[rand #numbers]; s/z/" : ".( int ($number) )/ge'
The actual output is something like this (the numbers change):
Test : 24591392
Test : 24591752
Test : 24591416
The expected output is:
Test : 45.6
Test : 12.3
Test : 78.9
Where the actual numbers are randomly selected. Any tips are welcome, including pointing out silly typos. Thanks!

You want to say #numbers=(12.3, 45.6, 78.9), not #numbers=[12.3, 45.6, 78.9]. The latter creates an array with a reference to another array as its only element. The output you are currently seeing is the numification of a reference value, not the contents of the array.

Related

Perl illegal division by zero at -e

I am running the following script:
# Usage: sh hmmscan-parser.sh hmmscan-file
# 1. take hmmer3 output and generate the tabular output
# 2. sort on the 6th and 7th cols
# 3. remove overlapped/redundant hmm matches and keep the one with the lower e-values
# 4. calculate the covered fraction of hmm (make sure you have downloaded the "all.hmm.ps.len" file to the same directory of this perl script)
# 5. apply the E-value cutoff and the covered faction cutoff
cat $1 | perl -e 'while(<>){if(/^\/\//){$x=join("",#a);($q)=($x=~/^Query:\s+(\S+)/m);while($x=~/^>> (\S+.*?\n\n)/msg){$a=$&;#c=split(/\n/,$a);$c[0]=~s/>> //;for($i=3;$i<=$#c;$i++){#d=split(/\s+/,$c[$i]);print $q."\t".$c[0]."\t$d[6]\t$d[7]\t$d[8]\t$d[10]\t$d[11]\n" if $d[6]<1;}}#a=();}else{push(#a,$_);}}' \
| sort -k 1,1 -k 6,6n -k 7,7n | uniq \
| perl -e 'while(<>){chomp;#a=split;next if $a[-1]==$a[-2];push(#{$b{$a[0]}},$_);}foreach(sort keys %b){#a=#{$b{$_}};for($i=0;$i<$#a;$i++){#b=split(/\t/,$a[$i]);#c=split(/\t/,$a[$i+1]);$len1=$b[-1]-$b[-2];$len2=$c[-1]-$c[-2];$len3=$b[-1]-$c[-2];if($len3>0 and ($len3/$len1>0.5 or $len3/$len2>0.5)){if($b[2]<$c[2]){splice(#a,$i+1,1);}else{splice(#a,$i,1);}$i=$i-1;}}foreach(#a){print $_."\n";}}' \
| uniq | perl -e 'open(IN,"all.hmm.ps.len");
while(<IN>)
{
chomp;
#a=split;
$b{$a[0]}=$a[1]; # creates hash of hmmName : hmmLength
}
while(<>)
{
chomp;
#a=split;
$r=($a[3]-$a[2])/$b{$a[1]}; # $a[4] = hmm end $a[3] = hmm start ; $b{$a[1]} = result of the hash of the name of the hmm (hmm length).
print $_."\t".$r."\n";
}' \
| perl -e 'while(<>){#a=split(/\t/,$_);if(($a[-2]-$a[-3])>80){print $_ if $a[2]<1e-5;}else{print $_ if $a[2]<1e-3;}}' | awk '$NF>0.3'
When I run the file, I get "Illegal division by zero at -e line 14, <> line 1"
Line 14 is :
$r=($a[3]-$a[2])/$b{$a[1]};
The first input (hmmscan-file) is under this form :
Query: NODE_1_length_300803_cov_11.207433_1 [L=264]
Description: # 139 # 930 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.543
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
[No hits detected that satisfy reporting thresholds]
Domain annotation for each model (and alignments):
[No targets detected that satisfy reporting thresholds]
Internal pipeline statistics summary:
-------------------------------------
Query sequence(s): 1 (264 residues searched)
Target model(s): 641 (202466 nodes)
Passed MSV filter: 18 (0.0280811); expected 12.8 (0.02)
Passed bias filter: 18 (0.0280811); expected 12.8 (0.02)
Passed Vit filter: 1 (0.00156006); expected 0.6 (0.001)
Passed Fwd filter: 0 (0); expected 0.0 (1e-05)
Initial search space (Z): 641 [actual number of targets]
Domain search space (domZ): 0 [number of targets reported over threshold]
# CPU time: 0.02u 0.02s 00:00:00.04 Elapsed: 00:00:00.03
# Mc/sec: 1357.56
//
Query: NODE_1_length_300803_cov_11.207433_2 [L=184]
Description: # 945 # 1496 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.431
the second input named (all.hmm.ps.len) is under this form whoever the third pearl command calls him.
BM10.hmm 28
CBM11.hmm 163
I did not understand where the problem is knowing that the script in general aims to create an array to clearly read the input (hmmscan-file).
thank you very much
So you have this error:
Illegal division by zero at -e line 14, <> line 1
And you say that line 14 is this:
$r=($a[3]-$a[2])/$b{$a[1]};
Well, there's only one division in that line of code, so it seems clear that when you're executing that line $b{$a[1]} contains zero (but note that because you don't have use warnings in your code, it's possible that it might also contain an empty string or undef and that's being silently converted to a zero).
So as the programmer, your job is to trace through your code and find out where these values are being set and what might be causing it not to contain a value that can be used in your division.
I would be happy to help work out what the problem is, but for a few problems:
Your program reads from two input files and you've only given us one.
Your variables all have single-letter names, making it impossible to understand what they are for.
Your use of a string of command-line programs is bizarre and renders your code pretty much unreadable.
If you want help (not just from here, but from any forum), your first move should be to make your code as readable as possible. It's more than likely that by doing that, you'll find the problem yourself.
But your current habit of reposting the question with incremental changes is winning you no friends here.
Update: One idea might be to rewrite your line as something like this:
if (exists $b{$a[1]}) {
$r=($a[3]-$a[2])/$b{$a[1]};
} else {
warn "Hey, I can't find the key '$a[1]' in the hash \%b\n";
}

Extracting values from a single file

I have a file with multiple lines; but a specific line contains tons of information, with several repeated expressions. I'm trying to extract some specific values. I first tried some commands with sed, for instance, but with no success. So, I was wondering if you could give me some insights.
So, here you have one fraction of the unique line of the given document I mentioned:
[...]6[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01},DLOOP.rate_median=0.04131395026396427,length=
[...]
10[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61},DLOOP.rate_median=0.04131395026396427,length=
[...]
My aim here is first to extract all the values that is between the brackets, after "habitat.set.prob={". and put them in a single line in a text file.
Also, it would be important to extract the numbers that appears just before the expression "[&length_range=]", which in this case are "6" and "10". They are the label of the set of numbers after "prob={"
So the set of numbers I want to extract always appears between "habitat.set.prob={" and "},DLOOP.rate_median", while the other number (the label) is always rigth before "[&length_range="; but what is before the label is not the same expression; actually it is a random number.
The goal then is end up with a file with the following characteristcs:
6 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
10 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
and so on …
What do you think? Is this possible?
I started with this very basic command at least to try to extract the set of numbers, but it didn't work
sed -n "/habitat.set.prob={/,/},DLOOP.rate_median=/ p"
| Well... I got some improvement.
I was able to get the values at least:
awk '{gsub("habitat.set.prob={","\n");printf"%s",$0}' filename | awk -F'},' '{print $1"}"}' | grep -iv "TREE" > stats.txt
|
Many thanks in advance.
Cheers,
Luiz
Something like that:
sed -rn '/.*[0-9]+\[&length_range=\{/,/habitat.set.prob=\{/{s/.*\b([0-9]+)\[&length_range.*/\1/p; s/.*habitat.set.prob=\{([^D]+)\},DLOOP.rate.*/\1/p}' habitat
6
0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01
10
0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
The first part '/.a./,/.b./' searches from pattern a to b, distributed over multiple lines. The -n told sed to do non-printing as default.
In '/.a./,/.b./{s/.c./.d./p; s/.e./.f./p}'
there are two substitution commands with p=print in curly braces.
I am not sure if you really digged a little, so not providing the complete answer, but let's hope this would help you:
for the first part: getting the no(which you call as label) you didn't mention if there is any specific pattern, so try this (data is the file which contains the actual input) - you need to work on how to get the number and tweak the RE a bit
sed -n 's/.*\([0-9][0-9]*\).*length_range.*/\1/p' data
For the other part which gives the numericals between habitat and DLOOP:
sed -n 's/.*habitat.set.prob=\(.*\),DLOOP.*/\1/pg' data | tr '{' ' ' | tr '}' ' '
Now, try to take this as a starter and work on your output to get your desired result!
To explain a bit:
In the first section - I am trying to capture the numericals between anything(.*) and (.*)length_range [you can escape the character [ and & by using \ in front of them]
In the second section: I am capturing pattern in between habitat.set.prob and DLOOP and then doin a tr to remove the brackets.
#include <iostream>
using namespace std;
int main()
{
string p = "1:2:3:4"; //input your string
int arr[4] = {}; //create a new empty integer array to put the integers in it
for(int i=0, j=0; i <p.length(); i++){//loop on the string to extract integers
if( p[i] == ':'){continue;}//if the value = ':' skip it and continue
arr[j]=(int)p[i]-48;j++;//put the integer in the array we created
}
cout << "String={"<<arr[0]<<" "<<arr[1]<<" "<<arr[2]<<" "<<arr[3]<<"}";//print the array
return 0;
}

convert row to column based on text

I have a rather large file (single column) with data similar to this:
BT1111
2.2.2.2/3
3.3.3.3/4
7.2.1.1/5
BT6766
2.2.1.1/5
4.5.1.1/7
BT9898
4.4.4.4/2
8.8.8.8/9
I wish to find a function that can align it into two columns, by moving all entries starting with digit one column ($1 to $2) and enrich it with the corresponding BT field, so desired output should be
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9
I can't imagine how to ensure the "look for next occurence" should be performed, but hope there is a function for it I have managed to overlook ?
perl -nle'if (/^\D/) { $n=$_ } else { print "$n;$_" }' input.txt
See Specifying file to process to Perl one-liner for alternate usages.
$ awk '/BT/{a=$1; next}{print a ";" $1}' input.txt
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9

sed: replace letter between square brackets

I have the following string:
signal[i]
signal[bg]
output [10:0]
input [i:1]
what I want is to replace the letters between square brackets (by underscore for example) and to keep the other strings that represents table declaration:
signal[_]
signal[__]
output [10:0]
input [i:1]
thanks
try:
awk '{gsub(/\[[a-zA-Z]+\]/,"[_]")} 1' Input_file
Globally substituting the (bracket)alphabets till their longest match then with [_]. Mentioning 1 will print the lines(edited or without edited ones).
EDIT: Above will substitute all alphabets with one single _, so to get as many underscores as many characters are there following may help in same.
awk '{match($0,/\[[a-zA-Z]+\]/);VAL=substr($0,RSTART+1,RLENGTH-2);if(VAL){len=length(VAL);;while(i<len){q=q?q"_":"_";i++}};gsub(/\[[a-zA-Z]+\]/,"["q"]")}1' Input_file
OR
awk '{
match($0,/\[[a-zA-Z]+\]/);
VAL=substr($0,RSTART+1,RLENGTH-2);
if(VAL){
len=length(VAL);
while(i<len){
q=q?q"_":"_";
i++
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]")
}
1
' Input_file
Will add explanation soon.
EDIT2: Following is the one with explanation purposes for OP and users.
awk '{
match($0,/\[[a-zA-Z]+\]/); #### using match awk's built-in utility to match the [alphabets] as per OP's requirement.
VAL=substr($0,RSTART+1,RLENGTH-2); #### Creating a variable named VAL which has substr($0,RSTART+1,RLENGTH-2); which will have substring value, whose starting point is RSTART+1 and ending point is RLENGTH-2.
RSTART and RLENGTH are the variables out of the box which will be having values only when awk finds any match while using match.
if(VAL){ #### Checking if value of VAL variable is NOT NULL. Then perform following actions.
len=length(VAL); #### creating a variable named len which will have length of variable VAL in it.
while(i<len){ #### Starting a while loop which will run till the value of VAL from i(null value).
q=q?q"_":"_"; #### creating a variable named q whose value will be concatenated it itself with "_".
i++ #### incrementing the value of variable i with 1 each time.
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]") #### Now globally substituting the value of [ alphabets ] with [ value of q(which have all underscores in it) then ].
}
1 #### Mentioning 1 will print (edited or non-edited) lines here.
' Input_file #### Mentioning the Input_file here.
Alternative gawk solution:
awk -F'\\[|\\]' '$2!~/^[0-9]+:[0-9]$/{ gsub(/./,"_",$2); $2="["$2"]" }1' OFS= file
The output:
signal[_]
signal[__]
output [10:0]
-F'\\[|\\]' - treating [ and ] as field separators
$2!~/^[0-9]+:[0-9]$/ - performing action if the 2nd field does not represent table declaration
gsub(/./,"_",$2) - replace each character with _
This might work for you (GNU sed);
sed ':a;s/\(\[_*\)[[:alpha:]]\([[:alpha:]]*\]\)/\1_\2/;ta' file
Match on opening and closing square brackets with any number of _'s and at least one alpha character and replace said character by an underscore and repeat.
awk '{sub(/\[i\]/,"[_]")sub(/\[bg\]/,"[__]")}1' file
signal[_]
signal[__]
output [10:0]
input [i:1]
The explanation is as follows: Since bracket is as special character it has to be escaped to be handled literally then it becomes easy use sub.

Wordnet synsets using perl

I installed Wordnet::Similarity and Wordnet::QueryData as an easy way to calculate information content score and probability that comes with these modules. But I'm stuck at this basic problem: given a word, print n words similar to it - which should not be difficult that iterating through the synsets and doing join.
using the wn command and piping it with a whole lot of tr, sort | uniq I can get all the words:
wn cat -synsn | grep -v Sense | tr '=' ' ' | tr '>' ' ' | tr '\t' ' ' | tr ',' '\n' | sort | uniq
OUTPUT
8 senses of cat
adult female
adult male
African tea
Arabian tea
big cat
bozo
cat
cat
CAT
Caterpillar
cat-o'-nine-tails
computed axial tomography
computed tomography
computerized axial tomography
computerized tomography
CT
excitant
felid
feline
gossip
gossiper
gossipmonger
guy
hombre
kat
khat
man
newsmonger
qat
quat
rumormonger
rumourmonger
stimulant
stimulant drug
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun cat
tracked vehicle
true cat
whip
woman
X-radiation
X-raying
but its kinda nasty,and needs further clean up.
What my script looks like is below, and what I want to get is all the words in cat#n1...8.
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new( noload => 1);
print "Senses: ", join(", ", $wn->querySense("cat#n")), "\n";
print "Synset: ", join(", ", $wn->querySense("cat", "syns")), "\n";
print "Hyponyms: ", join(", ", $wn->querySense("cat#n#1", "hypo")), "\n";
OUTPUT:
Senses: cat#n#1, cat#n#2, cat#n#3, cat#n#4, cat#n#5, cat#n#6, cat#n#7, cat#n#8
Synset: cat#n, cat#v
Hyponyms: domestic_cat#n#1, wildcat#n#3
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new;
foreach $word (qw/cat#n/) {
#senses = $wn->querySense($word);
foreach $wps (#senses) {
#gloss = $wn -> querySense($wps, "syns");
print "$wps : #gloss\n";
}
}
OUTPUT:
cat#n#1 : cat#n#1 true_cat#n#1
cat#n#2 : guy#n#1 cat#n#2 hombre#n#1 bozo#n#2
cat#n#3 : cat#n#3
cat#n#4 : kat#n#1 khat#n#1 qat#n#1 quat#n#1 cat#n#4 Arabian_tea#n#1 African_tea#n#1
cat#n#5 : cat-o'-nine-tails#n#1 cat#n#5
cat#n#6 : Caterpillar#n#2 cat#n#6
cat#n#7 : big_cat#n#1 cat#n#7
cat#n#8 : computerized_tomography#n#1 computed_tomography#n#1 CT#n#2 computerized_axial_tomography#n#1 computed_axial_tomography#n#1 CAT#n#8
P.S.
I have never written perl before, but have been looking into perl scripts since morning - and can now understand the basic stuff. Just need to know if there is cleaner way to do this using the api docs - couldn't figure out from the api or usergroup archives.
Update:
I think I'll settle with:
wn cat -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d'
sed rocks!
I think you'll find the following hepful...
http://marimba.d.umn.edu/WordNet-Pairs/
What are the N most similar words to X, according to WordNet?
This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
-------------- verb data
These files were created with WordNet::Similarity version 2.05 using
WordNet 3.0. They show all the pairwise verb-verb similarities found
in WordNet according to the path, wup, lch, lin, res, and jcn measures.
The path, wup, and lch are path-based, while res, lin, and jcn are based
on information content.
As of March 15, 2011 pairwise measures for all verbs using the six
measures above are availble, each in their own .tar file. Each *.tar
file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx
2.0 - 2.4 GB compressed. In each of these .tar files you will find
25,047 files, one for each verb sense. Each file consists of 25,048 lines,
where each line (except the first) contains a WordNet verb sense and the
similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000
pairwise similarity values. Note that these are symmetric (sim (A,B)
= sim (B,A)) so you have a bit more than 300 million unique values.
-------------- noun data
As of August 19, 2011 pairwise measures for all nouns using the path
measure are available. This file is named WordNet-noun-noun-path-pairs.tar.
It is approximately 120 GB compressed. In this file you will find
146,312 files, one for each noun sense. Each file consists of
146,313 lines, where each line (except the first) contains a WordNet
noun sense and the similarity to the sense featured in that particular
file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these
are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion
unique values.
We are currently running wup, res, and lesk, but do not have an
estimated date of availability yet.
Put this is a script, say synonym.sh
wn $1 -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d' | sed 's/ //g' | grep -iv $1 | tr '\n' ','
wn $1 -synsv | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d' | sed 's/ //g' | grep -iv $1 | tr '\n' ',';echo
From your perl script
system("/path/synonym.sh","kittens");
system("/path/synonym.sh","cats");