Junk output or random character in Data::Dumper output for Net::DNS::Resolver object - perl

I am getting familiarized with the Net::DNS library in Perl and an object is created using
my $res = Net::DNS::Resolver->new();
However, simply trying to query a domain name shows a lot f junk values, though the output itself is correct. Here is the code snippet
#!/usr/bin/perl
use Net::DNS;
use Net::IP;
use Data::Dumper;
my $rr;
$domain = 'google.com';
my $res = Net::DNS::Resolver->new();
my $ns_req = $res->query($domain, "NS");
print "\n\n###\n".Dumper($ns_req)."\n###\n\n";
Here are 2 outputs for various domains tested against this object:
What are these junk values being displayed? Is there a way to clean up the output a bit in order to read the output properly?

You are dumping the internals of the object which include the buffer which holds the original response bytes.
You should use the API defined in the module documentation to access the information.
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
$result->print;
Output:
;; Answer received from x.x.x.x (100 bytes)
;; HEADER SECTION
;; id = 39595
;; qr = 1 aa = 0 tc = 0 rd = 1 opcode = QUERY
;; ra = 1 z = 0 ad = 0 cd = 0 rcode = NOERROR
;; qdcount = 1 ancount = 4 nscount = 0 arcount = 0
;; do = 0
;; QUESTION SECTION (1 record)
;; google.com. IN NS
;; ANSWER SECTION (4 records)
google.com. 21599 IN NS ns4.google.com.
google.com. 21599 IN NS ns2.google.com.
google.com. 21599 IN NS ns1.google.com.
google.com. 21599 IN NS ns3.google.com.
;; AUTHORITY SECTION (0 records)
;; ADDITIONAL SECTION (0 records)
The query method returns a Net::DNS::Packet which provides other methods to obtain specific parts of the response.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
for my $answer ($result->answer) {
print $answer->nsdname, "\n";
}
Output:
ns2.google.com
ns1.google.com
ns3.google.com
ns4.google.com
If you are interested in the contents of the binary buffer, Net::DNS::Packet has a data method which returns the contents of that buffer. As RFC 1035 points out:
3.2. RR definitions
3.2.1. Format
All RRs have the same top level format shown below:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| |
/ /
/ NAME /
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TYPE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| CLASS |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TTL |
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| RDLENGTH |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/ RDATA /
/ /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
where:
NAME an owner name, i.e., the name of the node to which this
resource record pertains.
TYPE two octets containing one of the RR TYPE codes.
CLASS two octets containing one of the RR CLASS codes.
TTL a 32 bit signed integer that specifies the time interval
that the resource record may be cached before the source
of the information should again be consulted. Zero
values are interpreted to mean that the RR can only be
used for the transaction in progress, and should not be
cached. For example, SOA records are always distributed
with a zero TTL to prohibit caching. Zero values can
also be used for extremely volatile data.
RDLENGTH an unsigned 16 bit integer that specifies the length in
octets of the RDATA field.
RDATA a variable length string of octets that describes the
resource. The format of this information varies
according to the TYPE and CLASS of the resource record.
You can examine the contents of $result->data by doing a hexdump:
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
print $result->data;
C:\...\t> perl tt.pl | xxd
00000000: 3256 8180 0001 0004 0000 0000 0667 6f6f 2V...........goo
00000010: 676c 6503 636f 6d00 0002 0001 c00c 0002 gle.com.........
00000020: 0001 0000 545f 0006 036e 7333 c00c c00c ....T_...ns3....
00000030: 0002 0001 0000 545f 0006 036e 7334 c00c ......T_...ns4..
00000040: c00c 0002 0001 0000 545f 0006 036e 7332 ........T_...ns2
00000050: c00c c00c 0002 0001 0000 545f 0006 036e ..........T_...n
00000060: 7331 c00c s1..

Related

Verifying NSEC3 records

I'm fiddling with DNSSEC, and I'd like to try to verify NSEC3 records generated by dnssec-signzone from bind9-utils (which I presume are valid). This is my zone file:
$ORIGIN dnssectest.mvolfik.tk.
$TTL 120
# SOA dnssectestns.mvolfik.tk. email.example.com. 15 259200 3600 300000 3600
A 192.168.0.101
s3c A 192.168.0.101
$INCLUDE zsk.key
$INCLUDE ksk.key
ZSK and KSK are generated with dnssec-keygen -a ECDSAP256SHA256 dnssectest.mvolfik.tk. (add -f KSK respectively)
I then signed it using the command dnssec-signzone -3 deadbeef -H 5 -o dnssectest.mvolfik.tk -k ksk.key zonefile zsk.key (use NSEC3 with deadbeef hex salt, 5 iterations)
I got the following NSEC3 records in the zonefile.signed: (omitted RRSIG and DNSKEY as irrelevant; A and SOA didn't change)
0 NSEC3PARAM 1 0 5 DEADBEEF
F66KKS17FM851AVA4EARFHS55I3TOO85.dnssectest.mvolfik.tk. 3600 IN NSEC3 1 0 5 DEADBEEF (
D60TA5J5RS4JD5AQK25B1BCUAHGP4DHC
A SOA RRSIG DNSKEY NSEC3PARAM )
D60TA5J5RS4JD5AQK25B1BCUAHGP4DHC.dnssectest.mvolfik.tk. 3600 IN NSEC3 1 0 5 DEADBEEF (
F66KKS17FM851AVA4EARFHS55I3TOO85
A RRSIG )
Now that I know that the only domains in this zone are s3c.dnssectest.mvolfik.tk. and dnssectest.mvolfik.tk., I assume that the following Python script would get me the same hashes as in the signe zone file above: (from pseudocode in RFC 5155)
import hashlib
def ih(salt, x, k):
if k == 0:
return hashlib.sha1(x + salt).digest()
return hashlib.sha1(ih(salt, x, k-1) + salt).digest()
print(ih(bytes.fromhex("deadbeef"), b"s3c.dnssectest.mvolfik.tk.", 5).hex())
print(ih(bytes.fromhex("deadbeef"), b"dnssectest.mvolfik.tk.", 5).hex())
However, I instead got b58374998347ba833ab33f15332829a589a80d82 and 545e01397a776ee73aa0372aea015408cc384574. What am I doing wrong?
So I looked into dnspython source code, and found the nsec3_hash function. Turns out that the name must be in wire format (means removing dots and instead prefixing labels a length byte - \x03s3c\x10dnssectest\x07mvolfik\x02tk\x00 etc, null byte at the end). And the result is encoded with base32 (0-9A-V), not hex. Probably easier just to use the dnspython library, but here's the full (a bit naive) code:
import hashlib, base64
b32_trans = str.maketrans(
"ABCDEFGHIJKLMNOPQRSTUVWXYZ234567", "0123456789ABCDEFGHIJKLMNOPQRSTUV"
)
def ih(salt, x, k):
if k == 0:
return hashlib.sha1(x + salt).digest()
return hashlib.sha1(ih(salt, x, k - 1) + salt).digest()
def nsec3(salt, name, k):
if not name.endswith("."):
name += "."
labels = name.split(".")
name_wire = b"".join(len(l).to_bytes(1, "big") + l.lower().encode() for l in labels)
digest = ih(bytes.fromhex(salt), name_wire, k)
return base64.b32encode(digest).decode().translate(b32_trans)
print(nsec3("deadbeef", "dnssectest.mvolfik.tk.", 5))
print(nsec3("deadbeef", "s3c.dnssectest.mvolfik.tk.", 5))
This gets the correct hashes seen in the NSEC3 records

Strange 64-bit time format from game, you can recognize it?

i capture this 64-bit time format from a game and trying to understand it.
I can not use a date delta because every now and then the value totally changes and even becomes negative as seen below.
v1:int64=-5990085973098618987; //2021-01-25 13:30:00
v2:int64=-5990085973321147595; //4 mins later
v3:int64=6140958949625363349; //7 mins later
v4:int64=6140958948894898101; //11 mins later
v5:int64=-174740204032730139; //16 mins later
v6:int64=-174740204054383467; //18 mins later
v7:int64=-6490439358095090795; //23 mins later
I tried to split the 64-bit into two 32-bit containers to get low and high part. still strange values.
I also tried using pdouble(#value)^ to get float value of the 64-bit data, still strange values.
So kind of running out of ideas, maybe some kind of bitfield data or something else going on.
hipart: -1394675573 | lopart: 1466441621 | hex: acdef08b|57681f95 | swap: -7701322112560996692
hipart: -1394675573 | lopart: 1243913013 | hex: acdef08b|4a249b35 | swap: 3862721007994330796
hipart: 1429803424 | lopart: -458425451 | hex: 553911a0|e4acfb95 | swap: -7639322244965910187
hipart: 1429803424 | lopart: -1188890699 | hex: 553911a0|b922f7b5 | swap: -5334757052947285675
hipart: -40684875 | lopart: -760849435 | hex: fd9332b5|d2a65be5 | swap: -1919757392230050819
hipart: -40684875 | lopart: -782502763 | hex: fd9332b5|d15bf495 | swap: -7641381711494605827
hipart: -1511173174 | lopart: -1467540587 | hex: a5ed53ca|a8871b95 | swap: -7702413578668347995
Any ideas welcomed, thanks in advance
//mbs
--EDIT: So far, thanks to Martin Rosenau we are able to encode like this:
func mulproc_nfsw(i:int64;key:uint32):int64;
begin
if (blnk i) or (blnk key) then exit;
p:pointer=#i;
hi:uint32=uint32(p+4)^; //30864159 (hex: 01d6f31f)
lo:uint32=uint32(p)^; //748455936 (hex: 2c9c8800)
hi64:int64=hi*key; //0135b55a acdef08b <-- keep
lo64:int64=lo*key; //1d566e0b a65f2800 <-- keep
q:pointer=#result; //-5990085971773806592
uint32(q+4)^:=hi64; //acdef08b
uint32(q)^:=lo64; //a65f2800
end;
func encode_time_nfsw(j:juncture):int64;
begin
if blnk j then exit; //input: '2021-01-25 13:37:07'
key:uint32=$A85A2115; //encode key
ft:int64=j.filetime; //hex: 01d6f31f 2c9c8800
result:=mulproc_nfsw(ft,key);
end;
--EDIT2: Finally, thanks to fpiette we are able to decode also:
func decode_time_nfsw(i:int64):juncture;
begin
if blnk i then exit; //input: -5990085971773806592
key:uint32=$3069263D; //decode key
ft:int64=mulproc_nfsw(i,key);
result.setfiletime(ft);
end;
I checked my suspicion that the high and the low 32 bits are simply multiplied by A85A2115 (hex):
We get a FILETIME structure. Then we perform a 32x32->32 bit multiplication (this means we throw away the high 32 bits of the 64 bits result) of the high and the low word independently.
Example:
25 Jan 2021 13:37:07 (and some milliseconds)
Unencrypted FILETIME:
High dword = 1D6F31F (hex)
Low dword = 2C9CA481 (hex)
Multiplication
High dword: 1D6F31F * A85A2115 = 135B55AACDEF08B (hex)
Low dword: 2C9CA481 * A85A2115 = 1D5680CA57681F95 (hex)
Now only take the low 32 bits of the results:
High dword: ACDEF08B (hex)
Low dword: 57681F95 (hex)
Unfortunately, I don't know how to do the the "reverse operation"; I did it by searching for the result in a loop with the following pseudo-code:
encryptedValue = 57681F95 (hex)
originalValue = 0
product = 0
while product not equal to encryptedValue
// 32-bit addition discarding carry:
product = product + A85A2115 (hex)
originalValue = originalValue + 1
end_of_while_loop
We get the following results:
25 Jan 2021 13:37:07 => acdef08b|57681f95
25 Jan 2021 13:40:51 => acdef08b|4a249b35
25 Jan 2021 13:45:07 => 553911a0|e4acfb95
25 Jan 2021 13:49:03 => 553911a0|b922f7b5
25 Jan 2021 13:53:53 => fd9332b5|d2a65be5
25 Jan 2021 13:55:50 => fd9332b5|d15bf495
25 Jan 2021 14:00:39 => a5ed53ca|a8871b95
Addendum
The reverse operation seems to be done by multiplying with 3069263D (hex) (and only using the low 32 bits).
Encrypting:
2C9CA481 * A85A2115 = 1D5680CA57681F95
=> Result: 57681F95
Decrypting:
57681F95 * 3069263D = 10876CAF2C9CA481
=> Result: 2C9CA481

How to convert binary to hex in Batch or Powershell?

I wondering if there is a way to convert binary to hexadecimal, in Batch or Powershell language.
Exemple :
10000100 to 84
01010101 to 55
101111111111 to BFF
In a simple way, I’m not very good in Batch or Powershell.
I will appreciate any kind of information
Converting a binary string to an integer is pretty straightforward:
$number = [Convert]::ToInt32('10000100', 2)
Now we just need to convert it to hexadecimal:
$number.ToString('X')
or
'{0:X}' -f $number
(pure batch)
#ECHO OFF
SETLOCAL
CALL :CONVERT 10000100
CALL :CONVERT 101111111111
CALL :CONVERT 1111111111
GOTO :EOF
:: Convert %1 to hex
:CONVERT
SET "data=%1"
SET "result="
:cvtlp
:: If there are no characters left in `data` we are finished
IF NOT DEFINED data ECHO %1 ----^> %result%&GOTO :EOF
:: Get the last 4 characters of `data` and prefix with "000"
:: This way, if there are only say 2 characters left (xx), the result will be
:: 000xx. we then use the last 4 characters only
=
SET "hex4=000%data:~-4%"
SET "hex4=%hex4:~-4%"
:: remove last 4 characters from `data`
SET "data=%data:~0,-4%"
:: now convert to hex
FOR %%a IN (0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111
8 1000 9 1001 A 1010 B 1011 C 1010 D 1101 E 1110 F 1111
) DO IF "%%a"=="%hex4%" (GOTO found) ELSE (SET "hex4=%%a")
:found
SET "result=%hex4%%result%"
GOTO cvtlp
This solution uses a parsing trick in the for %%a loop. The original value of hex4 is compared in the if and where the if fails, the value tested is assigned to hex4 so that when a match is found, the previous value tested remains in hex4.

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this
Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12
In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt
You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done
Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12
Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).