Strange 64-bit time format from game, you can recognize it?

Strange 64-bit time format from game, you can recognize it? - date

i capture this 64-bit time format from a game and trying to understand it.
I can not use a date delta because every now and then the value totally changes and even becomes negative as seen below.
v1:int64=-5990085973098618987; //2021-01-25 13:30:00
v2:int64=-5990085973321147595; //4 mins later
v3:int64=6140958949625363349; //7 mins later
v4:int64=6140958948894898101; //11 mins later
v5:int64=-174740204032730139; //16 mins later
v6:int64=-174740204054383467; //18 mins later
v7:int64=-6490439358095090795; //23 mins later
I tried to split the 64-bit into two 32-bit containers to get low and high part. still strange values.
I also tried using pdouble(#value)^ to get float value of the 64-bit data, still strange values.
So kind of running out of ideas, maybe some kind of bitfield data or something else going on.
hipart: -1394675573 | lopart: 1466441621 | hex: acdef08b|57681f95 | swap: -7701322112560996692
hipart: -1394675573 | lopart: 1243913013 | hex: acdef08b|4a249b35 | swap: 3862721007994330796
hipart: 1429803424 | lopart: -458425451 | hex: 553911a0|e4acfb95 | swap: -7639322244965910187
hipart: 1429803424 | lopart: -1188890699 | hex: 553911a0|b922f7b5 | swap: -5334757052947285675
hipart: -40684875 | lopart: -760849435 | hex: fd9332b5|d2a65be5 | swap: -1919757392230050819
hipart: -40684875 | lopart: -782502763 | hex: fd9332b5|d15bf495 | swap: -7641381711494605827
hipart: -1511173174 | lopart: -1467540587 | hex: a5ed53ca|a8871b95 | swap: -7702413578668347995
Any ideas welcomed, thanks in advance
//mbs
--EDIT: So far, thanks to Martin Rosenau we are able to encode like this:
func mulproc_nfsw(i:int64;key:uint32):int64;
begin
if (blnk i) or (blnk key) then exit;
p:pointer=#i;
hi:uint32=uint32(p+4)^; //30864159 (hex: 01d6f31f)
lo:uint32=uint32(p)^; //748455936 (hex: 2c9c8800)
hi64:int64=hi*key; //0135b55a acdef08b <-- keep
lo64:int64=lo*key; //1d566e0b a65f2800 <-- keep
q:pointer=#result; //-5990085971773806592
uint32(q+4)^:=hi64; //acdef08b
uint32(q)^:=lo64; //a65f2800
end;
func encode_time_nfsw(j:juncture):int64;
begin
if blnk j then exit; //input: '2021-01-25 13:37:07'
key:uint32=$A85A2115; //encode key
ft:int64=j.filetime; //hex: 01d6f31f 2c9c8800
result:=mulproc_nfsw(ft,key);
end;
--EDIT2: Finally, thanks to fpiette we are able to decode also:
func decode_time_nfsw(i:int64):juncture;
begin
if blnk i then exit; //input: -5990085971773806592
key:uint32=$3069263D; //decode key
ft:int64=mulproc_nfsw(i,key);
result.setfiletime(ft);
end;

I checked my suspicion that the high and the low 32 bits are simply multiplied by A85A2115 (hex):
We get a FILETIME structure. Then we perform a 32x32->32 bit multiplication (this means we throw away the high 32 bits of the 64 bits result) of the high and the low word independently.
Example:
25 Jan 2021 13:37:07 (and some milliseconds)
Unencrypted FILETIME:
High dword = 1D6F31F (hex)
Low dword = 2C9CA481 (hex)
Multiplication
High dword: 1D6F31F * A85A2115 = 135B55AACDEF08B (hex)
Low dword: 2C9CA481 * A85A2115 = 1D5680CA57681F95 (hex)
Now only take the low 32 bits of the results:
High dword: ACDEF08B (hex)
Low dword: 57681F95 (hex)
Unfortunately, I don't know how to do the the "reverse operation"; I did it by searching for the result in a loop with the following pseudo-code:
encryptedValue = 57681F95 (hex)
originalValue = 0
product = 0
while product not equal to encryptedValue
// 32-bit addition discarding carry:
product = product + A85A2115 (hex)
originalValue = originalValue + 1
end_of_while_loop
We get the following results:
25 Jan 2021 13:37:07 => acdef08b|57681f95
25 Jan 2021 13:40:51 => acdef08b|4a249b35
25 Jan 2021 13:45:07 => 553911a0|e4acfb95
25 Jan 2021 13:49:03 => 553911a0|b922f7b5
25 Jan 2021 13:53:53 => fd9332b5|d2a65be5
25 Jan 2021 13:55:50 => fd9332b5|d15bf495
25 Jan 2021 14:00:39 => a5ed53ca|a8871b95
Addendum
The reverse operation seems to be done by multiplying with 3069263D (hex) (and only using the low 32 bits).
Encrypting:
2C9CA481 * A85A2115 = 1D5680CA57681F95
=> Result: 57681F95
Decrypting:
57681F95 * 3069263D = 10876CAF2C9CA481
=> Result: 2C9CA481

Related

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?

There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

Two byte report count for hid report descriptor

I'm trying to create an HID report descriptor for USB 3.0 with a report count of 1024 bytes.
The documentation at usb.org for HID does not seem to mention a two byte report count. Nonetheless, I have seen some people use 0x96 (instead of 0x95) to enter a two byte count, such as:
0x96, 0x00, 0x02, // REPORT_COUNT (512)
which was taken from here:
Custom HID device HID report descriptor
Likewise, from this same example, 0x26 is used for a two byte logical maximum.
Where did this 0x96 and 0x26 field come from? I don't see any documentation for it.

REPORT_COUNT is defined in the Device Class Definition for HID 1.11 document in section 6.2.2.7 Global Items on page 36 as:
Report Count 1001 01 nn Unsigned integer specifying the number of data
fields for the item; determines how many fields are included in the
report for this particular item (and consequently how many bits are
added to the report).
The nn in the above code is the item length indicator (bSize) and is defined earlier in section 6.2.2.2 Short Items as:
bSize Numeric expression specifying size of data:
0 = 0 bytes
1 = 1 byte
2 = 2 bytes
3 = 4 bytes
Rather confusingly, the valid values of bSize are listed in decimal. So, in binary, the bits for nn would be:
00 = 0 bytes (i.e. there is no data associated with this item)
01 = 1 byte
10 = 2 bytes
11 = 4 bytes
Putting it all together for REPORT_COUNT, which is an unsigned integer, the following alternatives could be specified:
1001 01 00 = 0x94 = REPORT_COUNT with no length (can only have value 0?)
1001 01 01 = 0x95 = 1-byte REPORT_COUNT (can have a value from 0 to 255)
1001 01 10 = 0x96 = 2-byte REPORT_COUNT (can have a value from 0 to 65535)
1001 01 11 = 0x97 = 4-byte REPORT_COUNT (can have a value from 0 to 4294967295)
Similarly, for LOGICAL_MAXIMUM, which is a signed integer (usually, there is an exception):
0010 01 00 = 0x24 = LOGICAL_MAXIMUM with no length (can only have value 0?)
0010 01 01 = 0x25 = 1-byte LOGICAL_MAXIMUM (can have a values from -128 to 127)
0010 01 10 = 0x26 = 2-byte LOGICAL_MAXIMUM (can have a value from -32768 to 32767)
0010 01 11 = 0x27 = 4-byte LOGICAL_MAXIMUM (can have a value from -2147483648 to 2147483647)
The specification is unclear on what value a zero-length item defaults to in general. It only mentions, at the end of section 6.2.2.4 Main Items, that MAIN item types and, within that type, INPUT item tags, have a default value of 0:
Remarks - The default data value for all Main items is zero (0).
- An Input item could have a data size of zero (0) bytes. In this case the value of
each data bit for the item can be assumed to be zero. This is functionally
identical to using a item tag that specifies a 4-byte data item followed by four
zero bytes.
It would be reasonable to assume 0 as the default for other item types too, but for REPORT_COUNT (a GLOBAL item) a value of 0 is not really a sensible default (IMHO). The specification doesn't really say.

Junk output or random character in Data::Dumper output for Net::DNS::Resolver object

I am getting familiarized with the Net::DNS library in Perl and an object is created using
my $res = Net::DNS::Resolver->new();
However, simply trying to query a domain name shows a lot f junk values, though the output itself is correct. Here is the code snippet
#!/usr/bin/perl
use Net::DNS;
use Net::IP;
use Data::Dumper;
my $rr;
$domain = 'google.com';
my $res = Net::DNS::Resolver->new();
my $ns_req = $res->query($domain, "NS");
print "\n\n###\n".Dumper($ns_req)."\n###\n\n";
Here are 2 outputs for various domains tested against this object:
What are these junk values being displayed? Is there a way to clean up the output a bit in order to read the output properly?

You are dumping the internals of the object which include the buffer which holds the original response bytes.
You should use the API defined in the module documentation to access the information.
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
$result->print;
Output:
;; Answer received from x.x.x.x (100 bytes)
;; HEADER SECTION
;; id = 39595
;; qr = 1 aa = 0 tc = 0 rd = 1 opcode = QUERY
;; ra = 1 z = 0 ad = 0 cd = 0 rcode = NOERROR
;; qdcount = 1 ancount = 4 nscount = 0 arcount = 0
;; do = 0
;; QUESTION SECTION (1 record)
;; google.com. IN NS
;; ANSWER SECTION (4 records)
google.com. 21599 IN NS ns4.google.com.
google.com. 21599 IN NS ns2.google.com.
google.com. 21599 IN NS ns1.google.com.
google.com. 21599 IN NS ns3.google.com.
;; AUTHORITY SECTION (0 records)
;; ADDITIONAL SECTION (0 records)
The query method returns a Net::DNS::Packet which provides other methods to obtain specific parts of the response.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
for my $answer ($result->answer) {
print $answer->nsdname, "\n";
}
Output:
ns2.google.com
ns1.google.com
ns3.google.com
ns4.google.com
If you are interested in the contents of the binary buffer, Net::DNS::Packet has a data method which returns the contents of that buffer. As RFC 1035 points out:
3.2. RR definitions
3.2.1. Format
All RRs have the same top level format shown below:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| |
/ /
/ NAME /
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TYPE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| CLASS |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TTL |
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| RDLENGTH |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/ RDATA /
/ /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
where:
NAME an owner name, i.e., the name of the node to which this
resource record pertains.
TYPE two octets containing one of the RR TYPE codes.
CLASS two octets containing one of the RR CLASS codes.
TTL a 32 bit signed integer that specifies the time interval
that the resource record may be cached before the source
of the information should again be consulted. Zero
values are interpreted to mean that the RR can only be
used for the transaction in progress, and should not be
cached. For example, SOA records are always distributed
with a zero TTL to prohibit caching. Zero values can
also be used for extremely volatile data.
RDLENGTH an unsigned 16 bit integer that specifies the length in
octets of the RDATA field.
RDATA a variable length string of octets that describes the
resource. The format of this information varies
according to the TYPE and CLASS of the resource record.
You can examine the contents of $result->data by doing a hexdump:
#!/usr/bin/env perl
use strict;
use warnings;
use Net::DNS;
my $resolver = Net::DNS::Resolver->new;
my $result = $resolver->query('google.com', "NS");
print $result->data;
C:\...\t> perl tt.pl | xxd
00000000: 3256 8180 0001 0004 0000 0000 0667 6f6f 2V...........goo
00000010: 676c 6503 636f 6d00 0002 0001 c00c 0002 gle.com.........
00000020: 0001 0000 545f 0006 036e 7333 c00c c00c ....T_...ns3....
00000030: 0002 0001 0000 545f 0006 036e 7334 c00c ......T_...ns4..
00000040: c00c 0002 0001 0000 545f 0006 036e 7332 ........T_...ns2
00000050: c00c c00c 0002 0001 0000 545f 0006 036e ..........T_...n
00000060: 7331 c00c s1..

How can I convert a 5 digit int date and 7 digit int time to a real date?

I've come across some data where the date for today's value is 77026 and the time (as of a few minutes ago) is 4766011. FYI: today is Fri, 18 Nov 2011 12:54:46 -0600
I can't figure out how these represent a date/time, and there is no supporting documentation.
How can I convert these numbers to a date value?
Some other dates from today are:
77026 | 4765509
77026 | 4765003
77026 | 4714129
77026 | 4617107
And some dates from what is probably yesterday:
77025 | 6292509
77025 | 6238790
77025 | 4009544

Ok, with your expanded examples, it would appear the first number is a day count. That'd put this time system's epoch at
to_days(today) = 734824
734824 - 77025 = 657799
from_days(657799) = Dec 29, 1800
The time values are problematic, it looks like they're decreasing (unless you listed most recent first?), but if they are some "# of intervals since midnight", then centi-seconds could be likely. That'd give us a range of 0 - 8,640,000.
4765509 = 47655.09 seconds -> sec_to_time(47655) = 13:14:15
sec_to_time(47650.03) -> 13:14:10
sec_to_time(47141.29) -> 13:05:41
sec_to_time(46171.07) -> 12:49:31

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?

I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926

I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.

When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Strange 64-bit time format from game, you can recognize it? - date

Related

Spark: All RDD data not getting saved to Cassandra table

Two byte report count for hid report descriptor

Junk output or random character in Data::Dumper output for Net::DNS::Resolver object

How can I convert a 5 digit int date and 7 digit int time to a real date?

How can I searching for different variants of bioinformatics motifs in string, using Perl?

Categories

Resources