Horrific collisions of adler32 hash - hash

When using adler32() as a hash function, one should expect rare collisions.
We can do the exact math of collisions probability, but roughly speaking,
since it is a 32-bits hash function, there should not be many collisions
on a sample set of a few thousands items.
This is hardly the case.
Here is an example: let's take strings that include a date in the middle, something like
"Some prefix text " + date + " some postfix text."
where the dates` format is yyyy-mm-dd, and looping over 2012.
There are 91 collisions in this example!
Even worse: there are 7 cases where 3 dates collided.
How come such a commonly-used hash function perform so poorly?
Or am I missing something?
Here are the detailed results of the above example:
0x592a0f1f: 2012-01-30, 2012-02-02, 2012-10-21
0x592b0f1f: 2012-02-11, 2012-10-30, 2012-11-02
0x593d0f20: 2012-01-31, 2012-02-03, 2012-10-22
0x593e0f20: 2012-02-12, 2012-10-31, 2012-11-03
0x59410f20: 2012-03-11, 2012-11-30, 2012-12-02
0x59560f21: 2012-03-30, 2012-04-02, 2012-12-21
0x59690f22: 2012-03-31, 2012-04-03, 2012-12-22
0x59020f1d: 2012-01-10, 2012-10-01
0x59150f1e: 2012-01-11, 2012-10-02
0x59160f1e: 2012-01-20, 2012-10-11
0x59170f1e: 2012-02-01, 2012-10-20
0x59180f1e: 2012-02-10, 2012-11-01
0x59280f1f: 2012-01-12, 2012-10-03
0x59290f1f: 2012-01-21, 2012-10-12
0x592c0f1f: 2012-02-20, 2012-11-11
0x592d0f1f: 2012-03-01, 2012-11-20
0x592e0f1f: 2012-03-10, 2012-12-01
0x593b0f20: 2012-01-13, 2012-10-04
0x593c0f20: 2012-01-22, 2012-10-13
0x593f0f20: 2012-02-21, 2012-11-12
0x59400f20: 2012-03-02, 2012-11-21
0x59420f20: 2012-03-20, 2012-12-11
0x59430f20: 2012-04-01, 2012-12-20
0x594e0f21: 2012-01-14, 2012-10-05
0x594f0f21: 2012-01-23, 2012-10-14
0x59500f21: 2012-02-04, 2012-10-23
0x59510f21: 2012-02-13, 2012-11-04
0x59520f21: 2012-02-22, 2012-11-13
0x59530f21: 2012-03-03, 2012-11-22
0x59540f21: 2012-03-12, 2012-12-03
0x59550f21: 2012-03-21, 2012-12-12
0x59570f21: 2012-04-11, 2012-12-30
0x59610f22: 2012-01-15, 2012-10-06
0x59620f22: 2012-01-24, 2012-10-15
0x59630f22: 2012-02-05, 2012-10-24
0x59640f22: 2012-02-14, 2012-11-05
0x59650f22: 2012-02-23, 2012-11-14
0x59660f22: 2012-03-04, 2012-11-23
0x59670f22: 2012-03-13, 2012-12-04
0x59680f22: 2012-03-22, 2012-12-13
0x596a0f22: 2012-04-12, 2012-12-31
0x596c0f22: 2012-04-30, 2012-05-02
0x59740f23: 2012-01-16, 2012-10-07
0x59750f23: 2012-01-25, 2012-10-16
0x59760f23: 2012-02-06, 2012-10-25
0x59770f23: 2012-02-15, 2012-11-06
0x59780f23: 2012-02-24, 2012-11-15
0x59790f23: 2012-03-05, 2012-11-24
0x597a0f23: 2012-03-14, 2012-12-05
0x597b0f23: 2012-03-23, 2012-12-14
0x597c0f23: 2012-04-04, 2012-12-23
0x59820f23: 2012-05-30, 2012-06-02
0x59870f24: 2012-01-17, 2012-10-08
0x59880f24: 2012-01-26, 2012-10-17
0x59890f24: 2012-02-07, 2012-10-26
0x598a0f24: 2012-02-16, 2012-11-07
0x598b0f24: 2012-02-25, 2012-11-16
0x598c0f24: 2012-03-06, 2012-11-25
0x598d0f24: 2012-03-15, 2012-12-06
0x598e0f24: 2012-03-24, 2012-12-15
0x598f0f24: 2012-04-05, 2012-12-24
0x59950f24: 2012-05-31, 2012-06-03
0x59980f24: 2012-06-30, 2012-07-02
0x599a0f25: 2012-01-18, 2012-10-09
0x599b0f25: 2012-01-27, 2012-10-18
0x599c0f25: 2012-02-08, 2012-10-27
0x599d0f25: 2012-02-17, 2012-11-08
0x599e0f25: 2012-02-26, 2012-11-17
0x599f0f25: 2012-03-07, 2012-11-26
0x59a00f25: 2012-03-16, 2012-12-07
0x59a10f25: 2012-03-25, 2012-12-16
0x59a20f25: 2012-04-06, 2012-12-25
0x59ae0f25: 2012-07-30, 2012-08-02
0x59ae0f26: 2012-01-28, 2012-10-19
0x59af0f26: 2012-02-09, 2012-10-28
0x59b00f26: 2012-02-18, 2012-11-09
0x59b10f26: 2012-02-27, 2012-11-18
0x59b20f26: 2012-03-08, 2012-11-27
0x59b30f26: 2012-03-17, 2012-12-08
0x59b40f26: 2012-03-26, 2012-12-17
0x59b50f26: 2012-04-07, 2012-12-26
0x59c10f26: 2012-07-31, 2012-08-03
0x59c40f26: 2012-08-30, 2012-09-02
0x59c40f27: 2012-02-28, 2012-11-19
0x59c50f27: 2012-03-09, 2012-11-28
0x59c60f27: 2012-03-18, 2012-12-09
0x59c70f27: 2012-03-27, 2012-12-18
0x59c80f27: 2012-04-08, 2012-12-27
0x59d70f27: 2012-08-31, 2012-09-03
0x59da0f28: 2012-03-28, 2012-12-19
0x59db0f28: 2012-04-09, 2012-12-28

Adler-32 was never intended to be and is not a hash function. It's purpose is error detection after decompression. It serves that purpose well since it is fast and since errors in the compressed data are amplified by the decompressor.
In the examples you give, you are using Adler-32 on very short strings, for which it has no chance to even make use of all 32 bits. Adler-32 requires at least a few hundred bytes to get rolling.
There are many non-cryptographic hash functions that are very fast and have good hash behavior, including avoidance of collisions. Take a look at the CityHash family of hash functions.
If you need cryptographic (non-spoofable) hash functions, then look at SHA-2 and SHA-3.

Related

How do I calculate a rolling 30 day window in KDB?

I have a keyed table of the form:
t | ar av mr mv
-----------------------------| ----------------------------------------
2016.01.04D09:51:00.000000000| -0.001061315 513 -0.01507338 576
2016.01.04D11:37:00.000000000| -0.0004846135 618 -0.001100514 583
2016.01.04D12:04:00.000000000| -0.0009708739 1619 -0.001653045 1000
I want to calculate the 30 day rolling correlation ar cor mr.
I'm stuck trying to create a self join with wj, but I'm not getting anywhere. Is this the way to do it?
You could do something like:
/-Function which creates the rolling windows (w:window size, s:list)
q)f:{[w;s] (w-1)_({ 1_x,y }\[w#0;s])}
/-e.g.
q)f[3;til 5]
0 1 2
1 2 3
2 3 4
/-Apply cor to each 30-day rolling window as below:
q)ar:exec ar from t;
q)mr:exec mr from t;
q)cor'[f[30;ar]; f[30; mr]]

WinDbg: How do I include a thread id and time value in a breakpoint .printf _without_ using pseudo registers?

I have some breakpoint "pairs," and I'd like to measure the time in between when they are hit.
The simplest thing that would allow me to do this is to include some sort of timestamp (even if it's just clock ticks or something) in the .printf I use when the breakpoint is hit.
I could use the pseudo registers $tid and $dbgtime in the breakpoint code. When I do, the performance really suffers.
bp1000 ucrtbase!malloc ".printf \"[0x%08x] [ucrtbase] [0x%04x] [0x%08x] malloc(%d): \", $dbgtime, $tid, dwo(#esp), dwo(#esp+4); gc "
When the same code is used (without using meaningful values for timestamp and thread id), things work much better.
bp1000 ucrtbase!malloc ".printf \"[0x%08x] [ucrtbase] [0x%04x] [0x%08x] malloc(%d): \", 0, 0, dwo(#esp), dwo(#esp+4); gc "
Is there some other (high-performance) way to get this information? The current time is more valuable than the thread ID. I can always make the breakpoint only apply to a specific thread so that emitting the ID is only sugar.
try this
0:000> bp ucrtbase!malloc "~# ; .echotime ; dd #$csp l2 ; gc ;"
0:000> bl
0 e 00007ff8`ab61c9e0 0001 (0001) 0:**** ucrtbase!malloc "~# ; .echotime ; dd #$csp l2 ; gc ;"
0:000> g
. 0 Id: 1a84.1f14 Suspend: 1 Teb: 00000018`f49d1000 Unfrozen
Start: cdb!wmainCRTStartup (00007ff6`efd2bbf0)
Priority: 0 Priority class: 32 Affinity: f
Debugger (not debuggee) time: Wed Aug 7 22:17:44.992 2019
00000018`f47eeb58 ab622762 00007ff8
. 0 Id: 1a84.1f14 Suspend: 1 Teb: 00000018`f49d1000 Unfrozen
Start: cdb!wmainCRTStartup (00007ff6`efd2bbf0)
Priority: 0 Priority class: 32 Affinity: f
Debugger (not debuggee) time: Wed Aug 7 22:17:44.992 2019 (UTC + 5:30)
00000018`f47eeb08 ab622762 00007ff8

Finding the values of positive peaks in smooth wave form

This is some code I wrote to search for the peaks of a very clean (no noise) signal where fun is an array containing evenly sampled data of a sine wave.
J=[fun(1)];
K=[1];
count=1;
for i=2:1.0:(length(fun)-2)
if fun(i-1)<fun(i) && fun(i)>fun(i+1)
J=[J,fun(i+1)];
K=[K,count+1];
end
count=count+1;
end
Included below is the data that I am trying to process.
The code found the peaks at the 664th and 991st entry, but none of the ones in between. I wrote the same algorithm in c++ and got the same result, so it is an algorithm problem, not language specific.
Please help me find the error or give me another solution.
fun = -1*pi/180*[-90.15
-90.00
-89.70
-89.10
-88.50
-87.75
-86.70
-85.65
-84.30
-82.95
-81.45
-79.80
-78.15
-76.35
-74.55
-72.30
-70.20
-67.80
-65.40
-62.70
-60.00
-57.15
-54.30
-51.15
-48.00
-44.85
-41.40
-37.95
-34.50
-30.90
-27.30
-23.55
-19.80
-16.05
-12.15
-8.25
-4.95
-1.50
1.95
4.80
7.80
10.65
13.95
17.40
20.70
23.85
27.15
30.30
33.45
36.45
39.45
42.45
45.30
48.00
50.70
53.40
55.95
58.35
60.75
63.15
65.25
67.35
69.45
71.40
73.20
74.85
76.50
78.15
79.50
80.85
82.05
83.25
84.15
85.05
85.95
86.70
87.45
88.05
88.50
88.95
89.10
89.25
89.40
89.25
89.10
88.95
88.50
88.05
87.45
86.70
86.10
85.20
84.30
83.25
82.20
81.00
79.65
78.15
76.65
75.00
73.35
71.55
69.60
67.50
65.40
63.30
60.90
58.65
56.10
53.55
51.00
48.30
45.45
42.60
39.75
36.75
33.75
30.60
27.45
24.30
21.00
17.70
14.40
11.10
7.65
4.80
1.95
-0.90
-4.35
-7.65
-11.10
-14.85
-18.75
-22.35
-26.10
-29.70
-33.30
-36.75
-40.20
-43.50
-46.80
-49.95
-52.95
-55.95
-58.65
-61.35
-63.90
-66.45
-68.85
-70.95
-73.05
-75.00
-76.80
-78.45
-80.10
-81.60
-82.95
-84.15
-85.20
-86.10
-87.00
-87.60
-88.05
-88.50
-88.80
-88.80
-88.80
-88.80
-88.50
-88.05
-87.60
-87.00
-86.25
-85.50
-84.45
-83.25
-82.05
-80.55
-79.05
-77.40
-75.60
-73.65
-71.55
-69.45
-67.20
-64.65
-62.25
-59.55
-56.70
-53.85
-50.85
-47.70
-44.55
-41.25
-37.95
-34.50
-30.90
-27.30
-23.70
-19.95
-16.20
-12.45
-8.55
-5.25
-1.95
1.50
4.35
7.20
10.05
13.35
16.65
19.95
23.10
26.40
29.55
32.55
35.55
38.55
41.40
44.25
47.10
49.80
52.35
54.90
57.30
59.70
61.95
64.05
66.30
68.25
70.20
72.00
73.65
75.30
76.80
78.30
79.65
80.85
81.90
82.95
83.85
84.75
85.50
86.10
86.55
87.00
87.45
87.60
87.75
87.75
87.75
87.60
87.30
87.00
86.55
85.95
85.35
84.60
83.70
82.80
81.75
80.55
79.35
78.00
76.50
75.00
73.35
71.70
69.75
67.95
65.85
63.75
61.50
59.25
56.85
54.45
51.90
49.35
46.65
43.80
40.95
38.10
35.10
32.10
28.95
25.95
22.65
19.50
16.20
13.05
9.75
6.90
4.05
1.05
-1.80
-5.10
-8.40
-11.70
-15.45
-19.20
-22.95
-26.55
-30.15
-33.60
-37.05
-40.35
-43.65
-46.80
-49.95
-52.80
-55.65
-58.50
-61.05
-63.60
-66.00
-68.25
-70.50
-72.45
-74.40
-76.20
-77.85
-79.35
-80.70
-81.90
-83.10
-84.15
-85.05
-85.80
-86.40
-86.85
-87.15
-87.45
-87.45
-87.45
-87.30
-87.00
-86.55
-85.95
-85.35
-84.45
-83.55
-82.50
-81.30
-79.95
-78.45
-76.95
-75.15
-73.35
-71.40
-69.30
-67.05
-64.65
-62.25
-59.70
-57.00
-54.15
-51.30
-48.30
-45.15
-41.85
-38.55
-35.25
-31.80
-28.20
-24.60
-21.00
-17.25
-13.65
-9.90
-6.60
-3.30
0.15
2.85
5.70
8.55
11.40
14.70
17.85
21.15
24.30
27.45
30.45
33.45
36.45
39.30
42.15
44.85
47.70
50.25
52.80
55.20
57.60
59.85
62.10
64.20
66.30
68.10
70.05
71.70
73.35
75.00
76.35
77.70
79.05
80.25
81.30
82.20
83.10
83.85
84.45
85.05
85.50
85.95
86.10
86.40
86.40
86.40
86.25
86.10
85.65
85.35
84.75
84.15
83.40
82.65
81.75
80.70
79.50
78.30
77.10
75.60
74.10
72.45
70.80
69.00
67.05
65.10
63.15
60.90
58.65
56.40
54.00
51.45
48.90
46.20
43.50
40.65
37.80
34.95
31.95
28.95
25.80
22.65
19.50
16.35
13.05
9.90
7.05
4.20
1.35
-1.50
-4.65
-7.95
-11.25
-15.00
-18.75
-22.35
-25.95
-29.40
-32.85
-36.30
-39.60
-42.75
-45.90
-49.05
-51.90
-54.75
-57.45
-60.15
-62.55
-64.95
-67.20
-69.30
-71.40
-73.20
-75.00
-76.65
-78.15
-79.50
-80.70
-81.90
-82.80
-83.70
-84.45
-85.05
-85.50
-85.80
-85.95
-86.10
-86.10
-85.80
-85.50
-85.05
-84.60
-83.85
-82.95
-82.05
-81.00
-79.65
-78.30
-76.95
-75.30
-73.65
-71.70
-69.75
-67.65
-65.40
-63.15
-60.60
-58.05
-55.35
-52.50
-49.65
-46.65
-43.50
-40.35
-37.05
-33.60
-30.15
-26.70
-23.10
-19.50
-15.90
-12.15
-8.55
-5.25
-1.95
1.35
4.05
6.90
9.75
12.45
15.75
18.90
22.05
25.05
28.20
31.20
34.20
37.05
39.90
42.60
45.30
48.00
50.55
53.10
55.35
57.75
60.00
62.10
64.20
66.15
67.95
69.75
71.40
73.05
74.55
75.90
77.10
78.30
79.50
80.55
81.30
82.20
82.95
83.55
84.00
84.45
84.75
84.90
85.05
85.05
84.90
84.75
84.45
84.15
83.55
83.10
82.35
81.60
80.70
79.65
78.60
77.55
76.20
74.85
73.35
71.85
70.20
68.40
66.60
64.65
62.55
60.45
58.35
55.95
53.70
51.15
48.75
46.05
43.35
40.65
37.80
34.95
32.10
29.10
25.95
22.95
19.80
16.65
13.50
10.20
7.05
4.20
1.50
-1.35
-4.50
-7.80
-11.10
-14.70
-18.30
-21.90
-25.50
-28.95
-32.40
-35.70
-39.00
-42.15
-45.30
-48.30
-51.15
-54.00
-56.70
-59.25
-61.65
-64.05
-66.30
-68.40
-70.35
-72.30
-73.95
-75.60
-77.10
-78.45
-79.65
-80.70
-81.60
-82.50
-83.10
-83.70
-84.15
-84.45
-84.60
-84.75
-84.60
-84.45
-84.15
-83.70
-83.10
-82.35
-81.45
-80.55
-79.35
-78.15
-76.80
-75.30
-73.65
-72.00
-70.05
-68.10
-66.00
-63.75
-61.35
-58.95
-56.40
-53.70
-50.85
-47.85
-44.85
-41.85
-38.70
-35.40
-32.10
-28.65
-25.05
-21.60
-18.00
-14.40
-10.80
-7.05
-3.90
-0.60
2.55
5.40
8.10
10.95
14.10
17.25
20.25
23.40
26.40
29.40
32.40
35.25
38.10
40.95
43.65
46.20
48.75
51.30
53.70
55.95
58.20
60.30
62.40
64.35
66.30
68.10
69.75
71.40
72.90
74.25
75.60
76.80
77.85
78.90
79.80
80.70
81.45
82.05
82.50
82.95
83.25
83.55
83.70
83.70
83.70
83.55
83.25
82.95
82.50
81.90
81.30
80.55
79.65
78.75
77.70
76.50
75.30
73.95
72.45
70.95
69.30
67.65
65.85
63.90
61.95
59.85
57.60
55.35
53.10
50.70
48.15
45.60
42.90
40.20
37.50
34.65
31.80
28.80
25.80
22.80
19.65
16.65
13.50
10.20
7.05
4.35
1.65
-1.20
-4.35
-7.50
-10.80
-14.40
-18.00
-21.45
-25.05
-28.50
-31.80
-35.10
-38.40
-41.55
-44.55
-47.55
-50.40
-53.25
-55.80
-58.35
-60.90
-63.15
-65.40
-67.35
-69.30
-71.25
-72.90
-74.55
-75.90
-77.25
-78.45
-79.50
-80.40
-81.30
-81.90
-82.50
-82.95
-83.25
-83.40
-83.40
-83.25
-83.10
-82.80
-82.35
-81.75
-81.00
-80.10
-79.05
-78.00
-76.65
-75.30
-73.80
-72.15
-70.50
-68.55
-66.60
-64.50
-62.25
-59.85
-57.30
-54.75
-52.05
-49.35
-46.35
-43.35
-40.35
-37.05
-33.90
-30.60
-27.15
-23.70
-20.25
-16.65
-13.05
-9.45
-6.30
-3.15
0.15
2.85
5.55
8.25
10.95
14.10
17.25
20.25
23.40
26.40
29.25
32.25
35.10
37.80
40.50
43.20
45.90
48.30
50.85
53.10
55.35
57.60
59.70
61.80
63.75
65.55
67.35
69.00
70.50
72.00
73.35
74.70
75.90
76.95
77.85
78.75
79.65
80.25
80.85
81.45
81.75
82.05
82.35
82.50
82.50
82.35
82.20
81.90
81.45
81.00
80.40
79.80
78.90
78.15
77.10
76.05
74.85
73.65
72.30
70.80
69.30
67.65
65.85
64.05
62.10
60.15
58.05
55.80
53.55
51.30
48.90
46.35
43.80
41.10
38.40
35.70
32.85
30.00
27.00
24.00
21.00
18.00
14.85
11.70
8.70
6.00
3.30
0.45
-2.25
-5.40
-8.55
-11.70
-15.30
-18.75
-22.20
-25.65
-29.10
-32.40
-35.70
-38.85
-41.85
-44.85
-47.85
-50.55
-53.25
-55.95
-58.35
-60.75
-63.00
-65.10
-67.05
-69.00
-70.80
-72.45
-73.95
-75.30
-76.50
-77.70
-78.75
-79.65
-80.40
-81.00
-81.45
-81.75
-82.05
-82.20
-82.05
-82.05
-81.75
-81.30
-80.70
-80.10];
Look at your data
First of all you should carefully look on your input data if your algorithm does not work as expected. Maybe it does what it is designed for but this is not what you expect. Some of your maxima are not clean local maxima. You have samples with exactly equal function values. I have drawn your data and magnified the first maximum to demonstrate it:
There are four values at index 165 to 169 that have identical numerical values. Your algorithm can not recognize a maximum of this shape.
Solutions
I have three suggestions for you.
Add precision to your data
Firstly: Look deeper in your data. They may have more precision if you take all significant digits. With a closer look your peaks might have real local maxima.
Don't re-invent the wheel
If you can solve it in matlab/octave you could just use an existing solution already able to deal with complicated situation as this:
[J,K]=findpeaks(fun,'DoubleSided')
This will give the expected result:
J =
-1.5603
1.5499
-1.5315
1.5263
-1.5080
1.5027
-1.4844
1.4792
-1.4608
1.4556
-1.4399
1.4347
K =
83
165
249
332
415
499
581
664
745
827
909
991
Use an improved algorithm
If you need to implement this method yourself you have to adapt your criterion for peak finding. For example you could use two single sided criteria and mark raising and falling and flat areas:
c(i)=1*(fun(i-1) < fun(i)) + -1*(fun(i+1) < fun(i))
This expression will produce in matlab/octave a 1 value for raising signal parts, 0 for flat parts and -1 for falling parts.
Now you can search this array for some conditions:
If you find a place without raise or fall after a raise and before falling signal you found a maximum. You also find a maximum if a fall follows a raise immediately.

Matlab: parsing large segmented data with empty strings

I have a complex data text file to parse, my first problem is some of the strings values are missing (such as row 5 column 4 shown in Data below, I tried using treatAsEmpty with 8 blank spaces but it didn't work it keeps moving the B from the 5th row over and not registering the rest of the row [To be honest I don't need that column, if you can show me how to ignore it that would solve this problem]).
textscan(fileName .'%4d %4d %4d %8s \t %1s %2d \b %2s %7s %5d %*[^\n]','delimiter','\r','treatAsEmpty',' ','EmptyValue',-Inf);
Data:
0439 0444 0441 S09E44SF A 13 ES 3.7E-04 10230
0727 0736 0732 S27W23SF A 29 ES 1.2E-03 10226
0937 0945 0942 S29W16SF A 23 ES 8.8E-04 10226
2000 2016 2008 S28W27SF C 23 ES 1.8E-03 10226
2134 2217 2153 B 27 ES 4.8E-02 10229
0032 0042 0037 S25W27SF C 45 ES 2.1E-03 10226
0142 0147 0145 S09E35SF C 14 ES 4.1E-04 10230
0536 0555 0541 S09E33SF C 16 ES 1.6E-03 10230
0214 0312 0252 N23W422F A 11 ES 2.3E-02 10223
My second problem is, the blank space that is row 6 and row 10. I need to get rows 1-5 in cells (1x9), rows 7-9 in cells (2x9), row 11 in cells (3x9), etc.

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).