Bob Jenkins' Hash getting bad performance - hash

I was building a Bloom filter and looked into what hashes to use and the Bob Jenkins' hash seemed like a good choice because of the evenness of the distribution.
I adapted the given C++ code to Go (possibly making a mistake but it seems to work).
I got around to benchmarking the cost of the hash and found that the SHA1 hash in the Go std library was much faster.
PASS
BenchmarkJenkins 1000000 2649 ns/op
BenchmarkSHA256 1000000 1218 ns/op
BenchmarkSHA1 5000000 462 ns/op
Was I misled when I read that you shouldn't use cryptographic hashes in this use case?
Or is the standard library code much more optimized than mine?
package jenkins
import (
"bytes"
"encoding/gob"
)
// adapted from http://bretmulvey.com/hash/7.html
func ComputeHash(key interface{}) (uint64, error) {
var buf bytes.Buffer
enc := gob.NewEncoder(&buf)
err := enc.Encode(key)
if err != nil {
return 0, err
}
data := buf.Bytes()
var a, b, c uint64
a, b = 0x9e3779b9, 0x9e3779b9
c = 0
i := 0
for i = 0; i < len(data)-12; {
a += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
i += 4
b += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
i += 4
c += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
a, b, c = mix(a, b, c)
}
c += uint64(len(data))
if i < len(data) {
a += uint64(data[i])
i++
}
if i < len(data) {
a += uint64(data[i]) << 8
i++
}
if i < len(data) {
a += uint64(data[i]) << 16
i++
}
if i < len(data) {
a += uint64(data[i]) << 24
i++
}
if i < len(data) {
b += uint64(data[i])
i++
}
if i < len(data) {
b += uint64(data[i]) << 8
i++
}
if i < len(data) {
b += uint64(data[i]) << 16
i++
}
if i < len(data) {
b += uint64(data[i]) << 24
i++
}
if i < len(data) {
c += uint64(data[i]) << 8
i++
}
if i < len(data) {
c += uint64(data[i]) << 16
i++
}
if i < len(data) {
c += uint64(data[i]) << 24
i++
}
a, b, c = mix(a, b, c)
return c, nil
}
func mix(a, b, c uint64) (uint64, uint64, uint64) {
a -= b
a -= c
a ^= (c >> 13)
b -= c
b -= a
b ^= (a << 8)
c -= a
c -= b
c ^= (b >> 13)
a -= b
a -= c
a ^= (c >> 12)
b -= c
b -= a
b ^= (a << 16)
c -= a
c -= b
c ^= (b >> 5)
a -= b
a -= c
a ^= (c >> 3)
b -= c
b -= a
b ^= (a << 10)
c -= a
c -= b
c ^= (b >> 15)
return a, b, c
}
EDIT:
Benchmarking code:
package bloom
import (
"testing"
"crypto/sha1"
"crypto/sha256"
)
func BenchmarkJenkins(b *testing.B) {
j := jenkinsHash{}
for i := 0; i < b.N; i++ {
j.ComputeHash(i)
}
}
func BenchmarkSHA1(b *testing.B) {
for i := 0; i < b.N; i++ {
sha1.Sum([]byte{byte(i)})
}
}
func BenchmarkSHA256(b *testing.B) {
for i := 0; i < b.N; i++ {
sha256.Sum256([]byte{byte(i)})
}
}

I'm going to lay bets on optimization; Bob Jenkin's hash should be substantially faster than any crypto-style hash like SHA. I would bet that the standard library is calling out into heavily-optimized C (or even assembly) for that, which is why it beats your unoptimized Go.
There appears to be an efficient Murmur3 available for Go at https://github.com/reusee/mmh3 (I haven't tried it). You might have better luck with that, or by calling out into C/C++ for your Bob Jenkins implementation.

The go sha1 hash is written in assembler and has been heavily optimised (I contributed the ARM version of the code).
Your hash function looks of about equivalent complexity to sha1 to me so I'm not surprised by your run times.
You could try the md5 hash which should do for your purpose and may be faster still (it is also in assembler).
If you only need a short hash result (int64) you could try one of Go's CRC functions.

#JensG was on the right track.
The calls to gob to encode the key in binary made up the vast majority of the cost.
When I transitioned to passing in byte arrays the benchmark started getting the results I was expecting.
Thanks for the help!
BenchmarkJenkins 100000000 20.4 ns/op
BenchmarkSHA1 5000000 463 ns/op
BenchmarkSHA256 1000000 1223 ns/op
Benchmark code:
package bloom
import (
"testing"
"crypto/sha1"
"crypto/sha256"
)
func BenchmarkJenkins(b *testing.B) {
j := jenkinsHash{}
for i := 0; i < b.N; i++ {
j.ComputeHash([]byte{byte(i)})
}
}
func BenchmarkSHA1(b *testing.B) {
for i := 0; i < b.N; i++ {
sha1.Sum([]byte{byte(i)})
}
}
func BenchmarkSHA256(b *testing.B) {
for i := 0; i < b.N; i++ {
sha256.Sum256([]byte{byte(i)})
}
}
Altered code:
package bloom
type jenkinsHash struct {
}
// adapted from http://bretmulvey.com/hash/7.html
func (_ jenkinsHash) ComputeHash(data []byte) (uint64, error) {
var a, b, c uint64
a, b = 0x9e3779b9, 0x9e3779b9
c = 0
i := 0
for i = 0; i < len(data)-12; {
a += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
i += 4
b += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
i += 4
c += uint64(data[i]) | uint64(data[i+1]<<8) | uint64(data[i+2]<<16) | uint64(data[i+3]<<24)
a, b, c = mix(a, b, c)
}
c += uint64(len(data))
if i < len(data) {
a += uint64(data[i])
i++
}
if i < len(data) {
a += uint64(data[i]) << 8
i++
}
if i < len(data) {
a += uint64(data[i]) << 16
i++
}
if i < len(data) {
a += uint64(data[i]) << 24
i++
}
if i < len(data) {
b += uint64(data[i])
i++
}
if i < len(data) {
b += uint64(data[i]) << 8
i++
}
if i < len(data) {
b += uint64(data[i]) << 16
i++
}
if i < len(data) {
b += uint64(data[i]) << 24
i++
}
if i < len(data) {
c += uint64(data[i]) << 8
i++
}
if i < len(data) {
c += uint64(data[i]) << 16
i++
}
if i < len(data) {
c += uint64(data[i]) << 24
i++
}
a, b, c = mix(a, b, c)
return c, nil
}
func mix(a, b, c uint64) (uint64, uint64, uint64) {
a -= b
a -= c
a ^= (c >> 13)
b -= c
b -= a
b ^= (a << 8)
c -= a
c -= b
c ^= (b >> 13)
a -= b
a -= c
a ^= (c >> 12)
b -= c
b -= a
b ^= (a << 16)
c -= a
c -= b
c ^= (b >> 5)
a -= b
a -= c
a ^= (c >> 3)
b -= c
b -= a
b ^= (a << 10)
c -= a
c -= b
c ^= (b >> 15)
return a, b, c
}

Related

How to write a specification of a method that char array convert to an integer in dafny?

method atoi(a:array<char>) returns(r:int)
requires a.Length>0
requires forall k :: 0<= k <a.Length ==> (a[k] as int) - ('0' as int) <= 9
ensures ??
{
var j:int := 0;
while j < a.Length
invariant ??
{
r := r*10 + (a[j] as int) - ('0' as int);
j := j + 1;
}
}
How to write "ensures" for the atoi method and "invariant" for the while loops in dafny?
I express the idea "each bit of the return value corresponds to each bit of the character array" as following:
// Ten to the NTH power
// e.g.: ten_pos_pow(2) == 10*10 == 100
function ten_pos_pow(p:int):int
requires p>=0
ensures ten_pos_pow(p) >= 1
{
if p==0 then 1 else
10*ten_pos_pow(p-1)
}
// Count from right to left, the ith digit of integer v (i starts from zero)
// e.g.: num_in_int(123,0) == 3 num_in_int(123,1) == 2 num_in_int(123,2) == 1
function num_in_int(v:int,i:int) : int
requires i>=0
{
(v % ten_pos_pow(i+1))/ten_pos_pow(i)
}
method atoi(a:array<char>) returns(r:int)
requires a.Length>0
requires forall k :: 0<= k <a.Length ==> (a[k] as int) - ('0' as int) <= 9
ensures forall k :: 0<= k < a.Length ==> ((a[k] as int) - ('0' as int)) == num_in_int(r,a.Length-k-1)
{
var i:int := 0;
r := 0;
while i < a.Length
invariant 0<= i <= a.Length
invariant forall k :: 0<= k < i ==> ((a[k] as int) - ('0' as int)) == num_in_int(r,i-k-1) // loop invariant violation
{
r := r*10 + (a[i] as int) - ('0' as int);
i := i + 1;
}
}
But the loops invariant violation. How to write a correct and provable specification?

Check a sequence inside another sequence in matlab

I have some sequences:
{B < A < C < D < E},
{A < B < D < E < C}
How to check B < E in the above strings or not?
Thanks!
You can use regexp:
inputs = {'{B < A < C < D < E}', '{A < B < D < E < C}' };
res = regexp( inputs, 'B.*<.*E' )
Results with
res =
[2] [6]
That is res{ii} is the first location of B in the input string ii. If res{ii} is empty it means B < E is not in the ii-th string.

Scala Branch And Bound Motif Search

Below code searches for a motif (of length 8) in a sequence(String) and, as the result, it has to give back sequence with the best score. The problem is, although the code produces no errors, there is no output at all (probably infinite cycle, I observe blank console).
I am gonna give all my code online and if that is required. In order to reproduce the problem, just pass a number (between 0 and 3 - you can give 4 sequence, so you must choose 1 of them 0 is the first , 1 is the second etc) as args(0) (e.g. "0"), expected output should look something like "Motif = ctgatgta"
import scala.util.control._
object BranchAndBound {
var seq: Array[String] = new Array[String](20)
var startPos: Array[Int] = new Array[Int](20)
var pickup: Array[String] = new Array[String](20)
var bestMotif: Array[Int] = new Array[Int](20)
var ScoreMatrix = Array.ofDim[Int](5, 20)
var i: Int = _
var j: Int = _
var lmer: Int = _
var t: Int = _
def main(args: Array[String]) {
var t1: Long = 0
var t2: Long = 0
t1 = 0
t2 = 0
t1 = System.currentTimeMillis()
val seq0 = Array(
Array(
" >5 regulatory reagions with 69 bp",
" cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat",
" agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc",
" aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt",
" agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca",
" ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc"),
Array(
" 2 columns mutants",
" cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat",
" agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc",
" aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttt",
" agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca",
" ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc"),
Array(
" 2 columns mutants",
" cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat",
" agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc",
" aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttt",
" agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca",
" ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc"),
Array(
" 2 columns mutants",
" cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat",
" agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc",
" aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttt",
" agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca",
" ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc"))
var k: Int = 0
var m: Int = 0
var n: Int = 0
var bestScore: Int = 0
var optScore: Int = 0
var get: Int = 0
var ok1: Boolean = false
var ok3: Boolean = false
ok1 = false
ok3 = false
j = 1
lmer = 8
m = 1
t = 5
n = 69
optScore = 0
bestScore = 0
k = java.lang.Integer.parseInt(args(0))
j = 1
while (j <= t) {
seq(j) = new String()
i = 0
while (i < n) {
seq(j) += seq0(k)(j).charAt(i)
i += 1
}
j += 1
}
j = 1
while (j <= t) {
newPickup(1, j)
j += 1
}
j = 0
bestScore = 0
i = 1
val whilebreaker = new Breaks
whilebreaker.breakable {
while (i > 0) {
if (i < t) {
if (startPos(1) == (n - lmer)) whilebreaker.break
val sc = Score()
optScore = sc + (t - i) * lmer
if (optScore < bestScore) {
ok1 = false
j = i
val whilebreak1 = new Breaks
whilebreak1.breakable {
while (j >= 1) {
if (startPos(j) < n - lmer) {
ok1 = true
newPickup(0, j)
whilebreak1.break
} else {
ok1 = true
newPickup(1, j)
val whilebreak2 = new Breaks
whilebreak2.breakable {
while (startPos(i - 1) == (n - lmer)) {
newPickup(1, i - 1)
i -= 1
if (i == 0) whilebreak2.break
}
}
if (i > 1) {
newPickup(0, i - 1)
i -= 1
}
whilebreak1.break
}
}
}
if (ok1 == false) i = 0
} else {
newPickup(1, i + 1)
i += 1
}
} else {
get = Score()
if (get > bestScore) {
bestScore = get
m = 1
while (m <= t) {
bestMotif(m) = startPos(m)
m += 1
}
}
ok3 = false
j = t
val whilebreak3 = new Breaks
whilebreak3.breakable {
while (j >= 1) {
if (startPos(j) < n - lmer) {
ok3 = true
newPickup(0, j)
whilebreak3.break
} else {
ok3 = true
newPickup(1, j)
val whilebreak4 = new Breaks
whilebreak4.breakable {
while (startPos(i - 1) == (n - lmer)) {
newPickup(1, i - 1)
i -= 1
if (i == 0) whilebreak4.break
}
}
if (i > 1) {
newPickup(0, i - 1)
i -= 1
}
whilebreak3.break
}
}
}
if (ok3 == false) i = 0
}
}
}
println("Motiv: " + Consensus())
// println()
j = 1
while (j <= t) {
t2 = System.currentTimeMillis()
j += 1
}
println("time= " + (t2 - t1) + " ms")
}
def Score(): Int = {
var j: Int = 0
var k: Int = 0
var m: Int = 0
var max: Int = 0
var sum: Int = 0
sum = 0
max = 0
m = 1
while (m <= lmer) {
k = 1
while (k <= 4) {
ScoreMatrix(k)(m) = 0
k += 1
}
m += 1
}
m = 1
while (m <= lmer) {
k = 1
while (k <= i) pickup(k).charAt(m) match {
case 'a' => ScoreMatrix(1)(m) += 1
case 'c' => ScoreMatrix(2)(m) += 1
case 'g' => ScoreMatrix(3)(m) += 1
case 't' => ScoreMatrix(4)(m) += 1
}
m += 1
}
j = 1
while (j <= lmer) {
max = 0
m = 1
while (m <= 4) {
if (ScoreMatrix(m)(j) > max) {
max = ScoreMatrix(m)(j)
}
m += 1
}
sum += max
j += 1
}
sum
}
def Consensus(): String = {
var i: Int = 0
var j: Int = 0
var k: Int = 0
var m: Int = 0
var max: Int = 0
var imax: Int = 0
var str: String = null
i = 1
while (i <= t) {
pickup(i) = " " +
seq(i).substring(bestMotif(i), bestMotif(i) + lmer)
i += 1
}
m = 1
while (m <= lmer) {
k = 1
while (k <= 4) {
ScoreMatrix(k)(m) = 0
k += 1
}
m += 1
}
m = 1
while (m <= lmer) {
k = 1
while (k <= t) pickup(k).charAt(m) match {
case 'a' => ScoreMatrix(1)(m) += 1
case 'c' => ScoreMatrix(2)(m) += 1
case 'g' => ScoreMatrix(3)(m) += 1
case 't' => ScoreMatrix(4)(m) += 1
}
m += 1
}
str = ""
imax = 0
j = 1
while (j <= lmer) {
max = 0
i = 1
while (i <= 4) {
if (ScoreMatrix(i)(j) > max) {
max = ScoreMatrix(i)(j)
imax = i
}
i += 1
}
imax match {
case 1 => str += 'a'
case 2 => str += 'c'
case 3 => str += 'g'
case 4 => str += 't'
}
j += 1
}
str
}
def newPickup(one: Int, h: Int) {
if (one == 1) startPos(h) = 1 else startPos(h) += 1
pickup(h) = " " + seq(h).substring(startPos(h), startPos(h) + lmer)
}
}
and thanks, i hope someone gonna find my failure.
Your current implementation 'hangs' on this loop:
while (k <= i) pickup(k).charAt(m) match {
case 'a' => ScoreMatrix(1)(m) += 1
case 'c' => ScoreMatrix(2)(m) += 1
case 'g' => ScoreMatrix(3)(m) += 1
case 't' => ScoreMatrix(4)(m) += 1
}
As it stands, the exit condition is never fulfilled because the relation between k and i never changes. Either increment k or decrement i.
It looks like programming is not the key aspect of this work, but increased modularity should help contain complexity.
Also, I wonder about the choice of using Scala. There're many areas in this algorithm that would benefit of a more functional approach. In this translation, using Scala in an imperative way gets cumbersome. If you have the opportunity, I'd recommend you to explore a more functional approach to solve this problem.
A tip: The intellij debugger didn't have issues with this code.

Convert arbitrary Golang interface to byte array

I'm trying to write a hash that will accept all datatypes. Once in the function, I handle the data as a byte array. I'm having trouble figuring out how to cast an arbitrary interface{} to a byte array.
I tried using the binary package but it seemed to depend on the type of data passed in. One of the parameters of the Write() fn (docs) required knowing the byte order of the parameter.
All datatype sizes are some multiple of a byte (even the bool), so this should be simple in theory.
Code in question below,
package bloom
import (
"encoding/gob"
"bytes"
)
// adapted from http://bretmulvey.com/hash/7.html
func ComputeHash(key interface{}) (uint, error) {
var buf bytes.Buffer
enc := gob.NewEncoder(&buf)
err := enc.Encode(key)
if err != nil {
return 0, err
}
data := buf.Bytes()
var a, b, c uint
a, b = 0x9e3779b9, 0x9e3779b9
c = 0;
i := 0;
for i = 0; i < len(data)-12; {
a += uint(data[i+1] | data[i+2] << 8 | data[i+3] << 16 | data[i+4] << 24)
i += 4
b += uint(data[i+1] | data[i+2] << 8 | data[i+3] << 16 | data[i+4] << 24)
i += 4
c += uint(data[i+1] | data[i+2] << 8 | data[i+3] << 16 | data[i+4] << 24)
a, b, c = mix(a, b, c);
}
c += uint(len(data))
if i < len(data) {
a += uint(data[i])
i++
}
if i < len(data) {
a += uint(data[i] << 8)
i++
}
if i < len(data) {
a += uint(data[i] << 16)
i++
}
if i < len(data) {
a += uint(data[i] << 24)
i++
}
if i < len(data) {
b += uint(data[i])
i++
}
if i < len(data) {
b += uint(data[i] << 8)
i++
}
if i < len(data) {
b += uint(data[i] << 16)
i++
}
if i < len(data) {
b += uint(data[i] << 24)
i++
}
if i < len(data) {
c += uint(data[i] << 8)
i++
}
if i < len(data) {
c += uint(data[i] << 16)
i++
}
if i < len(data) {
c += uint(data[i] << 24)
i++
}
a, b, c = mix(a, b, c)
return c, nil
}
func mix(a, b, c uint) (uint, uint, uint){
a -= b; a -= c; a ^= (c>>13);
b -= c; b -= a; b ^= (a<<8);
c -= a; c -= b; c ^= (b>>13);
a -= b; a -= c; a ^= (c>>12);
b -= c; b -= a; b ^= (a<<16);
c -= a; c -= b; c ^= (b>>5);
a -= b; a -= c; a ^= (c>>3);
b -= c; b -= a; b ^= (a<<10);
c -= a; c -= b; c ^= (b>>15);
return a, b, c
}
Other problems in my code led me away from the gob package earlier, turns out it was the proper way as #nvcnvn suggested. Relevant code on how to solve this issue below:
package bloom
import (
"encoding/gob"
"bytes"
)
func GetBytes(key interface{}) ([]byte, error) {
var buf bytes.Buffer
enc := gob.NewEncoder(&buf)
err := enc.Encode(key)
if err != nil {
return nil, err
}
return buf.Bytes(), nil
}
Another way to convert interface{} to []bytes is to use a fmt package.
/*
* Convert variable `key` from interface{} to []byte
*/
byteKey := []byte(fmt.Sprintf("%v", key.(interface{})))
fmt.Sprintf converts interface value to string.
[]byte converts string value to byte.
※ Note ※ This method does not work if interface{} value is a pointer. Please find #PassKit's comment below.

Scala constructing a "List", read from stdin, output to stdout

I'm trying to read formatted inputs from stdin using Scala:
The equivalent C++ code is here:
int main() {
int t, n, m, p;
cin >> t;
for (int i = 0; i < t; ++i) {
cin >> n >> m >> p;
vector<Player> players;
for (int j = 0; j < n; ++j) {
Player player;
cin >> player.name >> player.pct >> player.height;
players.push_back(player);
}
vector<Player> ret = Solve(players, n, m, p);
cout << "Case #" << i + 1 << ": ";
for (auto &item : ret) cout << item.name << " ";
cout << endl;
}
return 0;
}
Where in the Scala code, I'd like to Use
players: List[Player], n: Int, m: Int, p: Int
to store these data.
Could someone provide a sample code?
Or, just let me know how to:
how the "main()" function work in scala
read formatted text from stdin
efficiently constructing a list from inputs (as list are immutable, perhaps there's a more efficient way to construct it? rather than having a new list as each element comes in?)
output formatted text to stdout
Thanks!!!
I don't know C++, but something like this should work :
def main(args: Array[String]) = {
val lines = io.Source.stdin.getLines
val t = lines.next.toInt
// 1 to t because of ++i
// 0 until t for i++
for (i <- 1 to t) {
// assuming n,m and p are all on the same line
val Array(n,m,p) = lines.next.split(' ').map(_.toInt)
// or (0 until n).toList if you prefer
// not sure about the difference performance-wise
val players = List.range(0,n).map { j =>
val Array(name,pct,height) = lines.next.split(' ')
Player(name, pct.toInt, height.toInt)
}
val ret = solve(players,n,m,p)
print(s"Case #${i+1} : ")
ret.foreach(player => print(player.name+" "))
println
}
}