Numba signature for structured arrays - numba

Numba's documentation does not give any example of signatures for functions that take structured arrays. I have tried several ways, but all were rejected by Numba (and Pylance).
import numba as nb
import numpy as np
PairSpec = [("x", np.float32), ("y", np.float32)]
Pair = np.dtype(PairSpec)
NumbaPair = nb.from_dtype(Pair)
# BUG None of this works
# #nb.jit(np.float32(Pair[:]))
# #nb.jit(np.float32(NumbaPair[:]))
#nb.jit
def sum(pairs):
pair = pairs[0]
return pair.x + pair.y
pairs = np.array([(2, 3)], dtype=PairSpec)
print(sum(pairs))
How to give a signature to a function that takes structured arrays?

The correct signature is nb.float32(NumbaPair[:]). Note the use of nb.float32 and not np.float32. Also please note that arrays of structures (AoS) generally tend to be less efficient than structures of arrays (SoA). This is especially true for coordinates since most fields are generally read and AoS prevent any efficient vectorization (while modern x86-64 processors can typically compute ~16 float32 values per cycle and per core, as opposed to 2 for scalar values).

Related

numba error with tuple sorting containing numpy arrays

I have a (working) function that uses the heapq module to build a priority queue of tuples and I would like to compile it with numba, however I get a very long and unclear error. It seems to boil down to a problem with tuple order comparison needed for the queue. The tuples have a fixed format, where the first item is a a floating point number (whose order I care about) and then a numpy array, which I need for computation but never get compared when running normally. This is intended because comparison on numpy arrays yields an array which cannot be used in conditionals and raises an exception. However, I guess numba needs a scalar yielding comparison to be at least defined for all items in the tuple, and hence the numba error.
I have a very minimal example:
#numba.njit
def f():
return 1 if (1, numpy.arange(3)) < (2, numpy.arange(3)) else 2
f()
where the numba compilation fails (without numba it works since it never needs to actually compare the arrays, as in the original code).
Here is a slightly less minimal but maybe clearer example, which shows what I am actually doing:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
heap = [(1, 0, numpy.random.rand(2, 3))]
for unique_id in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
heappush(heap, (order, unique_id, data))
return heap[0]
f(100)
Here order is the variable whose order I care about in the queue, unique_id is a trick to avoid that when order is the same the comparison goes on to data and throws an exception.
I tried to bypass the problem converting the numpy array to a list when in the tuple and back to array for computation, but while this compiles, the numba version is slower than the interpreted one, even thought the array is quite small (usually 2x3). Without converting I would need to rewrite the code as loops which I would prefer to avoid (but is doable).
Is there a better alternative to get this working with numba, hopefully running faster than the python interpreter?
I'll try to respond based on the minimal example you provided.
I think that the problem here is not related to the ability of numba to perform comparison between all the elements of the tuple, but rather on where to store the result of such a comparison. This is stated in the error log returned when trying to execute your example:
cannot store {i8*, i8*, i64, i64, i8*, [1 x i64], [1 x i64]} to i1*: mismatching types
Basically, you are trying to store the result of a comparison between a pair of floats and a pair of arrays into a single boolean, and numba doesn't know how to do that.
If you are only interested in comparing the first elements of the tuples, the quickest workaround I can think of is forcing the comparison to happen only on the first elements, e.g.
#numba.njit
def f():
return 1 if (1, numpy.arange(3))[0] < (2, numpy.arange(3))[0] else 2
f()
If this is not applicable to your use case, please provide more details about it.
EDIT
According to the further information you provided, I think the best way to solve this is avoiding pushing the numpy arrays to the heap. Since you're only interested in the ordering properties of the heap, you can just push the keys to the heap and store the corresponding numpy arrays in a separate dictionary, using as keys the same values you push in the heap.
As a sidenote, when you use standard library functions in nopython-jitted functions, you are resorting on specific numba's re-implementation of those functions rather then the "original" python's ones. A comprehensive list of the available python's features in numba can be found here.
Ok, I found a solution to the problem: since storing the array in the heap tuple is the cause of the numba error, it is enough to store it in a separate dictionary with an unique key and store only the key in the heap tuple. For instance, using an integer as the key:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
key = 0
array_storage = {key: numpy.random.rand(2, 3)}
heap = [(1.0, key)]
for _ in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
key += 1
heappush(heap, (order, key))
array_storage[key] = data
return heap[0]
f(100)
Now the tuples in the heap can be compared yielding a boolean value and I still get to associate the data with its tuple. I am not completely satisfied since it seems a workaround, but it works pretty well and it is not overly complicated. If anyone has a better one please let me know!

Doc2Vec Clustering with kmeans for a new document

I have a corpus trained with Doc2Vec as follows:
d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10)
d2vmodel.build_vocab(train_corpus)
d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs)
Using the vectors, the documents are clustered with kmeans:
kmeans_model = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++', random_state = 42)
X = kmeans_model.fit(d2vmodel.docvecs.vectors_docs)
labels=kmeans_model.labels_.tolist()
I would like to use the k-means to cluster a new document and know which cluster it belongs to. I've tried the following but I don't think the input for predict is correct.
from numpy import array
testdocument = gensim.utils.simple_preprocess('Microsoft excel')
cluster_label = kmeans_model.predict(array(testdocument))
Any help is appreciated!
Your kmeans_model expects a features-vector similar to what it was provided during its original clustering – not the list-of-string-tokens you'll get back from gensim.simple_preprocess().
In fact, you want to use the Doc2Vec model to take such lists-of-tokens and turn them into model-compatible vectors, via its infer_vector() method. For example:
testdoc_words = gensim.utils.simple_preprocess('Microsoft excel')
testdoc_vector = d2vmodel.infer_vector(testdoc_words)
cluster_label = kmeans_model.predict(array(testdoc_vector))
Note that both Doc2Vec and inference work better on documents of at least tens-of-words long (not tiny 2-word phrases like your test here), and that inference may also often benefit from using a larger-than-default optional epochs parameter (especially on short documents).
Note also that your test document should be really preprocessed and tokenized exactly the same as your training data – so if some other process was used for preparing train_corpus, use that same process for post-training documents. (Words not recognized by the Doc2Vec model, because they weren't present during training, will be silently ignored – so an error like doing a different style of case-flattening at inference time will weaken results a lot.)

Access package contents from string argument in Modelica

I have a string vector with the names of some substances vec = {"H2","O2"}, and I would like to use these strings to access a record in a package such that
Modelica.Media.IdealGases.Common.SingleGasesData.'vec[1]'
returns the data of H2.
Is this possible in Modelica, or do I have to do it manually?
I ended up doing it manually:
import d = Modelica.Media.IdealGases.Common.SingleGasesData;
constant Modelica.Media.IdealGases.Common.DataRecord data[Species]={d.H2,d.O2};
It might be slow and requires some index tracking, but for small sizes it is doable.

Restriction on Range

i'm surprised. Why was made restriction of implementation to type Range, is whose the size limited by Int.MaxValue?
Thanks.
From the NumericRange docs,
NumericRange is a more generic version of the Range class which works
with arbitrary types. It must be supplied with an Integral
implementation of the range type.
Factories for likely types include Range.BigInt, Range.Long, and
Range.BigDecimal. Range.Int exists for completeness, but the Int-based
scala.Range should be more performant.
val r1 = new Range(0, 100, 1)
val veryBig = Int.MaxValue.toLong + 1
val r2 = Range.Long(veryBig, veryBig + 100, 1)
assert(r1 sameElements r2.map(_ - veryBig))
In my opinion the other answer is just wrong.
It demonstrates that you can use other number types, but this doesn't change the fact that a Range can only hold 2³¹ elements, like every other collection in Scala/Java.
As far as I know there is no real rationale behind this design decision. Having 64-bit collections would be certainly nice and support for arrays with 64bit indices are common for Java, but it is hard to integrate that into the existing language/collection framework. Some people say that the JVM is limited to a total of 4 billion objects, but I couldn't verify that.

What is the default hash code that Mathematica uses?

The online documentation says
Hash[expr]
gives an integer hash code for the expression expr.
Hash[expr,"type"]
gives an integer hash code of the specified type for expr.
It also gives "possible hash code types":
"Adler32" Adler 32-bit cyclic redundancy check
"CRC32" 32-bit cyclic redundancy check
"MD2" 128-bit MD2 code
"MD5" 128-bit MD5 code
"SHA" 160-bit SHA-1 code
"SHA256" 256-bit SHA code
"SHA384" 384-bit SHA code
"SHA512" 512-bit SHA code
Yet none of these correspond to the default returned by Hash[expr].
So my questions are:
What method does the default Hash use?
Are there any other hash codes built in?
The default hash algorithm is, more-or-less, a basic 32-bit hash function applied to the underlying expression representation, but the exact code is a proprietary component of the Mathematica kernel. It's subject to (and has) change between Mathematica versions, and lacks a number of desirable cryptographic properties, so I personally recommend you use MD5 or one of the SHA variants for any serious application where security matters. The built-in hash is intended for typical data structure use (e.g. in a hash table).
The named hash algorithms you list from the documentation are the only ones currently available. Are you looking for a different one in particular?
I've been doing some reverse engeneering on 32 and 64 bit Windows version of Mathematica 10.4 and that's what I found:
32 BIT
It uses a Fowler–Noll–Vo hash function (FNV-1, with multiplication before) with 16777619 as FNV prime and ‭84696351‬ as offset basis. This function is applied on Murmur3-32 hashed value of the address of expression's data (MMA uses a pointer in order to keep one instance of each data). The address is eventually resolved to the value - for simple machine integers the value is immediate, for others is a bit trickier. The Murmur3-32 implementing function contains in fact an additional parameter (defaulted with 4, special case making behaving as in Wikipedia) which selects how many bits to choose from the expression struct in input. Since a normal expression is internally represented as an array of pointers, one can take the first, the second etc.. by repeatedly adding 4 (bytes = 32 bit) to the base pointer of the expression. So, passing 8 to the function will give the second pointer, 12 the third and so on. Since internal structs (big integers, machine integers, machine reals, big reals and so on) have different member variables (e.g. a machine integer has only a pointer to int, a complex 2 pointers to numbers etc..), for each expression struct there is a "wrapper" that combine its internal members in one single 32-bit hash (basically with FNV-1 rounds). The simplest expression to hash is an integer.
The murmur3_32() function has 1131470165 as seed, n=0 and other params as in Wikipedia.
So we have:
hash_of_number = 16777619 * (84696351‬ ^ murmur3_32( &number ))
with " ^ " meaning XOR.
I really didn't try it - pointers are encoded using WINAPI EncodePointer(), so they can't be exploited at runtime. (May be worth running in Linux under Wine with a modified version of EncodePonter?)
64 BIT
It uses a FNV-1 64 bit hash function with 0xAF63BD4C8601B7DF as offset basis and 0x100000001B3 as FNV prime, along with a SIP64-24 hash (here's the reference code) with the first 64 bit of 0x0AE3F68FE7126BBF76F98EF7F39DE1521 as k0 and the last 64 bit as k1. The function is applied to the base pointer of the expression and resolved internally. As in 32-bit's murmur3, there is an additional parameter (defaulted to 8) to select how many pointers to choose from the input expression struct. For each expression type there is a wrapper to condensate struct members into a single hash by means of FNV-1 64 bit rounds.
For a machine integer, we have:
hash_number_64bit = 0x100000001B3 * (0xAF63BD4C8601B7DF ^ SIP64_24( &number ))
Again, I didn't really try it. Could anyone try?
Not for the faint-hearted
If you take a look at their notes on internal implementation, they say that "Each expression contains a special form of hash code that is used both in pattern matching and evaluation."
The hash code they're referring to is the one generated by these functions - at some point in the normal expression wrapper function there's an assignment that puts the computed hash inside the expression struct itself.
It would certainly be cool to understand HOW they can make use of these hashes for pattern matching purpose. So I had a try running through the bigInteger wrapper to see what happens - that's the simplest compound expression.
It begins checking something that returns 1 - dunno what.
So it executes
var1 = 16777619 * (67918732 ^ hashMachineInteger(1));
with hashMachineInteger() is what we said before - including values.
Then it reads the length in bytes of the bigInt from the struct (bignum_length) and runs
result = 16777619 * (v10 ^ murmur3_32(v6, 4 * v4));
Note that murmur3_32() is called if 4 * bignum_length is greater than 8 (may be related to the max value of machine integers $MaxMachineNumber 2^32^32 and by converse to what a bigInt is supposed to be).
So, the final code is
if (bignum_length > 8){
result = 16777619 * (16777619 * (67918732 ^ ( 16777619 * (84696351‬ ^ murmur3_32( 1, 4 )))) ^ murmur3_32( &bignum, 4 * bignum_length ));
}
I've made some hypoteses on the properties of this construction. The presence of many XORs and the fact that 16777619 + 67918732 = 84696351‬ may make one think that some sort of cyclic structure is exploited to check patterns - i.e. subtracting the offset and dividing by the prime, or something like that. The software Cassandra uses the Murmur hash algorithm for token generation - see these images for what I mean with "cyclic structure". Maybe various primes are used for each expression - must still check.
Hope it helps
It seems that Hash calls the internal Data`HashCode function, then divides it by 2, takes the first 20 digits of N[..] and then the IntegerPart, plus one, that is:
IntegerPart[N[Data`HashCode[expr]/2, 20]] + 1