How should I use an array (or tuple) as key and values for a Numba typed dictionary? - numba

I have the following code that attempts to store a key-value pair into a numba dictionary. The official page of Numba says that the new typed dictionary supports array as the key, but I could not get it to work.
The error message says that the key cannot be hash.
Any idea how to get this working?
In [7]: from numba.typed import Dict
...: from numba import types
...: import numpy as np
In [15]: dd = Dict.empty(key_type=types.int32[::1], value_type=types.int32[::1],)
In [16]: key = np.asarray([1,2,3], dtype=np.int32)
In [17]: dd[key] = key
Error Message:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'hash' of type array(int32, 1d, C)
EDIT:
I am probably missing something. I could use types.UniTuple in the interpreter (without the #jit decorator). However, when I put the following function into a script a.py and run it with command "python a.py", I got the UniTuple not found error.
#jit(nopython=True)
def go_fast2(date, starttime, id, tt, result): # Function is compiled and runs in machine code
prev_record = Dict.empty(key_type=types.UniTuple(types.int64, 2), value_type=types.UniTuple(types.int64, 3),)
for i in range(1, length):
key = np.asarray([date[i], id[i]], dtype=np.int64)
thistt = tt[i]
thistime = starttime[i]
if key in prev_record:
prev_time = prev_record[key][0]
prev_tt = prev_record[key][1]
prev_res = prev_record[key][2]
if thistt == prev_tt and thistime - prev_time <= 30 * 1000 * 1000: # with in a 10 seconds window
result[i] = prev_res + 1
else:
result[i] = 0
prev_record[key] = np.asarray((thistime, thistt, result[i]), dtype=np.int64)
else:
result[i] = 0
prev_record[key] = np.asarray((thistime, thistt, result[i]), dtype=np.int64)
return

The current documentation says that:
Acceptable key/value types include but are not limited to: unicode
strings, arrays, scalars, tuples.
The wording does make it seem like you might be able to use an array as a key type, but that is incorrect as an array is not hashable because it is mutable. It wouldn't work with a standard python dict either. You could convert the array to a tuple and that would work:
dd = Dict.empty(
key_type=types.UniTuple(types.int64, 3),
value_type=types.int64[::1],)
key = np.asarray([1,2,3], dtype=np.int64)
dd[tuple(key)] = key
Note that the int32 dtype you were using previously won't work on 64-bit machines since the tuple of int32s will be automatically converted to int64 when calling tuple() on the array.
The other issue is that a tuple has a fixed size, so you couldn't use arrays of arbitrary size as the key.

Related

Passing Argument to a Generator to build a tf.data.Dataset

I am trying to build a tensorflow dataset from a generator. I have a list of tuples called some_list , where each tuple has an integer and some text.
When I do not pass some_list as an argument to the generator, the code works fine
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen1():
random.shuffle(some_list)
size=len(some_list)
i=0
while True:
yield some_list[i][0],some_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(some_list)
#Not passing any argument
tf_dataset1 = tf.data.Dataset.from_generator(text_gen1,output_types=(tf.int32,tf.string),
output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([7, 1, 2])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Seven', b'One', b'Two'], dtype=object)>)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([3, 5, 4])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Three', b'Five', b'Four'], dtype=object)>)
However, when I try to pass some_list as an argument, the code fails
def text_gen2(file_list):
random.shuffle(file_list)
size=len(file_list)
i=0
while True:
yield file_list[i][0],file_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(file_list)
tf_dataset2 = tf.data.Dataset.from_generator(text_gen2,args=[some_list],output_types=
(tf.int32,tf.string),output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
ValueError: Can't convert Python sequence with mixed types to Tensor.
I noticed , when I try to pass a list of integers as an argument , the code works. However, a list of tuples seems to make it crash. Can someone shed some light on it ?
The problem is what it says is, you cannot have heterogeneous data types (int and str) in the same tf.Tensor. I did a few changes and came up with the code below.
Separate your some_list to two lists using zip(), i.e. int_list and str_list and make your generator function accept two lists.
I don't understand why you're manually shuffling stuff within the generator. You can do it in a cleaner way using tf.data.Dataset.shuffle()
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen2(int_list, str_list):
for x, y in zip(int_list, str_list):
yield x, y
tf_dataset2 = tf.data.Dataset.from_generator(
text_gen2,
args=list(zip(*some_list)),
output_types=(tf.int32,tf.string),output_shapes = ((),())
)
i = 0
for count_batch in tf_dataset2.repeat().batch(4).shuffle(buffer_size=6):
print(count_batch)
i += 1
if i > 10: break;

Adding elements to a MLMultiArray

I have a CoreML model (created using TF and converted to CoreML). For it
input is: MultiArray (Double 1 x 40 x 3)
output is: MultiArray (Double)
I will be getting these [a,b,c] tuples and need to collect 40 of them before sending into to the model for prediction. I am looking through the MLMultiArray documentation and am stuck. May be because of I am new to Swift.
I have a variable called modelInput that I want to initialize and then as the tuples come in, add them to the modelInput variable.
modelInput = MLMultiArray(shape:[1,40,3], dataType:MLMultiArrayDataType.double))
The modelInput.count is 120 after this call. So I am guessing an empty array is created.
However now I want to add the tuples as they come in. I am not sure how to do this.
For this I have a currCount variable which is updated after every call. The following code however gives me an error.
"Value of type 'UnsafeMutableRawPointer' has no subscripts"
var currPtr : UnsafeMutableRawPointer = modelInput.dataPointer + currCount
currPtr[0] = a
currPtr[1] = b
currPtr[2] = c
currCount = currCount + 3
How do I update the multiArray?
Is my approach even correct? Is this the correct way to create a multi array for the prediction input?
I would also like to print the contents of the MLMultiArray. There doesn't appear to be any helper functions to do that though.
You can use pointers, but you have to change the raw pointer into a typed one. For example:
let ptr = UnsafeMutablePointer<Float>(OpaquePointer(multiArray.dataPointer))
ptr[0] = a
ptr[1] = b
ptr[2] = c
I figured it out. I have to this --
modelInput[currCount+0] = NSNumber(floatLiteral: a)
modelInput[currCount+1] = NSNumber(floatLiteral: b)
modelInput[currCount+2] = NSNumber(floatLiteral: c)
I cannot use the raw pointer to access elements.

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

convert number string into float with specific precision (without getting rounding errors)

I have a vector of cells (say, size of 50x1, called tokens) , each of which is a struct with properties x,f1,f2 which are strings representing numbers. for example, tokens{15} gives:
x: "-1.4343429"
f1: "15.7947111"
f2: "-5.8196158"
and I am trying to put those numbers into 3 vectors (each is also 50x1) whose type is float. So I create 3 vectors:
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
and that works fine (why wouldn't it?). But then when I try to populate those vectors: (L is a for loop index)
x(L)=tokens{L}.x;
.. also for the other 2
I get :
The following error occurred converting from string to single:
Conversion to single from string is not possible.
Which I can understand; implicit conversion doesn't work for single. It does work if x, f1 and f2 are of type 50x1 double.
The reason I am doing it with floats is because the data I get is from a C program which writes the some floats into a file to be read by matlab. If I try to convert the values into doubles in the C program I get rounding errors...
So, (after what I hope is a good question,) how might I be able to get the numbers in those strings, at the right precision? (all the strings have the same number of decimal places: 7).
The MCVE:
filedata = fopen('fname1.txt','rt');
%fname1.txt is created by a C program. I am quite sure that the problem isn't there.
scanned = textscan(filedata,'%s','Delimiter','\n');
raw = scanned{1};
stringValues = strings(50,1);
for K=1:length(raw)
stringValues(K)=raw{K};
end
clear K %purely for convenience
regex = 'x=(?<x>[\-\.0-9]*),f1=(?<f1>[\-\.0-9]*),f2=(?<f2>[\-\.0-9]*)';
tokens = regexp(stringValues,regex,'names');
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
for L=1:length(tokens)
x(L)=tokens{L}.x;
f1(L)=tokens{L}.f1;
f2(L)=tokens{L}.f2;
end
Use function str2double before assigning into yours arrays (and then cast it to single if you want). Strings (char arrays) must be explicitely converted to numbers before using them as numbers.

Loop through values of a SPSS variable inside of a Macro

How can I pass the values of a specific variable to a list-processing loop inside a macro?
Let's say, as an simplified example, I've got a variable foo which contains the values 1,4,12,33 and 51.
DATA LIST FREE / foo (F2) .
BEGIN DATA
1
4
12
33
51
END DATA.
And a macro that does some stuff with those values.
For testing reasons this Macro will just echo those values.
I'd like to find a way to run a routine that works like the following:
DEFINE !testmacro (list !CMDEND)
!DO !i !IN (!list)
ECHO !QUOTE(!i).
!DOEND.
!ENDDEFINE.
!testmacro list = 1 4 12 33 51. * <- = values from foo.
This is a situation where using the Python apis would be a good choice.
I made myself a little bit familiar with Python recently :-)
So this is what I worked out.
If the variable is a numeric:
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = [element[0] for element in spssdata.Spssdata('foo').fetchall()]
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
If the variable is a string:
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = [element[0].strip() for element in spssdata.Spssdata('bar').fetchall()]
foostring = " ".join(foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Variants
Duplicates removed and list is orderd
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = sorted(set([element[0] for element in spssdata.Spssdata('foo').fetchall()]))
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Duplicates removed and items in order of first appearance in the dataset
Here, I use a function which I retrieved from Peter Bengtsson's Homepage (peterbe.com)
BEGIN PROGRAM PYTHON.
import spss,spssdata
def uniquify(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
foolist = uniquify([element[0] for element in spssdata.Spssdata('foo').fetchall()])
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Non-Python Solution
Not that I recommend it, but there is even a way to do this without Python.
I got the basic Idea from a SPSS programming book, which goes as follows:
Use the WRITE command to create a text file with the wanted command and variable values and include it with the insert command.
DATASET COPY foolistdata.
DATASET ACTIVATE foolistdata.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK
/NumberOfCases=N.
* Variable which contains the command as string in the first case.
STRING macrocommand (A18).
IF ($casenum=1) macroCommand = "!testmacro list = ".
EXECUTE.
* variable which contains a period (.) in the last case,
* for the ending of the command string.
STRING commandEnd (A1).
IF ($casenum=NumberOfCases) commandEnd = ".".
* Write the 'table' with the command and variable values into a textfile.
WRITE OUTFILE="macrocommand.txt" /macrocommand bar commandEnd.
EXECUTE.
* Macrocall.
INSERT FILE ="macrocommand.txt".