Fast iteration over unicode string Cython - unicode

I have the following cython function.
01:
+02: cdef int count_char_in_x(unicode x,Py_UCS4 c):
03: cdef:
+04: int count = 0
05: Py_UCS4 x_k
06:
+07: for x_k in x: ## Yellow
+08: if x_k == c:
+09: count+=1
10:
+11: return count
Line 07 is not properly optimized.
The annotated HTML code is expanded as:
+07: for x_k in x: ## Yellow
if (unlikely(__pyx_v_x == Py_None)) {
PyErr_SetString(PyExc_TypeError, "'NoneType' is not iterable");
__PYX_ERR(0, 8, __pyx_L1_error)
}
__Pyx_INCREF(__pyx_v_x);
__pyx_t_1 = __pyx_v_x;
__pyx_t_6 = __Pyx_init_unicode_iteration(__pyx_t_1, (&__pyx_t_3), (&__pyx_t_4), (&__pyx_t_5)); if (unlikely(__pyx_t_6 == ((int)-1))) __PYX_ERR(0, 8, __pyx_L1_error)
for (__pyx_t_7 = 0; __pyx_t_7 < __pyx_t_3; __pyx_t_7++) {
__pyx_t_2 = __pyx_t_7;
__pyx_v_x_k = __Pyx_PyUnicode_READ(__pyx_t_5, __pyx_t_4, __pyx_t_2);
Any tips on how could this be improved?
I think it is possible to write a cdef/cpdef function that at runtime completly avoids Python None type checks. Any idea on how this could be done?

The generated C code looks pretty good to me. The loop overall is a int-iterated for loop (i.e. it's not relying on calling the Python methods __iter__ and __next__).
__Pyx_PyUnicode_READ is translated pretty directly to PyUnicode_READ (depending slightly on the Python version you're using). PyUnicode_READ is a C macro which is as close to a direct array access as you can get.
This is probably as good as it's getting. You might get a small improvement by using bytes rather than unicode (provided you're dealing with ASCII characters). You might just consider whether it's really worth reimplementing unicode.count.
If it were a regular def function you could declare x as unicode not None to remove the None check before the loop. That might make a small difference. However, as #ead points out that isn't supported for cdef functions. It's likely the cost of a def function call will be slightly larger than the cost of a None-check, but you should time it if you care.

Related

Why are macros based on abstract syntax trees better than macros based on string preprocessing?

I am beginning my journey of learning Rust. I came across this line in Rust by Example:
However, unlike macros in C and other languages, Rust macros are expanded into abstract syntax trees, rather than string preprocessing, so you don't get unexpected precedence bugs.
Why is an abstract syntax tree better than string preprocessing?
If you have this in C:
#define X(A,B) A+B
int r = X(1,2) * 3;
The value of r will be 7, because the preprocessor expands it to 1+2 * 3, which is 1+(2*3).
In Rust, you would have:
macro_rules! X { ($a:expr,$b:expr) => { $a+$b } }
let r = X!(1,2) * 3;
This will evaluate to 9, because the compiler will interpret the expansion as (1+2)*3. This is because the compiler knows that the result of the macro is supposed to be a complete, self-contained expression.
That said, the C macro could also be defined like so:
#define X(A,B) ((A)+(B))
This would avoid any non-obvious evaluation problems, including the arguments themselves being reinterpreted due to context. However, when you're using a macro, you can never be sure whether or not the macro has correctly accounted for every possible way it could be used, so it's hard to tell what any given macro expansion will do.
By using AST nodes instead of text, Rust ensures this ambiguity can't happen.
A classic example using the C preprocessor is
#define MUL(a, b) a * b
// ...
int res = MUL(x + y, 5);
The use of the macro will expand to
int res = x + y * 5;
which is very far from the expected
int res = (x + y) * 5;
This happens because the C preprocessor really just does simple text-based substitutions, it's not really an integral part of the language itself. Preprocessing and parsing are two separate steps.
If the preprocessor instead parsed the macro like the rest of the compiler, which happens for languages where macros are part of the actual language syntax, this is no longer a problem as things like precedence (as mentioned) and associativity are taken into account.

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

Turn off Warning: Extension: Conversion from LOGICAL(4) to INTEGER(4) at (1) for gfortran?

I am intentionally casting an array of boolean values to integers but I get this warning:
Warning: Extension: Conversion from LOGICAL(4) to INTEGER(4) at (1)
which I don't want. Can I either
(1) Turn off that warning in the Makefile?
or (more favorably)
(2) Explicitly make this cast in the code so that the compiler doesn't need to worry?
The code will looking something like this:
A = (B.eq.0)
where A and B are both size (n,1) integer arrays. B will be filled with integers ranging from 0 to 3. I need to use this type of command again later with something like A = (B.eq.1) and I need A to be an integer array where it is 1 if and only if B is the requested integer, otherwise it should be 0. These should act as boolean values (1 for .true., 0 for .false.), but I am going to be using them in matrix operations and summations where they will be converted to floating point values (when necessary) for division, so logical values are not optimal in this circumstance.
Specifically, I am looking for the fastest, most vectorized version of this command. It is easy to write a wrapper for testing elements, but I want this to be a vectorized operation for efficiency.
I am currently compiling with gfortran, but would like whatever methods are used to also work in ifort as I will be compiling with intel compilers down the road.
update:
Both merge and where work perfectly for the example in question. I will look into performance metrics on these and select the best for vectorization. I am also interested in how this will work with matrices, not just arrays, but that was not my original question so I will post a new one unless someone wants to expand their answer to how this might be adapted for matrices.
I have not found a compiler option to solve (1).
However, the type conversion is pretty simple. The documentation for gfortran specifies that .true. is mapped to 1, and false to 0.
Note that the conversion is not specified by the standard, and different values could be used by other compilers. Specifically, you should not depend on the exact values.
A simple merge will do the trick for scalars and arrays:
program test
integer :: int_sca, int_vec(3)
logical :: log_sca, log_vec(3)
log_sca = .true.
log_vec = [ .true., .false., .true. ]
int_sca = merge( 1, 0, log_sca )
int_vec = merge( 1, 0, log_vec )
print *, int_sca
print *, int_vec
end program
To address your updated question, this is trivial to do with merge:
A = merge(1, 0, B == 0)
This can be performed on scalars and arrays of arbitrary dimensions. For the latter, this can easily be vectorized be the compiler. You should consult the manual of your compiler for that, though.
The where statement in Casey's answer can be extended in the same way.
Since you convert them to floats later on, why not assign them as floats right away? Assuming that A is real, this could look like:
A = merge(1., 0., B == 0)
Another method to compliment #AlexanderVogt is to use the where construct.
program test
implicit none
integer :: int_vec(5)
logical :: log_vec(5)
log_vec = [ .true., .true., .false., .true., .false. ]
where (log_vec)
int_vec = 1
elsewhere
int_vec = 0
end where
print *, log_vec
print *, int_vec
end program test
This will assign 1 to the elements of int_vec that correspond to true elements of log_vec and 0 to the others.
The where construct will work for any rank array.
For this particular example you could avoid the logical all together:
A=1-(3-B)/3
Of course not so good for readability, but it might be ok performance-wise.
Edit, running performance tests this is 2-3 x faster than the where construct, and of course absolutely standards conforming. In fact you can throw in an absolute value and generalize as:
integer,parameter :: h=huge(1)
A=1-(h-abs(B))/h
and still beat the where loop.

How can I reduce the number of test cases ScalaCheck generates?

I'm trying to solve two ScalaCheck (+ specs2) problems:
Is there any way to change the number of cases that ScalaCheck generates?
How can I generate strings that contain some Unicode characters?
For example, I'd like to generate about 10 random strings that include both alphanumeric and Unicode characters. This code, however, always generates 100 random strings, and they are strictly alpha character based:
"make a random string" in {
def stringGenerator = Gen.alphaStr.suchThat(_.length < 40)
implicit def randomString: Arbitrary[String] = Arbitrary(stringGenerator)
"the string" ! prop { (s: String) => (s.length > 20 && s.length < 40) ==> { println(s); success; } }.setArbitrary(randomString)
}
Edit
I just realized there's another problem:
Frequently, ScalaCheck gives up without generating 100 test cases
Granted I don't want 100, but apparently my code is trying to generate an overly complex set of rules. The last time it ran, I saw "gave up after 47 tests."
The "gave up after 47 tests" error means that your conditions (which include both the suchThat predicate and the ==> part) are too restrictive. Fortunately it's often not too hard to bake these into your generator, and in your case you can write something like this (which also addresses the issue of picking arbitrary characters, not just alphanumeric ones):
val stringGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, arbitrary[Char])
}
Here we pick an arbitrary length in the desired range and then pick that number of arbitrary characters and concatenate them into a string.
You could also increase the maxDiscardRatio parameter:
import org.specs2.scalacheck.Parameters
implicit val params: Parameters = Parameters(maxDiscardRatio = 1024)
But that's typically not a good idea—if you're throwing away most of your generated values your tests will take longer, and refactoring your generator is generally a lot cleaner and faster.
You can also decrease the number of test cases by setting the appropriate parameter:
implicit val params: Parameters = Parameters(minTestsOk = 10)
But again, unless you have a good reason to do this, I'd suggest trusting the defaults.

kdb c++ interface: create byte list from std::string

The following is very slow for long strings:
std::string s = "long string";
K klist = DBVec::CreateList(KG , s.length());
for (int i=0; i<s.length(); i++)
{
kG(klist)[i]=s.c_str()[i];
}
It works acceptably fast (<100ms) for strings up to 100k, but slows to a crawl (tens of minutes, possibly hours) for strings of a few million characters. I don't see anything other than kG that can create nonlinearity. I don't see any reason for accessor function kG to be non-constant time, but there is just nothing else in this loop. Unfortunately I don't know how kG works due to lack of documentation.
Question: given a blob of binary data as std::string, what's the efficient way to construct a byte list?
kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object
http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster
This works:
memcpy(kG(klist), s.c_str(), s.length());
Still wonder why that loop is not O(N).