Creating a numpy python string array with pybind11

Creating a numpy python string array with pybind11 - pybind11

I am trying to modify a numpy string array from c++ with pybind11. The code i am using has the following structure:
py::array_t<py::str> process_array(py::array_t<py::str> input);
PYBIND11_EMBEDDED_MODULE(fast_calc, m) {
m.def("process_array", process_array);
}
py::array_t<py::str> process_array(py::array_t<py::str> input) {
auto buf = input.request();
cout << &buf;
return input;
}
The problem i face is this error message:
pybind11/numpy.h:1114:19: error: static assertion failed: Attempt to use a non-POD or unimplemented POD type as a numpy dtype
static_assert(is_pod_struct::value, "Attempt to use a non-POD or unimplemented POD type as a numpy dtype");
Not sure whats the catch. In python you can create numpy string arrays so what am i doing wrong?
Thanks.

Fixed length strings are supported in pybind11 (tested on v2.2.3, CentOS7, python3.6.5) by using the pybind11::array_t< std::array<char, N> > or char[N] type. Likely you'll want to pad out the string with null values just in case, as the standard pitfalls of C-style strings apply (e.g. N-1 useable characters). I prefer working with std::array as it doesn't decay to a char* without calling .data() making your intentions clearer to other readers.
So some psuedocode would look like this for a vector of 16 byte strings:
using np_str_t = std::array<char, 16>;
pybind11::array_t<np_str_t> cstring_array(vector.size());
np_str_t* array_of_cstr_ptr = reinterpret_cast<np_str_t*>(cstring_array.request().ptr);
for(const auto & s : vector)
{
std::strncpy(array_of_cstr_ptr->data(), s.data(), array_of_cstr_ptr->size());
array_of_cstr_ptr++;
}
return cstring_array; //numpy array back to python code
And then in python:
array([b'ABC', b'XYZ'], dtype='|S16')

Related

Pedersen circom/circomlibjs inconsistency?

As a unit test for a larger use case, I am checking that indeed the pedersen hash I am doing in the frontend aligns with the expected hash done through a circom circuit. I am using a simple assert in the circuit and generating a witness and am feeding both the hashed and unhashed values to the circuit, recreating the hash to make sure that it goes through.
I am running a Pedersen hash in my frontend using circomlibjs. As a unit test, I have. a circuit with a simple assert that check whether the results from my frontend line up with the pedersen hash in the circom circuit.
The circuit I am using:
include "../node_modules/circomlib/circuits/bitify.circom";
include "../node_modules/circomlib/circuits/pedersen.circom";
template check() {
signal input unhashed;
signal input hashed;
signal output createdHash[2];
component hasher = Pedersen(256);
component unhashedBits = Num2Bits(256);
unhashedBits.in <== unhashed;
for (var i = 0; i < 256; i++){
hasher.in[i] <== unhashedBits.out[i];
}
createdHash[0] <== hasher.out[0];
createdHash[1] <== hasher.out[1];
hashed === createdHash[1];
}
component main = check();
In the frontend, I am running the following,
import { buildPedersenHash } from 'circomlibjs';
export function buff2hex(buff) {
function i2hex(i) {
return ('0' + i.toString(16)).slice(-2);
}
return '0x' + Array.from(buff).map(i2hex).join('');
}
const secret = (new TextEncoder(32)).encode("Hello");
var pedersen = await buildPedersenHash();
var h = pedersen.hash(secret);
console.log(buff2hex(secret));
console.log(buff2hex(h));
The values that are printed are:
0x48656c6c6f
0x0e90d7d613ab8b5ea7f4f8bc537db6bb0fa2e5e97bbac1c1f609ef9e6a35fd8b
Which are consistent with the test done here.
So I then create an input.json file which looks as follows,
{
"unhashed": "0x48656c6c6f",
"hashed": "0x0e90d7d613ab8b5ea7f4f8bc537db6bb0fa2e5e97bbac1c1f609ef9e6a35fd8b"
}
And lastly run the following script to create a witness, in the hopes that the assert will go through.
# Compile the circuit
circom ${CIRCUIT}.circom --r1cs --wasm --sym --c
# Generate the witness.wtns
node ${CIRCUIT}_js/generate_witness.js ${CIRCUIT}_js/${CIRCUIT}.wasm input.json ${CIRCUIT}_js/witness.wtns
However, I keep getting an assert error,
Error: Error: Assert Failed.
Error in template check_11 line: 26
Which describes the assert in the circuit, so I assume there is an inconsistency in the hash.
I am new to circom so any insights would be greatly appreciated!

For anyone who stumbles across this, it happens that the cause of issue is endianess. The issue was fixed by converting the unhashed to little endian in the input, I am not sure as to where exactly the problem is, but seems like the hasher reads it as big endian on the frontend but the input is expected little endian (or vice verse).
As I have managed to patch up a fix for this at the moment, I will stop investigating, but implore anyone who understand this further to give a better explanation.

Circomlib assert fail on simple MimcSponge hash

I am playing around with circom and circomlib.
I am using a simple mimcsponge hashing circuit and seeing if I can create a correct input through javascript frontend.
The circuit I am running
template sponge_test() {
signal input l;
signal input r;
signal input o;
// instantiate - 2 inputs 220 rounds of hashing and 1 output
component hasher = MiMCSponge(2, 220, 1);
// signals in hasher
hasher.ins[0] <== l;
hasher.ins[1] <== r;
// addition constant
hasher.k <== 0;
o === hasher.outs[0];
}
component main = sponge_test();
In my javascript front end I am importing circomlib
import { buildMimcSponge } from 'circomlibjs';
function toHexString(byteArray) {
return Array.from(byteArray, function(byte) {
return ('0' + (byte & 0xFF).toString(16)).slice(-2);
}).join('')
}
export async function getProof(message) {
var hasher = await buildMimcSponge();
var h = hasher.multiHash([BigInt("0x3"), BigInt("0x4")]);
// returns byte array
console.log(h);
// back to hexstring
console.log(toHexString(h));
}
I then create an input.json that looks like this:
{
"l": "0x3",
"r": "0x4",
"o": "0x690f48aba976f2786371b7fa3e941df623e96329e0570dc610f59b7fcfa94723"
}
Which includes the values I used for the input of the hashing and the output I got from printing the hex value, and then run the following script
# Compile the circuit
circom ${CIRCUIT}.circom --r1cs --wasm --sym --c
# Generate the witness.wtns
node ${CIRCUIT}_js/generate_witness.js ${CIRCUIT}_js/${CIRCUIT}.wasm input.json ${CIRCUIT}_js/witness.wtns
And I get the error that the assert (o===hasher.outs[0]) fails.
Now, I know that that mimcsponge circuit uses 220 rounds as well in the javascript implementation of circomlib by looking at the node lib, where else could I be reaching inconsistent results for the hashing?

So I found that reading the has is done using the following. I believe it is because it is specific to the elliptic curves being used.
hasher.F.toString(h, 16);
This produces the expected result which gets accepted by the circuit.
If anyone has further insights, I would be happy to understand it further.

Proper way to call a different method from the same C-extension module?

I'm converting a pure-Python module to a C-extension to familiarize myself with the C API.
The Python implementation is as follows:
_CRC_TABLE_ = [0] * 256
def initialize_crc_table():
if _CRC_TABLE_[1] != 0: # Safeguard against re-initialization
return
# snip
def calculate_crc(data: bytes, initial: int = 0) -> int:
if _CRC_TABLE_[1] == 0: # In case user forgets to initialize first
initialize_crc_table()
# snip
# additional non-CRC methods trimmed
My C-extension thus far works:
#include <Python.h>
static Py_ssize_t CRC_TABLE_LEN = 256;
PyObject *_CRC_TABLE_;
static PyObject *method_initialize_crc_table(PyObject *self, PyObject *args) {
// snip
}
static PyMethodDef module_methods[] = {
{"initialize_crc_table", method_initialize_crc_table, METH_VARARGS, NULL},
{NULL, NULL, 0, NULL}
};
void _allocate_table_() {
_CRC_TABLE = PyList_New(CRC_TABLE_LEN);
PyObject *zero = Py_BuildValue("i", 0);
for (int i = 0; i < CRC_TABLE_LEN; i++) {
PyList_SetItem(_CRC_TABLE_, i, zero);
}
}
#if PY_MAJOR_VERSION >= 3
static struct PyModuleDef module_utilities = {
PyModuleDef_HEAD_INIT,
"utilities",
NULL,
-1,
module_methods,
};
PyMODINIT_FUNC PyInit_utilities() {
PyObject *module = PyModule_Create(&module_utilities);
_allocate_table_();
PyModule_AddObject(module, "_CRC_TABLE", _CRC_TABLE_);
return module;
}
#else
PyMODINIT_FUNC initutilities() {
PyObject *module = Py_InitModule3("utilities", module_methods, NULL);
_allocate_table_();
PyModule_AddObject(module, "_CRC_TABLE", _CRC_TABLE_);
}
I am able to access utilities._CRC_TABLE_ from the C-extension in the interpreter and values match the Python-equivalent when invoking utilities.intialize_crc_table.
Now I'm trying to call initialize_crc_table at the start of calculate_crc, performing the same check as used in the Python implementation. I'm returning None for now:
static PyObject *method_calculate_crc(PyObject *self, PyObject *args) {
if (!(uint)PyLong_AsUnsignedLong(PyList_GetItem(_CRC_TABLE_, (Py_ssize_t) 1))) {
PyObject *call_initialize_crc_table = PyObject_GetAttrString(self, "initialize_crc_table");
PyObject_CallObject(call_initialize_crc_table, NULL);
Py_DECREF(call_initialize_crc_table);
}
Py_RETURN_NONE;
}
I've added this to module_methods[] and it compiles without warnings or errors. When I run this method within the interpreter, I get a segfault. I assume it's because self isn't the module as an object.
I can do this as an alternative, which appears to work without issue:
static PyObject *method_calculate_crc(PyObject *self, PyObject *args) {
if (!(uint)PyLong_AsUnsignedLong(PyList_GetItem(_CRC_TABLE_, (Py_ssize_t) 1))) {
method_initialize_crc_table(self, NULL);
}
Py_RETURN_NONE;
}
However, I am not certain if I should be passing self, NULL, or something else to the method.
What is the proper way of invoking method_initialize_crc_table from method_calculate_crc?

There was a "gotcha" here that I must clarify on. While the code was intended for Python 3, development was initially done in Python 2 as the development files were not yet available on the machine I was using. This shed some light on some differences in how each version handles things. David's comments helped lead to this clarification.
If a method is defined as METH_VARARGS but is defined for a module (versus a class), Python 2 does not pass anything for the PyObject *self parameter. This is noted in the documentation but is easy to overlook if you're not careful. Python 3, however, does pass a pointer to the module. As DavidW recommended, I implemented a global variable to hold a reference to the module. Assuming his claims of Python handling the de-referencing at exit are correct, we can safely use this for accessing module global attributes.
With our issue of PyObject *self solved, we no longer get a segfault. We can then address the question of which approach is (seemingly more) correct for calling a method within the local scope of the module. Do we do this:
if (/* conditional */)
PyObject_CallMethod(module, "initialize_crc_table", NULL);
Or this:
if (/* conditional */)
method_initialize_crc_table(self, args, kwargs);
Benchmarks seem to provide an answer here. Using Python's built-in timeit module, we can see a very clear difference in terms of performance. Note that so far in our implementation, .calculate_crc accesses ._CRC_TABLE_ and checks if it's initialized, but no processing occurs. Performance compared to Python 2 and 3 were identical and thus ignored.
The command is as follows:
python3 -m timeit "import utilities; utilities.calculate_crc(0)"
PyObject_CallMethod: 874 nsec per loop
method_initialize_crc_table: 44.3 usec per loop
Using the PyObject_ function is reported as 50x faster, quite a significant difference. Benchmarks alone do not facilitate what is "more correct" but with no clear guidance it may be a sufficient justification for our use. Therefore, I will be using PyObject_ calls for this project.

QGArray::at: Absolute index 7645637866 out of range

I am using doxygen to parse Linux kernel (https://github.com/torvalds/linux). After running more than 20 hours, while generating the call graphs, it reports errors: QGArray::at: Absolute index xxxxxxxxxx out of range. I analyzed the source code and doubted that it might be caused by the type of array_data->len in doxygen-master/qtools/qgarray.h:54
struct array_data : public QShared { // shared array
array_data() { data=0; len=0; }
char *data; // actual array data
uint len;
};
I try to use a long type for len, rebuild and reinstall Doxygen, but parsing Linux will need another 20 hours to check it.
I want to know how to fix the error perfectly?

Retrieving gdcm DataElement values as strings

I am basically trying to read out all or most attribute values from a DICOM file, using the gdcm C++ library. I am having hard time to get out any non-string values. The gdcm examples generally assume I know the group/element numbers beforehand so I can use the Attribute template classes, but I have no need or interest in them, I just have to report all attribute names and values. Actually the values should go into an XML so I need a string representation. What I currently have is something like:
for (gdcm::DataSet::ConstIterator it = ds.Begin(); it!=ds.End(); ++it) {
const gdcm::DataElement& elem = *it;
if (elem.GetVR() != gdcm::VR::SQ) {
const gdcm::Tag& tag = elem.GetTag();
std::cout << dict.GetDictEntry(tag).GetKeyword() << ": ";
std::cout << elem.GetValue() << "\n";
}
}
It seems for numeric values like UL the output is something like "Loaded:4", presumably meaning that the library has loaded 4 bytes of data (an unsigned long). This is not helpful at all, how to get the actual value? I must be certainly overlooking something obvious.
From the examples it seems there is a gdcm::StringFilter class which is able to do that, but it seems it wants to search each element by itself in the DICOM file, which would make the algorithm complexity quadratic, this is certainly something I would like to avoid.
TIA
Paavo

Have you looked at gdcmdump? You can use it to output the DICOM file as text or XML. You can also look at the source to see how it does this.

I ended up with extracting parts of gdcm::StringFilter::ToStringPair() into a separate function. Seems to work well for simpler DCM files at least...

You could also start by reading the FAQ, in particular How do I convert an attribute value to a string ?
As explained there, you simply need to use gdcm::StringFilter:
sf = gdcm.StringFilter()
sf.SetFile(r.GetFile())
print sf.ToStringPair(gdcm.Tag(0x0028,0x0010))

Try something like this:
gdcm::Reader reader;
reader.SetFileName( absSlicePath.c_str() );
if( !_reader.Read() )
{
return;
}
gdcm::File file = reader.GetFile();
gdcm::DataSet ds = file.GetDataSet();
std::stringstream strm;
strm << ds;
you get a stringstream containing all the DICOM tags-values.
Actually, most of the DICOM classes (DataElement, DataSet, etc) have the std::ostream & operator<< (std::ostream &_os, const *Some_Class* &_val) overloaded. So you can just expand the for loop and use operator<< to put the values into the stringstream, and then into the string.
For example, if you are using QT :
ui->pTagsTxt->append(QString(strm.str().c_str()));