I am trying to use CUSP to solve a complex matrix using the GMRES method.
When compiling, I get a error that says:
"no suitable conversion function from "cusp::complex" to "float" exists"
and if I go see where the error comes from it gives me gmres.inl, line 143
resid[0] = s[0];
where resid and s type are
cusp::array1d<NormType,cusp::host_memory> resid(1);
cusp::array1d<ValueType,cusp::host_memory> s(R+1);
typedef typename LinearOperator::value_type ValueType;
typedef typename LinearOperator::memory_space MemorySpace;
typedef typename norm_type<ValueType>::type NormType;
Why is it telling me float, when the type of both is complex?
code:`
// create an empty sparse matrix structure (CSR format)
cusp::csr_matrix,cusp::device_memory>A;
// initialize matrix from file
cusp::io::read_matrix_market_file(A, PATH);
// allocate storage for solution (x) and right hand side (b)
cusp::array1d<complex<float>, cusp::device_memory> x(A.num_rows, 0);
cusp::array1d<complex<float>, cusp::device_memory> b(A.num_rows, 1);
// set stopping criteria:
// iteration_limit = 100
// relative_tolerance = 1e-6
cusp::verbose_monitor<complex<float>> monitor(b,10000, 1e-6);
int restart = 50;
// set preconditioner (identity)
cusp::identity_operator<complex<float>, cusp::device_memory> M(A.num_rows, A.num_rows);
// solve the linear system A x = b
cusp::krylov::gmres(A, x, b, restart, monitor,M);//`
Related
I am calculating the intersection point of two lines given in the polar coordinate system:
typedef ap_fixed<16,3,AP_RND> t_lines_angle;
typedef ap_fixed<16,14,AP_RND> t_lines_rho;
bool get_intersection(
hls::Polar_< t_lines_angle, t_lines_rho>* lineOne,
hls::Polar_< t_lines_angle, t_lines_rho>* lineTwo,
Point* point)
{
float angleL1 = lineOne->angle.to_float();
float angleL2 = lineTwo->angle.to_float();
t_lines_angle rhoL1 = lineOne->rho.to_float();
t_lines_angle rhoL2 = lineTwo->rho.to_float();
t_lines_angle ct1=cosf(angleL1);
t_lines_angle st1=sinf(angleL1);
t_lines_angle ct2=cosf(angleL2);
t_lines_angle st2=sinf(angleL2);
t_lines_angle d=ct1*st2-st1*ct2;
// we make sure that the lines intersect
// which means that parallel lines are not possible
point->X = (int)((st2*rhoL1-st1*rhoL2)/d);
point->Y = (int)((-ct2*rhoL1+ct1*rhoL2)/d);
return true;
}
After synthesis for our FPGA I saw that the 4 implementations of the float sine (and cos) take 4800 LUTs per implementation, which sums up to 19000 LUTs for these 4 functions. I want to reduce the LUT count by using a fixed point sine. I already found a implementation of CORDIC but I am not sure how to use it. The input of the function is an integer but i have a ap_fixed datatype. How can I map this ap_fixed to integer? and how can I map my 3.13 fixed point to the required 2.14 fixed point?
With the help of one of my colleagues I figured out a quite easy solution that does not require any hand written implementations or manipulation of the fixed point data:
use #include "hls_math.h" and the hls::sinf() and hls::cosf() functions.
It is important to say that the input of the functions should be ap_fixed<32, I> where I <= 32. The output of the functions can be assigned to different types e.g., ap_fixed<16, I>
Example:
void CalculateSomeTrig(ap_fixed<16,5>* angle, ap_fixed<16,5>* output)
{
ap_fixed<32,5> functionInput = *angle;
*output = hls::sinf(functionInput);
}
LUT consumption:
In my case the consumption of LUT was reduced to 400 LUTs for each implementation of the function.
You can use bit-slicing to get the fraction and the integer parts of the ap_fixed variable, and then manipulate them to get the new ap_fixed. Perhaps something like:
constexpr int max(int a, int b) { return a > b ? a : b; }
template <int W2, int I2, int W1, int I1>
ap_fixed<W2, I2> convert(ap_fixed<W1, I1> f)
{
// Read fraction part as integer:
ap_fixed<max(W2, W1) + 1, max(I2, I1) + 1> result = f(W1 - I1 - 1, 0);
// Shift by the original number of bits in the fraction part
result >>= W1 - I1;
// Add the integer part
result += f(W1 - 1, W1 - I1);
return result;
}
I haven't tested this code well, so take it with a grain of salt.
A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:
I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).
I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):
I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:
#define MWR2MM_m 2048 // Bit-length of operands
#define MWR2MM_w 8 // word size
#define MWR2MM_e 257 // number of words per operand
// Type definitions
typedef ap_uint<1> bit_t; // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t; // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t; // m-bit operand size
/*
* Multiple-word radix 2 montgomery multiplication using carry-propagate adder
*/
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
// extend operands to 2 extra words of 0
ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin;
ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;
ap_uint<2> C = 0; // two carry bits
bit_t qi = 0; // an intermediate result bit
// Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
ap_uint<MWR2MM_w> temp_concat=0;
// scan X bit-by bit
for (int i=0; i<MWR2MM_m; i++)
{
qi = (X[i]*Y[0]) xor S[0];
// C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
C = temp_concat.range(9,8);
S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);
// scan Y and M word-by word, for each bit of X
for (int j=1; j<=MWR2MM_e; j++)
{
temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
C = temp_concat.range(9,8);
S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);
S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
}
S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
C=0;
}
// if final partial sum is greater than the modulus, bring it back to proper range
if (S >= M)
S -= M;
*out = S;
}
Unfortunately, the LUT utilization is huge.
This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.
Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?
I've already tried the following:
Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.
However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?
In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?
Thanks for the help.
I need to make sure that at least 1 change occurred in a specific uint item X, i.e. X had 2 different values (it is unknown what specific values). Something like this:
cover some_event {
item X : uint = some_uint using no_collect;
transition X using when = (prev_X != X);
};
** The code causes compilation error
Is it possible to define such coverage in Specman?
Thank you for your help
what you wrote is almost accurate, but instead of "when" - use "ignore"
cover some_event is {
item X : uint = some_uint using no_collect;
transition X using ignore = (prev_X == X);
};
Hi I use imread function in matlab to read an image. But now if i want to use this image in a mex file what should I do?
I used im2double function to the image and then passed it to the mex file but the results i get are not acceptable.
so is there any other function which can be used
Matlab has different data types, for example uint8, in32 and, most commonly, double. You can see those in the workspace or by using the whos command.
These data types are quite similar to the ones in C(++). So for example, we could write this piece of c++/mex code printint.cpp:
#include <matrix.h>
#include <mex.h>
void mexFunction(int nargout, mxArray *argout[], int nargin, const mxArray *argin[]) {
int nCols = mxGetDimensions(argin[0])[1];
int nRows = mxGetDimensions(argin[0])[0];
int *data = (int *)mxGetData(argin[0]);
for(int row=0; row<nRows; ++row) {
for(int col=0; col<nCols-1; ++col) {
mexPrintf("%3d, ", data[nRows*col + row]);
}
mexPrintf("%3d\n", data[nRows*(nCols-1) + row]);
}
}
This code simply prints all the numbers in a matrix, but it assumes that the elements in the matrix are integers. That is, the mxGetData returns a void pointer, which may point to any kind of data, also doubles. I cast the void pointer to an integer pointer, which is tricky but allowed in C(++).
This does however allow me to run printint on doubles, or any kind of matrix for that matter:
>> doubles=randi(10,[2,5])
doubles =
8 3 1 7 10
1 1 9 4 1
>> whos('doubles')
Name Size Bytes Class Attributes
doubles 2x5 80 double
>> printint(doubles)
0, 0, 0, 0, 0
1075838976, 1072693248, 1074266112, 1072693248, 1072693248
One can see that the output is garbage, I have interpreted doubles as if they were integers. This may also cause segmentation faults, namely if the actual sizeof of the data one inputs is smaller than the sizeof of an int; like uint8. If one inputs integers, the method does work:
>> integers = int32(doubles)
integers =
8 3 1 7 10
1 1 9 4 1
>> whos('integers')
Name Size Bytes Class Attributes
integers 2x5 40 int32
>> printint(integers)
8, 3, 1, 7, 10
1, 1, 9, 4, 1
I typically assume a certain data type to be input, since I use mex to speed up certain parts of the code and don't want to spend time on creating flexible and fool proof API's. This comes at the cost of running into unexpected behavior when the input and assumed data types do not match. Therefore, one may consider programatically testing the data type that is input, in order to handle it correctly.
Your problem: In case you use I=imread(filename), Matlab returns a rows x cols x channels 3-D matrix of type uint8, which is equal to an unsigned char in C(++). In order to work with this data, one could write a simple class to wrap the image. This is done here:
#include <matrix.h>
#include <mex.h>
class MatlabImage {
public:
MatlabImage(const mxArray *data) {
this->data = (unsigned char*)mxGetData(data);
this->rows = mxGetDimensions(data)[0];
this->cols = mxGetDimensions(data)[1];
this->channels = mxGetDimensions(data)[2];
}
unsigned char *data;
int rows, cols, channels;
int get(int row, int col, int channel) {
return this->data[row + rows*(col + cols*channel)];
}
};
void mexFunction(int nargout, mxArray *argout[], int nargin, const mxArray *argin[]) {
MatlabImage image(argin[0]);
int row=316; int col=253;
mexPrintf("I(%d, %d) = (%3d, %3d, %3d)\n", row, col, image.get(row,col,0), image.get(row,col,1), image.get(row,col,2));
}
This can then be used in Matlab like this:
>> I = imread('workspace.png');
>> whos('I')
Name Size Bytes Class Attributes
I 328x344x3 338496 uint8
>> getpixel(I)
I(316, 253) = (197, 214, 233)
>> I(317,254,:)
ans(:,:,1) =
197
ans(:,:,2) =
214
ans(:,:,3) =
233
One can also work with doubles in a similar way. double(I) would convert the image to doubles, and im2double(I) does the same, but then divides all values by 255, such that they lie between 0 and 1. In both case one would need to cast the result of mxGetData(...) to (double *) instead of (unsigned char *).Finally note that C(++) starts indexing from 0 and Matlab from 1, so passing indexes from one to another always requires incrementing or decrementing them somewhere.
Convert to double and divided by 255. In Mex file Matlab double array is used to represent the image
The problem in general:
I have a big 2d point space, sparsely populated with dots.
Think of it as a big white canvas sprinkled with black dots.
I have to iterate over and search through these dots a lot.
The Canvas (point space) can be huge, bordering on the limits
of int and its size is unknown before setting points in there.
That brought me to the idea of hashing:
Ideal:
I need a hash function taking a 2D point, returning a unique uint32.
So that no collisions can occur. You can assume that the number of
dots on the Canvas is easily countable by uint32.
IMPORTANT: It is impossible to know the size of the canvas beforehand
(it may even change),
so things like
canvaswidth * y + x
are sadly out of the question.
I also tried a very naive
abs(x) + abs(y)
but that produces too many collisions.
Compromise:
A hash function that provides keys with a very low probability of collision.
Cantor's enumeration of pairs
n = ((x + y)*(x + y + 1)/2) + y
might be interesting, as it's closest to your original canvaswidth * y + x but will work for any x or y. But for a real world int32 hash, rather than a mapping of pairs of integers to integers, you're probably better off with a bit manipulation such as Bob Jenkin's mix and calling that with x,y and a salt.
a hash function that is GUARANTEED collision-free is not a hash function :)
Instead of using a hash function, you could consider using binary space partition trees (BSPs) or XY-trees (closely related).
If you want to hash two uint32's into one uint32, do not use things like Y & 0xFFFF because that discards half of the bits. Do something like
(x * 0x1f1f1f1f) ^ y
(you need to transform one of the variables first to make sure the hash function is not commutative)
Like Emil, but handles 16-bit overflows in x in a way that produces fewer collisions, and takes fewer instructions to compute:
hash = ( y << 16 ) ^ x;
You can recursively divide your XY plane into cells, then divide these cells into sub-cells, etc.
Gustavo Niemeyer invented in 2008 his Geohash geocoding system.
Amazon's open source Geo Library computes the hash for any longitude-latitude coordinate. The resulting Geohash value is a 63 bit number. The probability of collision depends of the hash's resolution: if two objects are closer than the intrinsic resolution, the calculated hash will be identical.
Read more:
https://en.wikipedia.org/wiki/Geohash
https://aws.amazon.com/fr/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/
https://github.com/awslabs/dynamodb-geo
Your "ideal" is impossible.
You want a mapping (x, y) -> i where x, y, and i are all 32-bit quantities, which is guaranteed not to generate duplicate values of i.
Here's why: suppose there is a function hash() so that hash(x, y) gives different integer values. There are 2^32 (about 4 billion) values for x, and 2^32 values of y. So hash(x, y) has 2^64 (about 16 million trillion) possible results. But there are only 2^32 possible values in a 32-bit int, so the result of hash() won't fit in a 32-bit int.
See also http://en.wikipedia.org/wiki/Counting_argument
Generally, you should always design your data structures to deal with collisions. (Unless your hashes are very long (at least 128 bit), very good (use cryptographic hash functions), and you're feeling lucky).
Perhaps?
hash = ((y & 0xFFFF) << 16) | (x & 0xFFFF);
Works as long as x and y can be stored as 16 bit integers. No idea about how many collisions this causes for larger integers, though. One idea might be to still use this scheme but combine it with a compression scheme, such as taking the modulus of 2^16.
If you can do a = ((y & 0xffff) << 16) | (x & 0xffff) then you could afterward apply a reversible 32-bit mix to a, such as Thomas Wang's
uint32_t hash( uint32_t a)
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
That way you get a random-looking result rather than high bits from one dimension and low bits from the other.
You can do
a >= b ? a * a + a + b : a + b * b
taken from here.
That works for points in positive plane. If your coordinates can be in negative axis too, then you will have to do:
A = a >= 0 ? 2 * a : -2 * a - 1;
B = b >= 0 ? 2 * b : -2 * b - 1;
A >= B ? A * A + A + B : A + B * B;
But to restrict the output to uint you will have to keep an upper bound for your inputs. and if so, then it turns out that you know the bounds. In other words in programming its impractical to write a function without having an idea on the integer type your inputs and output can be and if so there definitely will be a lower bound and upper bound for every integer type.
public uint GetHashCode(whatever a, whatever b)
{
if (a > ushort.MaxValue || b > ushort.MaxValue ||
a < ushort.MinValue || b < ushort.MinValue)
{
throw new ArgumentOutOfRangeException();
}
return (uint)(a * short.MaxValue + b); //very good space/speed efficiency
//or whatever your function is.
}
If you want output to be strictly uint for unknown range of inputs, then there will be reasonable amount of collisions depending upon that range. What I would suggest is to have a function that can overflow but unchecked. Emil's solution is great, in C#:
return unchecked((uint)((a & 0xffff) << 16 | (b & 0xffff)));
See Mapping two integers to one, in a unique and deterministic way for a plethora of options..
According to your use case, it might be possible to use a Quadtree and replace points with the string of branch names. It is actually a sparse representation for points and will need a custom Quadtree structure that extends the canvas by adding branches when you add points off the canvas but it avoids collisions and you'll have benefits like quick nearest neighbor searches.
If you're already using languages or platforms that all objects (even primitive ones like integers) has built-in hash functions implemented (Java platform Languages like Java, .NET platform languages like C#. And others like Python, Ruby, etc ).
You may use built-in hashing values as a building block and add your "hashing flavor" in to the mix. Like:
// C# code snippet
public class SomeVerySimplePoint {
public int X;
public int Y;
public override int GetHashCode() {
return ( Y.GetHashCode() << 16 ) ^ X.GetHashCode();
}
}
And also having test cases like "predefined million point set" running against each possible hash generating algorithm comparison for different aspects like, computation time, memory required, key collision count, and edge cases (too big or too small values) may be handy.
the Fibonacci hash works very well for integer pairs
multiplier 0x9E3779B9
other word sizes 1/phi = (sqrt(5)-1)/2 * 2^w round to odd
a1 + a2*multiplier
this will give very different values for close together pairs
I do not know about the result with all pairs