kdb c++ interface: create byte list from std::string

kdb c++ interface: create byte list from std::string - kdb

The following is very slow for long strings:
std::string s = "long string";
K klist = DBVec::CreateList(KG , s.length());
for (int i=0; i<s.length(); i++)
{
kG(klist)[i]=s.c_str()[i];
}
It works acceptably fast (<100ms) for strings up to 100k, but slows to a crawl (tens of minutes, possibly hours) for strings of a few million characters. I don't see anything other than kG that can create nonlinearity. I don't see any reason for accessor function kG to be non-constant time, but there is just nothing else in this loop. Unfortunately I don't know how kG works due to lack of documentation.
Question: given a blob of binary data as std::string, what's the efficient way to construct a byte list?

kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object
http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster

This works:
memcpy(kG(klist), s.c_str(), s.length());
Still wonder why that loop is not O(N).

Related

Seed for hash-table non cryptographic hash functions

If one sets the hash table seed during resize or table creation to a random number, will that prevent the DDoS attacks on such hash table or, knowing the hash algorithm, the attacker will still easily get around the seed? What if the algorithm uses the Pearson hash function with randomly generated tables, unknown to the attacker? Does such table hash still need a seed or it is safe enough?
Context: I want to use an on-disk hash table for a key-value database for my toy web server, where the keys may depend on the user input.

There is exist several approaches to protect your hash-subsystem from "adverse selection" attack, most popular of them is named Universal Hashing, where hash-function or it's property randomly selected, at initialization.
In my own approach, I am using same hash function, where each char adding to result with non-linear mixing, dependends of random array of uint32_t[256]. Array is created during system initialization, and in my code, it happening at each start, by reading the /dev/urandom. See my implementation in open source emerSSL program. You're welcome for borrow this entire hash-table implementation, or hash-function only.
Currently, my hash-function from the referred source computes two independent hashes for double hashing search algorithm.
There is "reduced" hash-function form the source, to demonstrate idea of non-linear mixing with S-block array"
uint32_t S_block[0x100]; // Substitute block, random contains
#define NLF(h, c) (S_block[(unsigned char)(c + h)] ^ c)
#define ROL(x, n) (((x) << (n)) | ((x) >> (32 - (n))))
int32_t hash(const char *key) {
uint32_t h = 0x1F351F35; // Barker code * 2
char c;
for(int i = 0; c = key[i]; i++) {
h = ROL(h, 5);
h += NLF(h, c);
}
return h;
}

Fast iteration over unicode string Cython

I have the following cython function.
01:
+02: cdef int count_char_in_x(unicode x,Py_UCS4 c):
03: cdef:
+04: int count = 0
05: Py_UCS4 x_k
06:
+07: for x_k in x: ## Yellow
+08: if x_k == c:
+09: count+=1
10:
+11: return count
Line 07 is not properly optimized.
The annotated HTML code is expanded as:
+07: for x_k in x: ## Yellow
if (unlikely(__pyx_v_x == Py_None)) {
PyErr_SetString(PyExc_TypeError, "'NoneType' is not iterable");
__PYX_ERR(0, 8, __pyx_L1_error)
}
__Pyx_INCREF(__pyx_v_x);
__pyx_t_1 = __pyx_v_x;
__pyx_t_6 = __Pyx_init_unicode_iteration(__pyx_t_1, (&__pyx_t_3), (&__pyx_t_4), (&__pyx_t_5)); if (unlikely(__pyx_t_6 == ((int)-1))) __PYX_ERR(0, 8, __pyx_L1_error)
for (__pyx_t_7 = 0; __pyx_t_7 < __pyx_t_3; __pyx_t_7++) {
__pyx_t_2 = __pyx_t_7;
__pyx_v_x_k = __Pyx_PyUnicode_READ(__pyx_t_5, __pyx_t_4, __pyx_t_2);
Any tips on how could this be improved?
I think it is possible to write a cdef/cpdef function that at runtime completly avoids Python None type checks. Any idea on how this could be done?

The generated C code looks pretty good to me. The loop overall is a int-iterated for loop (i.e. it's not relying on calling the Python methods __iter__ and __next__).
__Pyx_PyUnicode_READ is translated pretty directly to PyUnicode_READ (depending slightly on the Python version you're using). PyUnicode_READ is a C macro which is as close to a direct array access as you can get.
This is probably as good as it's getting. You might get a small improvement by using bytes rather than unicode (provided you're dealing with ASCII characters). You might just consider whether it's really worth reimplementing unicode.count.
If it were a regular def function you could declare x as unicode not None to remove the None check before the loop. That might make a small difference. However, as #ead points out that isn't supported for cdef functions. It's likely the cost of a def function call will be slightly larger than the cost of a None-check, but you should time it if you care.

Create a CoffeeScript range with a length instead an endpoint?

I want to create a CoffeeScript range (like [4...496]) but using a length instead of an end range. This can be done with a loop like
myNum = getBigNumber()
newArray = ( n + myNum for n in [0...50] )
but I'm wondering if there is range-related shortcut that I'm missing. Is there something like
[getBigNumber()...].length(50) available in CoffeeScript?

You can just do
range = [myNum...myNum + 50]
Edit: As mu points out in the comments, CoffeeScript will add some complexity whether you use the snippet above or the original code. If performance is an issue, it might be better to drop down to plain JS for the loop (using backticks in the CoffeeScript code).
Assuming you want an ascending (i.e. low to high) range, you can do:
myNum = getBigNumber()
length = 50
range = new Array length
i = 0
`for(; i < length ; i++) { range[i] = i + myNum }` # raw, escaped JS
It's a lot faster than CoffeeScript's way of doing things, but note that CoffeeScript's range syntax also supports creating descending ranges by just flipping the boundary values. So CoffeeScript is (as always) easier on the eyes and simpler to work with, but raw JS is 3.5x faster in my test.

What is the time complexity of this hash function?

I have a hashing function and I want to know if it is constant. Since the length of the array word is constant, does that mean the function is constant in Big O notation?
public int hash(String s) {
if (s.length() > 7)
return -1;
for (int i = 0; i < word.length; ++i) {
if (word[i].compareTo(s) == 0)
return i;
}
return -1;
}

Since the length of the array word is constant, does that mean the function is constant in Big O notation?
Big O is used to describe how the run time or memory consumption of a process grows as its input grows. If your array is of constant length, then it will not grow and have an effect. Therefore, you can in this context consider hash() to run in O(1), assuming that the string comparisons are done in relatively constant time.
One way to think about it would be to say that since the length of the array is not variable, it should always be possible to "unroll" that loop so as to have a fixed number of O(1) comparisons one after the other, which all-in-all will still be O(1). Again, this presumes that the time taken to compare the strings is also constant (which in reality may not be the case if you have very large strings of varying lengths). Of course, if you know that the contents of the array will also be constant in addition to its length, then you can say for certain that the function will be O(1).

The time required to compare two strings of lengths m and n is O(min{m, n} + 1). Let's suppose that k is the length of the word array and that m is the length of the longest word in word and n is the length of the input string. In this case, the function does O(k) string comparisons, each of which take time O(min{m, n} + 1). Therefore, the runtime is O(k min{m, n} + m).
Now, since m is known to be a constant, we can simplify this and say that the runtime will be O(min{m, n} + 1). If all of the strings in word are fixed constants, then m is a constant and the runtime is O(min{1, n} + 1) = O(1) and your hash function runs in constant time. Otherwise, if they're unboundedly long, the only thing you can claim is that the runtime is O(min{m, n} + 1).
Hope this helps!

This function is O(1) if word is constant.
s.length() runs in constant time regardless of the length of s.
The time it takes to run word[i].compareTo(s) is bounded by the length of word[i]. As long as word doesn't change, this means there is an upper bound for the time it takes to run the entire for loop.
So there's an upper bound on the time this function takes to run, and the function is O(1).
If word can change, I believe this function would be O(n) where n is the size of word. However, if the elements of word have increasing lengths, word[i].compareTo(s) will be bounded by larger and larger numbers, so the length of s might begin to matter. Perhaps the complexity is actually O(n^2). I don't know, and now I'm curious myself.

your function has complexity O(N2), as it has 2 inputs:
s - your string (length N1)
word - array (length N2)
so, you complexity will be O(N1*N2), which can be simplified to O(N2)
if length N2 is really const, then function will have complexity O(N1) in worst case.
if length N1 also consts - then we have O(1) complexity

Mapping letters to integers in MATLAB

The function arithenco needs the input message to be a sequence of positive integers. Hence, I need convert a message into a sequence of numbers message_int, by using the following mapping.
‘A’→1, ‘C’→2, ‘G’→3, ‘T’→4.

From what I understand, the alphabet you are using contains only four values A,C,G,T (DNA sequences I suppose).
Simple comparison would suffice:
seq = 'TGGAGGCCCACAACCATTCCCTCAGCCCAATTGACCGAAAGGGCGCGA';
msg_int = zeros(size(seq));
msg_int(seq=='A') = 1;
msg_int(seq=='C') = 2;
msg_int(seq=='G') = 3;
msg_int(seq=='T') = 4;

Oh, just reread your question: your mapping is not so simple. Sorry.
(since darvidsOn wrote the same I won't delete this answer - it might give you a start - but it doesn't answer your question completely).
Have a look at http://www.matrixlab-examples.com/ascii-chart.html
You can use d = double('A') to convert a char into a double- you will then need to subtract 64 to get the mapping that you want (because A is ascii code 65).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

kdb c++ interface: create byte list from std::string - kdb

kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster

This works: memcpy(kG(klist), s.c_str(), s.length()); Still wonder why that loop is not O(N).

Related

Seed for hash-table non cryptographic hash functions

Fast iteration over unicode string Cython

Create a CoffeeScript range with a length instead an endpoint?

What is the time complexity of this hash function?

Mapping letters to integers in MATLAB

Categories

Resources