ICU: Which compare API to use? - unicode

I read the documentation on the different compare APIs that ICU provides, but couldn't quite get the difference between them.
int8_t icu::UnicodeString::compare (const UnicodeString &text ) const
int8_t icu::UnicodeString::caseCompare (
int32_t start,
int32_t length,
const UChar * srcChars,
int32_t srcStart,
int32_t srcLength,
uint32_t options
)
virtual EComparisonResult icu::Collator::compare(
const UnicodeString &source,
const UnicodeString &target
)
To be able to do case insensitive operations on UTF16 strings, which API fits the bill and why?
Thanks!

from the docs:
UnicodeString::compare — this is a bitwise (exact) compare. So, A ≠ a.
UnicodeString::caseCompare — this is probably what you want to use, but keep reading. A = a and ß = ss, etc. You can see this demo to play with the comparison.
Collator - this is for locale-sensitive collation which is a different tool. Yes, you can do case sensitive comparison with the right options. But it does more powerful comparisons also, such as black-bird = BlackBird.
Hope this helps.

Related

In DPI-C, How to map data type to reg or wire

I am writing a CRC16 function in C to use in System Verilog.
Requirement as below:
Output of CRC16 has 16 bits
Input of CRC16 has bigger than 72 bits
The difficulty is that I don't know whether DPI-C can support map data type with reg/wire in System Verilog to C or not ?
how many maximum length of reg/wire can support to use DPI-C.
Can anybody help me ?
Stay with compatible types across the language boundaries. For output use shortint For input, use an array of byte in SystemVerilog which maps to array of char in C.
Dpi support has provision for any bit width, converting packed arrays into c-arrays. The question is: what are you going to do with 72-bit data at c side?
But, svBitVecVal for two-state bits and svLogicVecVal for four-stat logics could be used at 'c' side to retrieve values. Look at H.7.6/7 of lrm for more info.
Here is an example from lrm H.10.2 for 4-state data (logic):
SystemVerilog:
typedef struct {int x; int y;} pair;
import "DPI-C" function void f1(input int i1, pair i2, output logic [63:0] o3);
C:
void f1(const int i1, const pair *i2, svLogicVecVal* o3)
{
int tab[8];
printf("%d\n", i1);
o3[0].aval = i2->x;
o3[0].bval = 0;
o3[1].aval = i2->y;
o3[1].b = 0;
...
}

Hash function for 8 / 16 bit "graphics" on 8 bit processor

For an implementation of coherent noise (similar to Perlin noise), I'm looking for a hash function suitable for graphics.
I don't need it to be in any way cryptographic, and really, I don't even need it to be a super brilliant hash.
I just want to to combine two 16 bit numbers and output an 8 bit hash. As random as possible is good, but also, fast on a AVR processor (8 bit, as used by Arduino) is good.
Currently I'm using an implementation here:
const uint32_t hash(uint32_t a)
{
a -= (a<<6);
a ^= (a>>17);
a -= (a<<9);
a ^= (a<<4);
a -= (a<<3);
a ^= (a<<10);
a ^= (a>>15);
return a;
}
But given that I'm truncating all but 8 bits, and I don't need anything spectacular, can I get away with something using fewer instructions?
… I'm inspired in this search by the lib8tion library that's packaged with FastLED. It has specific functions to, for example, multiple two uint8_t numbers to give a uint16_t number in the fewest possible clock cycles.
Check out Pearson hashing:
unsigned char hash(unsigned short a, unsigned short b) {
static const unsigned char t[256] = {...};
return t[t[t[t[a & 0xFF] ^ (b & 0xFF)] ^ (a >> 8)] ^ (b >> 8)];
}

How can I make a good hash function without unsigned integers?

I'm looking for a simple hash function that doesn't rely on integer overflow, and doesn't rely on unsigned integers.
The problem is that I have to create the hash function in blueprint from Unreal Engine (only has signed 32 bit integer, with undefined overflow behavior) and in PHP5, with a version that uses 64 bit signed integers.
So when I use the 'common' simple hash functions, they don't give the same result on both platforms because they all rely on bit-overflowing behavior of unsigned integers.
The only thing that is really important is that is has good 'randomness'. Does anyone know something simple that would accomplish this?
It's meant for a very basic signing symstem for sending messages to a server. Doesn't need to be top security... it's for storing high scores of a simple game on a server. The idea is that I would generate several hash-integers from the message (using different 'start numbers') and append them to make a hash-signature ). I just need to make sure that if people sniff the network messages send to the server that they cannot easily send faked messages. They would need to provide the correct hash-signature with their message, which they shouldn't be able to do unless they know the hash function being used. Ofcourse if they reverse engineer the game they can still 'hack' it, but I wouldn't know how to counter that...
I have no access to existing hash functions in the unreal engine blueprint system.
The first thing I would try would be to simulate the behavior of unsigned integers using signed integers, by explicitly applying the modulo operator whenever the accumulated hash-value gets large enough that it might risk overflowing.
Example code in C (apologies for the poor hash function, but the same technique should be applicable to any hash function, at least in principle):
#include <stdio.h>
#include <string.h>
int hashFunction(const char * buf, int numBytes)
{
const int multiplier = 33;
const int maxAllowedValue = 2147483648-256; // assuming 32-bit ints here
const int maxPreMultValue = maxAllowedValue/multiplier;
int hash = 536870912; // arbitrary starting number
for (int i=0; i<numBytes; i++)
{
hash = hash % maxPreMultValue; // make sure hash cannot overflow in the next operation!
hash = (hash*multiplier)+buf[i];
}
return hash;
}
int main(int argc, char ** argv)
{
while(1)
{
printf("Enter a string to hash:\n");
char buf[1024]; fgets(buf, sizeof(buf), stdin);
printf("Hash code for that string is: %i\n", hashFunction(buf, strlen(buf)));
}
}

Can ancillary data be portably allocated?

IEEE Std 1003.1-2008's <sys/socket.h> section doesn't provide the CMSG_SPACE or CMSG_LEN macros, and instead merely says:
Ancillary data consists of a sequence of pairs, each consisting of a
cmsghdr structure followed by a data array.
Is there a portable way to allocate ancillary data without CMSG_SPACE, or to attach ancillary data to a message without CMSG_LEN? That quote suggests to me that a single buffer with size (sizeof(struct cmsghdr)+ sizeof data)*nr_of_pairs (where data may change per pair, of course), with each individual cmgshdr.cmsglen = sizeof(struct cmsghdr) + sizeof data and msg.msg_controllen = (sizeof(struct cmsghdr)+ sizeof data)*nr_of_pairs, but all of the system-specific documentation for CMSG_SPACE/CMSG_LEN suggests that there are alignment issues that may get in the way of this.
OK, so from what I can tell my guess as to how to allocate wouldn't work in general (I couldn't get it to work on Linux, I had to use CMSG_SPACE/CMSG_LEN instead). Based on the diagram in section 4.2 of rfc2292, I came up with the following definitions for CMSG_SPACE and CMSG_LEN that I think should be portable to conforming implementations of IEEE Std 1003.1-2008:
#include <stddef.h>
#include <sys/socket.h>
#ifndef CMSG_LEN
socklen_t CMSG_LEN(size_t len) {
return (CMSG_DATA((struct cmsghdr *) NULL) - (unsigned char *) NULL) + len;
}
#endif
#ifndef CMSG_SPACE
socklen_t CMSG_SPACE(size_t len) {
struct msghdr msg;
struct cmsghdr cmsg;
msg.msg_control = &cmsg;
msg.msg_controllen = ~0ULL; /* To maximize the chance that CMSG_NXTHDR won't return NULL */
cmsg.cmsg_len = CMSG_LEN(len);
return (unsigned char *) CMSG_NXTHDR(&msg, &cmsg) - (unsigned char *) &cmsg;
}
#endif
Obvously this should be done with macros, but I think this shows the idea. This seems really hacky to me and, due to possible size checks in CMSG_NXTHDR, can't be shoved into a compile-time constant, so probably the next version of POSIX should define CMSG_SPACE and CMSG_LEN since any program using ancillary data has to use them anyway.

Are int32s signed or unsigned in OSC (or is it unspecified?)

The OSC Specification, version 1.0 specifies the "int32" data type as "32-bit big-endian two's complement integer". This implies that it's signed (otherwise, why would you write "two's complement"...), but it doesn't come right out and say it.
This comes up most clearly in the encoding of blobs: should it be legal to have a blob of length #x90000000 ? This number can be encoded as an unsigned 32-bit integer, but not as a signed 32-bit integer. I grant you, that's an extremely big blob (more than 2 gigabytes).
The specification gives you no more details. I checked the code of the C++ osc implementation I use and it's defined as:
typedef signed long int32;
the blob is defined as:
struct Blob{
Blob() {}
explicit Blob( const void* data_, unsigned long size_ )
: data( data_ ), size( size_ ) {}
const void* data;
unsigned long size;
};
So yes, it's signed integer for the "atomic" int32 type.
The blob on the other hand has it's size stored as unsigned long. So probably it can be larger. You may have to try it first, because I have only the implementation of osc pack here.

Categories