Ocaml: unicode string length - unicode

In OCaml, how can I compute the length of a string that may have unicode encodings ? To give an example, here is my problem:
utop # "\u{02227}";;
- : string = "∧"
utop # Caml.String.length "\u{02227}";;
- : int = 3
utop # Base.String.length "\u{02227}";;
- : int = 3
and I would like to obtain the obvious answer: 1.

If you want to count the number of extended grapheme clusters (aka a graphical character), you can use uuseg. For instance
let len = Uuseg_string.fold_utf_8 `Grapheme_cluster (fun x _ -> x + 1) 0
let n = len "∧";;
returns
val n : int = 1

Related

How to generate arbitrary instances of a language given its concrete syntax in Rascal?

Given the concrete syntax of a language, I would like to define a function "instance" with signature str (type[&T]) that could be called with the reified type of the syntax and return a valid instance of the language.
For example, with this syntax:
lexical IntegerLiteral = [0-9]+;
start syntax Exp
= IntegerLiteral
| bracket "(" Exp ")"
> left Exp "*" Exp
> left Exp "+" Exp
;
A valid return of instance(#Exp) could be "1+(2*3)".
The reified type of a concrete syntax definition does contain information about the productions, but I am not sure if this approach is better than a dedicated data structure. Any pointers of how could I implement it?
The most natural thing is to use the Tree data-type from the ParseTree module in the standard library. It is the format that the parser produces, but you can also use it yourself. To get a string from the tree, simply print it in a string like so:
str s = "<myTree>";
A relatively complete random tree generator can be found here: https://github.com/cwi-swat/drambiguity/blob/master/src/GenerateTrees.rsc
The core of the implementation is this:
Tree randomChar(range(int min, int max)) = char(arbInt(max + 1 - min) + min);
Tree randomTree(type[Tree] gr)
= randomTree(gr.symbol, 0, toMap({ <s, p> | s <- gr.definitions, /Production p:prod(_,_,_) <- gr.definitions[s]}));
Tree randomTree(\char-class(list[CharRange] ranges), int rec, map[Symbol, set[Production]] _)
= randomChar(ranges[arbInt(size(ranges))]);
default Tree randomTree(Symbol sort, int rec, map[Symbol, set[Production]] gr) {
p = randomAlt(sort, gr[sort], rec);
return appl(p, [randomTree(delabel(s), rec + 1, gr) | s <- p.symbols]);
}
default Production randomAlt(Symbol sort, set[Production] alts, int rec) {
int w(Production p) = rec > 100 ? p.weight * p.weight : p.weight;
int total(set[Production] ps) = (1 | it + w(p) | Production p <- ps);
r = arbInt(total(alts));
count = 0;
for (Production p <- alts) {
count += w(p);
if (count >= r) {
return p;
}
}
throw "could not select a production for <sort> from <alts>";
}
Tree randomChar(range(int min, int max)) = char(arbInt(max + 1 - min) + min);
It is a simple recursive function which randomly selects productions from a reified grammar.
The trick towards termination lies in the weight of each rule. This is computed a priori, such that every rule has its own weight in the random selection. We take care to give the set of rules that lead to termination at least 50% chance of being selected (as opposed to the recursive rules) (code here: https://github.com/cwi-swat/drambiguity/blob/master/src/Termination.rsc)
Grammar terminationWeights(Grammar g) {
deps = dependencies(g.rules);
weights = ();
recProds = {p | /p:prod(s,[*_,t,*_],_) := g, <delabel(t), delabel(s)> in deps};
for (nt <- g.rules) {
prods = {p | /p:prod(_,_,_) := g.rules[nt]};
count = size(prods);
recCount = size(prods & recProds);
notRecCount = size(prods - recProds);
// at least 50% of the weight should go to non-recursive rules if they exist
notRecWeight = notRecCount != 0 ? (count * 10) / (2 * notRecCount) : 0;
recWeight = recCount != 0 ? (count * 10) / (2 * recCount) : 0;
weights += (p : p in recProds ? recWeight : notRecWeight | p <- prods);
}
return visit (g) {
case p:prod(_, _, _) => p[weight=weights[p]]
}
}
#memo
rel[Symbol,Symbol] dependencies(map[Symbol, Production] gr)
= {<delabel(from),delabel(to)> | /prod(Symbol from,[_*,Symbol to,_*],_) := gr}+;
Note that this randomTree algorithm will not terminate on grammars that are not "productive" (i.e. they have only a rule like syntax E = E;
Also it can generate trees that are filtered by disambiguation rules. So you can check this by running the parser on a generated string and check for parse errors. Also it can generated ambiguous strings.
By the way, this code was inspired by the PhD thesis of Naveneetha Vasudevan of King's College, London.

How to create a 16 bytes Array for key to ShipHash in F#?

I am working on code where I need to hash values. SipHash seems like a great option.
let getSipHashValue (buffer:byte []) (key:byte []) =
match key.GetLength(0) with
| 16 -> SipHash24.Hash64(buffer, key)
| _ -> uint64(0)
Is there a way to pad the key to 16 bytes and make sure that it works?
I can get the exact length word as key but I would like to be able to use any word (that is shorter than 16 bytes) and just use some padding.
open System
open System.Text
let testKey : byte [] =
Encoding.UTF8.GetBytes "accumulativeness"
Console.WriteLine("Length: {0}", testKey.GetLength(0))
Is there a way to do that in F#?
I think I got it:
open System
open System.Text
let rec getPaddedBytes (s:string) =
let b = Encoding.UTF8.GetBytes s
match b.GetLength(0) with
| 16 -> b
| x when x < 16 -> getPaddedBytes (s + "0")
| _ -> b[0..15]
Console.WriteLine("Length: {0}", testKey.GetLength(0))
let testBytes = getPaddedBytes "accum"
let testString = Encoding.UTF8.GetString testBytes
Console.WriteLine("X: {0}", testString)
I need to fix getting the first 16 bytes. Not sure about that syntax.
It's not clear from your question what you want the padding to look like but you can pad with zeros with:
let getPadding (bs: byte[]): byte[] =
let rem = bs.Length % 16
let padBytes = if rem = 0 then 0 else (16 - rem)
Array.zeroCreate padBytes
let pad (bs: byte[]): byte[] =
Array.append bs (getPadding bs)
which you can then use with:
let padded = pad testKey
printfn "Key length: %d" padded.Length

Does PureScript support “format strings” like C / Java etc.?

I need to output a number with leading zeros and as six digits. In C or Java I would use "%06d" as a format string to do this. Does PureScript support format strings? Or how would I achieve this?
I don't know of any module that would support a printf-style functionality in PureScript. It would be very nice to have a type-safe way to format numbers.
In the meantime, I would write something likes this:
import Data.String (length, fromCharArray)
import Data.Array (replicate)
-- | Pad a string with the given character up to a maximum length.
padLeft :: Char -> Int -> String -> String
padLeft c len str = prefix <> str
where prefix = fromCharArray (replicate (len - length str) c)
-- | Pad a number with leading zeros up to the given length.
padZeros :: Int -> Int -> String
padZeros len num | num >= 0 = padLeft '0' len (show num)
| otherwise = "-" <> padLeft '0' len (show (-num))
Which produces the following results:
> padZeros 6 8
"000008"
> padZeros 6 678
"000678"
> padZeros 6 345678
"345678"
> padZeros 6 12345678
"12345678"
> padZeros 6 (-678)
"-000678"
Edit: In the meantime, I've written a small module that can format numbers in this way:
https://github.com/sharkdp/purescript-format
For your particular example, you would need to do the following:
If you want to format Integers:
> format (width 6 <> zeroFill) 123
"000123"
If you want to format Numbers
> format (width 6 <> zeroFill <> precision 1) 12.345
"0012.3"

When assigning the value of a MEMPTR to a LONGCHAR variable using GET-STRING, i got an error 9324

When assigning the value of a MEMPTR to a LONGCHAR variable using GET-STRING, i got an error 9324 (Attempt to exceed maximum size of a CHARACTER variable) is there any solution ?
You should use COPY-LOB statement
As definitely not solved (COPY-LOB does not work) use the following hard coding:
DEF VAR Z64 AS MEMPTR.
DEF VAR A AS LONGCHAR.
DEF VAR Z AS CHAR.
DEF VAR size64 AS INT.
SET-SIZE(Z64) = 200000. /* base64 is a function of vpxPrint */
RUN base64("e:/temp/XXXX.jpg", Z64, 10, OUTPUT size64). /* get the size */
MESSAGE "base64 string length" size64 VIEW-AS ALERT-BOX.
RUN base64("e:/temp/XXXX.jpg", Z64, 200000, OUTPUT size64).
DEF VAR i AS INT.
DEF VAR j AS INT.
DEF VAR lastSegment AS INT.
/* Segments of 30.000 bytes (PROGRESS limit) */
j = TRUNC(size64 / 30000, 0).
IF j MOD 30000 <> 0 THEN DO:
j = j + 1.
lastSegment = size64 MOD 30000.
END.
ELSE
lastSegment = 30000.
DO i = 1 TO j:
Z = GET-STRING(Z64, (i - 1) * 30000 + 1,
(IF i = j THEN lastSegment ELSE 30000)).
A = A + Z.
END.
SET-SIZE(z64) = 0.
/* LONGCHAR "A" contains the string!
===================================*/
Marcel FONDACCI
www.4GL.fr

Three boolean values saved in one tinyint

probably a simple question but I seem to be suffering from programmer's block. :)
I have three boolean values: A, B, and C. I would like to save the state combination as an unsigned tinyint (max 255) into a database and be able to derive the states from the saved integer.
Even though there are only a limited number of combinations, I would like to avoid hard-coding each state combination to a specific value (something like if A=true and B=true has the value 1).
I tried to assign values to the variables so (A=1, B=2, C=3) and then adding, but I can't differentiate between A and B being true from i.e. only C being true.
I am stumped but pretty sure that it is possible.
Thanks
Binary maths I think. Choose a location that's a power of 2 (1, 2, 4, 8 etch) then you can use the 'bitwise and' operator & to determine the value.
Say A = 1, B = 2 , C= 4
00000111 => A B and C => 7
00000101 => A and C => 5
00000100 => C => 4
then to determine them :
if( val & 4 ) // same as if (C)
if( val & 2 ) // same as if (B)
if( val & 1 ) // same as if (A)
if((val & 4) && (val & 2) ) // same as if (C and B)
No need for a state table.
Edit: to reflect comment
If the tinyint has a maximum value of 255 => you have 8 bits to play with and can store 8 boolean values in there
binary math as others have said
encoding:
myTinyInt = A*1 + B*2 + C*4 (assuming you convert A,B,C to 0 or 1 beforehand)
decoding
bool A = myTinyInt & 1 != 0 (& is the bitwise and operator in many languages)
bool B = myTinyInt & 2 != 0
bool C = myTinyInt & 4 != 0
I'll add that you should find a way to not use magic numbers. You can build masks into constants using the Left Logical/Bit Shift with a constant bit position that is the position of the flag of interest in the bit field. (Wow... that makes almost no sense.) An example in C++ would be:
enum Flags {
kBitMask_A = (1 << 0),
kBitMask_B = (1 << 1),
kBitMask_C = (1 << 2),
};
uint8_t byte = 0; // byte = 0b00000000
byte |= kBitMask_A; // Set A, byte = 0b00000001
byte |= kBitMask_C; // Set C, byte = 0b00000101
if (byte & kBitMask_A) { // Test A, (0b00000101 & 0b00000001) = T
byte &= ~kBitMask_A; // Clear A, byte = 0b00000100
}
In any case, I would recommend looking for Bitset support in your favorite programming language. Many languages will abstract the logical operations away behind normal arithmetic or "test/set" operations.
Need to use binary...
A = 1,
B = 2,
C = 4,
D = 8,
E = 16,
F = 32,
G = 64,
H = 128
This means A + B = 3 but C = 4. You'll never have two conflicting values. I've listed the maximum you can have for a single byte, 8 values or (bits).