How can I work with a single byte and binary! byte arrays in Rebol 3? - unicode

In Rebol 2, it is possible to use to char! to produce what is effectively a single byte, that you can use in operations on binaries such as append:
>> buffer: #{DECAFBAD}
>> data: #{FFAE}
>> append buffer (to char! (first data))
== #{DECAFBADFF}
Seems sensible. But in Rebol 3, you get something different:
>> append buffer (to char! (first data))
== #{DECAFBADC3BF}
That's because it doesn't model single characters as single bytes (due to Unicode). So the integer value of first data (255) is translated into a two-byte sequence:
>> to char! 255
== #"ÿ"
>> to binary! (to char! 255)
== #{C3BF}
Given that CHAR! is no longer equivalent to a byte in Rebol 3, and no BYTE! datatype was added (such that a BINARY! could be considered a series of these BYTE!s just as a STRING! is a series of CHAR!), what is one to do about this kind of situation?

Use an integer!, the closest match we have for expressing a byte in R3, at the moment.
Note that integers are range-checked when used as bytes in context of a binary!:
>> append #{} 1024
** Script error: value out of range: 1024
** Where: append
** Near: append #{} 1024
For your first example, you actually append one element of one series to another series of the same type. In R3 you can express this in the obvious and most straight-forward way:
>> append #{DECAFBAD} first #{FFAE}
== #{DECAFBADFF}
So for that matter, a binary! is a series of range-constrained integer!s.
Unfortunately, that won't work in R2, because its binary! model was just broken in many tiny ways, including the above. While conceptually a binary! in R2 can be considered a series of char!s, that concept is not consistently implemented.

>> buffer: #{DECAFBAD}
== #{DECAFBAD}
>> data: #{FFAE}
== #{FFAE}
>> append buffer data/1
== #{DECAFBADFF}

The easiest workaround for the issue in both Rebol 2 and Rebol 3 is to use COPY/PART. This way, rather than trying to reduce your content to a byte value, you keep it as a binary series type. You're merely appending a binary sequence of length 1:
>> append buffer (copy/part data 1)
== #{DECAFBADFF}
It does seem that for completeness, Rebol 3 would have a BYTE! type, as calling it a "series of INTEGER!" clearly does not match the precedent set by STRING!
It's open source, so you (um, I?) might try adding it and seeing what the full ramifications are. :-)

Related

Get UTF-16 code unit at a given index in ABAP

I want to get the UTF-16 code unit at a given index in ABAP.
Same can be done in JavaScript with charCodeAt().
For example "d".charCodeAt(); will give back 100.
Is there a similar functionality in ABAP?
This can be done with class CL_ABAP_CONV_OUT_CE
DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = '4103' ). "Litte Endian
TRY.
CALL METHOD lo_converter->convert
EXPORTING
data = 'a'
n = 1
IMPORTING
buffer = DATA(lv_buffer). "lv_buffer will 0061
CATCH ...
ENDTRY.
Codepage 4102 is for UTF-16 Big endian.
It is possible to encode not just a single character, but a string as well:
EXPORTING
data = 'abc'
n = 3
"n" always stands for the length of the string you want to be encoded. It could be less, than the actual length of the string.
When you say you "want to get the UTF-16 code unit",
either you mean the Unicode code point, e.g. the character d is always U+0064 (official "name" of Unicode character, the two bytes 0x0064 being the hexadecimal representation of decimal 100),
or you mean you want to encode d to UTF-16 little endian (SAP code page 4103) or big endian (SAP code page 4102) which gives respectively 2 bytes 0x4400 or 2 bytes 0x0044.
For the second case, see József answer.
For the first case, you may get it using the method UCCP (UniCode Code Point) or UCCPI (UniCode Code Point Integer) of class CL_ABAP_CONV_OUT_CE:
DATA: l_unicode_point_hex TYPE x LENGTH 2,
l_unicode_point_int TYPE i.
l_unicode_point_hex = cl_abap_conv_out_ce=>UCCP( 'd' ).
ASSERT l_unicode_point_hex = '0064'.
l_unicode_point_int = cl_abap_conv_out_ce=>UCCPI( 'd' ).
ASSERT l_unicode_point_int = 100.
EDIT: Note that the two methods return always the same values whatever the SAP system code page is (4102, 4103 or whatever).

Difference between string and character vector in Matlab 2015 [duplicate]

What is the difference between string and character class in MATLAB?
a = 'AX'; % This is a character.
b = string(a) % This is a string.
The documentation suggests:
There are two ways to represent text in MATLAB®. You can store text in character arrays. A typical use is to store short pieces of text as character vectors. And starting in Release 2016b, you can also store multiple pieces of text in string arrays. String arrays provide a set of functions for working with text as data.
This is how the two representations differ:
Element access. To represent char vectors of different length, one had to use cell arrays, e.g. ch = {'a', 'ab', 'abc'}. With strings, they can be created in actual arrays: str = [string('a'), string('ab'), string('abc')].
However, to index characters in a string array directly, the curly bracket notation has to be used:
str{3}(2) % == 'b'
Memory use. Chars use exactly two bytes per character. strings have overhead:
a = 'abc'
b = string('abc')
whos a b
returns
Name Size Bytes Class Attributes
a 1x3 6 char
b 1x1 132 string
The best place to start for understanding the difference is the documentation. The key difference, as stated there:
A character array is a sequence of characters, just as a numeric array is a sequence of numbers. A typical use is to store short pieces of text as character vectors, such as c = 'Hello World';.
A string array is a container for pieces of text. String arrays provide a set of functions for working with text as data. To convert text to string arrays, use the string function.
Here are a few more key points about their differences:
They are different classes (i.e. types): char versus string. As such they will have different sets of methods defined for each. Think about what sort of operations you want to do on your text, then choose the one that best supports those.
Since a string is a container class, be mindful of how its size differs from an equivalent character array representation. Using your example:
>> a = 'AX'; % This is a character.
>> b = string(a) % This is a string.
>> whos
Name Size Bytes Class Attributes
a 1x2 4 char
b 1x1 134 string
Notice that the string container lists its size as 1x1 (and takes up more bytes in memory) while the character array is, as its name implies, a 1x2 array of characters.
They can't always be used interchangeably, and you may need to convert between the two for certain operations. For example, string objects can't be used as dynamic field names for structure indexing:
>> s = struct('a', 1);
>> name = string('a');
>> s.(name)
Argument to dynamic structure reference must evaluate to a valid field name.
>> s.(char(name))
ans =
1
Strings do have a bit of overhead, but still increase by 2 bytes per character. After every 8 characters it increases the size of the variable. The red line is y=2x+127.
figure is created using:
v=[];N=100;
for ct = 1:N
s=char(randi([0 255],[1,ct]));
s=string(s);
a=whos('s');v(ct)=a.bytes;
end
figure(1);clf
plot(v)
xlabel('# characters')
ylabel('# bytes')
p=polyfit(1:N,v,1);
hold on
plot([0,N],[127,2*N+127],'r')
hold off
One important practical thing to note is, that strings and chars behave differently when interacting with square brackets. This can be especially confusing when coming from python. consider following example:
>>['asdf' '123']
ans =
'asdf123'
>> ["asdf" "123"]
ans =
1×2 string array
"asdf" "123"

Comparing characters in Rebol 3

I am trying to compare characters to see if they match. I can't figure out why it doesn't work. I'm expecting true on the output, but I'm getting false.
character: "a"
word: "aardvark"
(first word) = character ; expecting true, getting false
So "a" in Rebol is not a character, it is actually a string.
A single unicode character is its own independent type, with its own literal syntax, e.g. #"a". For example, it can be converted back and forth from INTEGER! to get a code point, which the single-letter string "a" cannot:
>> to integer! #"a"
== 97
>> to integer! "a"
** Script error: cannot MAKE/TO integer! from: "a"
** Where: to
** Near: to integer! "a"
A string is not a series of one-character STRING!s, it's a series of CHAR!. So what you want is therefore:
character: #"a"
word: "aardvark"
(first word) = character ;-- true!
(Note: Interestingly, binary conversions of both a single character string and that character will be equivalent:
>> to binary! "μ"
== #{CEBC}
>> to binary! #"μ"
== #{CEBC}
...those are UTF-8 byte representations.)
I recommend for cases like this, when things start to behave in a different way than you expected, to use things like probe and type?. This will help you get a sense of what's going on, and you can use the interactive Rebol console on small pieces of code.
For instance:
>> character: "a"
>> word: "aardvark"
>> type? first word
== char!
>> type? character
== string!
So you can indeed see that the first element of word is a character #"a", while your character is the string! "a". (Although I agree with #HostileFork that comparing a string of length 1 and a character is for a human the same.)
Other places you can test things are http://tryrebol.esperconsultancy.nl or in the chat room with RebolBot

xor between two numbers (after hex to binary conversion)

i donot know why there is error in this coding:
hex_str1 = '5'
bin_str1 = dec2bin(hex2dec(hex_str1))
hex_str2 = '4'
bin_str2 = dec2bin(hex2dec(hex_str2))
c=xor(bin_str1,bin_str2)
the value of c is not correct when i transform the hex to binary by using the xor function.but when i used the array the value of c is correct.the coding is
e=[1 1 1 0];
f=[1 0 1 0];
g=xor(e,f)
what are the mistake in my first coding to xor of hec to binary value??anyone can help me find the solution...
Your mistake is applying xor on two strings instead of actual numerical arrays.
For the xor command, logical "0"s are represented by actual zero elements. Any non-zero elements are interpreted as logical "1"s.
When you apply xor on two strings, the numerical value of each character (element) is its ASCII value. From xor's point of view, the zeroes in your string are not really zeroes, but simply non-zero values (being equal to the ASCII value of the character '0'), which are interpreted as logical "1"s. The bottom line is that in your example you're xor-ing 111b and 111b, and so the result is 0.
The solution is to convert your strings to logical arrays:
num1 = (bin_str1 == '1');
num2 = (bin_str2 == '1');
c = xor(num1, num2);
To convert the result back into a string (of a binary number), use this:
bin_str3 = sprintf('%d', c);
... and to a hexadecimal string, add this:
hex_str3 = dec2hex(bin2dec(bin_str3));
it is really helpful, and give me the correct conversion while forming HMAC value in matlab...
but in matlab you can not convert string of length more than 52 character using bin2dec() function and similarly hex2dec() can not take hexadecimal character string more than 13 length.

How to use Unicode codepoints above U+FFFF in Rebol 3 strings like in Rebol 2?

I know you can't use caret style escaping in strings for codepoints bigger than ^(FF) in Rebol 2, because it doesn't know anything about Unicode. So this doesn't generate anything good, it looks messed up:
print {Q: What does a Zen master's {Cow} Say? A: "^(03BC)"!}
Yet the code works in Rebol 3 and prints out:
Q: What does a Zen master's {Cow} Say? A: "μ"!
That's great, but R3 maxes out its ability to hold a character in a string at all at U+FFFF apparently:
>> type? "^(FFFF)"
== string!
>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"
The situation is a lot better than the random behavior of Rebol 2 when it met codepoints it didn't know about. However, there used to be a workaround in Rebol for storing strings if you knew how to do your own UTF-8 encoding (or got your strings by way of loading source code off disk). You could just assemble them from individual characters.
So the UTF-8 encoding of U+010000 is #F0908080, and you could before say:
workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
And you'd get a string with that single codepoint encoded using UTF-8, that you could save to disk in code blocks and read back in again. Is there any similar trick in R3?
There is a workaround using the string! datatype as well. You cannot use UTF-8 in that case, but you can use UTF-16 workaround as follows:
utf-16: "^(d800)^(dc00)"
, which encodes the ^(10000) code point using UTF-16 surrogate pair. In general, the following function can do the encoding:
utf-16: func [
code [integer!]
/local low high
] [
case [
code < 0 [do make error! "invalid code"]
code < 65536 [append copy "" to char! code]
code < 1114112 [
code: code - 65536
low: code and 1023
high: code - low / 1024
append append copy "" to char! high + 55296 to char! low + 56320
]
'else [do make error! "invalid code"]
]
]
Yes, there is a trick...which is the trick you should have been using in R2 as well. Don't use a string! Use a binary! if you have to do this sort of thing:
good-workaround: #{F0908080}
It would've worked in Rebol2, and it works in Rebol3. You can save it and load it without any funny business.
In fact, if care about Unicode at all, ever...stop doing string processing that is using codepoints higher than ^(7F) if you are stuck in Rebol 2 and not 3. We'll see why by looking at that terrible workaround:
terrible-workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
..."And you'd get a string with that single UTF-8 codepoint"...
The only thing you should get is a string with four individual character codepoints, and with 4 = length? terrible-workaround. Rebol2 is broken because string! is basically no different from binary! under the hood. In fact, in Rebol2 you could alias the two types back and forth without making a copy, look up AS-BINARY and AS-STRING. (This is impossible in Rebol3 because they really are fundamentally different, so don't get attached to the feature!)
It's somewhat deceptive to see these strings reporting a length of 4, and there's a false comfort of each character producing the same value if you convert them to integer!. Because if you ever write them out to a file or port somewhere, and they need to be encoded, you'll get bitten. Note this in Rebol2:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{80}
But in R3, you have a UTF-8 encoding when binary conversion is needed:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{C280}
So you will be in for a surprise when your seemingly-working code does something different at a later time, and winds up serializing differently. In fact, if you want to know how "messed up" R2 is in this regard, look at why you got a weird symbol for your "mu". In R2:
>> to binary! #"^(03BC)"
== #{BC}
It just threw the "03" away. :-/
So if you need for some reason to work with a Unicode strings and can't switch to R3, try something like this for the cow example:
mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master's {Cow} Say? A: "} mu-utf8 {"!}]
That gets you a binary. Only convert it to string for debug output, and be ready to see gibberish. But it is the right thing to do if you're stuck in Rebol2.
And to reiterate the answer: it's also what to do if for some odd reason stuck needing to use those higher codepoints in Rebol3:
utf8: rejoin [#{} {Q: What did the Mycenaean's {Cow} Say? A: "} #{010000} {"!}]
I'm sure that would be a very funny joke if I knew what LINEAR B SYLLABLE B008 A was. Which leads me to say that most likely, if you're doing something this esoteric you probably only have a few codepoints being cited as examples. You can hold most of your data as string up until you need to slot them in conveniently, and hold the result in a binary series.
UPDATE: If one hits this problem, here is a utility function that can be useful for working around it temporarily:
safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]
r2-string-to-binary: func [
str [string!] /string /unescape /unsafe
/local result s e escape-rule unsafe-rule safe-rule rule
] [
result: copy either string [{}] [#{}]
escape-rule: [
"^^(" s: 2 hex-digit e: ")" (
append result debase/base copy/part s e 16
)
]
unsafe-rule: [
s: unsafe-r2-char (
append result to integer! first s
)
]
safe-rule: [
s: safe-r2-char (append result first s)
]
rule: compose/deep [
any [
(either unescape [[escape-rule |]] [])
safe-rule
(either unsafe [[| unsafe-rule]] [])
]
]
unless parse/all str rule [
print "Unsafe codepoints found in string! by r2-string-to-binary"
print "See http://stackoverflow.com/questions/15077974/"
print mold str
throw "Bad codepoint found by r2-string-to-binary"
]
result
]
If you use this instead of a to binary! conversion, you will get the consistent behavior in both Rebol2 and Rebol3. (It effectively implements a solution for terrible-workaround style strings.)