PowerBuilder 12 how to determine encoding of input file - encoding

I'm new to PowerBuilder 12, and would like to know is there any way to determine the encoding (e.g. Unicode, BIG5) of an input file. Any comments and code samples are appreciated! Thanks!

From the PB 12.5 help file :
FileEncoding ( filename )
filename : The name of the file you want to test for encoding type
Return Values
EncodingANSI!
EncodingUTF8!
EncodingUTF16LE!
EncodingUTF16BE!
If filename does not exist, returns null.

Finding Unicode is pretty easy, if you assume the Unicode file has a BOM prefix (although reality is that not all Unicode files do have this). Some code to do this is below. However, I have no idea about Big5; it looks to me (at first glance at the spec, never had occasion to use it) like it doesn't have a similar prefix.
Good luck,
Terry
function of_filetype (string as_filename) returns encoding
integer li_NullCount, li_NonNullCount, li_OffsetTest
long ll_File
encoding le_Return
blob lblb_UTF16BE, lblb_UTF16LE, lblb_UTF8, lblb_Test, lblb_BOMTest, lblb_Null
lblb_UTF16BE = Blob ("~hFE~hFF", EncodingANSI!)
lblb_UTF16LE = Blob ("~hFF~hFE", EncodingANSI!)
lblb_UTF8 = Blob ("~hEF~hBB~hBF", EncodingANSI!)
lblb_Null = blobmid (blob ("~h01", encodingutf16le!), 2, 1)
SetNull (le_Return)
// Get a set of bytes to test
ll_File = FileOpen (as_FileName, StreamMode!, Read!, Shared!)
FileRead (ll_File, lblb_Test)
FileClose (ll_File)
// test for BOMs: UTF-16BE (FF FE), UTF-16LE (FF FE), UTF-8 (EF BB BF)
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16BE))
IF lblb_BOMTest = lblb_UTF16BE THEN RETURN EncodingUTF16BE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF16LE))
IF lblb_BOMTest = lblb_UTF16LE THEN RETURN EncodingUTF16LE!
lblb_BOMTest = BlobMid (lblb_Test, 1, Len (lblb_UTF8))
IF lblb_BOMTest = lblb_UTF8 THEN RETURN EncodingUTF8!
//I've removed a hack from here that I wouldn't encourage others to use, basically checking for
//0x00 in places I'd "expect" them to be if it was a Unicode file, but that makes assumptions
RETURN le_Return

Related

Get UTF-16 code unit at a given index in ABAP

I want to get the UTF-16 code unit at a given index in ABAP.
Same can be done in JavaScript with charCodeAt().
For example "d".charCodeAt(); will give back 100.
Is there a similar functionality in ABAP?
This can be done with class CL_ABAP_CONV_OUT_CE
DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = '4103' ). "Litte Endian
TRY.
CALL METHOD lo_converter->convert
EXPORTING
data = 'a'
n = 1
IMPORTING
buffer = DATA(lv_buffer). "lv_buffer will 0061
CATCH ...
ENDTRY.
Codepage 4102 is for UTF-16 Big endian.
It is possible to encode not just a single character, but a string as well:
EXPORTING
data = 'abc'
n = 3
"n" always stands for the length of the string you want to be encoded. It could be less, than the actual length of the string.
When you say you "want to get the UTF-16 code unit",
either you mean the Unicode code point, e.g. the character d is always U+0064 (official "name" of Unicode character, the two bytes 0x0064 being the hexadecimal representation of decimal 100),
or you mean you want to encode d to UTF-16 little endian (SAP code page 4103) or big endian (SAP code page 4102) which gives respectively 2 bytes 0x4400 or 2 bytes 0x0044.
For the second case, see József answer.
For the first case, you may get it using the method UCCP (UniCode Code Point) or UCCPI (UniCode Code Point Integer) of class CL_ABAP_CONV_OUT_CE:
DATA: l_unicode_point_hex TYPE x LENGTH 2,
l_unicode_point_int TYPE i.
l_unicode_point_hex = cl_abap_conv_out_ce=>UCCP( 'd' ).
ASSERT l_unicode_point_hex = '0064'.
l_unicode_point_int = cl_abap_conv_out_ce=>UCCPI( 'd' ).
ASSERT l_unicode_point_int = 100.
EDIT: Note that the two methods return always the same values whatever the SAP system code page is (4102, 4103 or whatever).

Decode a string with both Unicode and Utf-8 codes in Python 2.x

Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?
You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.

CoreFoundation UTF-16 un-paired surrogate

I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :
cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);
auto range = CFRangeMake(0, CFStringGetLenth(cfString));
CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize);
Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.
So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.
Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?
UPDATE:
I did exactly following:
UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair
CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);
after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed
std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);
that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr.
So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine
result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer, bufferSize, usedSize);
that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

C# - Get ANSI code value of a character

I'd like to retrieve the ANSI code value of a given character.
E.g. when I now get the int value of the trademark character, I get 8482.
Instead I would like to get 153, which is the value of the trademark character in codepage 1252.
Some help would be appreciated.
Jurgen
Found it myself:
Encoding ansiEncoding = Encoding.GetEncoding(1252);
byte[] bytes = ansiEncoding.GetBytes(c);
int code = bytes[0];